Data Engineer Resume
Hartford, CA
SUMMARY
- 7+ years of IT experience in Software Development as a Big Data /Hadoop Developer with good knowledge and experience in Hadoop framework.
- Expertise in Hadoop architecture and various components such as HDFS, YARN, High Availability, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce programming paradigm.
- Experience with all aspects of development from initial implementation and requirement discovery, through release, enhancement, and support (SDLC & Agile techniques).
- Experience in Design, Development, Data Migration, Testing, Support and Maintenance using Redshift Databases.
- Experience on Apache Hadoop technologies like Hadoop distributed file system (HDFS), Map Reduce framework, Hive, PIG, Python, Sqoop, Oozie, HBase, Spark, Scala, and Python.
- Experience in AWS cloud solution development using Lambda, SQS, SNS, Dynamo DB, Athena, S3, EMR, EC2, Redshift, Glue, and CloudFormation.
- Experience in using Microsoft Azure SQL database, Data Lake, Azure ML, Azure data factory, Functions, Databricks and HDInsight.
- Working experience in big data on cloud using AWS EC2 and Microsoft Azure, and handled redshift & Dynamo databases with huge amount of data.
- Extensive experience in migrating on premise Hadoop platforms to cloud solutions using AWS and Azure.
- Experience in writing python as ETL framework and Pyspark to process huge amount of data daily.
- Strong experience in implementing data models and loading unstructured data using HBase, Dynamo Db and Cassandra.
- Created multiple report dashboards, visualizations and heat maps using tableau, QlikView and qliksense reporting tools.
- Strong experience in extracting and loading data using complex business logic's using Hive from different data sources and built the ETL pipelines to process tera bytes of data daily.
- Experienced in transporting and processing real time event streaming using Kafka and Spark Streaming.
- Hands on experience with importing and exporting data from Relational databases to HDFS, Hive and HBase using Sqoop.
- Experienced in processing real time data using Kafka 0.10.1 producers and stream processors and implemented stream process using Kinesis and data landed into data lake S3.
- Experience in implementing multitenant models for the Hadoop 2.0 Ecosystem using various big data technologies.
- Designed and developed spark pipelines to ingest real time event - based data from Kafka and other message queue systems and processed huge data with spark batch processing into data warehouse hive.
- Experienced in creating and analyzing Software Requirement Specifications (SRS) and Functional Specification Document (FSD).
- Designed data models for both OLAP and OLTP applications using Erwin and used both star and snowflake schemas in the implementations.
- Capable of organizing, coordinating, and managing multiple tasks simultaneously.
TECHNICAL SKILLS
Programming: Python, R, SQL, HTML, CSS
Databases: SQL Server, MySQL, NoSQL, Hive, Hadoop, Redshift
Python: NLTK, spaCy, matplotlib, NumPy, Pandas, Scikit-Learn
Tools: Git, Docker, Flask, DVC, Keras, Tensorflow, PyTorch
Cloud: AWS (S3, EC2, Redshift, Lambda, EMR), Azure (Synapse Analytics, Azure SQL, ADF), GCP
Visualization: Tableau, Power BI, Sisense, Excel
Core Competencies: Supervised and Unsupervised ML, SVM, DNN, Text Analytics, MX Net, Big Data, NLP
PROFESSIONAL EXPERIENCE
Confidential | Hartford, CA
Data Engineer
Responsibilities:
- Extensively worked in all phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver data science solutions.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
- Supporting Continuous storage in AWS using Elastic Block Storage, S3, and Glacier. Created Volumes and configured Snapshots for EC2 instances.
- Worked as a Data Engineer to review business requirement and compose source to target data mapping documents.
- Installed and configured Hive and written Hive UDFs and Used MapReduce for unit testing.
- Participated in JAD meetings to gather the requirements and understand the End Users System.
- Used SDLC Methodology of Data Warehouse development using Kanbanize.
- Worked on managing and reviewing Hadoop log files. Tested and reported defects in an Agile Methodology perspective.
- Created table structure for data marts in netezza
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
- Experience in building multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP.
- Configured Hive Meta store with MySQL, which stores the metadata for Hive tables.
- Created Use Case Diagrams using UML to define the functional requirements of the application.
- Worked on configuring and managing disaster recovery and backup on Cassandra Data.
- Created PL/SQL packages and Database Triggers and developed user procedures and prepared user manuals for the new programs.
- Experience in designing, developing, and deploying projects in GCP suite including GCP Suite such as Big Query, Data Flow, Data proc, Google Cloud Storage, Composer, and Looker etc.
- Created jobs and transformation in Pentaho Data Integration to generate reports and transfer data from HBase to RDBMS.
- Designed the HBase schemes based on the requirements and HBase data migration and validation.
- Created automated pipelines in AWS Code Pipeline to deploy Docker containers in AWS ECS using services.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
- Performed the Data Mapping, Data design (Data Modeling) to integrate the data across the multiple databases into EDW.
- Worked on Data modeling, Advanced SQL with Columnar Databases using AWS.
- Worked with NoSQL database HBase in getting real time data analytics.
- Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from Mongo DB through Sqoop and placed in HDFS and processed.
- Worked on configuring and managing disaster recovery and backup on Cassandra Data.
- Generated various presentable reports and documentation using report designer and pinned reports in Erwin.
Environment: Hadoop, Agile, Hive, Netezza, PL/SQL, HBase, GCP, AWS, NoSQL, Oozie 5, MongoDB, PL/SQL, SSRS, SSIS, OLTP, OLAP, Puppet
Confidential | Jacksonville, FL
Data Engineer
Responsibilities:
- Involved in full life cycle of the project from Design, Analysis, logical and physical architecture modeling, development, Implementation, and testing.
- Worked on Setting up and built AWS infrastructure with various services available by writing cloud formation templates (CFT) in Json and Yaml.
- Developed Cloud Formation scripts to build EC2 on demand.
- With the help of IAM created roles, users and groups and attached policies to provide minimum access to the resources.
- Updating the bucket policy with IAM role to restrict the access to user and configured AWS Identity Access Management (IAM) Group and users for improved login authentication.
- Created topics in SNS to send notifications to subscribers as per the requirement.
- Moving data from Oracle to HDFS using Sqoop.
- Created Hive Tables, loaded transactional data from Oracle using Sqoop and worked with highly unstructured and semi structured data.
- Developed MapReduce (YARN) jobs for cleaning, accessing, and validating the data.
- Created and worked Sqoop jobs with incremental load to populate Hive External tables
- Scripts were written for distribution of query for performance test jobs in Amazon Data Lake.
- Developed optimal strategies for distributing the web log data over the cluster importing and exporting the stored web log data into HDFS and Hive using Sqoop.
- Apache Hadoop installation and configuration of multiple nodes on AWS EC2 system.
- Developed Pig Latin scripts for replacing the existing legacy process to the Hadoop and the data is fed to AWS S3.
- Working on CDC tables using Spark Application to load data into Dynamic Partition Enabled Hive Tables.
- Designed and developed automation test scripts using Python.
- Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
- Analyzed the SQL scripts and designed the solution to implement using Pyspark.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Uploaded streaming data from Kafka to HDFS, HBase and Hive by integrating with storm.
- Supporting data analysis projects by using Elastic MapReduce on the Amazon Web Services (AWS) cloud performed Export and import of data into s3.
- Involved in designing the row key in Hbase to store Text and JSON as key values in Hbase table and designed row key in such a way to get/scan it in a sorted order.
- Creating Hive tables and working on them using Hive QL.
- Designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
- Developed multiple POCs using PySpark and deployed on the YARN cluster, compared the performance of Spark, with Hive and SQL and developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBASE and Hive tables.
- Monitored workload, job performance and capacity planning using Cloudera Manager.
Environment: AWS, Hadoop, Hive, Yarn, HBase, SSRS, SSIS, Oracle Database 11g, Oracle BI tools, Tableau, MS-Excel, Python, Naive Bayes, SVM, K- means, ANN, Regression, MS Access, SQL Server Management Studio.
Confidential | San Diego, CA
Data Engineer
Responsibilities:
- Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.
- Designed of Cloud architectures for customers looking to migrate or develop new PaaS, IaaS, or hybrid solutions utilizing Amazon Web Services (AWS).
- Designed, build, configured, test, install software, manage and support all aspects and components of the application development environments in AWS.
- Utilized AWS Cloud formation to create new AWS environments following best practices in VPC / Subnet design.
- Analyzed the business, technical, functional, performance and infrastructure requirements needed to access and process large amounts of data.
- Coordinated with the Dev, DBA, QA, and IT Operations environments to ensure there are no resource conflicts.
- Worked within and across agile teams to design, develop, test, implement, and support technical solutions across a full stack of development tools and technologies, and tracking all stories on JIRA.
- Responsible for the development and maintenance of processes and associated scripts/tools for automated build, testing, and deployment of the products to various developments.
- Manage Production Server infrastructure environment, collaborated with development team to troubleshoot and resolve issues, deliver product release with frequent deployment with zero downtime deployment.
- Extensively involved in infrastructure as code, execution plans, resource graph and change automation using Terraform. Managed AWS infrastructure as code using Terraform.
- Created Terraform scripts for EC2 instances, Elastic Load balancers and S3 buckets.
- Managed different infrastructure resources, like physical machines, VMs and even Docker containers using Terraform It supports different Cloud service providers like AWS.
- Built Jenkins jobs to create AWS infrastructure from GitHub repos containing Terraform code.
- Configure ELK stack in conjunction with AWS and using Log Stash to output data to AWS S3.
- Involved in AWS EC2 based automation through Terraform, Ansible, Python, and Bash Scripts. Adopted new features as they were released by Amazon, including ELB & EBS.
- Experience in Virtualization technologies and worked with containerizing applications.
- Automated deployment of application using deployment tool (Ansible).
Environment: AWS, PaaS, IaaS, JSON, EC2, Python, Pandas, Regression, Classification, CNN, RNN, Random Forest, TensorFlow, Keras, Seaborn, NumPy, SVM, Preprocessing, SQL, AWS Sage maker, AWS S3.
Confidential
Data Analyst
Responsibilities:
- Exported the analyzed data to the Relational databases using Sqoop for performing visualization and generating reports for the Business Intelligence team.
- Collaborated with the business to define requirements and recommend optimized solutions. Ability to quickly understand complex business processes and associated data sets.
- Consulted with internal and external stakeholders to identify specific needs within customerapplication modules and document requirements for data, reports, analysis, metadata, training, service levels, data quality, and performance and troubleshooting.
- Experience extracting data from MySQL into HDFS using Sqoop and developed Simple to complex Map Reduce jobs.
- Responsible for writing complicated SQL queries with a good understanding of transactional databases.
- Assisted reporting teams in developing Tableau visualizations and dashboards using Tableau Desktop.
- Analyzed the data by performing Hive queries and running Pig Scripts to know user behavior and creating partitioned tables in Hive as part of my job.
- Administered and supported distribution of Horton works.
- Wrote Korn shell, Bash shell, Pearl scripts to automate most Database maintenance tasks.
- Worked on installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in Java for data cleaning and preprocessing.
- Monitoring the running Map-Reduce programs on the cluster.
- Responsible for loading data from UNIX file systems to HDFS.
- Through PHP, I created documents and execute software designs that may involve complicated workflows or multiple product areas
- An alternate UNIX/Oracle-based system required bug fixes, change requests, and tuning. My position was responsible for all requests of this system. Implementation, testing, and documentation were performed for this system.
- Consult with project managers, business analysts, and development teams on application development and business plans
- Installed and configured Hive and Created Hive UDFs.
- Involved in creating Hive Tables, loading with data, and Writing Hive queries which will invoke and run Map Reduce jobs in the backend.
- Implemented the workflows using the Apache Oozie framework to automate tasks.
- Developed scripts and automated data management from end to end and sync up between the clusters.
- Designed, developed, tested, and deployed Power BI scripts and performed detailed analytics.
- Performed DAX queries and functions in Power BI.
Environment: s: Apache Hadoop, Java, Bash, ETL, Map Reduce, Hive, Pig, Horton works, Deployment tools, Data tax, Flat files, Oracle 11g/10g, MySQL, Window NT, UNIX, Sqoop, Oozie, Tableau.