We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

Charlotte, NC

SUMMARY

  • Data Engineer professional with 8+ years of combined experience in the fields of Data Engineer, Big Data implementations and Spark technologies.
  • Around 5 years of experience, specialized in Big Data ecosystem - Data acquisition, Ingestion, Modeling, Analysis, Integration, and Data processing.
  • In-depth experience and good knowledge in using Hadoop ecosystem tools like HDFS, MapReduce, YARN, Spark, Kafka, Hive, Sqoop, Pig, Impala, HBase, Flume, Oozie, and Zookeeper.
  • Strong knowledge/understanding of Hadoop architecture and various concepts like HDFS, MapReduce, Job Tracker, Task Tracker, Name Node, and Data Node.
  • Hands-on experience in working with Cloudera, Azure HDInsight, and AWS cloud.
  • Experience in converting SQL queries into Spark Transformations using Spark RDDs, Scala and Performed map-side joins on RDD's.
  • Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark, and Hive.
  • Profound knowledge in developing production-ready Spark applications using Spark components like Spark SQL, MLib, Spark Streaming, and GraphX.
  • Experience in Hive partitioning, bucketing and perform joins on Hive tables and implement Hive SerDes.
  • Strong experience working with Amazon cloud web services like EMR, Redshift, DynamoDB, Lambda, Athena, S3, RDS, and CloudWatch for efficient processing of big data.
  • A solid experience and understanding of designing and operationalization of large-scale data and analytics solutions on Snowflake Data Warehouse.
  • Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.
  • Developing ETL pipelines in and out of data warehouse using a combination of Python and SnowSQL.
  • Experience in extracting files from MongoDB through Sqoop and placed in HDFS and then processed.
  • Worked with different ingestion services with Batch and Real-time data handling utilizing Spark streaming, Kafka Confluent, Storm, Flume and Sqoop.
  • Hands-on experience working on NoSQL databases including HBase, Cassandra, MongoDB, and its integration with Hadoop cluster, and Kubernetes cluster.
  • Proficient at composing MapReduce jobs and UDF's to assemble, examine, change, and convey the information according to business prerequisites.
  • Experience dealing with file formats like text file, sequence file, Parquet, Avro, JSON, and ORC and Compression codecs like Snappy, Gzip, Lzo.
  • Experience on Azure cloud segments like HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, and Cosmos DB.
  • Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling, granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
  • Have extensive experience in creating pipeline jobs, schedule triggers using Azure data factory.
  • Involved in creating Azure Data Factory pipelines, used Azure Databricks notebook to cook the data, publish it in views to consume in Power BI reports.
  • Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
  • Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instances.
  • Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Knowledge on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring, and cloud deployment manager.
  • Created tableau dashboards designing with large data volumes from data source SQL server.
  • Excellent Programming skills at a higher level of abstraction using Scala, Java, and Python.
  • Proficient with Python including NumPy, SciPy, Pandas, Scikit-learn, Matplotlib, and TensorFlow.
  • Extensive experience in writing stored procedures and complex SQL queries using relational databases like MySQL, MS SQL, and Oracle.
  • Having working experience with building RESTful web services, and RESTful API.
  • Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
  • Extract, Transform and Load (ETL) source data into respective target tables to build the required data marts.
  • Active involvement in all scrum ceremonies - Sprint Planning, Daily Scrum, Sprint Review and Retrospective meetings and assisted Product owner in creating and prioritizing user stories.
  • Strong experience in working with UNIX/LINUX environments, writing shell scripts.
  • Strong debugging and critical thinking ability with a fabulous understanding of framework advancements in methodologies and strategies.
  • Worked on multiple stages of Software Development Life Cycle including Development, Component Integration, Performance Testing, Deployment and Support Maintenance.
  • Excellent analytical, communication skills which helps to understand the business logics and develop a good relation between stakeholders and team members.

TECHNICAL SKILLS

Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, SparkAirflow, MongoDB, Cassandra, HBase, and Storm, Azure Databricks, Azure Data ExplorerAzure HDInsight.

Hadoop Distribution: Apache Hadoop 2.x/1.x, Cloudera CDH, Hortonworks HDP, Amazon EMR (EMR, EC2, EBSRDS, S3, Glue, Elasticsearch, Lambda, Kinesis, SQS, DynamoDB, Redshift, ECS), Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB, Azure DevOps, Active Directory)

Programming Languages: Scala, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting, C, C++, Java

Script Languages: JavaScript, jQuery, Python.

Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQLDatabase (HBase, MongoDB).

Operating Systems: Linux, Windows, Ubuntu, Unix

Web/Application server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans

Data Visualization Tools: Tableau, Power BI, SAS, Excel, ETL, Talend.

OLAP/Reporting: SQL Server Analysis Services and Reporting Services.

Cloud Technologies: MS Azure, Amazon Web Services (AWS).

Machine Learning Models: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Principal Component Analysis, Linear Regression, Naïve Bayes.

Development Tools: Eclipse, NetBeans, IntelliJ, Hue, Microsoft Office Suite.

PROFESSIONAL EXPERIENCE

Confidential, Charlotte, NC

Data Engineer

Responsibilities:

  • Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives.
  • Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
  • Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
  • Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
  • Experience in data cleansing and data mining.
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
  • Involved in migrating data from on prem Cloudera cluster to AWS EC2 instances deployed on EMR cluster and developed ETL pipeline to extract logs and store in AWS S3 Data Lake and further processed it using PySpark.
  • Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
  • Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
  • Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
  • Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway.
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
  • Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
  • Troubleshoot and maintain ETL/ELT jobs running using Matillion.
  • Created DAG to use the Email Operator, Bash Operator, and spark Livy operator to execute and in EC2 instance.
  • Deploy the code to EMR via CI/CD using Jenkins
  • Extensively used Code cloud for code check-in and checkouts for version control.

Environment: Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Code cloud, AWS.

Confidential, Rochester, MN

Data Engineer

Responsibilities:

  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Strong experience of leading multiple Azure Big Data and Data transformation implementations in Banking and Financial Services, High Tech and Utilities industries.
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
  • Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
  • Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)
  • Responsible for resolving the issues and troubleshooting related to performance of Hadoop cluster.
  • Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
  • Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
  • Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Lambda, Glue, Azure, ADF,Blob, cosmos DB, Python, PySpark, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Oozie, HBase, Oracle, Teradata, Cassandra, MLlib, Tableau, Maven, Git, Jira.

Confidential, MI

Big Data Engineer

Responsibilities:

  • Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
  • Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
  • Developed spark applications for performing large scale transformations and denormalization of relational datasets.
  • Have real-time experience of Kafka-Storm on HDP 2.2 platform for real time analysis.
  • Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
  • Designing ETL processes using Informatica to load data from Flat Files, Oracle, and Excel files to target Oracle Data Warehouse database.
  • Created reports for the BI team using Sqoop to export data into HDFS and Hive.
  • Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
  • Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
  • Created HIVE Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
  • Developed Complex HiveQL's using SerDe JSON.
  • Created HBase tables to load large sets of structured data.
  • Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports
  • Performed Real time event processing of data from multiple servers in the organization using Apache Storm by integrating with apache Kafka.
  • Managed and reviewed Hadoop log files.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Worked on PySpark APIs for data transformations.
  • Data ingestion to Hadoop (Sqoop imports). To perform validations and consolidations for the imported data.
  • Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
  • Upgraded current Linux version to RHEL version 5.6
  • Expertise in hardening, Linux Server, and Compiling, Building, and installing Apache Server from sources with minimum modules
  • Worked on JSON, Parquet, Hadoop File formats.
  • Worked on different Java technologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
  • Used Git hub for continuous integration services.

Environment: Agile Scrum, MapReduce, Hive, Pig, Sqoop, Spark, Scala, Oozie, Flume, Java, HBase, Kafka, Python, Storm, JSON, Parquet, GIT, JSON SerDe, Cloudera.

Confidential

Data Analyst

Responsibilities:

  • Understand the data visualization requirements from the Business Users.
  • Writing SQL queries to extract data from the Sales data marts as per the requirements.
  • Developed Tableau data visualization using Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.
  • Designed and deploy rich Graphic visualizations with Drill Down and Drop-down menu option and Parameterized using Tableau.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
  • Explored traffic data from databases connecting them with transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.
  • Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.
  • Data Cleaning, merging, and exporting the dataset was done in Tableau Prep.
  • Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality in order to improve the analysis.

Environment: Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.

Confidential

Data Analyst

Responsibilities:

  • Processed data received from vendors and loading them into the database. The process was carried out on a weekly basis and reports were delivered on a bi-weekly basis. The extracted data had to be checked for integrity.
  • Documented requirements and obtained signoffs.
  • Coordinated between the Business users and development team in resolving issues.
  • Documented data cleansing and data profiling.
  • Wrote SQL scripts to meet the business requirement.
  • Analyzed views and produced reports.
  • Tested cleansed data for integrity and uniqueness.
  • Automated the existing system to achieve faster and accurate data loading.
  • Generated weekly, bi-weekly reports to be sent to client business team using business objects and documented them too.
  • Used Informatica Data Transformations like Source Qualifier, Aggregator, Joiner, Normalizer, Rank, Router, Lookup, Sorter, Reusable, Transaction control, etc, to parse complex files and load them into databases.
  • Created complex SQL queries and scripts to extract, aggregate, and validate data from Oracle, MS SQL, and flat files using Informatica and loaded it into a single data warehouse repository for data analysis.
  • Learned to create Business Process Models.
  • Ability to manage multiple projects simultaneously tracking them towards varying timelines effectively through a combination of business and technical skills.
  • Good Understanding of clinical practice management, medical and laboratory billing, and insurance claim with processing with process flow diagrams.
  • Assisted QA team in creating test scenarios that cover a day in the life of the patient for Inpatient and Ambulatory workflows.

Environment: SQL, data profiling, data loading, QA team, Tableau, Python, Machine Learning models, Informatica

We'd love your feedback!