We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

IL

SUMMARY

  • Dynamic and motivated IT professional hanging over 8 years of experience as a Big Data Engineer with expertise in designing data intensive applications using Cloud Data engineering, Data Warehouse, Hadoop Ecosystem, Big Data Analytical, Data Visualization, Reporting, and Data Quality solutions.
  • Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Apache Cassandra, NoSQL, Spark, Python, Scala, Sqoop, HBase, Hive, Oozie, Impala, Pig, Zookeeper, and Flume.
  • Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming. Utilized Flume to analyze log files and write into HDFS.
  • Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark - SQL, Dataframe API, Spark Streaming, Pair RDD's and worked explicitly on PySpark.
  • Developed framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
  • Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Hands-on experience with Amazon EC2, S3, RDS(Aurora), IAM, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, EMR, Redshift, DynamoDB and other services of the AWS family and in Microsoft Azure.
  • Proven expertise in deploying major software solutions for various high-end clients meeting the business requirements such as big data Processing, Ingestion, Analytics and Cloud Migration from On-prem to AWS Cloud.
  • Experience in Work on AWS Databases like Elastic Cache (Memcached & Redis) and NoSQL databases - HBase, Cassandra & MongoDB, database performance tuning & data modeling.
  • Established connection from Azure to On-premises

TECHNICAL SKILLS

Hadoop/Big Data: HDFS, MapReduce, Yarn, HBase, Pig, Hive, Sqoop, Flume, Oozie, Zookeeper, Splunk, Hortonworks, Cloudera

Programming languages: SQL, Python, R, Scala, Spark, Linux shell scripts

Databases: RDBMS (MySQL, DB2, MS-SQL Server, Terradata, PostgreSQL), NoSQL (MongoDB, HBase, Cassandra), Snowflake virtual warehouse

OLAP & ETL Tools: Tableau, Spyder, Spark, SSIS, Informatica Power Center, Pentaho, Talend

Data Modelling Tools: Microsoft Visio, ER Studio, Erwin

Python and R libraries: R-tidyr, tidyverse, dplyr reshape, lubridate, Python - beautiful Soup, numpy, scipy, matplotlib, python-twitter, pandas, scikit-learn, keras.

Machine Learning: Regression, Clustering, MLlib, Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, and Gradient Boost & Adaboost, Neural Networks and Time Series Analysis.

Data analysis Tools: Machine Learning, Deep Learning, Data Warehouse, Data Mining, Data Analysis, Big data, Visualizing, Data Munging, Data Modelling

Cloud Computing Tools: Snowflake, SnowSQL, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

Amazon Web Services: EMR, EC2, S3, RDS, Cloud Search, Redshift, Data Pipeline, Lambda.

Reporting Tools: JIRA, MS Excel, Tableau, Power BI, QlikView, Qlik Sense, D3, SSRS, SSIS

IDE’s: Pycharm

Development Methodologies: Agile, Scrum, Waterfall

PROFESSIONAL EXPERIENCE

Confidential, IL

Data Engineer

Responsibilities:

  • Analyze, develop, and build modern data solutions with the Azure PaaS service to enable data visualization. Understand the application's current Production state and the impact of new installation on existing business processes. Worked on migration of data from On - prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake,
  • Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks. Pipelines were created in Azure Data Factory utilizing Linked Services/Datasets/Pipeline/ to extract, transform, and load data from many sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards.
  • Used Azure ML to build, test and deploy predictive analytics solutions based on data. Developed Spark applications with Azure Data Factory and Spark-SQL for data extraction, transformation, and aggregation from different file formats in order to analyze and transform the data in order to uncover insights into customer usage patterns.
  • Analyzed the SQL scripts and designed it by using PySpark SQL for faster performance. Applied technical knowledge to architect solutions that meet business, and IT needs, created roadmaps, and ensure long term technical viability of new deployments, infusing key analytics and AI technologies where appropriate (e.g., Azure Machine Learning, Machine Learning Server, BOT framework
  • Azure Cognitive Services, Azure Databricks, etc.) Managed relational database service in which the Azure SQL handles reliability, scaling, and maintenance. Integrated data storage solutions with Spark, particularly with Azure Data Lake storage and Blob storage.
  • Configured stream analytics, Event hubs and worked to manage IoT solutions with Azure. Successfully completed a proofofconcept for Azure implementation, with the larger goal of migrating on-premises servers and data to the cloud.
  • Responsible for estimating cluster size, monitoring, and troubleshooting the Spark Databricks cluster. Experienced in adjusting the performance of Spark applications for the proper batch interval time, parallelism level, and memory tuning. Extensively involved in the Analysis, design and Modeling.
  • Worked on Snowflake Schema, Data Modelling and Elements, and Source to Target Mappings, Interface Matrix and Design elements.
  • Performed data quality issue analysis using Snow SQL by building analytical warehouses on Snowflake. Helped individual teams to set up their repositories in bit bucket and maintain their code and help them setting up jobs which can make use of CI/CD environment.

Confidential

Data Engineer

Responsibilities:

  • Developed Apache presto and Apache drill setups in AWS EMR (Elastic Map Reduce) cluster, to combine multiple databases like MySQL and Hive. This enables to compare results like joins and inserts on various data sources controlling through single platform.
  • The AWS Lambda functions were written in Scala with cross - functional dependencies that generated custom libraries for delivering the Lambda function in the cloud. Performed raw data ingestion into S3 from kinesis firehouse, which triggered a lambda function and put refined data into another S3 bucket and wrote to SQS queue as aurora topics.
  • Writing to the Glue metadata catalog allows us to query the improved data from Athena, resulting in a serverless querying environment. Created Pyspark frame to bring data from DB2 to Amazon S3. Worked on Kafka Backup Index, Log4j appender minimized logs and pointed ambari server logs to NAS
  • Storage. Created AWS RDS (Relational database services) to work as Hive metastore and could combine 20 EMR cluster's meta data into a single RDS, which avoids the data loss even by terminating the EMR. Used AWS Code Commit Repository to store their programming logics and script and has them again to their new clusters. Spin up the EMRs clusters from 30 to 50 nodes which are memory optimized such as R2, R4, X1 and X1e instances with autoscaling feature. Hive Being the primary query engine of EMR, we has created external table schemas for the data that is being processed. Mounted Local directory file path to
  • Amazon S3 using S3fs fuse, to has KMS encryption enabled on the data reflecting in S3 buckets. Designed and implemented ETL pipelines on S3 parquet files on data lake using AWS Glue. Migrated the data from Amazon Redshift data warehouse to Snowflake. Involved in code migration of quality monitoring tool from AWS EC2 to AWS Lambda and built logical datasets to administer quality monitoring on snowflake warehouses.
  • Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations and JSON schema to define table and column mapping from S3 data to Redshift.
  • Applied Auto scaling techniques to scale in and scale out the instances with given Memory out of time. dis helped in reducing the number of instances count when the cluster is not actively in use. dis is applied by even considering Hive's replication factor as 2 leaving minimum 5 instances running.

Environment: Amazon Web Services, Elastic Map Reduce cluster, EC2s, CloudFormation, Amazon S3, Amazon Redshift, Hive, Scala, PySpark, Snowflake, Shell Scripting, Tableau, Kafka.

Confidential, IL

Hadoop Spark Engineer

Responsibilities:

  • Developed Spark applications using Scala and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Expertise in creating HDInsight cluster and Storage Account with End-to-End environment for running the jobs.
  • Processed data into HDFS by developing solutions, analyzed the data using MapReduce, Pig, Hive and produce summary results from Hadoop to downstream systems.
  • Used Kettle widely in order to import data from various systems/sources like MySQL into HDFS.
  • Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in hive and Map Side joins.
  • Involved in creating Hive tables, and then applied HiveQL on those tables for data validation.
  • Used Zookeeper for various types of centralized configurations.
  • Deep understanding of schedulers, workload management, availability, scalability and distributed data platforms.
  • Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in Java for data cleaning and pre-processing.
  • Involved in loading data from UNIX file system to HDFS.
  • Wrote MapReduce jobs to discover trends in data usage by users.
  • Involved in managing and reviewing Hadoop log files.
  • Involved in running Hadoop streaming jobs to process terabytes of text data.
  • Developed HIVE queries for the analysts.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Exported the result set from HIVE to MySQL using Shell scripts.
  • Used Git for version control.
  • Maintain System integrity of all sub-components primarily HDFS, MR, HBase, and Flume.
  • Monitor System health and logs and respond accordingly to any warning or failure conditions.
  • Developing Spark (Scala) notebooks to transform and partition the data and organize files in ADLS
  • Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, Java 1.6, UNIX Shell Scripting.

Confidential, WI

Hadoop Developer

Responsibilities:

  • Observed the Set up and monitoring of a scalable distributed system based onHDFSfor better idea and worked closely with the team to understand the business requirement and add new support features.
  • Gathered business requirement to determine the feasibility and to convert them to technical tasks in the design document.
  • Installed and configured HadoopMapReducejobs, HDFS and developed multiple MapReduce jobs in java and used differentUDF's for data cleaning and processing.
  • Involved in loading data from LINUX file system to HDFS.
  • UsedPIG LatinandPIG scriptsto process data.
  • Experienced in importing and exporting data into HDFS and assisted in exporting analyzed data to RDBMS usingSQOOP.
  • Extracted the data from various SQL servers into HDFS using SQOOP. Developed custom MapReduce codes, generated JAR files for user defined functions and integrated it withHIVEto extend the accessibility of statistical procedures within the entire analysis team.
  • ImplementedPartitioning, Dynamic partitioningandBucketingin HIVE using internal and external table for more efficient data.
  • Used HIVE queries for aggregating the data and mining information sorted by volume and grouped by vendor and product.
  • Performed statistical data analysis routines using Java API's to analyze data using.

Environment: Hadoop (HDFS/MapReduce), PIG, HIVE, SQOOP, SQL, Linux, Statistical analysis.

Confidential

Data Analyst

Environment: Excel, SQL server, Power View, Power Query

Responsibilities:

  • My job consists of building dashboards by using Power view and deployed on SharePoint for sales and marketing team to monitor the company s main KPIs.
  • Participated as a Data Analyst with the understanding of entire lifecycle of process for the team.
  • Performed data collection, cleaning, validation, and visualization
  • Developed SQL queries
  • Tuned and Optimized SQL Queries using Execution Plan and Profiler.
  • Import and Export of data from one server to other servers using tools like SSIS
  • Used SQL including complex quires, unions, multi-table joins, CTEs, Views.
  • Extensively used advance chart visualizations like Dual Axis, Bar chart, Heat Maps, Tree maps etc.,
  • Created dashboards and reports for ad-Hoc Reporting.

We'd love your feedback!