We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

4.00/5 (Submit Your Rating)

Chicago, IL

SUMMARY:

  • Having 8 Years of Big Data experience in building highly scalable data analytics applications.
  • Strong experience working with Hadoop ecosystem components like HDFS, Map Reduce, Spark, HBase, Oozie, Hive, Sqoop, Pig, Flume and Kafka.
  • Good hands - on experiencing working with various distributions of Hadoop like Cloudera (CDH), Hortonworks (HDP) and Amazon EMR.
  • Good understanding of Distributed Systems architecture and design principles behind Parallel Computing.
  • Expertise in developing production ready Spark applications utilizing Spark-Core, Dataframes, Spark-SQL, Spark-ML and Spark-Streaming API's.
  • Strong experience troubleshooting failures in spark applications and fine-tuning spark applications and hive queries for better performance.
  • Worked extensively on Hive for building complex data analytical applications.
  • Strong experience writing complex map-reduce jobs including development of custom Input Formats and custom Record Readers.
  • Sound Knowledge in map side join, reduce side join, shuffle & sort, distributed cache, compression techniques, multiple Hadoop Input & output formats.
  • Worked with Apache NiFi to automate the data flow between the systems and managed flow of information between systems.
  • Good experience working with AWS Cloud services like S3, EMR, Redshift, Glue, Athena etc.,
  • Deep understanding of performance tuning, partitioning for building scalable data lakes.
  • Worked on building real time data workflows using Kafka, Spark streaming and HBase.
  • Extensive knowledge on NoSQL databases like HBase, Cassandra and Mongo DB.
  • Solid experience in working with Csv, text, sequential, Avro, parquet, orc, json formats of data.
  • Extensive experience in performing ETL on structured, semi-structured data using Pig Latin Scripts.
  • Designed and implemented Hive and Pig UDF's using Java for evaluation, filtering, loading and storing of data.
  • Good knowledge to have professional knowledge of Synapse. Developing ETL pipelines in and out of data warehouse using combination of Python and Snowflakes Snows Writing SQL queries against Snowflake.
  • Experience in using connecting various Hadoop sources like Hive, Impala, Phoenix to Tableau for reporting.
  • Fluent programming experience with Scala, Java, Python, SQL, T-SQL, R.
  • Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
  • Adept at configuring and installing Hadoop/Spark Ecosystem Components.
  • Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
  • Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
  • Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
  • Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
  • Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed knowledge of Map Reduce framework.
  • Good knowledge in the core concepts of programming such as algorithms, data structures, collections.
  • Developed core modules in large cross-platform applications using JAVA, JSP, Servlets, Hibernate, RESTful, JDBC, JavaScript, XML, and HTML.
  • Passionate about working and gaining more expertise on a variety of cutting-edge Big Data technologies.
  • Ability to adapt quickly to evolving technology, strong sense of responsibility and .
  • Eager to update my knowledge base constantly and learn new skills according to business needs.

TECHNICAL SKILLS:

Big Data Ecosystems: Hadoop, HDFS, Yarn, MapReduce, Hive, Pig, Sqoop, Spark, HBase, Zookeeper, Oozie, Flume, Kafka, Azure, Airflow

Programming Languages: SQL, Scala, Python and HQL

NoSQL Databases: HBase, MongoDB, Cassandra

Databases: SQL Server, MySQL, Oracle 8i/9i/10g

Cloud Ecosystem: Amazon Web Services (S3, EC2), Azure, AWS S3, EMR, Redshift, Athena, Glue, Lambda functions

Hadoop Distributions: Cloudera, Apache, Hortonworks

Operating Systems: Microsoft Windows, LINUX, UNIX

Build Tools: Maven, SBT, Ant, Jenkins

Version Control Tools: GITHUB, SVN

PROFESSIONAL EXPERIENCE:

Confidential, Chicago, IL

Sr. Data Engineer

Responsibilities:

  • Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena.
  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Extracted, transformed and loaded data from various heterogeneous data sources and destinations using AWS Redshift.
  • Migrate data from on-premises to AWS storage buckets
  • Developed a python script to hit REST API’s and extract data to AWS S3
  • Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
  • Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
  • Selected and generated data into Csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
  • Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's
  • Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
  • Compiling and validating data from all departments and Presenting to Director Operation.
  • Created Tableau reports with complex calculations and worked on Ad-hoc reporting using Tableau.
  • Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Pre-processing using Hive and Pig.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target snowflake database.
  • Design, develop, and test dimensional data models using Star and Snowflake schema methodologies under the Kimball method.
  • Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Strong understanding of AWS components such as EC2 and S3
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
  • Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
  • Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
  • Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Ensure deliverables (Daily, Weekly & Monthly MIS Reports) are prepared to satisfy the project requirements cost and schedule
  • Experience in deploying code through Jenkins and creating pull requests using bit bucket.
  • Used Git for version control with colleagues.

Environment: AWS, Reds hift, S3, Map Reduce, AWS Lambda, AWS Glue, EC2, Hadoop, Hive, Py spark, SQL, T-SQL, No SQL, Python, Scala, Yarn, Tableau, Git, Bit Bucket, OLAP, snowflake, pig, spark, Cassandra, Mango DB.

Confidential, CA

Data Engineer

Responsibilities:

  • Performed analytics on AWS S3 using Spark, Performed transformations and actions as per business requirements.
  • Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
  • Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
  • Designed & build infrastructure for the Google Cloud environment from scratch
  • Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
  • Performed Data Migration to GCP
  • Responsible for data services and data movement infrastructures
  • Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud data flow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Experienced in ETL concepts, building ETL solutions and Data modeling
  • Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters
  • Loaded application analytics data into data warehouse in regular intervals of time
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, EventBridge, SNS)
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
  • Worked on confluence and Jira Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
  • SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports
  • Created action filters, parameters and calculated sets for preparing dashboards and worksheets using Power BI Developed visualizations and dashboards using Power BI Compiled data from various sources to perform complex analysis for actionable results Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
  • Optimized the TensorFlow Model for efficiency
  • Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
  • Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
  • Built performant, scalable ETL processes to load, cleanse and validate data
  • Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
  • Collaborate with team members and stakeholders in design and development of data environment
  • Preparing associated documentation for specifications, requirements, and testing
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
  • Experience of Reporting, Analysis and Integration Services (SSRS/SSIS/SSAS)
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
  • Worked on SQL Server Integration Services (SSIS), SSAS, SSRS, T-SQL skills, stored procedures, triggers
  • Architect and implement ETL and data movement solutions using Azure Data Factory, SSIS create and run SSIS Package ADF V2 Azure-SSIS IR
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
  • Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).

Environment: Azure, Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, DataProc, Cloud Sql, MySQL, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql, Docker, Jenkins, Azure data factory, Azure Databricks.

Confidential

Hadoop Developer

Responsibilities:

  • Designed and Developed data integration/engineering workflows on big data technologies and platforms - Hadoop, Spark, MapReduce, Hive, HBase.
  • Worked in Agile methodology and actively participated in standup calls, PI planning.
  • Involved in Requirement gathering and prepared the Design documents.
  • Involved in importing data into HDFS and Hive using Sqoop and involved in creating Hive tables, loading with data, and writing Hive queries.
  • Developed Hive queries and Sqooped data from RDBMS to Hadoop staging area.
  • Handled importing of data from various data sources, performed transformations using Hive, and loaded data into Data Lake.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities.
  • Processed data stored in Data Lake and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.
  • Developed dataflows and processes for the Data processing using SQL (SparkSQL& Dataframes).
  • Designed and developed Map Reduce (hive) programs to analyze & evaluate multiple solutions, considering multiple cost factors across the business as well operational impact.
  • Involved in planning process of iterations under the Agile Scrum methodology.
  • Working on Hive Metastore backup, Partitioning, and bucketing techniques in hive to improve the performance.
  • Scheduling Spark jobs using Oozie workflow in Hadoop Cluster and Generated detailed design documentation for the source-to-target transformations.

Environment: Spark, Python, Sqoop, Hive, Hadoop, SQL, HBase, MapReduce, HDFS, Oozie, Agile

Confidential

Informatica Developer

Responsibilities:

  • Worked on Informatica tools -Source Analyzer, Mapping Designer, Mapplet Designer and Transformation Developer.
  • Developed mappings as per the given mapping Specs.
  • Involved in Data Extracting, Transforming and Loading the data from Source to Staging and Staging to Target according to the Business requirements.
  • Prepared Unit test cases for the mappings.
  • Validating ETL code while moving to other environments.

Environment: Informatica, SQL developer, Oracle, Putty, WinSCP, Flat Files, Control M

We'd love your feedback!