We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

2.00/5 (Submit Your Rating)

Chicago, Il

PROFESSIONAL SUMMARY:

  • Over 6+years experience as Big Data Engineer, Data Engineer and ETL developer, comprises designing, development and implementation of data models.
  • Evaluated big data technologies and prototype solutions to improve our data processing architecture. Data modeling, development and administration of relational and NoSQL Databases.
  • Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming and expertise in building PySpark and Spark - Scala applications for interactive analysis, batch processing and stream processing.
  • In depth understanding of Hadoop Architecture including YARN and various components such as HDFS Resource Manager, Node Manager, Name Node, Data Node.
  • Experience ofHadoopArchitecture and worked with various components such asHDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
  • Experience on Hadoop distributions like HortonWorks and MapR.
  • Hands on experience in developing spark applications using spark tools like RDD transformations, spark code, spark MLlid, spark streaming and spark SQL.
  • Hands on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, Scala and Hue.
  • In depth understanding of Hadoop Architecture including YARN and various components such as HDFS Resource Manager, Node Manager, Name Node, Data Node.
  • Expertise in Big Data/ Hadoop Ecosystem consisting of Apythom, Apache Hive, Spark, MapReduce, Apache Kafka, Sqoop, Oozie, Zookeeper, HDFS, YARN.
  • IT experience on Big Data technologies, Spark, Database development.
  • Experience in tuning and debugging Spark application and using Spark optimization techniques.Experience of building machine learning solutions using PySpark for large sets of data on Hadoop ecosystem.
  • Good experience on programming languages Python,scala.
  • Good experience in developing web applications implementing Model View Control (MVC) architecture using Django, Flask, data mart and Python web application.
  • Good Experience withPython web frameworkssuch asDjango,FlaskandPyramidFramework.
  • Experience in using various packages in R and python like ggplot2, caret, dplyr, Rweka, rjson, plyr, SciPy, scikit - learn, PyTorch NLTK, NumPy, Open CV, Beautiful Soup, Rpy2.
  • Have Experience in SQL, NoSQL databases like HBase, Cassandra and MongoDB.
  • Experienced in writing complex SQL Quires like stored procedures, triggers, joints, and Sub quires.
  • Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, Impala, Pyspark, Spark SQL.
  • Experience in writing SQL queries, Stored Procedures, functions, packages, tables, views, triggers on relational databases like Oracle, DB2, MySQL, PostgreSQL, and MS SQL Server.
  • Extensive experience in development of Bash scripting, T-SQL, and PL/SQL scripts.
  • Good knowledge in Technologies on systems which comprises of massive amount of data running in highly distributive mode in Azure, Cloudera, Hortonworks Hadoop distributions and Amazon AWS.
  • Knowledge in Amazon AWS and Microsoft Azure cloud services.
  • Expertise in configuring the monitoring and alerting tools according to the requirement like AWS cloud watch.
  • Good experience Amazon Web Service (AWS)concepts like EMR, S3 and EC2 web services which provides fast and efficient processing of Teradata, Big Data Analytics.
  • Good knowledge in Data Base Creation and Maintanance of physical data models with Oracle Teradata, Netezza, Db2, MongoDB, HBase and SQL Server databases.
  • Extensive experience in working with NO SQL databases and its integration Dynamo DB, Cosmo DB, Mongo DB, Cassandra and HBase
  • Strong Experience in the Analysis, design, development, testing and implementation of Business intelligence solutions using Data warehouse,Data Mart Design, ETL, ETL pipelines and Writing ETL scripts using Regular Expressions and custom tools like Informatica, pentaho, and Sync Sort to ETL data.
  • Ingested data into Snowflake cloud data warehouse using Snowpipe.Having good knowledge of tools like Snowflake, SSIS, SSAS, SSRS to design warehousing applications.
  • Configured Snow pipe to pull the data from S3 buckets into Snowflake table and stored incoming data in the Snowflake staging area.
  • Strong experence working with HDFS, MapReduce, Spark, Hive, Sqoop, Flume, Kafka, Oozie, Pig and HBase.
  • Created Airflow (DAGs) tasks in python to automate the process of sqoop data integration from wide range of data sets.
  • Creating the data pipelines using state of the art Big Data frameworks/tools
  • Designed, built, and tested a Dynamic Locking Mechanism (DLM) for a six-node CentOS Linux cluster in Python and Twisted Framework using an SQLite database to manage configurations.
  • Designed UNIX Shell Scripting for automating deployments and other routine tasks.

TECHNICAL SKILLS:

Hadoop/Big Data Technologies: Spark, Airflow, Hadoop, Map Reduce, Sqoop, Hive, Oozie, Zookeeper and Cloudera Manager, Kafka, Flume.

ETL Tools: Informatica, Teradata

NO SQL Database: HBase, Cassandra, Dynamo DB, Mongo DB.

Monitoring and Reporting: Tableau, Custom shell scripts, Apache Airflow

Hadoop Distribution: Horton Works, Cloudera, Amazon EMR

Build Tools: Jenkins, Git

Programming & Scripting: Python, Scala, JAVA, SQL, Shell Scripting, C, C++

Databases: Oracle, MY SQL, Teradata

Version Control: GIT, GitHub

Operating Systems: Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7

Cloud Computing: AWS, Oracle, Azure

Data base modelling: ER modelling, dimension modelling, Start schema modelling, Snowflake modelling

Visualization Reporting: Tableau, ggplot, matplotlib and PowerBI EDA Tools pandas, Numpy,SciPy

PROFESSIONAL EXPERIENCE:

Confidential, Chicago,IL.

Sr. Data Engineer

Responsibilities:

  • Proficient in working with Azure cloud platform (DataLake, DataBricks, HDInsight, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
  • Designed and deployed data pipelines over DataLake using DataBricks and Apache Airflow.
  • Enabling data scientists/analysts to work on machine learning solutions.
  • Worked on Azure Data Factory to integrate data of both on-prem (MySQL and Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load into Azure Synapse.
  • Extract Transform and Load data (ETL)from Sources Systems to Azure Data Storage services using Azure Data Factory, T-SQL, Spark SQL AzureSQL, Azure DW and U-SQL Azure Data Lake Analytics and Processing the data in InAzure Databricks.
  • Configured Spark Streaming to receive real time log data from Apache Flume and store the stream data using Spark to Azure table.
  • Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
  • Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
  • Used DataLake to store and do numerous types of processing and analytics on semi structured and structured data.
  • Ingested data to Blob Storage and processed the data using Databricks. Involved in developing Spark Scala scripts and UDF’s to perform transformations on large data sets.
  • Utilized Spark streaming API to ingest data from various sources.
  • Optimized the existing code to improve cluster performance.
  • Involved in using Spark DataFrames to create various Datasets and applied business transformations and data cleansing operations in data bricks platform.
  • Wrote efficient, reusable & clean python scripts to build ETL pipeline, Direct Acyclic Graph (DAGs) workflows using Airflow.
  • Extensively used Kubernetes and made it possible to handle batch workloads required to feed, analytics and machine learning applications.
  • Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved the query performance.
  • Managed resources and scheduling across the cluster using Azure Kubernetes Service (AKS). AKS can be used to create, configure, and manage a cluster of Virtual machines.
  • Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication and Apache Range for authorization.
  • Developed Scala for processing large datasets via MapReduce jobs and compiled program into bytecode for JVM for data processing.
  • Experienced in memory tuning, batch interval time, level of parallelism to improve the processing time and efficiency.
  • Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements.
  • Environment: Azure HDInsight, DataBricks (ADBX), DataLake (ADLS), CosmosDB, MySQL, Snowflake, MongoDB, Teradata, Ambari, Flume, VSTS,, Azure DevOps, Ranger, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, Airflow, Hive, Sqoop, HBase.

Confidential, Atlanta, GA

Data Engineer

Responsibilities:

  • Extensive experience in working with AWS cloudplatform (EC2, S3, EMR, Redshift, Lambda and Glue).
  • Worked with Data and Analytics team to architect and build Data Lake using various AWS services like Athena, Glue,EMR, Redshift, Hive and Airflow.
  • Evaluation of current architecture and design of the new one allowing horizontal unlimited scalability.
  • Imported data from AWSS3 into spark RDD, performed transformations and actions on RDD’s.
  • Improving teh performance and optimization of the existing algorithms in Hadoop using spark context, Spark-SQL, Data Frame, Pair RDD’s, YARN.
  • Migration from legacy mainframes to Hadoop Stack.
  • Developed Spark Applications by using Python and R and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
  • Deployed data pipelines with CI/CD process using Jenkins and Ansible.
  • Implemented Python code for different tasks and time sensor for each job for workflow management and automation by Airflow tool.
  • Migrated code to work with Spark execution engine for performance and optimization of existing applications in Hadoop.
  • Used Spark Streaming APIs to perform data transformation and load data real-time.
  • Created data model which gets the data from Kafka in real time and persist it to Cassandra.
  • Developed Kafka consumer API in python for consuming data from topics.
  • Parsed Extensible markup language (XML) messages using Kafka and processed the xml file to capture real time User Interface (UI) updates.
  • Loaded data to S3 buckets using PySpark and AWS Glue. Filtered data stored in S3 buckets using Elasticsearch and loaded data into Hive External tables.
  • Configured a snow pipe to facilitate data pull from S3 buckets into Snowflakes table.
  • Stage the API or Kafka Data (in JSON file format) into Snowflake DB by flattening the same for different functional services.
  • Stored incoming data in the Snowflakes staging area.
  • Worked on Amazon Redshift for migrating from on-premises Data warehouses
  • Developed Kibana Dashboards based on the log stash data and integrated combinations of source and target systems into Elasticsearch doe near real time log analytical monitoring End to End transactions.
  • Implemented AWS Step functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, ML model and deploying it for prediction.
  • Good understanding of Apache Cassandra architecture, replication strategy, gossip, snitches, etc.,
  • Involved in design of columnar families in Cassandra and ingested data from RDBMS, performed data transformations, and then exported the data to Cassandra as per business requirement.
  • Experience in using Parquet, Avro, and JSON file formats, developed UDF in Hive.
  • Developed Sqoop and Kafka jobs to load data from RDBMS, External Systems into HDFS.
  • Developed Oozie coordinators to schedule Hive Scripts and create Data Pipelines.
  • On cluster testing if HDFS, Hive, Pig and MapReduce to facilitate access to new users.
  • Continuous monitoring and managing the Hadoop cluster through Cloudera manager.
  • Provided technical mentoring to junior team members, performing code and design reviews, enforcing coding standards and best practices
  • Environment: EC2, S3, EMR, Redshift, Lambda and Glue, Spark, Spark SQL, AWS EMR, mapR, HDFS, Hive, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, Eclipse, Jenkins, Oracle, Git, Tableau, MySQL, Soap, Cassandra, and Agile Methodologies, Redshift, Avro, SageMaker EC2, S3, EMR, Redshift, Lambda and Glue.

Confidential

Data Engineering Analyst

Responsibilities:

  • Enterprise Insurance data warehouse is a conversion project of migrating existing data marts at an integrated place to get the advantage of corporate wide data warehouse. It involves developing existing data marts and adding new subject areas to existing data marts, it helps business users a platform queries across various subject areas using single OLAP tool (Cognos).
  • Created map design document to transfer data from source system to data warehouse, built ETL pipeline which made analyst’s job easy and reduced the patient’s expense.
  • Development of Informatica Mappings, Sessions, Worklets, Workflows.
  • Responsible for profiling Thunder Token server application. Golang, YAML, Ethereum, Prometheus, Grafana, Ansible, AWS EC2, and S3, Terraform
  • Wrote Shell scripts to monitor load on database and Perl scripts to format data extracted from data warehouse based on user requirements.
  • Implemented View Model patterns in creating and managing views, partial views, View Models, and Web APIs using ASP. NET MVC.
  • Performed EDA on raw data and identified anomalies. Created Curation layer after transforming tables using python data manipulation modules for ML usage
  • Evaluated multiple EDA tools, including Python Pandas.
  • Designed, developed, and delivered the jobs and transformations over the data to enrich the data and progressively elevate for consuming in the layer of the delta lake.
  • Used Sqoop to import the data from RDBMS to Hadoop Distributed File System(HDFS) and later analyzed the imported data using Hadoop
  • Performed network traffic and analysis expertise using data mining, Hadoop ecosystem (MapReduce, HDFS Hive) and visualization tools by considering raw packet data, network flow, and Intrusion Detection Systems (IDS).Developed frontend and backend modules using Python on Django including Tastypie Web Framework using Git.
  • Involved in building database Models, APIs and views utilizing Python, in order to build an interactive Web based solutions.
  • Managed large datasets using panda data frames and MySQL.
  • Developed merge jobs in python to extract and load data into MySQL database and Worked on MySQL data lineage process to map source to target DB, table and column mappings using OO Python.
  • Experience analyzing data with the help of Python libraries including Pandas, NumPy, SciPy and Matplotlib.
  • Collected, analyzed and interpreted patterns within the client data to make suggestions about sales and marketing approach. Tools like power BI, Tableau.
  • Created chat-bot to receive complaints from the customers and give them an estimated waiting time to resolve the issue.

Environment: Python, Pandas, Numpy,SciPy R, AWS EMR, Apache Spark, Hadoop ecosystem (MapReduce, HDFS, Hive) Scala, LogRythm, Openvas, Informatica, Ubuntu, Tableau, Power BI, Matplotlib

Confidential

 ETL Developer

Responsibilities:

  • Extensively used Informatica Client tools Power Center Designer, Workflow Manager, Workflow Monitor and Repository Manager.
  • Extracted data from various heterogeneous sources like Oracle, Flat Files.
  • Designed Informatica Mappings (Power Center tool)by translating the business requirements.
  • Extracting data from Oracle and Flat file, Excel files and performed complex joiner, Expression, Aggregate, Lookup, Stored procedure, Filter, Router transformation, Update strategy transformations to load data into the target systems.
  • Created Sessions, Tasks, Workflows and Worklets using Workflow manager.
  • Worked with Data modeler in developing STAR Schemas.
  • Involved in analyzing the existence of the source feed in existing CSDR database.
  • Handling high volume of day to day Informatica workflow migrations.
  • Review of Informatica ETL design documents and working closely with development to ensure correct standards are followed.
  • Creating new repositories from scratch, backup and restore.
  • Experience in working with Groups, roles, privileges and assigned them to each user group.
  • Knowledge in Code change migration from Dev to QA and QA to Production.
  • Worked on SQL queries to query the Repository DB to find the deviations from Company's ETL Standards for the objects created by users such as Sources, Targets, Transformations, LogFiles, Mappings, Sessions and Workflows.
  • Used Pre session and Post Session to send e-mail to various business users through the Workflow Manager.
  • Leveraging the existing PL/SQL scripts for the daily ETL operation.
  • Experience in ensuring that all support requests are properly approved, documented, and communicated using the MQC tool. Documenting common issues and resolution procedures
  • Extensively involved in enhancing and managing Unix Shell Scripts.
  • In in converting the business requirement into technical design document.
  • Documenting the macro logic and working closely with Business Analyst to prepare BRD.
  • Involved in setting up SFTP setup with the internal bank management.
  • Building UNIX scripts in cleaning up the source files.
  • Involved in loading all the sample source data using SQL loader and scripts.
  • Creating Informatica workflows to load the source data into CSDR.
  • Involved in creating various UNIX script used during ETL load process.
  • Periodically cleaning up Informatica repositories.
  • Monitoring the daily load and handing over the stats with the QA Team.
  • Creating new repositories from scratch, backup and restore
  • Environment: Informatica, Load Runner 8.x, HP QC 10/11, Toad, SQL, PL/SQL,MQS,UNIX,BRD,CSDR.

We'd love your feedback!