We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

0/5 (Submit Your Rating)

MI

SUMMARY

  • Overall, 10+ Years of strong experience as Data Engineer and Data Modeler including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Excellent programming skills with experience in Java, PL/SQL, SQL, Scala, C++, and Python Programming.
  • Strong experience in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
  • Involved in developing Hive DDLS to create, alter and drop Hive tables.
  • Extensively used Alteryx for data creation, data blending and creation of data models.
  • Hands on experience in writingMap Reduceprograms using Java to handle different data sets usingMap and Reduce tasks.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow - Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Extensively involved in designing and developing the Power BI Data model using multiple DAX expressions to build calculated columns and calculated measures.
  • Extensively used Alteryx for data creation, data blending and creation of data models.
  • Extensive knowledge in writing Hadoop jobs for data analysis as per the business requirements using Hive and worked on HiveQL queries for required data extraction, join operations, writing custom UDF's as required and having good experience in optimizing Hive Queries.
  • Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
  • Experience on ETL concepts using Informatica Power Center, AB Initio.
  • Experience in importing and exporting the data using Sqoop from f3 to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
  • Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
  • Excellent understanding and knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
  • Work with client teams to design and implement modern, scalable data solutions using a range of new and emerging technologies from the Google Cloud Platform.
  • Hands on experience with data ingestion tools Kafka, Flume, and workflow management tools Oozie.
  • Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and Used Spark Data Frame Operations to perform required Validations in the data.
  • Strong Experience with Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups.
  • Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
  • Good understanding and knowledge of NoSQL databases like MongoDB, PostgreSQL, HBase and Cassandra.
  • Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
  • Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory.
  • Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
  • Had extensive working experience on RDBMS such as Oracle, DevOps, Microsoft SQL Server, MYSQL and Worked with NoSQL databases like HBase, Mongo DB, Cassandra.
  • Extensive experience working on various databases and database script development using SQL and PL/SQL.
  • Worked on various programming languages using IDEs like Eclipse, NetBeans, and IntelliJ, Putty, GIT.
  • Have very strong inter-personal skills and the ability to work independently and with the group, can learn quickly and easily adaptable to the working environment.

TECHNICAL SKILLS

Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase

Programming Languages: Java, PL/SQL, SQL, Python, Scala, Spark, C, C++, Go

Cluster Mgmt.& monitoring: CDH 4, CDH 5, Horton Works Ambari 2.5

Data Bases: MySQL, SQL Server, Oracle, MS Access

NoSQL Data Bases: MongoDB, Cassandra, HBase

Workflow mgmt. tools: Oozie, Apache Airflow

Visualization & ETL tools: Tableau, Banana UI, D3.js, Informatica, Talend

Cloud Technologies: Azure, AWS

IDE’s: Eclipse, Jupiter notebook, Spyder, PyCharm, IntelliJ

Version Control Systems: Git, SVN

Operating Systems: Unix, Linux, Windows

PROFESSIONAL EXPERIENCE

Confidential, MI

Senior Data Engineer

Responsibilities:

  • Wrote Kafka producers to stream the data from external rest API to Kafka topics.
  • Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
  • Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Streamed real time data by integrating Kafka with Spark for dynamic price surging using machine learning algorithm.
  • Used python, the ETL pipeline was developed and programmed to collect data from Redshift data warehouse.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
  • Responsible for designing, building, and testing workflows in Alteryx.
  • Used Alteryx for data preparation and then tableau for Visualization and Reporting.
  • Processes data in Alteryx to create TDE for tableau reporting.
  • Proficient with development and administration of Talend Open studio and Talend server.
  • Building a job withTalend. By drag and drop the component from pallet with list of ETL components.
  • Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
  • Migrate data into RV Data Pipeline using Databricks, Spark SQL and Scala.
  • Used Databricks for encrypting data using server-side encryption.
  • Working on Docker Hub, Docker Swarm, Docker Container network, creating Image files primarily for middleware installations & domain configurations. Evaluated Kubernetes for Docker Container Orchestration.
  • Used Delta Lake as it is an open-source data storage layer which delivers reliability to data lakes.
  • Experience withSnowflake Multi-Cluster Warehouses.
  • Experience in snowflake Clone and Time Travel.
  • Responsible
  • Involved in Migrating Objects from Teradata to Snowflake.
  • Developed data warehouse model in snowflake for multiple datasets using where space.
  • Involved in testing Snowflake to understand best possible way to use the cloud resources.
  • Experience in integrating Jenkins with various tools like Maven (Build tool), Git (Repository), SonarQube (code verification), Nexus (Artifactory) and implementing CI/CD automation for creating Jenkins pipelines programmatically architecting Jenkins Clusters, and scheduled builds day and overnight to support development needs.
  • Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
  • Responsible for ingesting large volumes of IOT data to Kafka.
  • Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
  • Involved in Trouble Shooting,Performance tuning of reportsand resolving issues withinTableau Server and Reports.
  • Use Informatica software to design and maintain data storage systems.
  • Installed Docker Registry for local upload and download of Docker images and from Docker Hub and created Docker files to automate the process of capturing and using the images.
  • Programmatically created CICD Pipelines in Jenkins using Groovy scripts, Jenkins file, integrating a variety of Enterprise tools and Testing Frameworks into Jenkins for fully automated pipelines to move code from Dev Workstations to all the way to Prod environment.
  • Experience on AWS cloud services such as EC2, S3, RDS, ELB, EBS, VPC, Route53, auto scaling groups, Cloud watch, Cloud Front, IAM to build configuration and troubleshooting for server migration from physical to cloud on various Amazon photos.
  • Worked on troubleshooting spark application to make them more error tolerant.
  • Designed, documented operational problems by following standards and procedures using JIRA.
  • Experience withSnowflake Virtual Warehouses.
  • Reviewing and Managing IBM reporting.
  • Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
  • Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • Worked extensively with Sqoop for importing data from Oracle.
  • Involved in creating Hive tables, loading, and analysing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Leveraged AWS cloud services such as EC2, auto-scaling and VPC to build secure, highly scalable and flexible systems that handled expected and unexpected load bursts.
  • Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store.
  • Developed Oozie work processes for planning and arranging the ETL cycle. Associated with composing Python scripts to computerize the way towards extricating weblogs utilizing Airflow DAGs.
  • Worked on SSIS creating all the interfaces between front end application and SQL Server database, then from legacy database to SQL Server Database and vice versa.
  • Used control flow tasks and container as well as Transformations in a complex design to build an algorithm to cleanse and consolidate data.

Environment: Kafka, HBase, Docker, Kubernetes, AWS, EC2, S3, Lambda, Informatica, Cloud Watch, Auto Scaling, EMR, Redshift, Jenkins, ETL, Spark, Hive, Athena, Sqoop, Pig, Oozie, Spark Streaming, Hue, Scala, Python, Databricks, Apache NIFI, GIT, Micro Services, Snowflakes

Confidential, Cincinnati, Ohio

Data Engineer

Responsibilities:

  • Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, NoSQL, Nifi, Druid
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
  • Measured Efficiency of Hadoop/Hive environment ensuring SLA is met.
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modelling methodology using Erwin.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
  • Experience in Big Data Integration platform which delivers high-scale, in-memory fast data processing, as part of theTalendData Fabric solution, where enterprise can turn more and more data into real-time decisions.
  • Experience in IBM DataStage system engineering on LINUX platform.
  • Developed stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
  • Solved the challenges using python as an ETL tool, In contrast with an ETL software such as Talend.
  • Advanced knowledge on Confidential Redshift and MPP database concepts.
  • Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day.
  • Strong understanding of AWS components such as EC2, S3, Lambda, Auto Scaling, Cloud Watch, Cloud Formation, Security groups and IAM.
  • Designed and implemented big data ingestion pipelines to ingest multi-TB data from various data source using Kafka, Spark streaming including data quality checks, transformation, and stored as efficient storage formats Performing data wrangling on multi-Terabyte datasets from various data sources for a variety of downstream purposes such as analytics using Spark.
  • Experience withSnowflake Multi-Cluster Warehouses.
  • Experience in snowflake Clone and Time Travel.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics
  • Develop Data warehousing systems by using Informatica tools.
  • Worked on Big data on AWS cloud services i.e., EC2, S3, EMR and DynamoDB
  • Migrated on premise database structure to Confidential Redshift data warehouse.
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Connected to Amazon Redshift through Tableau to extract live data for real time analysis.
  • Implemented Workload Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Exception handling in python to add logs to the application.
  • Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes.
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems.
  • Compiled data from various sources to perform complex analysis for actionable results.
  • Built performant, scalable ETL processes to load, cleanse and validate data.
  • Analyse the existing application programs and tune SQL queries using execution plan, query analyser, SQL Profiler, and database engine tuning advisor to enhance performance.
  • Implementing and Managing ETL solutions and automating operational processes.
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data.
  • Collaborate with team members and stakeholders in design and development of data environment.
  • Preparing associated documentation for specifications, requirements, and testing
  • Optimized the TensorFlow Model for efficiency.
  • Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies.

Environment: Oracle, Kafka, Python, Redshift, Informatica, AWS,EC2, S3, SQL Server, Erwin, RDS, NOSQL, Snowflake Schema, MySQL, Dynamo DB, PostgreSQL, Tableau, Git Hub

Confidential, San Diego, CA

GCP Data Engineer

Responsibilities:

  • Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.
  • Hands on Experience with 904688396609 Big Query, experience on architecting the ETL transformations and writing spark jobs to do the processing.
  • Used GCP Cloud dataflow, PUB / Sub Cloud shell, GCS Buckets, BQ Command lie utilities. to build and deliver Data solutions using GCP products and offerings.
  • GCP Data engineering experience when communicating with clients on their requirements, turning these into technical data solutions.
  • Deployed the initial Azure components like Azure Virtual Networks, Azure Application Gateway, Azure Storage and Affinity groups.
  • Developed data pipeline using Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
  • Delivered de normalized data forPower BIconsumers for modeling and visualization from the produced layer in Data Lake
  • Written Kafka REST API to collect events from front end.
  • Involved in creating HDInsight cluster in Microsoft Azure Portal also created Events hub and Azure SQL Databases.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
  • Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage.
  • Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
  • Involved in running all the hive scripts through hive. Hive on Spark and some through Spark SQL.
  • Migrate data into RV Data Pipeline using Databricks, Spark SQL and Scala.
  • Worked on product positioning and messaging that differentiate Hortonworks in the open-source space.
  • Experience in design and developing Application leveraging MongoDB.
  • Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
  • Involved in complete big data flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.
  • Troubleshooting the Azure Development, configuration, and Performance issues.
  • Interacted with multiple teams who are responsible for Azure Platform to fix the Azure Platform Bugs.
  • Providing 24/7 support for on-call on Azure configuration and Performance issues.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers.
  • Used Jira for bug tracking and Bitbucket to check-in and checkout code changes.

Environment: Scala, Azure, HDFS, Yarn, MapReduce, Hive, Sqoop, Flume, Oozie, Kafka, Impala, Spark SQL, Spark Streaming, Eclipse, Oracle, Teradata, PL/SQL UNIX Shell Scripting.

Confidential

Data Engineer

Responsibilities:

  • Experience creating and organizing HDFS over a staging area.
  • Imported Legacy data from SQL Server and Teradata into Amazon S3
  • As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
  • Wrote Python code to manipulate and organize data frame such that all attributes in each field were formatted identically.
  • Developed SQL scripts to Upload, Retrieve, Manipulate, and handle sensitive data (National Provider Identifier Data I.e., Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
  • Utilized Pandas to create a data frame.
  • Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
  • Created basher files and all other xml configurations to automate the deployment of Hadoop VMs over AWS EMR.
  • Developed a raw layer of external tables within S3 containing copied data from HDFS.
  • Created a data service layer of internal tables in Hive for data manipulation and organization.
  • Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
  • Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyse and investigate the data quality when these types of loads are done (To look for any data loss, data corruption).

Environment: HDFS, AWS, SSIS, Snowflake Hadoop, Hive, HBase, MapReduce, Spark, Sqoop, Pandas, MySQL, SQL Server, PostgreSQL, Teradata, Java, Unix, Python, Tableau, Oozie, Git.

We'd love your feedback!