Data Engineer Resume
SUMMARY
- Around 7 years of Professional experience as an ETL and Big Data Developer with expertise in, Python, Hadoop, Spark etc.
- Development, Implementation, Deployment and Maintenance using Bigdata technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.
- Experienced in extending Hive and Pig core functionality by writing custom UDFs and MapReduce using Python.
- Good working knowledge with Data Warehousing and ETL processes.
- Acquired profound knowledge in developing production ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, DataFrames, Datasets and Spark-ML.
- Profound experience in creating real time data streaming solutions using Apache Spark/Spark Streaming, Kafka.
- Worked on NoSQL databases including HBase, Cassandra and Mongo DB
- Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
- In-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Map Reduce, Spark.
- Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
- Experience developing iterative algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
- Good Knowledge in making and keeping up profoundly versatile and fault-tolerant Infrastructure in AWS environment spanning over different availability zones.
- Expertise in working with AWS cloud services like EMR, S3, Redshift, EMR, Lambda, DynamoDB, RDS, SNS, SQS, Glue, Data Pipeline, Athena for big data development.
- Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats.
- Experienced in setting up Apache NiFi and performing POC with NiFi in orchestrating a data pipeline for data ingestion.
- Developed ETL solution for GCP Migration using GCP Dataflow, GCP Composer, Apache Airflow and GCP BigQuery.
- Experience working on creating and running Docker images with multiple micro - services.
- Excellent technical and analytical skills with a clear understanding of the design goals of Entity-Relationship modeling for OLTP and dimension modeling for OLAP.
- Experienced Orchestrating, scheduling, and monitoring job tools like Crontab, Oozie, and Airflow.
- Expertise in writing DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS.
- In-depth Knowledge of Hadoop Architecture and its components such as HDFS, Yarn, Resource Manager, Node Manager, Job History Server, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce.
- Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats.
- Expertise in python scripting and Shell scripting.
- Proficient in Tableau to analyze and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards.
- Experience in infrastructure automation using Chef & Dockers.
- Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies.
- Experience in all phases of Data Warehouse development like requirements gathering, design, development, implementation, testing, and documentation.
- Solid knowledge of Dimensional Data Modeling with Star Schema and s for FACT and Dimensions Tables using Analysis Services.
- Experienced working on Continuous Integration & build tools such as Jenkins and GIT, SVN for version control.
- Experience with proficient knowledge of Data Analytics, Machine Learning (ML), Predictive Modeling, Natural Language Processing (NLP), and Deep Learning algorithms.
- A Data Science enthusiast with strong Problem solving, Debugging and Analytical capabilities, who actively engages in understanding and delivering business requirements.
- Strong working knowledge across the technology stack including ETL, data analysis, metadata, data quality, audit and design.
- Quick learner and an excellent team player and have the ability to meet tight deadlines and work under pressure.
- Strong time management skills with ability to participate in multiple projects simultaneously as well as strong analytical/problem solving skills.
TECHNICAL SKILLS
Big Data Eco System: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, HBase, Kafka Connect, Impala, Stream sets, Oozie, Airflow, Zookeeper, NiFi, Amazon Web Services.
Hadoop Distributions: Apache Hadoop 1x/2x, Cloudera CDP, Hortonworks HDP
Languages: Python, Scala, Java, R, Pig Latin, HiveQL, Shell Scripting.
Software Methodologies: Agile, SDLC Waterfall.
Design Patterns: Eclipse, Net Beans, IntelliJ, Spring Tool Suite.
Databases: MySQL, Oracle, DB2, PostgreSQL, DynamoDB, MS SQL SERVER, Snowflake.
NoSQL: HBase, MongoDB, Cassandra.
ETL/BI: Power BI, Tableau, Talend, Informatica.
Version control: GIT, SVN, Bitbucket.
Web Development: JavaScript, Node.js, HTML, CSS, Spring, JDBC, Angular, Hibernate, Tomcat.
Operating Systems: Windows (XP/7/8/10), Linux (Unix, Ubuntu), Mac OS.
Cloud Technologies: Amazon Web Services, EC2, S3, SQS, SNS, Lambda, EMR, Code Build, CloudWatch. Azure HDInsight (Databricks, DataLake, Blob Storage, Data Factory, SQL DB, SQL DWH, CosmosDB, Azure DevOps, Active Directory).
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential
Responsibilities:
- Migrated terabytes of data from the legacy data warehouse into the cloud environment in an incremental format.
- Worked on creating data pipelines with Airflow to schedule PySpark jobs for performing incremental loads and used Flume for weblog server data. Created Airflow Scheduling scripts in Python.
- Developed Streaming applications usingPySparkandKafkato read messages fromAmazon AWS Kafka queues& write the JSON data toAWS S3 buckets.
- Design and Develop ETL Processes inAWS Glueto migrate codes data from external sources like S3, ORC/Parquet/Text Files intoAWS Redshift.
- Assessed existing and EDW (enterprise data warehouse) technologies and methods to ensure our EDW/BI architecture meets the needs of the business and enterprise and allows for business growth.
- Developed tools using Python, Shell scripting, XML to automate some of the menial tasks.
- Developed streaming and batch processing applications usingPySparkto ingest data from the various sources into HDFS Data lake.
- Developed the back-end web services using Python and Django REST framework.
- DevelopedSparkjobs onDatabricksto perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
- Developed High Speed BI layer on Hadoop platform with Apache Spark & Java & Python.
- Performed data cleansing and applied transformations usingDatabricksandSparkdata analysis.
- Extensively usedDatabricksnotebooks for interactive analytics using Spark APIs.
- WrittenHiveUDFs to implement custom functions in the hive for aggregations.
- Worked extensively withSqoop for importing and exportingthe data from HDFS to Relational Database systems/mainframe and vice-versa loading data into HDFS.
- Processed the schema oriented and non-schema-oriented data using Scala and Spark.
- Provided architecture and design as product is migrated to Scala, Play framework and Sencha UI.
- DevelopedDDLsandDMLsscripts inSQLandHQLfor analytics applications inRDBMSandHive.
- Used Oozie Schedulersystems to automate the pipeline workflow and orchestrate the map-reduce jobs that extract andZookeeperfor providing coordinating services to the cluster.
- Created Hive queriesthat helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Assessed existing and EDW (enterprise data warehouse) technologies and methods to ensure our EDW/BI architecture meets the needs of the business and enterprise and allows for business growth.
- Worked onBig Data Integrationand Analytics based on Hadoop, SOLR, Spark, Kafka, Storm, and web Methods technologies.
- Provide all data validation through SQL queries UNIX commands to perform Back-end testing. Performed different levels of testing like Integration testing and System testing.
- Performing Backend testing by creating and executing SQL queries and Performed Database testing using SQL queries to check the Data Validation and Data Integrity.
- Responsible for data extraction and data integration from different data sources into Hadoop Data Lake by creating ETL pipelines Using Spark, MapReduce, Pig, and Hive.
- Developed Spark programs with Scala and applied principles of functional programming to process the complex unstructured and structured data sets. Processed the data with Spark from Hadoop Distributed File System.
- Developed data pipelines using Apache Spark and other ETL solutions to get data from various applications like CAT and MyBlue to the central warehouse.
- Developed reusable ETL pipelines and defined strategy to extract Industry codes data from different vendors and build ETL Logic on top of them to feed the data to the central data warehouse.
- Involved in daily operational activities to troubleshoot ad-hoc production and data issues and enhancement of infrastructure in the big data and AWS cloud space to provide better solutions to resolve the existing issues.
- Performed data analysis and data profiling using advanced SQL queries.
- Developed and maintained ETL mappings using Informatica Designer to extract the data from multiple source systems that consists of databases like SQL Server, flat files to the Staging area, EDW and then to the Data Marts.
- Developed ETL procedures to ensure compliance of standards, lack of redundancy, translates business rules and functionality requirements into ETL procedures.
- Performed Impact analysis on the downstream due to the changes to the existing mappings and provided the feedback.
- Participated in Integration regression testing and End-to-End testing as part of the Highmark Integration and Codeset projects to avoid impacts after Production Migration.
- Provided guidance and leadership to newly on boarded offshore team.
- Participated in project management and in providing the project estimates for development team efforts for the offshore as well as on-site.
- Coordinated and monitored the project progress to ensure the timely flow and complete delivery of the project.
- Provides updates to the management and the business to keep them informed of any system issues.
- Involved in documenting all the changes that take place throughout the development of the database and create a repository for further analysis and future reference.
- Used Informatica as ETL tool to pull data from source systems/ files, cleanse, transform and load data into the Teradata using Teradata Utilities.
- ETL created by multiple Informatica transformations (Source Qualifier, Lookup, Router, Update Strategy) were utilized to create SCD type mappings to illustrate changes in loan related data in a timely manner.
- Design and implement disaster recovery for the PostgreSQL Database.
- Planning for PostgreSQL backup & recovery of database through Physical and logical (pg dump/pg restore)
- Responsible for all backup, recovery, and upgrading of all of the PostgreSQL databases.
- Installing and monitoring PostgreSQL database using the standard monitoring tools like Nagios etc.
- Involved in tuning the PySpark applications using various memory and resource allocation parameters, setting the right Batch Interval time, and varying the number of executors to meet the increasing load over time.
- Deployed Spark and Hadoop jobs on the EMR cluster.
- Setup full CI/CD pipelines so that each commit a developer makes will go through standard process of software lifecycle and gets tested well enough before it can make it to the production.
- Helped individual teams to set up their repositories in bit bucket and maintain their code and help them setting up jobs which can make use of CI/CD environment.
- Developed a POC for project migration from on prem Hadoop MapR system to GCP/Snowflake.
- Hand-On experience in Implement, Build and Deployment of CI/CD pipelines, managing projects often includes tracking multiple deployments across multiple pipeline stages.
- Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies.
Environment: Healthcare, SQL Server, Informatica, ETL, Hortonworks, Apache Hadoop 2.6.0, HDFS, Hive 1.2.1000, Sqoop 1.4.6, HBase 1.1.2, Oozie 4.1.0, Storm 0.9.3, YARN, NiFi, Cassandra, Zookeeper, Spark, Kafka, Oracle 11g, MySQL, Shell Script, AWS, EC2, Tomcat 8, Spring 3.2.3, Source Control GIT, Tera Data SQL Assistant.
DATA ENgineer
Confidential
Responsibilities:
- Hands on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop MapReduce, HDFS, HBase, Hive, Spark, Sqoop, Pig, Zookeeper and Flume.
- Data warehouse, Business Intelligence architecture design and develop. Designed the ETL process from various sources into Hadoop/HDFS for analysis and further processing of data modules.
- Responsible for validation of Target data in Data Warehouse which are Transformed, loaded using Hadoop Big.
- Design, develop, test, implement and support of Data Warehousing ETL using Talend and Hadoop Technologies.
- Extensively worked with MySQL for identifying required tables and views to export into HDFS.
- Designed and automated Custom-built input adapters usingSpark, Sqoop, andOozieto ingest and analyze data from RDBMS toAzure Data lake.
- Designed SQL, SSIS, and Python-based batch and real-time ETL pipelines to extract data from transactional and operational databases and load the data into target databases/data warehouses.
- Developed Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Created Automated ETL jobs in Talend and pushed the data to Azure SQL data warehouse.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Used Azure Synapse to manage processing workloads and served data for BI and prediction needs.
- Managed resources and scheduling across the cluster usingAzure Kubernetes Service.
- UsedAzure Data Factory,SQL API, and Mongo API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
- Analyzing the business requirements and doing the GAP analysis then transforming them into detailed design specifications.
- Develop data pipelines to consume data from Enterprise Data Lake (MapReduce, Hadoop distribution - Hive tables/HDFS) for the analytics solution.
- Processing the data (this is customer transaction related data) and developed the views day-wise, weekly, and monthly transaction summary customer wise branch wise and zonal wise.
- Created Spark RDDs from data files and then performed transformations and actions to other RDDs.
- Created Hive Tables with dynamic and static partitioning including buckets for efficiency. Also Created external tables in HIVE for staging purposes.
- Loaded HIVE tables with data, wrote hive queries that run on MapReduce and Created customized BI tool for management teams that perform query analytics using HiveQL.
- AppliedTableau Desktopto implement loan dashboards to demonstrate necessary metrics (bad / good loans, distribution of age with respect of loan amount, duration for loans get full repayment, etc.) where filters, calculated fields, control panel were used.
- Visualized transformed data by usingTableau Desktopdashboards containing histogram, trend lines, pie charts and statistics.
- To meet specific business requirements wrote UDF’s in Scala and Pyspark.
- Experience in developing Spark applications using Spark-SQL indatabricksfor data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
- Creating Spark clusters and configuring high concurrency clusters using Azure databricks to speed up the preparation of high-quality data.
- Creating databricks notebooks using SQL, Python and automated notebooks using jobs.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity.
- Primarily involved in Data Migration using SQL, SQL Azure, Azure storage, and Azure Data Factory, SSIS, PowerShell.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Designed and built a Data Discovery Platform for a large system integrator using Azure HdInsight components. Used Azure data factory and data Catalog to ingest and maintain data sources. Security on HdInsight was enabled using Azure Active directory.
- Lead in Installation, integration, and configuration of Jenkins CI/CD, including installation of Jenkins plugins.
- Maintenance, taking backup and recovery of CI/CD tools, jobs and scripts.
- Implemented aCI/CDpipeline withDocker,Jenkins,andGitHubby virtualizing the servers using Docker for the Dev and Test environments by achieving needs through configuring automation using Containerization.
- Performed Code Reviews and responsible for Design, Code, and Test signoff.
- Worked on designing and developing the Real - Time Tax Computation Engine usingOracle, StreamSets, Kafka, Spark Structured Streaming.
- Performed POC to check the time taking for Change Data Capture (CDC) of oracle data acrossStriim, StreamSets.
- Experience working on Vagrant boxes to setup a local Kafka and StreamSets pipelines.
- Involved in Cleaning and conforming the data and integratedStreamsetsfordata quality screeninginETL Streamsand configured ETL Pipeline.
- Worked with JSON file format for StreamSets. Worked with Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs.
- Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) andAgile methodologies.
Environment: HDFS, Hive 1.2.1000, Sqoop 1.4.6, HBase 1.1.2, Oozie 4.1.0, Storm 0.9.3, YARN, Tableau, Cassandra, Zookeeper, Spark, Kafka, Oracle 11g, MySQL, Shell Script, AWS, EC2, Tomcat 8, Spring 3.2.3, STS 3.6, Build Tool Gradle 2.2, Source Control GIT, Teradata SQL Assistant.
DATABASE INTERN
Confidential
Responsibilities:
- Worked on handling large Datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Worked with data analysis and visualizing Big Data in Tableau with Spark.
- Developed Spark streaming model which gets transactional data as input from multiple sources and create multiple batches and later processed for already trained fraud detection model and error records.
- Developed Terraform scripts to create the AWS resources such as EC2, Auto Scaling Groups, ELB, Route53, S3, SNS and Cloud Watch Alarms. Developed scripts for loading application call logs to S3.
- Extensive knowledge in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning and troubleshooting Hadoop clusters.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- DevelopedDDLsandDMLsscripts inSQLandHQLfor creating tables and analyze the data inRDBMSandHive.
- Created Hive UDFs for additional functionality inHiveforanalytics.
- Used Sqoop to import and export data fromHDFStoRDBMSand vice-versa.
- Created Hive tables and involved in data loading and writing Hive UDFs.
- Exported the analyzed data to the relational databaseMySQLusingSqoopfor visualization and to generate reports.
- Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.
- Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
- Migrated existing on-prem applications to AWS Cloud using Server Migration Services
- Configured Security and RBAC models in AWS IAM, to authenticate users and application in AWS environment.
- Developing Scripts and Batch Job to schedule a bundle (a group of coordinators) which consists of various.
- Worked on Cluster coordination services throughZookeeper.
- Assisted in creating and maintaining technical documentation to launching Hadoop Clusters and even for executingHivequeries andPigScripts.
- Designed codes for data access along with stored procedures and current SQL queries.
- Extensively utilized Informatica to create complete ETL process and load data into database which was to be used by Reporting Services.
- Performed ad-hoc analysis of production data where necessary to analyze and develop solutions to reported incidents including member, claim, provider and network data.
- Developed test cases and completed planned unit testing (UAT).
- Validated data transformations and performed End-to-End data validations for ETL workflows loading data from XMLs to EDW.
- Provided general production support for tickets. Responsible for maintaining the accuracy and quality of the data.
- Wrote complex SQL scripts to avoid Informatica Look-ups to improve the performance as the volume of the data was large.
- Assisted in troubleshooting and impact analysis of any changes made to the database Objects
- Optimized SQL performance, integrity and security of the project’s databases/schemas. Contribute to the optimization of data management and performance tuning.
- Documented and tracked the project progress using Azure Devops.
- Created Build and Release for multiple user stories to production environment using Visual Studio Team Services (VSTS).
- Effectively contributed to Test Data Management to cover various testing needs which include data testing, warehousing, masking, cloning, data tracing, reusing and data manufacturing to satisfy the full test data life cycle.
- Developing and maintaining Test Strategy and Test Plans across multi project fast paced environment.
- Utilized TFS for documenting test cases, bugs, bug tracking and also for change migration to higher environments.
- Created Tidal Job events to schedule the ETL extract workflows and also to modify the tier point notifications.
- Coordinated and communicated the Data Masking and post validation in the lower DEV and TST environments to ensure that PHI data is only restricted to prod environment.
- Involved in daily stand ups, retrospective and sprint planning’s with the team.
ETL DEVELOPER
Confidential
Responsibilities:
- Used Informatica as an ETL tool to create source/target definitions, mappings, and sessions to extract, transform and load data into staging tables from various sources.
- Analyzing, designing, and developing ETL strategies and processes, writing ETL specifications, Informatica development, and administration.
- Designed and Developed Informatica processes to extract data from internal source systems.
- Experienced in Data Integration from multiple sources like Relational tables, flat files, MS Excel & XML files.
- Extensive experience in developing complex mappings using packaged transformations like Router, Connected and Unconnected lookups, Joiner etc.
- Extensive experience in Building, publishing customized interactive reports and dashboards, report scheduling using Tableau Desktop and Tableau Server.
- Developed Tableau visualizations and dashboards using Tableau Desktop and published the same on Tableau Server.
- Worked extensively in creating dashboards using Tableau that includes tools like Tableau Desktop, Tableau Server and Tableau Reader in various versions of Tableau 9.0, 8.2and 8.1. Also involved in the administration of Tableau server like installations, upgrades, user, user groups creation and setting up security features.
- Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
- Used debugger in Informatica Designer to resolve the issues regarding data thus reducing project delay.
- Performed various types of joins in Tableau for demonstrating integrated data purpose and validated data integrity to examine the feasibility of discussed visualization design.
- Leveraged advanced features of tableau such as calculated fields, parameters, and sets to support data analysis and data mining.
- Developed wrapper shell scripts for calling Informatica workflows using PMCMD command and Created shell scripts to fine tune the ETL flow of the Informatica workflows.
- Worked as ETL developer and Tableau developer and widely involved in Designing, development and debugging of ETL mappings using Informatica designer tool as well as Created advanced chart types, visualizations, and complex calculations to manipulate the data using Tableau Desktop.
- Used informatica to parse out the xml data into DataMart structures that is further utilized for reporting needs.
- Utilized Informatica PowerCenter to accomplish full phases of data flow from source data (Oracle, SQL Server, flat files) being analyzed before extracted to transformation.
- Used Custom SQL feature on Tableau Desktop to create overly complex and performance optimized dashboards.
- Connected Tableau to various databases and performed Live data connections, query auto updates on data refresh etc.
- Created design specification and Operational documentation.
- Wrote complex SQL scripts to avoid Informatica Look-ups to improve the performance as the volume of the data was large.
- Coordinated and communicated the Data Masking and post validation in the lower DEV and TST environments to ensure that PHI data is only restricted to prod environment.
- Involved in daily stand ups, retrospective and sprint planning’s with the team.