Sr. Data Engineer Resume
Ashburn, VA
SUMMARY
- Over 8+ years of IT experience and Technical proficiency in Data Engineering.
- Hands on experience in creating pipelines in Azure Data Factory V2 using activities like Move & Transform, Copy, filter, for each, Get Metadata, Lookup, Data bricks etc.
- Excellent knowledge on integrating Azure Data Factory V2/V1 with variety of data sources and processing the data using the pipelines, pipeline parameters, activities, activity parameters, manually/window based/event - based job scheduling.
- Hands on experience working with different file formats like JSON, CSV, AVRO, PARQUET using Databricks and Data Factory.
- Extensively working in reading Continuous JSON data from different source system using Event Hub into various downstream systems using stream analytics and Apache spark structured streaming (Databricks).
- Good Experience on importing and exporting the data from HDFS and Hive into Relational Database Systems and vice versa using Sqoop.
- Expertise in Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Centre.
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
- Strong Experience in working with Databases like Teradata and proficiency in writing complex SQL, PL/SQL for creating tables, views, indexes, stored procedures and functions.
- Experience in importing and exporting Terabytes of data between HDFS and Relational Database Systems using Sqoop.
- Experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Tableau.
- Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.
- Experience in building highly reliable, scalable Big data solutions on Hadoop distributions Cloudera, Horton works, AWS EMR.
- Good experienced in Data Modeling and Data Analysis as a Proficient in gathering business requirements and handling requirements management.
- Experience in transferring the data using Informatica tool from AWS S3 to AWS Redshift
- Creating data models (ERD, logical) including robust data definitions, which may be entity-relationship-attribute models, star, and snowflake models
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
- Good understanding and exposure to Python programming.
- Good knowledge on NoSQL Databases including HBase, MongoDB, Cassandra, MapR-DB.
- Installation, configuration and administration experience in Big Data platforms Cloudera Manager of Cloudera, MCS of MapR.
- Extensive knowledge and Hands on experience implementing cloud data lakes like Azure Data Lake Gen1 and Azure Data Lake Gen2.
- Extensive working experience with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2 instances, RDS and others.
- Experience in migrating the data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement.
- Experience with RDBMS like SQL Server, MySQL, Oracle and data warehouses like Teradata and Netezza.
- Proficient knowledge and hands on experience in writing shell scripts in Linux.
- Experience on developing MapReduce jobs for data cleaning and data manipulation as required for the business.
TECHNICAL SKILLS
Big Data & Hadoop Ecosystem: MapReduce, Spark 2.3, HBase 1.2, Hive 2.3, Pig 0.17, Solr 7.2, Flume 1.8, Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hue, Cloudera Manager, Stream sets, Neo4j, Hadoop 3.0, Apache Nifi 1.6, Cassandra 3.11
AWS Cloud Platforms: AWS, EC2, EC3, Redshift
Azure Cloud Platform: Azure Data Factory v2, Azure Blob Storage, Azure Data Lake Gen 1 & Gen 2, Azure Synapse
Programming Languages: PySpark, Python, T-SQL, U-SQL, LINUX Shell Scripting, AZURE Power Shell
ETL Tools: Teradata SQL Assistant, TPT, BTEQ, Fast Load, Multi Load, Fast Export, T Pump, Informatica Power Centre
RDBMS: Microsoft SQL Server 2017, Teradata 15.0, Oracle 12c, and MS Access
Data Modeling Tools: Erwin Data Modeler 9.7/9.6, Erwin Model Manager, ER Studio v17, and Power Designer.
Reporting Tools: SSRS, Power BI, Tableau, SSAS, MS-Excel, SAS BI Platform.
OLAP Tools: Tableau 7, SAP BO, SSAS, Business Objects, and Crystal Reports 9
Data Modeling Tools: Erwin R9.7/9.6, ER Studio V17
BI Tools: Tableau 10, Tableau server 10, Tableau Reader 10, SAP Business Objects, Crystal Reports
Operating Systems: Microsoft Windows Vista7/8 and 10, UNIX, and Linux.
Methodologies: Agile, RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Waterfall Model.
PROFESSIONAL EXPERIENCE
Confidential - Ashburn, VA
Sr. Data Engineer
Responsibilities:
- As a senior Data engineer assisting in leading the plan, building, and running states within the Enterprise Analytics Team.
- Meetings with business/user groups to understand the business process, gather requirements, analyze, design, development and implementation according to client requirement.
- Provided a summary of the Project's goals, and the specific expectation of business users from BI and how it aligns with the project goals.
- Responsible for building scalable distributed data solutions using Big Data technologies like Apache Hadoop, Shell Scripting, Hive.
- Provided strategies and requirements for the seamless migration of applications, web services, and data from local and server-based systems to the AWS cloud.
- Used Agile (SCRUM) methodologies for Software Development.
- Applied Data Governance rules (primary qualifier, Class words and valid abbreviation in Table name and Column names).
- Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS.
- Created reports using either Tableau based client needs for dynamic interactions with the data produced.
- Created Dashboards using Tableau, showcasing the trend the client is following.
- Worked on Proof of Concepts to decide Tableau as a BI strategy for enterprise reporting.
- Build Hadoop solutions for big data problems using MR1 and MR2 in YARN.
- Architected, Designed and Developed Business applications and Data marts for reporting.
- Developed Big Data solutions focused on pattern matching and predictive modeling.
- Developed the code for Importing and exporting data into HDFS and Hive using Sqoop
- Installed and configured Hadoop and responsible for maintaining cluster and managing and reviewing Hadoop log files.
- Developed Python scripts to automate and provide Control flow to Pig scripts.
- Developed reconciliation process to make sureelasticsearchindex document count match to source records
- Created Hive External tables to stage data and then move the data from Staging to main tables
- Implemented the Big Data solution using Hadoop, hive to pull/load the data into the HDFS system.
- Worked with Oozie workflow engine to schedule time based jobs to perform multiple actions.
- Developed incremental and complete load Python processes to ingest data intoElasticSearchfrom oracle database
- Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Experienced in AWS cloud environment and on S3 storage and EC2 instances
- Load the data from different sources such as HDFS or HBase into Spark RDD and implement in memory data computation to generate the output response.
- Created Tableau scorecards, dashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts using show me functionality.
- Developed Rest services to write data intoElasticSearchindex using Python Flask specifications
- Developed complete end to end Big-data processing in Hadoop eco system.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.
- Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
- Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.
- Improved performance by optimizing computing time to process the streaming data and saved cost to company by optimizing the cluster run time.
- Generating alerts on the daily metrics of the events to the product people.
- Suggest fixes to complex issues by doing a thorough analysis of root cause and impact of the defect.
Environment: Hive 2.3, HDFS, Yarn, HBase, PL/SQL, Tableau, Mongo DB, Pig 0.16, Sqoop 1.2, Oozie 4.3
Confidential - Dublin, OH
Data Engineer
Responsibilities:
- Worked as Data Engineer to collaborate with other Product Engineering team members to develop, test and support data-related initiatives.
- Followed Agile methodology for the entire project.
- Defined the business objectives comprehensively through discussions with business stakeholders, functional analysts and participating in requirement collection sessions.
- Worked closely with business power users to create reports/dashboards using Tableau desktop.
- Developed data warehouse model in Snowflake for over 100 datasets.
- Designed and implemented a fully operational production grade large scale data solution on Snowflake Data Warehouse.
- Exposure to implementation and operations of data governance, data strategy, data management and solutions.
- Designed and developed various analytical reports from multiple data sources by blending data on a single worksheet in Tableau Desktop.
- Identified existing business process workflows and built a data pipeline using hive MapReduce to depict SAAS.
- Designed and Implemented SAAS Subscription Data model on 200 Node Hadoop cluster with a data size of 10 TB sqooped through Data Lake using Hive.
- Created data visualizations and dashboards and applied various actions (URL, Filter, Parameters) in Tableau.
- Assembled large, complex data sets that meet functional/non-functional business requirements.
- Involved in creating Azure Data Factory pipelines.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
- Automated jobs using different triggers like Events, Schedules and Tumbling in ADF.
- Ingested huge volume and variety of data from disparate source systems into Azure data Lake Gen2 using Azure Data Factory V2.
- Seamlessly worked on Python to build data pipelines after the data got loaded from Kafka.
- Prepared Architecture Design Document (ADD) as per the HLSD.
- Developed programs in Spark to use on application for faster data processing than standard MapReduce programs.
- Worked with Data governance team to evaluate test results for fulfillment of all data requirements.
- Worked on Azure Data Factory and Azure Databricks as part of EDS transformation.
- Provided data acquisition strategy for the source systems.
- List each identified target system and give over view on the impacts for each system. This again includes data landing, staging, core, data sharing and data visualization.
- Provided design to Data history and Retention requirements.
- Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark Databricks cluster.
- Created notifications and alerts, subscriptions to reports in the Power BI service.
- Provided support once the Power BI reports were published if there were any data changes, system changes, or requirement changes.
- Provided guidance to RCM creation by calling out the systems that are impacted by code changes, configuration changes and testing.
- Created S2TM (Source to target Mapping) of the tables sourcing from staging databases to build the tables in Core databases.
- Participate in weekly data analyst meting and submit weekly data governance status.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Involved in designing complex PL/SQL program units to handle business critical scenarios.
Environment: Spark, Azure Data Factory, Azure Databricks, PL/SQL, SAAS, Tableau, MapReduce, Kafka, Python, HDFS, Hive and Unix.
Confidential - Juno Beach, FL
Azure Data Engineer
Responsibilities:
- Take initiative and ownership to provide business solutions on time.
- Created High level technical design documents and Application design documents as per the requirements and delivered clear, well-communicated and complete design documents.
- Implemented Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and unstructured data to meet business functional requirements.
- Design and developed Batch processing and real-time processing solutions using ADF, Databricks clusters and stream Analytics.
- Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.
- Maintain and provide support for optimal pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
- Automated jobs using different triggers like Events, Schedules and Tumbling in ADF.
- Created, provisioned different Databricks clusters, notebooks, jobs and auto scaling.
- Created several Databricks Spark jobs with Pyspark to perform several tables to table operations.
- Performed data flow transformation using the data flow activity.
- Implemented Azure, self-hosted integration runtime in ADF.
- Developed spark programming code in Python Databricks workbooks.
- Improved performance by optimizing computing time to process the streaming data by optimizing the cluster run time.
- Developed Power BI reports using power query as a feed from SQL Server & different Data sources.
- Ensure data accuracy, integrity, and reliability of both in back-end Power BI reports.
- Created Linked services to connect the external resources to ADF.
- Wrote complex SQL queries including joins, correlated sub queries, scalar sub queries, views, Stored Procedures and Triggers
- Ensured the developed solutions are formally documented and signed off by business.
- Worked with team members to resolve any technical issue, Troubleshooting, Project Risk &Issue identification, and management.
- Worked on the cost estimation, billing, and implementation of services on the cloud.
Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, SQL server, Windows remote desktop, AZURE Power Shell, Data bricks, Python, Azure SQL Server, Azure Data Warehouse.
Confidential
Teradata/ETL Consultant
Responsibilities:
- Involved in Requirement gathering, business Analysis, Design and Development, testing and implementation of business rules.
- Analyzing the requirements and the existing environment to help come up with the right strategy to load and Extract on Data warehouse.
- Prepared the ETL specifications, Mapping document, ETL framework specification
- Implement slowly changing Dimension logics in the mapping to effectively handle change data capture which is typical in data warehousing systems.
- Prepared functional and technical specific, design documents.
- Responsible for data profiling, data cleansing and data conformation
- Actively participated in code migration process to higher environment and documents creation for the same.
- Created a BTEQ scripts for preloading of the worktables prior to main load process.
- Proficient in understanding Teradata EXPLAIN plans, Collect Stats option, Secondary Indexes (USI, NUSI), Partition Primary Index (PPI), Volatile, global temporary, derived tables etc.
- Reviewed the SQL for missing joins, join constraints, data format issues, miss-matched aliases and casting errors.
- Used Teradata Manager, BTEQ, FASTLOAD, MULTILOAD, TPUMP, SQL and TASM for workload management.
- Wrote various TPT scripts for ad hoc requirements and used tdload for exporting data from one environment to another environment using TPT.
- Involved in Data Modeling to identify the gaps with respect to business requirements and transforming the business rules.
- Perform workload management using various tools like Teradata Manager, Fast Load, Multi Load, TPUMP, TPT, SQL Assistant.
- Developed the scripts using Teradata Parallel Transporter and implemented the Extraction- Transformation-Loading of data with TPT.
- Identify the performance bottlenecks in the production processes and identify key places where SQL can be tuned to improve overall performance of the production process
- Developed UNIX scripts to automate different tasks involved as part of loading process
Environment: Teradata 15, Teradata SQL Assistant, PowerCenter, TPump, BTEQ, MLOAD, FLOAD, FASTEXPORT, Erwin Designer, Informatica 9.5, Tableau, POWER BI, UNIX, Korn Shell scripts.