We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

0/5 (Submit Your Rating)

Columbus, IndianA

SUMMARY

  • Over 8+ years of working experience as Data Engineering with high proficient knowledge in Data Analysis and Big data.
  • Experienced using "Big data" work on Hadoop, Spark, PySpark, Hive, HDFS and other NoSQL platforms.
  • Experience in developing MapReduce Programs using ApacheHadoop for analyzing the Bigdata as per the requirement.
  • Experienced in Technical consulting and end - to-end delivery with architecture, data modeling, data governance and design - development - implementation of solutions.
  • Experience in installation, configuration, supporting and managing -Cloudera Hadoop platform along with CDH4&CDH5 clusters.
  • Strong experience and knowledge of NoSQL databases such as MongoDB and Cassandra.
  • Proficient in Normalization/Denormalization techniques in relational/dimensional database environments and have done normalizations up to 3NF.
  • Hands-on experience with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2instances, RDS and others.
  • Hands-on experience with Google cloud services like GCP, BigQuery, GCS Bucket and G-Cloud Function.
  • Experienced in Informatica ILM and Informatica Lifecycle Management and its tools.
  • Efficient in all phases of the development lifecycle, coherent with Data Cleansing, Data Conversion, Data Profiling, Data Mapping, Performance Tuning and System Testing.
  • Experience in Big Data Hadoop Ecosystem in ingestion, storage, querying, processing and analysis of Big data.
  • Good understanding of Ralph Kimball (Dimensional) & Bill Inman (Relational) model Methodologies.
  • Experienced working extensively on the Master Data Management(MDM) and application used for MDM.
  • Experience in transferring the data using Informatica tool from AWS S3 to AWS Redshift.
  • Good Knowledge on SQL queries and creating database objects like stored procedures, triggers, packages and functions using SQL and PL/SQL for implementing the business techniques.
  • Supporting ad-hoc business requests, Developed Stored Procedures, and Triggers and extensively used Quest tools like TOAD.
  • Good understanding and exposure to Python programming.
  • Excellent working experience in Scrum/Agile framework and Waterfall project execution methodologies.
  • Experience in migrating the data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement.
  • Extensive experience working with business users/SMEs as well as senior management.
  • Strong experience in using MS Excel and MS Access to dump the data and analyze based on business needs.
  • Good experience in Data Analysis as a Proficient in gathering business requirements and handling requirements management.

TECHNICAL SKILLS

Programming Languages: SQL, PL/SQL, UNIX shell Scripting, R

Methodologies: Agile, RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Waterfall Model.

Big Data & Hadoop Ecosystem: MapReduce, Spark 3.3, HBase 2.3.4, Hive 2.3, Flume 1.9, Sqoop 1.4.6, Kafka 2.6, Oozie 4.3, Hue, Cloudera Manager, Neo4j, Hadoop 3.3, Apache NiFi 1.6

Databases: Microsoft SQL Server 2017, Teradata 15.0, Oracle 12c, and MS Access

Operating Systems: Microsoft Windows Vista 7/8 and 10, UNIX, and Linux.

BI Tools: Tableau 10, SSRS, Crystal Reports, Power BI.

NOSQL Database: MongoDB, Azure Sql DB, Cassandra 3.11.10

Data Modeling Tools: Erwin R9.7/9.6, ER Studio V17

Cloud Platforms: GCP, Google big-query, AWS, EC2, EC3, Redshift & MS Azure

PROFESSIONAL EXPERIENCE

Sr. Data Engineer

Confidential, Columbus, Indiana

Responsibilities:

  • As a Data Engineer involved in Agile Scrum meetings to help, manage and organize a team of developers with regular code review sessions.
  • Participated in Code Reviews, Enhancement discussion, maintenance of existing pipelines & systems, testing and bug-fix activities on-going basis.
  • Worked closely with the business analysts to convert the Business Requirements into Technical Requirements and prepared low- and high-level documentation.
  • Worked on Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's
  • Developed ETL Processes in AWS Glue to migrate data from external sources like S3, ORC/Parquet/Text Files into AWSRedshift.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Involved in daily Scrum meetings to discuss the development/progress and was active in making scrum meetings more productive.
  • Seamlessly worked on Python to build data pipelines after the data got loaded from Kafka.
  • Used Kafka Streams to Configure Spark Streaming to get information and then store it in HDFS.
  • Worked on loading data into Spark RDD's, perform advanced procedures like text analytics using in-memory data computation capabilities of Spark to generate the Output response.
  • Implemented usage of AmazonEMR for processing BigData across a HadoopCluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service(S3)
  • Created AWS Lambda functions and assigned IAM roles to schedule python scripts using Cloud Watch Triggers to support the infrastructure needs (SQS, Event Bridge, SNS)
  • Involved in converting MapReduce programs into Spark transformations using Spark RDD's using Scala and Python.
  • Integrated Kafka-Spark streaming for high efficiency throughput and reliability.
  • Developed a python script to hit REST API's and extract data to AWSS3
  • Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script
  • Worked on functions in Lambda that aggregates the data from incoming events, and then stored result data in Amazon Dynamo DB.
  • Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.
  • Designed and Developed ETL jobs to extract data from oracle and load it in data mart in Redshift
  • Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift
  • Used JSON schema to define table and column mapping from S3 data to Redshift
  • Connected Redshift to Tableau for creating dynamic dashboard for analytics team
  • Used JIRA to track issues and Change Management
  • Involved in creating Jenkins jobs for CI/CD using GIT, Maven and Bash scripting

Environment: Spark 3.3, AWS S3, Redshift, Glue, EMR, IAM, EC2, Tableau, Jenkins, Jira, Python, Kafka, Agile.

Data Engineer

Confidential, Boston, MA

Responsibilities:

  • As a Data Engineer, I am responsible for building scalable distributed data solutions using Hadoop.
  • Involved in Agile Development process (Scrum and Sprint planning).
  • Handled Hadoopcluster installations in Windows environment.
  • Migrated on-premise environment in GCP (Google Cloud Platform)
  • Migrated data warehouses to Snowflake Data warehouse.
  • Defined virtual warehouse sizing for Snowflake for different type of workloads.
  • Involved in porting the existing on-premise Hive code migration to GCP (Google Cloud Platform) BigQuery.
  • Involved in migration an Oracle SQLETL to run on Google cloud platform using cloud Dataproc & BigQuery, cloud pub/sub for triggering the Apache Airflow jobs.
  • Extracted data from data lakes, EDW to relational databases for analyzing and getting more meaningful insights using SQL Queries and PySpark.
  • Developed PySpark script to merge static and dynamic files and cleanse the data.
  • Created Pyspark procedures, functions, packages to load data.
  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems.
  • Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
  • Wrote Sqoop Scripts for importing and exporting data from RDBMS to HDFS.
  • Set up Data Lake in Google cloud using Google cloud storage, BigQuery and Bigtable.
  • Developed scripts in BigQuery and connecting it to reporting tools.
  • Designed workflows using Airflow to automate the services developed for Change data capture.
  • Carried out data transformation and cleansing using SQL queries and PySpark.
  • Used Kafka and Spark streaming to ingest real time or near real time data in HDFS.
  • Worked related to downloading BigQuery data into Spark dataframes for advanced ETL capabilities.
  • Worked on PySpark APIs for data transformations.
  • Built reports for monitoring data loads into GCP and drive reliability at the site level.
  • Participated in daily stand-ups, bi-weekly scrums and PI panning.

Environment: Hadoop 3.3, GCP, BigQuery, Big Table, Spark 3.0, PySpark, Sqoop 1.4.7, ETL, HDFS, Snowflake DW, Oracle Sql, MapReduce, Kafka 2.8 and Agile process.

Sr. Data Analyst

Confidential, Lake Success, NY

Responsibilities:

  • As a Data Analyst role to review business requirement and compose source to target data mapping documents.
  • Interacted with Business Analyst, SMEs and other Data Engineers to understanding Business needs.
  • Participated in design discussions and assured functional specifications are delivered in all phases of SDLC in an Agile Environment.
  • Defined appropriate security roles related to data Define roles associated with securing, provisioning.
  • Audit security of the data within the domain of stewardship and define named individuals for each required role.
  • Worked closely with the business analyst and Data warehouse architect to understand the source data and need of the Warehouse.
  • Interacted with stakeholders on clearing their doubts regarding the reports in PowerBI.
  • Actively involved in SQL and Azure SQLDW code development using T-SQL
  • Involved in designing of star schema based data model with dimensions and facts.
  • Worked on a migration project, which required gap analysis between legacy systems and new systems.
  • Involved in requirement gathering and database design and implementation of star-schema, snowflake schema/dimensional data warehouse using Erwin.
  • Performed and utilized necessary PL/SQL queries to analyze and validate the data.
  • Reviewed the Joint Requirement Documents (JRD) with the cross functional team to analyze the High Level Requirements.
  • Developed and maintained new data ingestion processes with Azure Data Factory.
  • Implemented data aggregation and business logic in Azure Data Lake.
  • Designed and developed automation test scripts using Python.
  • Created publishing reports for stakeholders using PowerBI.
  • Analyzed escalated incidences within the Azure SQL database.
  • Worked on the enhancing the Dataquality in the database.
  • Worked on Performance Tuning of the database, which includes indexes, optimizing SQL Statements.
  • Involved in capturing data lineage, table and column data definitions, valid values and others necessary information in the data model.
  • Created or modified the T-SQL queries as per the business requirements.
  • Involved in user training sessions and assisting in UAT (User Acceptance Testing).
  • Participation in design and daily stand-up meetings.

Environment: Erwin, Azure Sql DB, Azure Data Lake, T-Sql, UAT, PL/SQL, Power BI, Python and Agile/Scrum.

Data Analyst

Confidential

Responsibilities:

  • Worked with Business Analyst and helped represent the business domain details and prepared low-level analysis documentation.
  • Created Hive tables and created Sqoop jobs to import the data from Oracle/SQL Server to HDFS
  • Developed Oozie workflows and scheduled them in Control-M as daily jobs to load incremental updates from the RDBMS source systems.
  • Wrote different pig scripts to clean up the ingested data and created partitions for the daily data.
  • Prepared pig scripts and Spark SQL to handle all the transformations specified in the S2TM's and to support SCD2 and SCD1 scenarios.
  • Wrote different UDF's to convert the date format and to create hash value using MD5 Algorithm in Java.
  • Implemented Partitioning and bucketing in Hive based on the requirement.
  • Involved in converting Hive SQL queries into Spark transformations using Spark SQL and Scala.
  • Experienced in implementing Spark RDD transformations, actions to implement business analysis and worked
  • Create Sqoop import jobs to import source tables from Microsoft SQL Server.
  • Create Sqoop export jobs to export target tables to Teradata and to make the target tables available to the reporting layer.
  • Worked with BI and QA team to test the application and fixed the defects immediately.
  • Leveraged open-source monitoring toolkit Prometheus to capture pod metrics and built sample dashboards in Splunk.
  • Involved in Unit and Integration level testing and prepared supporting documents for deployment.

Environment: HDP 2.2.4, Hadoop 2.6, Hive 0.14, Pig 0.14, HBase, Spark1.6, Scala, Kafka, Oozie, SQL Server, Jenkins, Nexus, Shell, Java, Eclipse.

ETL/Informatica Developer

Confidential

Responsibilities:

  • Experienced with Informatica 10.2, performance tuning for a workflow using pushdown optimization to send data from the source file and target side (both sides).
  • Developed new mapping designs using various tools in Informatica like Source Analyzer, Warehouse Designer, Mapplet Designer and Mapping Designer.
  • Developed the mappings using transformations in Informatica according to technical specifications.
  • Created, optimized, reviewed, and executed Teradata SQL test queries to validate transformation rules used in source to target mappings/source views, and to verify data in target tables.
  • Performed data manipulations using various Informatica Transformations like Filter, Expression, Lookup (Connected and Un-Connected), Aggregate, Update Strategy, Normalizer, Joiner, Router, Sorter and Union.
  • Created mappings, Mapplets according to Business requirement using Informatica big data version, deployed them as applications, and exported to power center for scheduling.
  • Design and develop and Implement detail layout of ETL testing plan procedures.
  • Design Audit Balance and Control (ABC) for ETL process.
  • Provide estimations for ETL deliverables and oversee the progress for quality ETL Deliverables.
  • Designed and developed Informatics’ Mappings and Sessions based on business user requirements and business rules to load data from source to target tables.
  • Worked on Informatica pushdown optimization. Also, worked on mapping / session / table level optimization etc.
  • Created various UNIX Shell Scripts for pre/post session commands for automation of loads using Tidal.
  • Implemented slowly changing dimension methodology for accessing the full history of accounts.
  • Scheduling Informatica jobs and implementing dependencies if necessary, using Autosys.

Environment: Informatica Power Center 10.2.0, Oracle Xe, PL/SQL, Teradata, Flat files, Erwin, UNIX, Oracle SQL developer, Tidal.

We'd love your feedback!