We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

PA

SUMMARY

  • Data Engineer with 8 + years of IT experiences in designing and implementing a complete end - to-end Hadoop Ecosystem which includes HDFS, YARN, HIVE, HBASE, SQOOP, OOZIE, SPARK, KAFKA.
  • Handled Upgrades of Apache Ambari, CDH, and HDP Cluster.
  • Excellent hands-on knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node.
  • Excellent hands-on knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node.
  • Hands-on experience with Real-time streaming using Spark and Kafka Streaming into HDFS.
  • Great Experience in Handling large Datasets using Partitions, Spark in-memory capabilities, and Broadcasts in Spark.
  • Good experience with Talend (Talend Data Fabric, Talend Enterprise Service Bus, Talend Data Integration)
  • Experience on ETL process.
  • Migrated on premise data warehouse to Azure synapse using Azure data factory/Datalakegen2.
  • Strong Knowledge of Spark concepts like RDD operations, Caching and Persistence.
  • Developed Analytical Components using Spark SQL and Spark Stream.
  • Automated the process of parquet file generation using azure data bricks.
  • Experience in Python and Scala, user-defined functions (UDF) for Hive using Python.
  • Expert in working with Hive Data Warehouse tool-creating tables, Data Distribution by implementing Partitions and Bucketing.
  • Expertise in all the Hive functionalities and migration of data from different databases like ORACLE, DB2, MYSQL, and MongoDB.
  • Experienced working with Amazon EMR, Cloudera, and Horton Works Hadoop Distributions.
  • Hands-on Experience with AWS infrastructure services, Amazon Simple Storage Service (Amazon S3), and Amazon Elastic Compute Cloud (Amazon EC2).
  • Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools) worked on Azure suite: Azure SQL Database, Azure Data Lake(ADLS), Azure Data Factory(ADF) V2, Azure SQL Data Warehouse, Azure Service Bus, Azure key Vault, Azure Analysis Service(AAS), Azure Blob Storage, Azure Search, Azure App Service, Azure data Platform Services.
  • Worked with DevOps pipeline using Jenkins, Urban code, Azure devops etc.
  • Extensive experience in using Continuous Integration tools like Jenkins and TravisCI.
  • Experienced with using most common Operators in Airflow - Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, GoogleCloudStorageToS3Operator.
  • Worked on designing data pipeline architecture in Google Cloud platform (GCP).
  • Experienced in writing Test cases and implement unit test cases using testing frameworks like Junit, Easy mock, and Mockito.
  • Experienced in Oozie and workflow scheduler to manage Hadoop jobs by DirectAcyclicGraph (DAG) of actions with control flows.
  • Hands-on experience in Application Development by Spark jobs in Scala and PySpark, Hive Queries, RDBMS, and Linux shell scripting.
  • Experienced in build scripts using Maven and do continuous integrations systems like Jenkins.
  • Experience with Azure transformation projects and implement ETL and data movement solutions using Azure Data Factory (ADF), SSIS.
  • Experience in dealing with complex data processing in spark.
  • Expertise in writing end-to-end Data Processing jobs to analyze data using Spark and Hive.
  • Hands-on experience with NoSQL databases like HBase, Cassandra and Relational Databases like Oracle and MySQL
  • Deep Analytics and understanding of Big Data and Algorithms using Hadoop, Spark, NoSQL, and distributed computing tools.
  • Experienced in using Data Warehousing ETL concepts using Informatica Power Center, OLAP, OLTP, and AutoSys.
  • Good understanding of XML methodologies (XML, XSL, XSD) including Web Services and SOAP.
  • Experienced in designing both time-driven and data-driven automated workflows using OOZIE to run jobs of Hadoop.
  • Good experience in doing project impact assessment, Project Schedule Planning, onsite-offshore Team coordination, End-user coordination starting from requirement gathering to live support.
  • Successful in meeting new technical challenges and finding solutions to meet the needs of the customer.
  • Successfully working in a fast-paced environment, both independently and in collaborative team environments.
  • Strong Business, Analytical, and Communication Skills.

TECHNICAL SKILLS

Hadoop Ecosystem: HDFS, Hive, Sqoop, Spark, Zookeeper, YARN, Kafka, NIFI, Oozie, MapReduce, impala

Programming Languages: C++,Java,Python, SQL, PySpark, Spark with Scala, UNIX shell Scripting

Big Data Platforms: Hortonworks, Cloudera,Talend

AWS Platform: Ec2, s3, EMR, Redshift, Glue

Operating Systems: Linux, Windows, Unix

Databases: Oracle, MS SQL Server, My SQL,DynamoDB, PostgreSQL,JSON,XML

Cloud: AWS, GCP,Azure(Data factory,Data bricks,Dataflow,sql DB),Snowflake

Tools: Jenkins, Maven, ANT

IDEs: Eclipse, NetBeans

Designing Tools: UML

Version Tools: Git

Others: Putty, WinSCP, AWS,tableau

PROFESSIONAL EXPERIENCE

Confidential

Data Engineer

Responsibilities:

  • Involved in Requirement Gathering, Business Analysis, and Translated Business requirements into Technical Design in Hadoop and Big Data.
  • Great Hands-on experience working on different Hadoop ecosystem components like Hive, Sqoop, Spark, Kafka.
  • In-depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Spark.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Worked on different file formats like JSON, CSV, XML using spark SQL.
  • Imported and exported data into HDFS from the database and vice versa using Sqoop.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
  • Developed Spark Python modules for machine learning & predictive analytics in Hadoop on Azure.
  • Worked with CDH4 as well as CDH5 applications. Performed Data transfer of large data back and forth from development and production clusters.
  • Created Partitions and Buckets in Hive for both Managed and External tables in Hive for optimizing performance.
  • Implemented Hive Join queries to join multiple tables of a source system and load them to Elastic search tables.
  • Built reusable Hive UDF libraries for business requirements which enabled users to use these UDF's in Hive Querying.
  • Developed Data Lake as a Data Management Platform for Hadoop.
  • Used Talend to run the ETL processes instead of Hive queries.
  • Successfully moved data from Hadoop to Cassandra using Bulk output format class.
  • Extracted data from Teradata database and loaded it into Data warehouse using spark-JDBC.
  • Handled Data Movement, Data transformation, Analysis, and visualization across the lake by integrating it with various tools.
  • Experienced in code repositories .
  • Good Understanding of NoSQL database, ETL process and hands-on experience in writing applications on NoSQL databases like HBase, Cassandra, and MongoDB.

Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Python, Kafka, Hive, Sqoop, Azure cloud: ADF, Data Lake Gen2, Databricks, Impala, Cassandra,snowflake, Tableau, Talend, Oozie, Jenkins, Cloudera, Oracle 12c, Linux.

Confidential, PA

Data Engineer

Responsibilities:

  • Responsible for Building Scalable Distributed Data solutions using Hadoop.
  • Real-time data processing (Kafka, Spark Streaming & Spark Structured Streaming), Worked on Spark SQL, Structured Streaming, MLlib and using Core Spark API to explore Spark features to build data pipelines, Implemented Spark streaming applications & fine-tune to reduce shuffling.
  • Handled large datasets using partitions, Spark In-Memory capabilities, Broadcasts in Spark, Effective & Efficient Joins, Transformations, and others during the ingestion process itself.
  • Worked in Performance Tuning of Spark Applications for setting right batch interval time, the correct level of parallelism, and memory tuning.
  • Worked on a Cluster of Size 105 nodes.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Python.
  • Handled importing of data from various data sources, performed transformations using Hive loaded data into HDFS, and Extracted the data from MySQL into HDFS using Sqoop.
  • Implemented schema extraction for Parquet and Avro file formats in Hive.
  • Used Hive and create Hive tables, loaded data from the local file system to HDFS.
  • Used hive to do transformations, event joins, and pre-aggregations before storing the data to HDFS.
  • Implemented Partitioning, Dynamic Partitions, and Buckets on huge datasets to analyze and compute various metrics for reporting.
  • Involved in HBase setup and storing data into HBase for future analysis.
  • Good experience working on Tableau and Spotfire and enabled the JDBC/ODBC data connectivity from those to Hive tables.
  • Used Oozie workflow to coordinate pig and Hive scripts.
  • Used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Glue,Elastic Map Reduce (EMR), Athena, and Snowflake.
  • Written Hive Queries for ad hoc data analysis to meet business requirements.

Environment: HDFS, Hive, Sqoop, HBase, Oozie, Flume, Sqoop, Kafka, Zookeeper, Amazon AWS, SparkSQL, Spark Data frames, PySpark, Python, JSON, SQL Scripting and Linux Shell Scripting, tableau,Avro, Parquet, Hortonworks.

Confidential - Denver, CO

Data Engineer

Responsibilities:

  • Worked on analyzing Hadoop clusters and different big data analytic tools including Hive, HBase database, and Sqoop.
  • Configured Sqoop Jobs to import data from RDBMS into HDFS using Oozie workflows.
  • Experienced in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • Load and transform huge datasets of structured and semi-structured data using Hive.
  • Responsible for developing a data pipeline with Amazon AWS to extract the data from weblogs and store it in HDFS.
  • Created Hive tables and Developed Hive queries for De-normalizing the Data.
  • Created Hive Scripts to sort, group, join and filter the enterprise-wise data.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
  • Created batch analysis job prototypes using Hadoop, Oozie, Hue, and Hive.
  • Created HBase tables to load large sets of structured, semi-structured, and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
  • Worked on the root cause analysis for all the issues that occur in batch and provide the permanent fixes for the issues.
  • Involved in Analyzing system failures, identifying root causes, and recommended course of actions.
  • Created and maintained technical documentation for all the workflows.
  • Created database access layer using JDBC and SQL stored procedures.
  • Collaborated with the infrastructure, network, database, application, and BI teams to ensure data quality and availability.

Environment: Hadoop YARN, Hive, Sqoop, Amazon AWS, Java, Python, Oozie, Jenkins, Cassandra, Oracle 12c, Linux.

Confidential

Data Engineer

Responsibilities:

  • Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring, and Troubleshooting, Manage and review data backups and log files.
  • Used Oozie to orchestrate the workflow.
  • Involved in loading data from the LINUX file system to HDFS.
  • Analyzed data using Hadoop components Hive.
  • Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations.
  • Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
  • Moved the data from Oracle, MSSQL Server into HDFS using Sqoop and importing various formats of flat files into HDFS. performed HBASE integrations in Spark.
  • Designed & build infrastructure for the Google Cloud environment from scratch.
  • Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS,GCP.
  • Mentored analyst and test team for writing Hive Queries.
  • Implemented test scripts to support test-driven development and continuous integration
  • Worked with application teams to install an operating system, Hadoop updates, patches, version upgrades as required.
  • Gained excellent hands-on knowledge on Hadoop cluster, spark, Data Migration concepts in Hive.

Environment: Hadoop, Spark,Java, HDFS, Sqoop, Hive, Cloudera, HBase, tableau,Linux, XML, MySQL Workbench, Eclipse, GCP,Gcp, Bigquery, Gcs Bucket, G-Cloud Function,Oracle 10g, PL/SQL, SQL*PLUS.

Confidential

Data Analyst

Responsibilities:

  • Interacted with business users to identify and understand business requirements and identified the scope of the projects.
  • Identified and designed business Entities and attributes and relationships between the Entities to develop a logical model and later translated the model into a physical model.
  • Developed normalized Logical and Physical database models for designing an OLTP application.
  • Enforced Referential Integrity (R.I) for a consistent relationship between parent and child tables. Work with users to identify the most appropriate source of record and profile the data required for sales and service.
  • Involved in defining the business/transformation rules applied for ICP data.
  • Define the list codes and code conversions between the source systems and the data mart.
  • Developed the financing reporting requirements by analyzing the existing business objects reports
  • Utilized Informatica toolset (Informatica Data Explorer, and Informatica Data Quality) to analyze legacy data for data profiling.
  • Reverse Engineered the Data Models and identified the Data Elements in the source systems and adding new Data Elements to the existing data models.
  • Created XSD's for applications to connect the interface and the database.
  • Compare data with original source documents and validate Data accuracy.
  • Used reverse engineering to create Graphical Representation (E-R diagram) and to connect to the existing database.
  • Generate weekly and monthly asset inventory reports.
  • Evaluated data profiling, cleansing, integration, and extraction tools (e.g. Informatica)
  • Coordinate with the business users in providing an appropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality
  • Also Worked on some impact of low quality and/or missing data on the performance of data warehouse client
  • Worked with NZ Load to load flat file data into Netezza tables.
  • Good understanding of Netezza architecture.
  • Identified design fl in the data warehouse
  • Executed DDL to create databases, tables, and views.
  • Generated comprehensive analytical reports by running SQL queries against current databases to conduct data analysis.
  • Involved in Data Mapping activities for the data warehouse
  • Created and Configured Workflows, Work lets, and Sessions to transport the data to target warehouse Netezza tables using Informatica Workflow Manager.
  • Extensively worked on Performance Tuning and understanding Joins and Data distribution.
  • Experienced in generating and documenting Metadata while designing applications.
  • Coordinated with DBAs and generated SQL codes from data models.
  • Generate reports for better communication between business teams.

Environment: SQL/Server, Oracle9i, MS-Office, Embarcadero, Crystal Reports, Netezza, Teradata, Enterprise Architect, Toad, Informatica, tableau,ER Studio, XML, Informatica, OBIEE.

We'd love your feedback!