We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

FloridA

SUMMARY

  • Having IT experience in all phases of Software Development Life Cycle (SDLC) with skills in data analysis, design, development, testing and deployment of software systems
  • Strong experience working with HDFS, Spark, Map Reduce, Hive, Pig, YARN, HDFS, Oozie, Sqoop, Flume, Kafka, and NoSQL Databases like HBase and Cassandra.
  • Hands - on use of Spark and ScalaAPI's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
  • Worked on Data Migration Projects
  • Experience in both On-premises and Cloud Environments.
  • Prepared documentation for all the requirements and enhancements to reports.
  • Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analysing the big data as per the requirement.
  • Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked with Cloudera and Hortonworks distributions.
  • Good knowledge in using the cloud services like Amazon EMR, S3, EC2, Red shift and Athena.
  • Strong expertise in building scalable applications using various programming languages (Java, Scala, and Python).
  • Strong understanding of Distributed systems design, HDFS architecture, internal working details of Map Reduce and Spark processing frameworks.
  • Solid experience developing Spark Applications for performing highly scalable data transformations using RDD, Data frame, Spark-SQL, and Spark Streaming.
  • Strong experience troubleshooting Spark failures and fine-tuning long running Spark applications.
  • Strong experience working with various configurations of Spark like broadcast thresholds, increasing shuffle partitions, caching, repartitioning etc., to improve the performance of the jobs.
  • Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
  • In depth knowledge on import/export of data from Databases using Sqoop.
  • Well versed in writing complex hive queries using analytical functions.
  • Knowledge in writing custom UDF’s in Hive to support custom business requirements.
  • Experienced in working with structured data using HiveQL, join operations, writing custom UDFs and optimizing Hive queries.
  • Solid experience in using the various file formats like CSV, TSV, Parquet, ORC, JSON and AVRO.
  • Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Amazon EMR) to fully implement and leverage various Hadoop services.
  • Strong Experience in working with Databases like Oracle, and MySQL, Teradata, Netezza and proficiency in writing complex SQL queries.
  • Proficient in Core Java concepts like Multi-threading, Collections and Exception Handling concepts.
  • Strong team player with good communication, analytical, presentation and inter-personal skills.
  • Experienced working with JIRA for project management, GIT for source code management, JENKINS for continuous integration and Crucible for code reviews.
  • Excellent communication, analytical skills, and quick learner. Also, capacity to work independently and highly motivated team player.
  • Experience in version control tools like SVN, GitHub and CVS.

TECHNICAL SKILLS

Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0.

BI Tools: SSIS, SSRS, SSAS.

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Programming Languages: SQL, PL/SQL, and UNIX.

Cloud Platform: AWS, Google Cloud.

Databases: Oracle 12c/11g, Teradata R15/R14.

OLAP Tools: Tableau, SSAS, Business Objects

ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, Florida

Data Engineer

Environment: Python, Hadoop, Teradata, Unix, Google cloud, DB2, PL/SQL, MS SQL, Ab initio ETL, Data Mapping, Spark, tableau, Nebula Metadata, Unix, Sql Server, Scala, Git.

Responsibilities:

  • Gathering data and business requirements from end users and management. Designed and built data solutions to migrate existing source data in Teradata and DB2 to Big Query (Google Cloud Platform).
  • Performed data manipulation on extracted data using Python Pandas.
  • Work with subject matter experts and project team to identify, define, collate, document and communicate the data migration requirements.
  • Built custom tableau/ SAP Business Objects dashboards for the Salesforce for accepting the parameters from the Salesforce to show the relevant data for that selected object.
  • Hands on Ab initio ETL, Data Mapping, Transformation and Loading in complex and high-volume environment
  • Design scoop scripts to load from Teradata and DB2 to Hadoop environment and also design Shell scripts to transfer data from Hadoop to Google Cloud Storage (GCS) and from GCS to Big Query.
  • Validate Scoop jobs, Shell scripts & perform data validation to check if data is loaded correctly without any discrepancy. Perform migration and testing of static data and transaction data from one core system to another.
  • Develop best practice, processes, and standards for effectively carrying out data migration activities. Work across multiple functional projects to understand data usage and implications for data migration.
  • Prepare data migration plans including migration risk, milestones, quality and business sign-off details.
  • Oversee the migration process from a business perspective. Coordinate between leads, process manager and project manager. Perform business validation of uploaded data.
  • Worked on to retrieve the data from FS to S3 using spark commands
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
  • Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Experienced in working with the Spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
  • Optimizing the performance of dashboards and workbooks in Tableau desktop and server.
  • Proposed EDW architecture changes to the team and highlighted the benefits to improve the performance and enable troubleshooting without affecting the Analytical systems
  • Involved in Debugging and monitoring and troubleshooting issues.
  • Analyze data, identify anomalies, and provide usable insight to customers.

Confidential

Hadoop Developer

Environment: RHEL, HDFS, Map-Reduce, Hive, Pig, Sqoop, Flume, Oozie, Mahout, HBase, Hortonworks data platform distribution, Cassandra.

Responsibilities:

  • Involved in design and development phases of Software Development Life Cycle (SDLC) using Scrum methodology.
  • Involved in Requirement gathering, Business Analysis and translated business requirements into technical design in Hadoop and Big Data.
  • Importing and exporting data into HDFS from database and vice versa using Sqoop.
  • Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest behavioral data into HDFS for analysis.
  • Used Maven extensively for building jar files of Map Reduce programs and deployed to Cluster.
  • Created customized BI tool for manager team that perform Query analytics using HiveQL.
  • Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
  • Developed suite of Unit Test Cases for Mapper, Reducer and Driver classes using MR Testing library.
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
  • Designed and implemented a Cassandra NoSQL based database that persists high-volume user profile data.
  • Migrated high-volume OLTP transactions from Oracle to Cassandra
  • Created Data Pipeline of Map Reduce programs using Chained Mappers.
  • Implemented Optimized join base by joining different data sets to get top claims based on state using Map Reduce.
  • Modeled Hive partitions extensively for data separation and faster data processing and followed Pig and Hive best practices for tuning.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • Loaded the aggregated data into DB2 for reporting on the dashboard.
  • Used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing the data into HDFS.
  • Implemented optimization and performance tuning in Hive and Pig.
  • Developed job flows in Oozie to automate the workflow for extraction of data from warehouses and weblogs.

We'd love your feedback!