We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

2.00/5 (Submit Your Rating)

St Louis, MO

SUMMARY

  • Experienced in Big Data implementation with strong knowledge on major components of Hadoop Ecosystem like Hadoop Map Reduce, HDFS, Sqoop, Hive, HBase, Oozie, Spark, Scala and Python.
  • Good exposure on usage of NoSQL databases like HBase, Cassandra and Scylla DB
  • Expertise in cloud infrastructure such as Amazon Web Services (AWS), EC2, S3, EMR, Glue, SNS, SQS, Lambda, Athena, Amazon RDS and various other services.
  • Worked on various Hadoop Distributions (Cloudera, EMR etc.) to fully implement and leverage new Hadoop features.
  • Hands on experience on Google Cloud Platform (GCP in all the bigdata products BigQuery, Cloud Data Proc, Google Cloud Storage,Composer (AirFlow as a service)
  • Skilled in Hadoop Architecture and ecosystem which includes HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN.
  • Comprehensive experience in importing and exporting data using Sqoop between RDBMS to HDFS.
  • Had good understanding and hands on with Spark tools like RDD, Data frame, Dataset and spark SQL.
  • Extensively worked on structured data using Hive QL, join operations, writing custom UDF’s and optimizing Hive queries.
  • Experienced in performing in - memory data processing for batch, real-time, and advanced analytics using Apache Spark (Spark Core, Spark SQL, and Streaming).
  • Ingested data into Hadoop from various data sources like Oracle, MySQL, and Teradata using Sqoop tool.
  • Experienced inAgileandWaterfallmethodologies in Project execution.
  • Experienced in providing security to Hadoop cluster withKerberosand integration with LDAP/AD at Enterprise level.
  • Working experience with developing User Defined Functions (UDFs) Apache Hive Data warehouse using Java, Scala, and Python.
  • Involved in best practices forCassandra, migrating application toCassandradatabase from the legacy platform for Choice, upgradedCassandra3.
  • Experienced in developing MapReduce programs using Apache Hadoop for working with Big Data.
  • Good understanding of XML methodologies (XML, XSL, XSD) including Web Services and SOAP.
  • Used the Spark -CassandraConnector to load data to and fromCassandra.
  • Hands on experience in Apache Spark creating RDD’s and Data Frames applying Operations Transformation and Actions and concerting RDD’s to Data Frames.
  • Migrating various Hive UDF's and queries into Spark SQL for faster requests.
  • Experience data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
  • Experience working withGitHub/Git 2.12source and version control systems.
  • Experience in using Apache Kafka for log aggregating.
  • Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS and performed the real-time analytics on the incoming data.
  • Experience in importing the real-time data to Hadoop using Kafka and implemented the Oozie job for daily imports.

TECHNICAL SKILLS:

Hadoop/Big Data: HDFS, Map Reduce, Hive, Impala Spark-SQL, HBase, Kafka, Sqoop, Spark Streaming, Oozie, Zookeeper, Hue Scala,Pyspark,SplunkHadoop Distribution Cloudera (CDH 12.2), Amazon AWS.

Programming/Scripting Languages: Core Java, Linux shell scripts, Python, Scala.

Database: MySQL, PL/SQL,SQL Developer, Teradata, HBase

ETL: Ab Intitio,Informatica

Real Time/Stream Processing: Apache Spark

Build Tools: Maven, SBT

Cloud: AWS, GCP, S3

PROFESSIONAL EXPERIENCE:

Confidential - St, Louis, MO

Sr. Data Engineer

Responsibilities:

  • Developed PySpark pipelines which will transform the raw data into useful flattened datasets.
  • Worked on creating data pipelines which will read millions of JSON files with nested data structures using spark and flattened them according to the business use case.
  • Created python framework which will clean the raw files in foundry platform.
  • Orchestrated processes in AWS using Glue ETL jobs and Lambdas so that the data flows without manual intervention.
  • Transformed complex SQL into PySpark while migrating the processes from AWS to Palantir Foundry.
  • Created Spark libraries and utilities which will work for transforming datasets from multiple data sources to the sink.
  • Worked on creating pipelines in AWS using Glue, Lambdas, S3 and AWS Step Functions which converts the data into useful formats.
  • Worked on analyzing Data on Hadoop cluster using different big data analytic tools including Spark (Spark SQL, Spark-Shell), Hive Data warehouse and Impala.
  • Implemented Spark using Scala and utilizing Data frames and Spark-SQL API and RESTful APIs for faster processing of data.
  • Used spark for fast In-Memory data processing and performed joins(Broadcast hash, Sort Merge Join), pivot (data transpose), complex transformations on terabytes of data.
  • Developed poc's on Spark Streaming to ingest flat files automatically when file landed in Edge Node landing Zone.
  • Created data catalog tables that provide an overview of where data originated and where it was sent.
  • Implemented Spring boot microservices to process the messages into the Kafka cluster setup to send it to the next teams for further processing.
  • Agile project Scrum team member as T-shaped skill of Technical Data Analyst / ETL developer in Big Data apps.
  • Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup.
  • Worked on partition of Kafka messages and setting up the replication factors in Kafka Cluster.
  • Diligently worked with Kafka Admin team to set up Kafka cluster and implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.
  • Configured AWS Lambda with multiple functions.
  • Integrated spark streaming service with Kafka to load the data into a HDFS location.
  • Used Kafka HDFS connector to export data from Kafka topic to HDFS files in a variety of formats and integrates with apache hive to make data immediately available for HQL querying.
  • Working knowledge ofbitbucket(as version control repository) andbamboo(for CICD).
  • Automated the dataflow by using bash scripts from pulling data from databases to loading data into HDFS using shell scripts.
  • Involved in data ingestion into HDFS using Spark, Sqoop from variety of sources using the connectors like JDBC.
  • Good experience in writing Spark applications using Python.
  • Performed ETL using AWS Glue.
  • Used AWS Athena to Query directly from AWS S3.
  • Pre-processed large sets of structured and semi-structured data with different formats like Text Files, Avro, Parquet, Sequence Files, and JSON Record and used Snappy and LZ4 compressions.
  • Worked with Oozie and Zookeeper to manage the flow of jobs and coordination in the cluster.
  • Used Git with Bit Bucket for code versioning and code reviewing, Sonar Qube for code analysis.

Environment: Spark, Python, Hadoop, Hive, S3, RDS, EMR, EC2, SNS, Lambda, Athena, Step Functions, Jenkins, Foundry, Git.

Confidential - Houston, TX

Data Engineer

Responsibilities:

  • Used Sqoop to import data into HDFS/Hive from multiple relational databases, performed operations and exported the results back.
  • Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).
  • Extensively used Spark Streaming to perform the analysis of sales data on the real-time regular window time intervals coming from sources like Kafka.
  • Performed Spark join optimizations, troubleshooted, monitored and wrote efficient codes using Scala.
  • Used big data tools Spark (Pyspark, SparkSQL) to conduct real-time analysis of the insurance transaction.
  • Performed Spark transformations and actions on large datasets. Implemented Spark SQL to perform complex data manipulations, and to work with large amounts of structured and semi-structured data stored in a cluster using Data Frames/Datasets.
  • Migrated previously written cron jobs to airflow/composer in GCP
  • Created Hive tables based on business requirements. Wrote many Hive queries, UDFs and implemented concepts like Partitioning, Bucketing for efficient data access, Windowing operations and more.
  • Integrated Hive, Sqoop with HBase and performed transactional and analytical processing.
  • Configured, designed, implemented and monitored Kafka clusters and connectors. Wrote Kafka producers and consumers using Java.
  • Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
  • Was involved in setting up of apache airflow service in GCP.
  • Implemented proof of concept (POC) for processing stream data using Kafka -> Spark -> HDFS.
  • Developed a data pipeline using Kafka, Spark, and Hive/ HDFS to ingest, transform and analyze data. Automated jobs using Oozie.
  • Generated Tableau dashboards and worksheets for large datasets.
  • Implemented custom interceptors for Flume to filter data, and defined channel selectors to multiplex the data into different sinks.
  • Implemented many Spark jobs and wrote Function definitions, Case and Object classes using Scala.
  • Used Spark SQL for Scala & Python interface that automatically converts RDD case classes to schema RDD.
  • Utilized Spark, Scala, Python for querying, preparing from big data sources.
  • Wrote pre-processing queries inpythonfor internalsparkjobs.
  • Involved in the process of Cassandra data modeling, performing data operations using CQL and Java.
  • Maintain and work with our data pipeline that transfers and processes several terabytes of data using Spark, Scala, Python, Apache Kafka, Pig/Hive & Impala.
  • Working experience inApache HadoopandSparkframeworks includingHadoop Distributed File System, MapReduce, Pyspark and SparkSQL.
  • Build data pipelines in airflow in GCP for ETL related jotis using different airflow operators.
  • Performed data integration with a goal of moving more data effectively, efficiently and with high performance to assist in business-critical projects using Talend Data Integration.
  • Used SQL queries and other data analysis methods to know the quality of the data.
  • Exported the aggregated data onto Oracle using Sqoop for reporting on the Tableau dashboard.
  • Involved in QA, test data creation, and unit testing activities.
  • Implemented security on Hadoop cluster using Kerberos.
  • Involved in design, development and testing phases of Software Development Life Cycle.
  • Weekly meetings with technical collaborators and active participation in code review sessions with senior and junior developers.

Environment: Spark, HDFS, Hive, Map Reduce, GCP, BigQuery, GCS, G-Cloud Function, Scala, Sqoop, Spark-SQL, Kafka, PySpark, Python, Linux Shell Scripting, JDBC, Git, Bit bucket, Control M, Maven.

Confidential - North Wales, PA

Data Engineer

Responsibilities:

  • Developed ETL data pipelines using Sqoop,Spark, Spark SQL, Scala, and Oozie.
  • UsedSpark for interactive queries, processing of streaming data and integrated with popular NoSQL databases
  • Experience withAWS Cloud IAM, Data pipeline,EMR, S3, EC2.
  • Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations.
  • DevelopedSpark code using Scala andSpark-SQL for faster processing of data.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Created Oozie workflow engine to run multiple Spark jobs.
  • Developed file cleaners using Python libraries and made it clean.
  • Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark-SQL, Data Frame, pair RDD's, Spark YARN.
  • Experience with terraform scripts which automates the step execution in EMR to load the data to Scylla DB.
  • De-normalizing the data as part of transformation which is coming from Netezza and loading it to No SQL Databases and MySQL.

Environment: HDFS, Spark, Scala, Tomcat, Netezza, EMR, Oracle, Sqoop, AWS, Terraform, Scylla DB, Cassandra, MySQL, Oozie

Confidential, San Diego, CA

Hadoop Developer

Responsibilities:

  • Worked on a live90 nodes Hadoop clusterrunningCDH4.4
  • Worked with highlyunstructured and semi structured data of 90 TBin size (270 TB)
  • Extracted the data from Teradata into HDFS usingSqoop.
  • Workedwith Sqoop (version 1.4.3)jobs with incremental loadto populate Hive External tables.
  • Extensive experience in writingPig (version 0.10)scripts to transform raw datafrom several data sources into forming baseline data.
  • Experience in Amazon AWS services such as EMR, EC2, S3, CloudFormation, RedShift which provides fast and efficient processing of Big Data.
  • Created data lake on amazon s3
  • Implemented scheduled downtime for non-prod servers for optimizing AWS pricing.
  • DevelopedHive(version 0.10) scripts for end user / analyst requirements to perform ad hoc analysis
  • Very good understanding ofPartitions, bucketingconcepts in Hive and designed bothManaged and Externaltables in Hive to optimize performance
  • Solved performance issuesin Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
  • DevelopedUDFsin Java as and when necessary to use in PIG and HIVE queries
  • Experience in usingSequence files, RC File, AVRO and HARfile formats.
  • DevelopedOozieworkflow for scheduling and orchestrating the ETL process
  • Worked on Performance Tuning to Ensure that assigned systems were patched, configured and optimized for maximum functionality and availability. Implemented solutions that reduced single points of failure and improved system uptime to 99.9% availability.
  • Written MapReduce programs in Python with the Hadoop streaming API.
  • Extracted files from CouchDB through Sqoop and placed in HDFS and processed.
  • Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.
  • Imported data fromMySQLserver and other relational databases to Apache Hadoop with the help ofApache Sqoop.
  • CreatingHive tablesand working on them for data analysis to meet the business requirements.

Environment: Hadoop, MapReduce, HDFS, Hive, HBase, Sqoop, Pig, Flume, Oracle 11/10g, DB2, Teradata, MySQL, Eclipse, PL/SQL, Java, Linux, Shell Scripting, SQL Developer, SOLR.

Confidential

Software Developer

Responsibilities:

  • Analyze and modify Java/J2EE Application using JDK 1.7/1.8 and develop webpages using Spring MVC Framework.
  • Coordinate with the business analyst and application architects to maintain knowledge on all functional requirements and ensure compliance to all architecture standards.
  • Follow AGILE methodology with TDD through all the phases of SDLC.
  • Used Connection Pooling to get JDBC connection and access database procedures.
  • Attending the daily Standup Meetings.
  • Use Rally for managing the portfolio, creating and keep tracking of the user stories.
  • Responsible for analysis, design, development and integration of UI components with backend using J2EE technologies.
  • Used JUnit to validate input for functions TDD.
  • Developed User Interface pages using HTML5, CSS3 and JavaScript.
  • Involved in development activities using Core Java /J2EE, Servlets, JSP, JSF used for creating web application, XML and Springs.
  • Used Maven tool for building the application and run it using Tomcat Server.
  • Use GIT as version control for tracking the changes in the project.
  • Used Junit Framework for unit testing and Selenium for integration testing and Test Automation.
  • Assist in development for various applications and maintain quality for same and perform troubleshoot to resolve all application issues/bugs identified during the test cycles.

Environment: Java/J2EE, JDK 1.7/1.8, LINUX, Spring MVC, Eclipse, JUnit, Servlets, DB2, Oracle 11g/12c, GIT, GitHub, JSON, RESTful, HTML5, CSS3, JavaScript, Rally, Agile/Scrum.

We'd love your feedback!