Sr. Data Engineer Resume
St Louis, MO
SUMMARY
- Experienced in Big Data implementation wif strong noledge on major components of Hadoop Ecosystem like Hadoop Map Reduce, HDFS, Sqoop, Hive, HBase, Oozie, Spark, Scala and Python.
- Good exposure on usage of NoSQL databases like HBase, Cassandra and Scylla DB
- Expertise in cloud infrastructure such as Amazon Web Services (AWS), EC2, S3, EMR, Glue, SNS, SQS, Lambda, Atana, Amazon RDS and various other services.
- Worked on various Hadoop Distributions (Cloudera, EMR etc.) to fully implement and leverage new Hadoop features.
- Hands on experience on Google Cloud Platform (GCP in all teh bigdata products BigQuery, Cloud Data Proc, Google Cloud Storage,Composer (AirFlow as a service)
- Skilled in Hadoop Architecture and ecosystem which includes HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN.
- Comprehensive experience in importing and exporting data using Sqoop between RDBMS to HDFS.
- Had good understanding and hands on wif Spark tools like RDD, Data frame, Dataset and spark SQL.
- Extensively worked on structured data using Hive QL, join operations, writing custom UDF’s and optimizing Hive queries.
- Experienced in performing in - memory data processing for batch, real-time, and advanced analytics using Apache Spark (Spark Core, Spark SQL, and Streaming).
- Ingested data into Hadoop from various data sources like Oracle, MySQL, and Teradata using Sqoop tool.
- Experienced inAgileandWaterfallmethodologies in Project execution.
- Experienced in providing security to Hadoop cluster wifKerberosand integration wif LDAP/AD at Enterprise level.
- Working experience wif developing User Defined Functions (UDFs) Apache Hive Data warehouse using Java, Scala, and Python.
- Involved in best practices forCassandra, migrating application toCassandradatabase from teh legacy platform for Choice, upgradedCassandra3.
- Experienced in developing MapReduce programs using Apache Hadoop for working wif Big Data.
- Good understanding of XML methodologies (XML, XSL, XSD) including Web Services and SOAP.
- Used teh Spark -CassandraConnector to load data to and fromCassandra.
- Hands on experience in Apache Spark creating RDD’s and Data Frames applying Operations Transformation and Actions and concerting RDD’s to Data Frames.
- Migrating various Hive UDF's and queries into Spark SQL for faster requests.
- Experience data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
- Experience working wifGitHub/Git 2.12source and version control systems.
- Experience in using Apache Kafka for log aggregating.
- Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS and performed teh real-time analytics on teh incoming data.
- Experience in importing teh real-time data to Hadoop using Kafka and implemented teh Oozie job for daily imports.
TECHNICAL SKILLS:
Hadoop/Big Data: HDFS, Map Reduce, Hive, Impala Spark-SQL, HBase, Kafka, Sqoop, Spark Streaming, Oozie, Zookeeper, Hue Scala,Pyspark,SplunkHadoop Distribution Cloudera (CDH 12.2), Amazon AWS.
Programming/Scripting Languages: Core Java, Linux shell scripts, Python, Scala.
Database: MySQL, PL/SQL,SQL Developer, Teradata, HBase
ETL: Ab Intitio,Informatica
Real Time/Stream Processing: Apache Spark
Build Tools: Maven, SBT
Cloud: AWS, GCP, S3
PROFESSIONAL EXPERIENCE:
Confidential - St, Louis, MO
Sr. Data Engineer
Responsibilities:
- Developed PySpark pipelines which will transform teh raw data into useful flattened datasets.
- Worked on creating data pipelines which will read millions of JSON files wif nested data structures using spark and flattened them according to teh business use case.
- Created python framework which will clean teh raw files in foundry platform.
- Orchestrated processes in AWS using Glue ETL jobs and Lambdas so that teh data flows wifout manual intervention.
- Transformed complex SQL into PySpark while migrating teh processes from AWS to Palantir Foundry.
- Created Spark libraries and utilities which will work for transforming datasets from multiple data sources to teh sink.
- Worked on creating pipelines in AWS using Glue, Lambdas, S3 and AWS Step Functions which converts teh data into useful formats.
- Worked on analyzing Data on Hadoop cluster using different big data analytic tools including Spark (Spark SQL, Spark-Shell), Hive Data warehouse and Impala.
- Implemented Spark using Scala and utilizing Data frames and Spark-SQL API and RESTful APIs for faster processing of data.
- Used spark for fast In-Memory data processing and performed joins(Broadcast hash, Sort Merge Join), pivot (data transpose), complex transformations on terabytes of data.
- Developed poc's on Spark Streaming to ingest flat files automatically when file landed in Edge Node landing Zone.
- Created data catalog tables that provide an overview of where data originated and where it was sent.
- Implemented Spring boot microservices to process teh messages into teh Kafka cluster setup to send it to teh next teams for further processing.
- Agile project Scrum team member as T-shaped skill of Technical Data Analyst / ETL developer in Big Data apps.
- Used Spring Kafka API calls to process teh messages smoothly on Kafka Cluster setup.
- Worked on partition of Kafka messages and setting up teh replication factors in Kafka Cluster.
- Diligently worked wif Kafka Admin team to set up Kafka cluster and implemented Kafka producer and consumer applications on Kafka cluster setup wif help of Zookeeper.
- Configured AWS Lambda wif multiple functions.
- Integrated spark streaming service wif Kafka to load teh data into a HDFS location.
- Used Kafka HDFS connector to export data from Kafka topic to HDFS files in a variety of formats and integrates wif apache hive to make data immediately available for HQL querying.
- Working noledge ofbitbucket(as version control repository) andbamboo(for CICD).
- Automated teh dataflow by using bash scripts from pulling data from databases to loading data into HDFS using shell scripts.
- Involved in data ingestion into HDFS using Spark, Sqoop from variety of sources using teh connectors like JDBC.
- Good experience in writing Spark applications using Python.
- Performed ETL using AWS Glue.
- Used AWS Atana to Query directly from AWS S3.
- Pre-processed large sets of structured and semi-structured data wif different formats like Text Files, Avro, Parquet, Sequence Files, and JSON Record and used Snappy and LZ4 compressions.
- Worked wif Oozie and Zookeeper to manage teh flow of jobs and coordination in teh cluster.
- Used Git wif Bit Bucket for code versioning and code reviewing, Sonar Qube for code analysis.
Environment: Spark, Python, Hadoop, Hive, S3, RDS, EMR, EC2, SNS, Lambda, Atana, Step Functions, Jenkins, Foundry, Git.
Confidential - Houston, TX
Data Engineer
Responsibilities:
- Used Sqoop to import data into HDFS/Hive from multiple relational databases, performed operations and exported teh results back.
- Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).
- Extensively used Spark Streaming to perform teh analysis of sales data on teh real-time regular window time intervals coming from sources like Kafka.
- Performed Spark join optimizations, troubleshooted, monitored and wrote efficient codes using Scala.
- Used big data tools Spark (Pyspark, SparkSQL) to conduct real-time analysis of teh insurance transaction.
- Performed Spark transformations and actions on large datasets. Implemented Spark SQL to perform complex data manipulations, and to work wif large amounts of structured and semi-structured data stored in a cluster using Data Frames/Datasets.
- Migrated previously written cron jobs to airflow/composer in GCP
- Created Hive tables based on business requirements. Wrote many Hive queries, UDFs and implemented concepts like Partitioning, Bucketing for efficient data access, Windowing operations and more.
- Integrated Hive, Sqoop wif HBase and performed transactional and analytical processing.
- Configured, designed, implemented and monitored Kafka clusters and connectors. Wrote Kafka producers and consumers using Java.
- Import teh data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate teh output response.
- Was involved in setting up of apache airflow service in GCP.
- Implemented proof of concept (POC) for processing stream data using Kafka -> Spark -> HDFS.
- Developed a data pipeline using Kafka, Spark, and Hive/ HDFS to ingest, transform and analyze data. Automated jobs using Oozie.
- Generated Tableau dashboards and worksheets for large datasets.
- Implemented custom interceptors for Flume to filter data, and defined channel selectors to multiplex teh data into different sinks.
- Implemented many Spark jobs and wrote Function definitions, Case and Object classes using Scala.
- Used Spark SQL for Scala & Python interface that automatically converts RDD case classes to schema RDD.
- Utilized Spark, Scala, Python for querying, preparing from big data sources.
- Wrote pre-processing queries inpythonfor internalsparkjobs.
- Involved in teh process of Cassandra data modeling, performing data operations using CQL and Java.
- Maintain and work wif our data pipeline that transfers and processes several terabytes of data using Spark, Scala, Python, Apache Kafka, Pig/Hive & Impala.
- Working experience inApache HadoopandSparkframeworks includingHadoop Distributed File System, MapReduce, Pyspark and SparkSQL.
- Build data pipelines in airflow in GCP for ETL related jotis using different airflow operators.
- Performed data integration wif a goal of moving more data effectively, efficiently and wif high performance to assist in business-critical projects using Talend Data Integration.
- Used SQL queries and other data analysis methods to no teh quality of teh data.
- Exported teh aggregated data onto Oracle using Sqoop for reporting on teh Tableau dashboard.
- Involved in QA, test data creation, and unit testing activities.
- Implemented security on Hadoop cluster using Kerberos.
- Involved in design, development and testing phases of Software Development Life Cycle.
- Weekly meetings wif technical collaborators and active participation in code review sessions wif senior and junior developers.
Environment: Spark, HDFS, Hive, Map Reduce, GCP, BigQuery, GCS, G-Cloud Function, Scala, Sqoop, Spark-SQL, Kafka, PySpark, Python, Linux Shell Scripting, JDBC, Git, Bit bucket, Control M, Maven.
Confidential - North Wales, PA
Data Engineer
Responsibilities:
- Developed ETL data pipelines using Sqoop,Spark, Spark SQL, Scala, and Oozie.
- UsedSpark for interactive queries, processing of streaming data and integrated wif popular NoSQL databases
- Experience wifAWS Cloud IAM, Data pipeline,EMR, S3, EC2.
- Developed teh batch scripts to fetch teh data from AWS S3 storage and do required transformations.
- DevelopedSpark code using Scala andSpark-SQL for faster processing of data.
- Developed Spark/Scala, Python for regular expression (regex) project in teh Hadoop/Hive environment wif Linux/Windows for big data resources.
- Created Oozie workflow engine to run multiple Spark jobs.
- Developed file cleaners using Python libraries and made it clean.
- Exploring wif Spark for improving teh performance and optimization of teh existing algorithms in Hadoop using Spark-SQL, Data Frame, pair RDD's, Spark YARN.
- Experience wif terraform scripts which automates teh step execution in EMR to load teh data to Scylla DB.
- De-normalizing teh data as part of transformation which is coming from Netezza and loading it to No SQL Databases and MySQL.
Environment: HDFS, Spark, Scala, Tomcat, Netezza, EMR, Oracle, Sqoop, AWS, Terraform, Scylla DB, Cassandra, MySQL, Oozie
Confidential, San Diego, CA
Hadoop Developer
Responsibilities:
- Worked on a live90 nodes Hadoop clusterrunningCDH4.4
- Worked wif highlyunstructured and semi structured data of 90 TBin size (270 TB)
- Extracted teh data from Teradata into HDFS usingSqoop.
- Workedwif Sqoop (version 1.4.3)jobs wif incremental loadto populate Hive External tables.
- Extensive experience in writingPig (version 0.10)scripts to transform raw datafrom several data sources into forming baseline data.
- Experience in Amazon AWS services such as EMR, EC2, S3, CloudFormation, RedShift which provides fast and efficient processing of Big Data.
- Created data lake on amazon s3
- Implemented scheduled downtime for non-prod servers for optimizing AWS pricing.
- DevelopedHive(version 0.10) scripts for end user / analyst requirements to perform ad hoc analysis
- Very good understanding ofPartitions, bucketingconcepts in Hive and designed bothManaged and Externaltables in Hive to optimize performance
- Solved performance issuesin Hive and Pig scripts wif understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
- DevelopedUDFsin Java as and when necessary to use in PIG and HIVE queries
- Experience in usingSequence files, RC File, AVRO and HARfile formats.
- DevelopedOozieworkflow for scheduling and orchestrating teh ETL process
- Worked on Performance Tuning to Ensure that assigned systems were patched, configured and optimized for maximum functionality and availability. Implemented solutions that reduced single points of failure and improved system uptime to 99.9% availability.
- Written MapReduce programs in Python wif teh Hadoop streaming API.
- Extracted files from CouchDB through Sqoop and placed in HDFS and processed.
- Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.
- Imported data fromMySQLserver and other relational databases to Apache Hadoop wif teh help ofApache Sqoop.
- CreatingHive tablesand working on them for data analysis to meet teh business requirements.
Environment: Hadoop, MapReduce, HDFS, Hive, HBase, Sqoop, Pig, Flume, Oracle 11/10g, DB2, Teradata, MySQL, Eclipse, PL/SQL, Java, Linux, Shell Scripting, SQL Developer, SOLR.
Confidential
Software Developer
Responsibilities:
- Analyze and modify Java/J2EE Application using JDK 1.7/1.8 and develop webpages using Spring MVC Framework.
- Coordinate wif teh business analyst and application architects to maintain noledge on all functional requirements and ensure compliance to all architecture standards.
- Follow AGILE methodology wif TDD through all teh phases of SDLC.
- Used Connection Pooling to get JDBC connection and access database procedures.
- Attending teh daily Standup Meetings.
- Use Rally for managing teh portfolio, creating and keep tracking of teh user stories.
- Responsible for analysis, design, development and integration of UI components wif backend using J2EE technologies.
- Used JUnit to validate input for functions TDD.
- Developed User Interface pages using HTML5, CSS3 and JavaScript.
- Involved in development activities using Core Java /J2EE, Servlets, JSP, JSF used for creating web application, XML and Springs.
- Used Maven tool for building teh application and run it using Tomcat Server.
- Use GIT as version control for tracking teh changes in teh project.
- Used Junit Framework for unit testing and Selenium for integration testing and Test Automation.
- Assist in development for various applications and maintain quality for same and perform troubleshoot to resolve all application issues/bugs identified during teh test cycles.
Environment: Java/J2EE, JDK 1.7/1.8, LINUX, Spring MVC, Eclipse, JUnit, Servlets, DB2, Oracle 11g/12c, GIT, GitHub, JSON, RESTful, HTML5, CSS3, JavaScript, Rally, Agile/Scrum.