We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

0/5 (Submit Your Rating)

Phoenix, AZ

SUMMARY

  • 7+ years of strong experience in Application Development using Pyspark,Java, Python, Scala and R & in depth understanding of Distributed Systems Architecture and Parallel Processing Frameworks
  • Strong experience using pyspark,HDFS, MapReduce, Hive, Pig, Spark, Sqoop, Oozie, and HBase
  • Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance
  • Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, MapR, Amazon EMR) to fully implement and leverage new Hadoop features
  • Experience in developing Spark Applications using Spark RDD, Spark - SQL and Dataframe APIs
  • Worked with real-time data processing and streaming techniques using Spark streaming and Kafka
  • Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop
  • Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries
  • Significant experience writing custom UDFs in Hive and custom Input Formats in MapReduce
  • Involved in creating Hive tables, loading with data and writing Hive ad-hoc queries that will run internally in MapReduce and TEZ, Replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing, Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
  • Strong understanding of real time streaming technologies Spark and Kafka
  • Knowledge of job work flow management and coordinating tools like Oozie
  • Strong experience building end to end data pipelines on Hadoop platform
  • Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase
  • Strong understanding of Logical and Physical database models and entity-relationship modeling
  • Experience with Software development tools such as JIRA, Play, GIT
  • Good understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables
  • Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data
  • Strong understanding of Java Virtual Machines and multi-threading process
  • Experience in writing complex SQL queries, creating reports and dashboards
  • Proficient in using Unix based Command Line Interface, Expertise in handling ETL tools like Informatica
  • Excellent analytical, communication and interpersonal skills
  • Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)

TECHNICAL SKILLS

Operating Systems: Linux (Ubuntu, CentOS), Windows, Mac OS

Hadoop Ecosystem: Hadoop, MapReduce, Yarn, HDFS, Pig, Oozie, Zookeeper

BigData Ecosystem: Spark, SparkSQL, Spark Streaming, Hive, Impala, Hue

Data Ingestion: Sqoop, Flume, NiFi, Kafka

NOSQL Databases: HBase, Cassandra, MongoDB

Programming Languages: C, Scala, Core Java, J2EE (SERVLETS, JSP, JDBC, JAVA BEANS, EJB) Frameworks MVC, Struts, Spring, Hibernate

Web Technologies: HTML, CSS, XML, JavaScript, Maven

Scripting Languages: Java Script, UNIX, Python, R Language

Databases: Oracle 11g, MS-Access, MySQL, SQL-Server 2000/2005/2008/2012 , Teradata

SQL Server Tools: SQL Server Management Studio, Enterprise Manager, Query Analyzer, Profiler, Export & Import (DTS).

IDE: Eclipse, Visual Studio, IDLE, IntelliJ

Web Services: Restful, SOAP

Tools: Bugzilla, QuickTestPro (QTP) 9.2, Selenium, Quality Center, Test Link, TWS, SPSS, SAS, Documentum, Tableau, Mahout

Methodologies: Agile, UML, Design Patterns

PROFESSIONAL EXPERIENCE

Confidential - Phoenix, AZ

Senior Data Engineer

Responsibilities:

  • Develop Shell script that reads Json files and apply it to Sqoop and Hive.
  • Ingested data from Relational DB, Oracle database, PostGRE SQL using SQOOP into HDFS and loaded them into Hive tables, AWS S3, GCP (Google Cloud Platform) and transformed and analyzed large datasets by running Hive queries and using Apache Spark.
  • Work with pySpark to migrate Fixed width, ORC, csv etc files.
  • Designed and implemented an ETL framework to load data from multiple sources into Hive and from Hive into
  • Teradata.
  • Utilized SQOOP, ETL and Hadoop File System API’s for implementing data ingestion pipelines.
  • Worked on Batch data of different granularity ranging from hourly, Daily to weekly and monthly.
  • Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
  • Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Ambari, spark and
  • Hive.
  • Work with stream sets and develop pipelines using streamsets
  • Developing and writing SQLs and stored procedures in Teradata. Loading data into snow flake and writing Snow
  • SQLs scripts
  • TDCH scripts for full and incremental refresh of Hadoop tables.
  • Optimizing Hive queries by parallelizing with portioning and bucketing.
  • Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and ORC.
  • Worked extensively on Teradata, Hadoop-Hive, Spark, SQLs, PLSQLs
  • Designed and published visually rich and intuitive Stream sets pipelines to migrate data
  • Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
  • Experienced in working with Hadoop from Horton works Data Platform and running services through Cloudera manager
  • Used Agile Scrum methodology/ Scrum Alliance for development

Environment: Hadoop, HDFS, AWS, Vertica, Scala, Kafka, MapReduce, YARN, Spark, Hive, Scala, MySQL, KerberosMaven,Stream sets.

Confidential - Charlotte, NC

Sr. Hadoop/Big Data Engineer

Responsibilities:

  • Setting up datalake in google cloud using Google cloud storage, Big Query, and Big Table.
  • Creating shell scripts to process the raw data, loading data to AWS S3, GCP (Google Cloud Platform) and Redshift databases
  • Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
  • Developing scripts in Big Query in GCP and connecting it to reporting tools.
  • Designed and implemented end to end big data platform on Teradata Appliance
  • Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 Using Hadoop spark.
  • Involvement developing architecture solution of the project to migrate data.
  • Developed Python, Bash scripts to automate and provide Control flow.
  • Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
  • Work with Pyspark to perform ETL and generate reports.
  • Writing regression SQL to merge the validated Data into Prod environment.
  • Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
  • Write UDFs in Hadoop Pyspark to perform transformations and loads.
  • Use NIFI to load data into HDFS as ORC files.
  • Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
  • Working with ORC, AVRO and Json file formats. and create external tables and query on top of these files Using BigQuery in GCP.
  • Working with google cloud storage. Research and development of strategies to minimize the cost in google cloud.
  • Using Apache solar for search operations on data.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Working with multiple sources. Migrating tables from Teradata and DB2 to Hadoop cluster.
  • Migrating processed ready tables from Hadoop and Google Cloud Storage(GCP) using Aorta Framework.
  • Source Analysis, Tracing back the sources of the data and finding its roots though Teradata, DB2 etc.
  • Identifying the jobs that load the source tables and documenting it.
  • Being an active part of Agile Scrum process with Sprints of 2 weeks.
  • Working With Jira, Microsoft planner to track the progress of the project.

Confidential, Charlotte, NC

Senior Hadoop/Big Data Engineer

Responsibilities:

  • Developed Hive, Bash scripts for source data validation and transformation. Automated data loading into HDFS and Hive for pre-processing the data using One Automation.
  • Gather data from Data warehouses in Teradata and Snowflake
  • Developed Spark/Scala, Python for regular expression project in the Hadoop/Hive environment.
  • Designed and implemented an ETL framework to load data from multiple sources into Hive and from Hive into Teradata.
  • Generate reports using Tableau.
  • Experience at building Big Data applications using Cassandra and Hadoop
  • Utilized SQOOP,ETL and Hadoop FileSystem APIs for implementing data ingestion pipelines
  • Worked on Batch data of different granularity ranging from hourly, Daily to weekly and monthly.
  • Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager
  • Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
  • Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Ambari, PIG, and Hive.
  • Developing and writing SQLs and stored procedures in Teradata. Loading data into snow flake and writing Snow SQLs scripts
  • TDCH scripts for full and incremental refresh of Hadoop tables.
  • Optimizing Hive queries by parallelizing with partitioning and bucketing.
  • Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and ORC.
  • Worked extensively on Teradata, Hadoop-Hive, Spark, SQLs, PLSQLs, Snow SQLs
  • Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making
  • Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
  • Experienced in working with Hadoop from Horton works Data Platform and running services through Cloudera manager
  • Used Agile Scrum methodology/ Scrum Alliance for development

Environment: Hadoop, HDFS, AWS, Vertica, Bash, Scala, Kafka, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Python, Java, NiFi, HBase, MySQL, Kerberos, Maven, Shell Scripting, SQL

Confidential, Houston, TX

Hadoop-Spark Developer

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop
  • Used Pyspark Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra
  • Experience in Loading the data into Pyspark data frames and Spark RDDs, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of pyspark Spark using Scala to generate the Output response
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in PySpark, Effective & efficient Joins, Transformations and other during ingestion process itself
  • Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in PySpark for Data Aggregation, queries and writing data back into OLTP system through SQOOP
  • Worked with Impala KUDU for creating a spark to IMPALA-KUDU data ingestion tool.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning
  • Optimizing of existing algorithms in Hadoop using PySpark Session, Spark-SQL, Data Frames and Pair RDDs
  • Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping
  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems
  • Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS
  • Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra for data access and analysis
  • Created Hive tables for loading and analyzing data, Implemented Partitions, Buckets and developed Hive queries to process the data and generate the data cubes for visualizing
  • Implemented schema extraction for Parquet and Avro file Formats in Hive
  • Developed Hive scripts in Hive QL to de-normalize and aggregate the data
  • Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive
  • Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project
  • Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala
  • Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow
  • Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster
  • Worked with BI team to create various kinds of reports using Tableau based on the client's needs
  • Experience in Querying on Parquet files by loading them into Sparkdataframes by using Zeppelin notebook
  • Experience in troubleshooting any problems that arises during any batch data processing jobs
  • Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR

Environment: Hadoop Yarn, Spark-Core, Spark-Streaming, Spark-SQL, Scala, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Talend, Cloudera, MySQL, Linux, Shell scripting

Confidential, Houston, TX

Hadoop Developer

Responsibilities:

  • Developed PIG scripts for source data validation and transformation. Automated data loading into HDFS and PIG for pre-processing the data using Oozie
  • Created storage with Amazon S3 for storing data. Worked on transferring data from Kafka topic into AWS S3 storage
  • Designed and implemented an ETL framework using Java and PIG to load data from multiple sources into Hive and from Hive into Vertica
  • Utilized SQOOP, Kafka, Flume and Hadoop FileSystem APIs for implementing data ingestion pipelines
  • Worked on real time streaming, performed transformations on the data using Kafka and Spark Streaming
  • Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager
  • Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
  • Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase
  • Developed Spark scripts by using Python in PySpark shell command in development.
  • Experienced in Hadoop Production support tasks by analyzing the Application and cluster logs
  • Created Hive tables, loaded with data, and wrote Hive queries to process the data. Created Partitions and used Bucketing on Hive tables and used required parameters to improve performance. Developed Pig and Hive UDFs as per business use-cases
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards
  • Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and XML
  • Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, SnowFlake
  • Used Apache NiFi to automate data movement between different Hadoop components
  • Used NiFi to perform conversion of raw XML data into JSON, AVRO
  • Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making
  • Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
  • Experienced in working with Hadoop from Cloudera Data Platform and running services through Cloudera manager
  • Used Agile Scrum methodology/ Scrum Alliance for development

Environment: Hadoop, HDFS, Hive, Scala, TEZ, Teradata, Teradata Studio, TDCH, SnowFlake, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Java, MySQL, Kerberos.

We'd love your feedback!