Senior Data Engineer Resume Phoenix, AZ - Hire IT People

SUMMARY

7+ years of strong experience in Application Development using Pyspark,Java, Python, Scala and R & in depth understanding of Distributed Systems Architecture and Parallel Processing Frameworks
Strong experience using pyspark,HDFS, MapReduce, Hive, Pig, Spark, Sqoop, Oozie, and HBase
Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance
Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, MapR, Amazon EMR) to fully implement and leverage new Hadoop features
Experience in developing Spark Applications using Spark RDD, Spark - SQL and Dataframe APIs
Worked with real-time data processing and streaming techniques using Spark streaming and Kafka
Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop
Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries
Significant experience writing custom UDFs in Hive and custom Input Formats in MapReduce
Involved in creating Hive tables, loading with data and writing Hive ad-hoc queries that will run internally in MapReduce and TEZ, Replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing, Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
Strong understanding of real time streaming technologies Spark and Kafka
Knowledge of job work flow management and coordinating tools like Oozie
Strong experience building end to end data pipelines on Hadoop platform
Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase
Strong understanding of Logical and Physical database models and entity-relationship modeling
Experience with Software development tools such as JIRA, Play, GIT
Good understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables
Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data
Strong understanding of Java Virtual Machines and multi-threading process
Experience in writing complex SQL queries, creating reports and dashboards
Proficient in using Unix based Command Line Interface, Expertise in handling ETL tools like Informatica
Excellent analytical, communication and interpersonal skills
Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)

TECHNICAL SKILLS

Operating Systems: Linux (Ubuntu, CentOS), Windows, Mac OS

Hadoop Ecosystem: Hadoop, MapReduce, Yarn, HDFS, Pig, Oozie, Zookeeper

BigData Ecosystem: Spark, SparkSQL, Spark Streaming, Hive, Impala, Hue

Data Ingestion: Sqoop, Flume, NiFi, Kafka

NOSQL Databases: HBase, Cassandra, MongoDB

Programming Languages: C, Scala, Core Java, J2EE (SERVLETS, JSP, JDBC, JAVA BEANS, EJB) Frameworks MVC, Struts, Spring, Hibernate

Web Technologies: HTML, CSS, XML, JavaScript, Maven

Scripting Languages: Java Script, UNIX, Python, R Language

Databases: Oracle 11g, MS-Access, MySQL, SQL-Server 2000/2005/2008/2012 , Teradata

SQL Server Tools: SQL Server Management Studio, Enterprise Manager, Query Analyzer, Profiler, Export & Import (DTS).

IDE: Eclipse, Visual Studio, IDLE, IntelliJ

Web Services: Restful, SOAP

Tools: Bugzilla, QuickTestPro (QTP) 9.2, Selenium, Quality Center, Test Link, TWS, SPSS, SAS, Documentum, Tableau, Mahout

Methodologies: Agile, UML, Design Patterns

PROFESSIONAL EXPERIENCE

Confidential - Phoenix, AZ

Senior Data Engineer

Responsibilities:

Develop Shell script that reads Json files and apply it to Sqoop and Hive.
Ingested data from Relational DB, Oracle database, PostGRE SQL using SQOOP into HDFS and loaded them into Hive tables, AWS S3, GCP (Google Cloud Platform) and transformed and analyzed large datasets by running Hive queries and using Apache Spark.
Work with pySpark to migrate Fixed width, ORC, csv etc files.
Designed and implemented an ETL framework to load data from multiple sources into Hive and from Hive into
Teradata.
Utilized SQOOP, ETL and Hadoop File System API’s for implementing data ingestion pipelines.
Worked on Batch data of different granularity ranging from hourly, Daily to weekly and monthly.
Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Ambari, spark and
Hive.
Work with stream sets and develop pipelines using streamsets
Developing and writing SQLs and stored procedures in Teradata. Loading data into snow flake and writing Snow
SQLs scripts
TDCH scripts for full and incremental refresh of Hadoop tables.
Optimizing Hive queries by parallelizing with portioning and bucketing.
Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and ORC.
Worked extensively on Teradata, Hadoop-Hive, Spark, SQLs, PLSQLs
Designed and published visually rich and intuitive Stream sets pipelines to migrate data
Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
Experienced in working with Hadoop from Horton works Data Platform and running services through Cloudera manager
Used Agile Scrum methodology/ Scrum Alliance for development

Environment: Hadoop, HDFS, AWS, Vertica, Scala, Kafka, MapReduce, YARN, Spark, Hive, Scala, MySQL, KerberosMaven,Stream sets.

Confidential - Charlotte, NC

Sr. Hadoop/Big Data Engineer

Responsibilities:

Setting up datalake in google cloud using Google cloud storage, Big Query, and Big Table.
Creating shell scripts to process the raw data, loading data to AWS S3, GCP (Google Cloud Platform) and Redshift databases
Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
Developing scripts in Big Query in GCP and connecting it to reporting tools.
Designed and implemented end to end big data platform on Teradata Appliance
Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 Using Hadoop spark.
Involvement developing architecture solution of the project to migrate data.
Developed Python, Bash scripts to automate and provide Control flow.
Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
Work with Pyspark to perform ETL and generate reports.
Writing regression SQL to merge the validated Data into Prod environment.
Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
Write UDFs in Hadoop Pyspark to perform transformations and loads.
Use NIFI to load data into HDFS as ORC files.
Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
Working with ORC, AVRO and Json file formats. and create external tables and query on top of these files Using BigQuery in GCP.
Working with google cloud storage. Research and development of strategies to minimize the cost in google cloud.
Using Apache solar for search operations on data.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
Working with multiple sources. Migrating tables from Teradata and DB2 to Hadoop cluster.
Migrating processed ready tables from Hadoop and Google Cloud Storage(GCP) using Aorta Framework.
Source Analysis, Tracing back the sources of the data and finding its roots though Teradata, DB2 etc.
Identifying the jobs that load the source tables and documenting it.
Being an active part of Agile Scrum process with Sprints of 2 weeks.
Working With Jira, Microsoft planner to track the progress of the project.

Confidential, Charlotte, NC

Senior Hadoop/Big Data Engineer

Responsibilities:

Developed Hive, Bash scripts for source data validation and transformation. Automated data loading into HDFS and Hive for pre-processing the data using One Automation.
Gather data from Data warehouses in Teradata and Snowflake
Developed Spark/Scala, Python for regular expression project in the Hadoop/Hive environment.
Designed and implemented an ETL framework to load data from multiple sources into Hive and from Hive into Teradata.
Generate reports using Tableau.
Experience at building Big Data applications using Cassandra and Hadoop
Utilized SQOOP,ETL and Hadoop FileSystem APIs for implementing data ingestion pipelines
Worked on Batch data of different granularity ranging from hourly, Daily to weekly and monthly.
Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager
Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Ambari, PIG, and Hive.
Developing and writing SQLs and stored procedures in Teradata. Loading data into snow flake and writing Snow SQLs scripts
TDCH scripts for full and incremental refresh of Hadoop tables.
Optimizing Hive queries by parallelizing with partitioning and bucketing.
Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and ORC.
Worked extensively on Teradata, Hadoop-Hive, Spark, SQLs, PLSQLs, Snow SQLs
Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making
Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
Experienced in working with Hadoop from Horton works Data Platform and running services through Cloudera manager
Used Agile Scrum methodology/ Scrum Alliance for development

Environment: Hadoop, HDFS, AWS, Vertica, Bash, Scala, Kafka, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Python, Java, NiFi, HBase, MySQL, Kerberos, Maven, Shell Scripting, SQL

Confidential, Houston, TX

Hadoop-Spark Developer

Responsibilities:

Responsible for building scalable distributed data solutions using Hadoop
Used Pyspark Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra
Experience in Loading the data into Pyspark data frames and Spark RDDs, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of pyspark Spark using Scala to generate the Output response
Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in PySpark, Effective & efficient Joins, Transformations and other during ingestion process itself
Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in PySpark for Data Aggregation, queries and writing data back into OLTP system through SQOOP
Worked with Impala KUDU for creating a spark to IMPALA-KUDU data ingestion tool.
Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning
Optimizing of existing algorithms in Hadoop using PySpark Session, Spark-SQL, Data Frames and Pair RDDs
Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping
Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems
Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS
Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra for data access and analysis
Created Hive tables for loading and analyzing data, Implemented Partitions, Buckets and developed Hive queries to process the data and generate the data cubes for visualizing
Implemented schema extraction for Parquet and Avro file Formats in Hive
Developed Hive scripts in Hive QL to de-normalize and aggregate the data
Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive
Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project
Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala
Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow
Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster
Worked with BI team to create various kinds of reports using Tableau based on the client's needs
Experience in Querying on Parquet files by loading them into Sparkdataframes by using Zeppelin notebook
Experience in troubleshooting any problems that arises during any batch data processing jobs
Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR

Environment: Hadoop Yarn, Spark-Core, Spark-Streaming, Spark-SQL, Scala, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Talend, Cloudera, MySQL, Linux, Shell scripting

Confidential, Houston, TX

Hadoop Developer

Responsibilities:

Developed PIG scripts for source data validation and transformation. Automated data loading into HDFS and PIG for pre-processing the data using Oozie
Created storage with Amazon S3 for storing data. Worked on transferring data from Kafka topic into AWS S3 storage
Designed and implemented an ETL framework using Java and PIG to load data from multiple sources into Hive and from Hive into Vertica
Utilized SQOOP, Kafka, Flume and Hadoop FileSystem APIs for implementing data ingestion pipelines
Worked on real time streaming, performed transformations on the data using Kafka and Spark Streaming
Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager
Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase
Developed Spark scripts by using Python in PySpark shell command in development.
Experienced in Hadoop Production support tasks by analyzing the Application and cluster logs
Created Hive tables, loaded with data, and wrote Hive queries to process the data. Created Partitions and used Bucketing on Hive tables and used required parameters to improve performance. Developed Pig and Hive UDFs as per business use-cases
Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards
Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and XML
Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, SnowFlake
Used Apache NiFi to automate data movement between different Hadoop components
Used NiFi to perform conversion of raw XML data into JSON, AVRO
Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making
Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
Experienced in working with Hadoop from Cloudera Data Platform and running services through Cloudera manager
Used Agile Scrum methodology/ Scrum Alliance for development

Environment: Hadoop, HDFS, Hive, Scala, TEZ, Teradata, Teradata Studio, TDCH, SnowFlake, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Java, MySQL, Kerberos.

We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

Phoenix, AZ

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship