Sr Data Engineer Resume
Pataskala, OhiO
SUMMARY
- 7 years of strong experience in Building end to end pipelines using Pyspark, Python, AWS services & in depth understanding of Distributed Systems Architecture and Parallel Processing Frameworks.
- Experience in writing complex SQL queries, creating reports and dashboards.
- Proficient in using Unix based Command Line Interface, Expertise in handling ETL tools like Informatica.
- Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing, and reporting of voluminous, rapidly changing data.
- Responsible for maintaining quality data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with teh stakeholders & solution architect.
- Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
- Set up and worked on Kerberos authentication TEMPprincipals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
- Performed end - to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.
- Implemented teh machine learning algorithms using python to predict teh quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake.
- Strong experience using pyspark, HDFS, MapReduce, Hive, Pig, Spark, Sqoop, Oozie, and HBase.
- Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance
- Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, MapR, Amazon EMR) to fully implement and leverage new features
- Experience in developing Spark Applications using Spark RDD, Spark-SQL, and Data frame APIs
- Worked with real-time data processing and streaming techniques using Spark streaming and Kafka
- Experience in moving data into and out of teh HDFS and Relational Database Systems (RDBMS) using Apache Sqoop
- Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing, and tuning teh HQL queries
- Significant experience writing custom UDFs in Hive and custom Input Formats in MapReduce
- Involved in creating Hive tables, loading with data, and writing Hive ad-hoc queries dat will run internally in MapReduce and TEZ, replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing, Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
- Strong understanding of real time streaming technologies Spark and Kafka
- Knowledge of job workflow management and coordinating tools like Oozie
- Strong experience building end to end data pipelines on Hadoop platform
- Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase
- Strong understanding of Logical and Physical database models and entity-relationship modeling
- Experience with Software development tools such as JIRA, Play, GIT
- Good understanding of teh Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data
- Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)
- Excellent analytical, communication and interpersonal skills
TECHNICAL SKILLS
Operating Systems: Linux (Ubuntu, CentOS), Windows, Mac OS
Hadoop Ecosystem: Hadoop, MapReduce, Yarn, HDFS, Pig, Oozie, Zookeeper
Big Data Ecosystem: Spark, Spark SQL, Spark Streaming, Hive, Impala, Hue
Data Ingestion: Sqoop, Flume, NiFi, Kafka
NOSQL Databases: HBase, Cassandra, MongoDB
Programming Languages: C, Scala, Core Java, J2EE (SERVLETS, JSP, JDBC, JAVA BEANS, EJB) Frameworks MVC, Struts, Spring, Hibernate
Web Technologies: HTML, CSS, XML, JavaScript, Maven
Scripting Languages: Java Script, UNIX, Python, R Language
Databases: Oracle 11g, MS-Access, MySQL, SQL-Server 2000/2005/2008/2012 , Teradata
SQL Server Tools: SQL Server Management Studio, Enterprise Manager, Query Analyzer, Profiler, Export & Import (DTS).
IDE: Eclipse, Visual Studio, IDLE, IntelliJ
Web Services: Restful, SOAP
Tools: Bugzilla, Quick Test Pro (QTP) 9.2, Selenium, Quality Center, Test Link, TWS, SPSS, SAS, Documentum, Tableau, Mahout
Methodologies: Agile, UML, Design Patterns
PROFESSIONAL EXPERIENCE:
Sr Data Engineer
Confidential, Pataskala, Ohio
Responsibilities:
- Develop Shell script dat reads Json files and apply it to Sqoop and Hive.
- Ingested data from Relational DB, Oracle database, Postgres SQL using SQOOP into HDFS and loaded them into Hive tables, AWS S3, transformed and analyzed large datasets by running Hive queries and using Apache Spark.
- Work with pyspark to migrate fixed width, ORC, csv etc. files.
- Designed and implemented an ETL framework to load data from multiple sources into Hive and from Hive into Teradata.
- Utilized SQOOP, ETL and Hadoop File System API's for implementing data ingestion pipelines.
- Worked on Batch data of different granularity ranging from hourly, daily to weekly and monthly.
- Handled Hadoop cluster installations in various environments such as Unix, Linux, and Windows
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Ambari, spark and Hive.
- Work with stream sets and develop pipelines using stream sets
- Developing and writing SQLs and stored procedures in Teradata. Loading data into snowflake and writing Snow SQLs scripts
- TDCH scripts for full and incremental refresh of Hadoop tables.
- Optimizing Hive queries by parallelizing with portioning and bucketing.
- Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and ORC.
- Worked extensively on Teradata, Hadoop-Hive, Spark, SQLs, PLSQLs
- Designed and published visually rich and intuitive Stream sets pipelines to migrate data
- Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
- Experienced in working with Hadoop from Horton works Data Platform and running services through Cloudera manager
- Used Agile Scrum methodology/ Scrum Alliance for development
Environment: Hadoop, HDFS, AWS, Vertica, Scala, Kafka, MapReduce, YARN, Spark, Hive, Scala, MySQL, Kerberos, Maven, Stream sets.
Sr. Hadoop/Big Data Engineer
Confidential, Tampa, FL
Responsibilities:
- Setting up Datalake in google cloud using Google cloud storage, Big Query, and Big Table.
- Creating shell scripts to process teh raw data, loading data to AWS S3, and Redshift databases
- Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
- Designed and implemented end to end big data platform on Teradata Appliance
- Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 Using Hadoop spark.
- Involvement developing architecture solution of teh project to migrate data.
- Developed Python, Bash scripts to automate and provide Control flow.
- Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
- Work with Pyspark to perform ETL and generate reports.
- Writing regression SQL to merge teh validated Data into Prod environment.
- Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
- Write UDFs in Hadoop Pyspark to perform transformations and loads.
- Use NIFI to load data into HDFS as ORC files.
- Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
- Working with Google cloud storage. Research and development of strategies to minimize teh cost in google cloud.
- Using Apache solar for search operations on data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Working with multiple sources. Migrating tables from Teradata and DB2 to Hadoop cluster.
- Source Analysis, tracing back teh sources of teh data and finding its roots though Teradata, DB2 etc.
- Identifying teh jobs dat load teh source tables and documenting it.
- Being an active part of Agile Scrum process with Sprints of 2 weeks.
- Working with Jira, Microsoft planner to track teh progress of teh project.
Big Data Engineer
Confidential, Greenwood Village, CO
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop
- Used Pyspark Spark-Streaming APIs to perform necessary transformations and actions on teh fly for building teh common learner data model which gets teh data from Kafka in near real time and persists into Cassandra
- Experience in Loading teh data into Pyspark data frames and Spark RDDs, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of pyspark Spark using Scala to generate teh Output response
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in PySpark, Effective & efficient Joins, Transformations and other during ingestion process itself
- Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in PySpark for Data Aggregation, queries and writing data back into OLTP system through SQOOP
- Worked with Impala KUDU for creating a spark to IMPALA-KUDU data ingestion tool.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning
- Optimizing of existing algorithms in Hadoop using PySpark Session, Spark-SQL, Data Frames and Pair RDDs
- Used Data tax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting, and grouping
- Designed, developed, and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems
- Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS
- Ingested data from RDBMS and performed data transformations, and then export teh transformed data to Cassandra for data access and analysis
- Created Hive tables for loading and analyzing data, Implemented Partitions, Buckets, and developed Hive queries to process teh data and generate teh data cubes for visualizing
- Implemented schema extraction for Parquet and Avro file Formats in Hive
- Developed Hive scripts in Hive QL to de-normalize and aggregate teh data
- Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive
- Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement teh former in project
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala
- Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow
- Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster
- Worked with BI team to create various kinds of reports using Tableau based on teh client's needs
- Experience in Querying on Parquet files by loading them into Spark data frames by using Zeppelin notebook
- Experience in troubleshooting any problems dat arises during any batch data processing jobs
- Extracted teh data from Teradata into HDFS/Dashboards using Spark Streaming
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining teh Hadoop cluster on AWS EMR
Environment: Hadoop Yarn, Spark-Core, Spark-Streaming, Spark-SQL, Scala, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Talend, Cloudera, MySQL, Linux, Shell scripting
Hadoop Developer
Confidential, New York, NY
Responsibilities:
- Developed Hive, Bash scripts for source data validation and transformation. Automated data loading into HDFS and Hive for pre-processing teh data using One Automation.
- Gather data from Data warehouses in Teradata and Snowflake
- Developed Spark/Scala, Python for regular expression project in teh Hadoop/Hive environment.
- Designed and implemented an ETL framework to load data from multiple sources into Hive and from Hive into Teradata.
- Generate reports using Tableau.
- Experience at building Big Data applications using Cassandra and Hadoop
- Utilized SQOOP, ETL and Hadoop File System APIs for implementing data ingestion pipelines
- Worked on Batch data of different granularity ranging from hourly, daily to weekly and monthly.
- Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager
- Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
- Assisted in upgrading, configuration, and maintenance of various Hadoop infrastructures like Ambari, PIG, and Hive.
- Developing and writing SQLs and stored procedures in Teradata. Loading data into snowflake and writing Snow SQLs scripts
- TDCH scripts for full and incremental refresh of Hadoop tables.
- Optimizing Hive queries by parallelizing with partitioning and bucketing.
- Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and ORC.
- Worked extensively on Teradata, Hadoop-Hive, Spark, SQLs, PLSQLs, Snow SQLs
- Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making
- Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
- Experienced in working with Hadoop from Horton works Data Platform and running services through Cloudera manager
- Used Agile Scrum methodology/ Scrum Alliance for development
Environment: Hadoop, HDFS, AWS, Vertica, Bash, Scala, Kafka, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Python, Java, NiFi, HBase, MySQL, Kerberos, Maven, Shell Scripting, SQL.
.