Senior Data Engineer Resume
Dallas, TX
SUMMARY
- Cloudera Certified Professional with 12+ Years of IT Experience including 8 Years of work experience in Big Data, Hadoop Architecture, Spark Development, design, ETL development and Ecosystem Analytics in Banking and Insurance domains.
- Extensive experience in Apache Spark, MapReduce, YARN, Scala, PySpark, Hive, Impala, Pig, Sqoop, Kafka, Hue, Hbase, Python, Oozie, Azure, Teradata, MongoDB, Unix shell scripting, Core Java, DMX - h tool, Mainframe and ETL development.
- Done the Teradata 12 Basics Certification.
- 1+ years of extensive Snowflake Cloud data warehouse implementation on AWS
- Created Data Ingestion Framework in Snowflake for the Batch Data from different file formats (XML, JSON, Avro) using Snowflake Stage and Snowflake Data pipe.
- Oracle Certified Professional, Done Java SE 6 Programmer Certification.
- Strong knowledge on implementation of SPARK core - SPARK SQL, GraphX and Spark streaming
- Experienced in writing PySpark scripts and do the joins/ETL operations on data frames.
- Good Knowledge on Scala based Spark programs.
- Strong experience on Hadoop distributions like Cloudera, MapR and Hortonworks.
- Good Working Experience on Amazon AWS
- Used Kubernetes to orchestrate the deployment, Scaling, and management of Docker Containers. ng data using Sqoop from HDFS to Relational Database Systems/Non-Relational Database Systems and vice-versa. And handled huge incremental data using sqoop.
- Experience in using Hive partitioning, bucketing and execute different types of joins on Hive tables and implementing Hive SerDes like JSON and Avro.
- Good understanding of NoSQL databases and hands on work experience in writing applications on NoSQL databases like HBase and MongoDB.
- Good working knowledge on RESTful API’s used for HTTP requests to GET, PUT, POST and DELETE Data.
- Strong Experience in working with Databases like Teradata, Oracle, and MySQL and proficiency in writing complex SQL queries.
- Extensive knowledge in Teradata utilities like SQL Assistance, BTEQ, TPT, Fast Load, Fast Export, MLOAD.
- Good knowledge on processing of semi-structured and unstructured data.
- Working on different file formats like JSON, Parquet, Avro, XML, CSV, XLS etc.
- Good knowledge of Normalization, Fact Tables and Dimension Tables, also dealing with OLAP and OLTP systems.
- Experience in identifying and resolveETL productionroot cause issues.
- Excellent understanding of Hadoop architecture, Hadoop Distributed File System, and various components such as HDFS, Name Node, Data Node, Job Tracker, Task Tracker, YARN, Spark Architecture and MapReduce concepts.
- Supporting the current Production Systems
- Good knowledge on setting up the cluster environment in Azure for PoCs and for R&D purposes.
- Good working knowledge on Mainframe development and legacy system conversions using JCL, CBOL, CICS.
- Experience in version control tools like Tortoise SVN.
- Experienced working with JIRA for project management, GIT for source code management, JENKINS for continuous integration and ServiceNow for change request/Incident management.
- Implemented standards and processes for Hadoop based application design and implementation.
- Experience in maintenance, enhancements, performance tuningofETL code.
- Worked with waterfall and Agile methodologies.
- Lead the team to provide roadmap, on technical grounds.
- Strong team player with good communication, analytical, presentation and inter-personal skills.
- Strong knowledge in data modelling, effort estimation, ETL Design, development, system testing, implementation, and production support.
TECHNICAL SKILLS
Big Data Technologies: Hadoop, MapReduce, YARN, Hive, Pig, HBase, Impala, Hue, Sqoop, Kafka
Spark components: RDD, Spark SQL (Data Frames and Dataset), Spark Streaming.
Programming Languages: SQL, C, C++, Core Java, Python, Scala, Shell Scripting, Cobol
Databases: Oracle 12c/11g, Teradata 15/14, MySQL, SQL Server2016/2014, DB2 and Mongo DB.
Scripting and Query Languages: Shell scripting, PL/SQL, Java Script, HTML, DHTML, XML
Version Control / Schedule: CVS, Tortoise SVN, Autosys, CA7, Oozie
Cloud Infrastructure: Microsoft Azure
ETL tools: Syncsort DMX-h and Informatica
Operating Systems: Windows, UNIX/Linux and Mainframe z/OS
Tools: Eclipse, Intellij, SQL Assistant, SQL Plus, SSH Tectia, Super Putty, JIRA, Incident Management (IM)
PROFESSIONAL EXPERIENCE
Confidential, Dallas, TX
Senior Data Engineer
Responsibilities:
- Worked on AWS Services like Athena, Glue, EC2 and S3.
- Working on legacy data systems SQL Server and had been helping to massage the data for reporting purpose to different teams.
- Worked on ETL using Spark, Kafka, Hive, HBase, Oozie on Hadoop.
- Created Data Ingestion Framework in Snowflake for the Batch Data from different file formats (XML, JSON, Avro) using Snowflake Stage and Snowflake Data pipe.
- Worked with Infrastructure and Network Security teams to implement Security and Authentication between Snowflake and AWS.
- Creating the database structures to replicate existing structures and applications in Teradata, which are to be migrated to the Big Data stack such as Hive, Spark, etc.
- Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark.
- Worked on ETL tools like TPT loads, Sqoop run book and stream ingestion.
- Created HBase tables to load large sets of structured and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
- Involved in transforming the relational database to legacy labels to HDFS, and HBASE tables using Sqoop and vice versa.
- Consumed the data from Kafka queue using Spark. Configured different topologies for Spark cluster and deployed them on regular basis.
- Written PySpark code to import data from MongoDB collection to HDFS and ingested the data back to MongoDB collection.
- Worked on different file formats (AVRO, PARQUET, TEXTFILE, CSV) and different compression codecs (GZIP, SNAPPY, LZO).
- Written complex Hive queries involving external dynamic partitioned on Hive tables which stores rolling window time-period user viewing history.
- Perform Code Review and Walk Through of new application and interfaces with all other teams in the project. All projects going live in the production environment to be reviewed to ensure all the conditions and constrains of the Coding Standards and Guidelines are met in the design, architecture, and development of the project.
- Automating processes through scripts and schedule developed jobs on daily automated schedules using schedulers.
- Preparing and executing unit test cases and provide support for User Acceptance Testing and System Integration Testing. Resolve UAT issues and Production issues on specified SLA tracked through HP Quality Centre Tool.
- Worked on analyzing Hadoop cluster and different Big Data Components including Hive, Spark, Oozie.
- Responsible in analysis, design, testing phases and responsible for documenting technical specifications.
- Involved in backing up the data and logs redirection is being completely atomized using UNIX scripts.
- Working with JIRA for project management, GIT for source code management, JENKINS for continuous integration.
Environment: Hadoop, HDFS, Spark, Kafka, Teradata, Scala, Hive, Pig, Sqoop, Oozie, HBase, MongoDB, PySpark, Impala, Cloudera Manager, Autosys, UNIX Shell Scripting.
Confidential, Dallas, TX
Big Data Engineer
Responsibilities:
- For New applications in Hadoop development of Data Ingestion and ETL workflows for various EAP data sources. Working with the Business analysts and the DBAs for requirements gathering, analysis, coding, testing, deployment, and project coordination.
- Worked on ETL using Spark, Kafka, Hive, HBase, Oozie on Hadoop.
- Creating the database structures to replicate existing structures and applications in Teradata, which are to be migrated to the Big Data stack such as Hive, Spark, etc.
- Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark.
- Worked on ETL tools like TPT loads, Sqoop run book and stream ingestion.
- Created HBase tables to load large sets of structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Worked on REST API’s that uses HTTP request for communication with web services. These were following certain constraints like, Client-server architecture, Stateless and Cacheable.
- Worked on moving some of the Data Objects on to Amazon AWS
- Supporting the current Production System’s with daily scheduled jobs and other on demand tickets
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Developed pig scripts to transform the data into structured format and it are automated through Oozie coordinators.
- Involved in transforming the relational database to legacy labels to HDFS, and HBASE tables using Sqoop and vice versa.
- Building/Maintaining Docker container clusters managed by Kubernetes Linux, Bash, GIT
- Utilized Kuburnetes for the runtime environment of the CI/CD system to build, test deploy.
- Consumed the data from Kafka queue using Spark. Configured different topologies for Spark cluster and deployed them on regular basis.
- Written PySpark code to import data from MongoDB collection to HDFS and ingested the data back to MongoDB collection.
- Worked on different file formats (AVRO, PARQUET, TEXTFILE, CSV) and different compression codecs (GZIP, SNAPPY, LZO).
- Written complex Hive queries involving external dynamic partitioned on Hive tables which stores rolling window time-period user viewing history.
- Perform Code Review and Walk Through of new application and interfaces with all other teams in the project. All projects going live in the production environment to be reviewed to ensure all the conditions and constrains of the Coding Standards and Guidelines are met in the design, architecture and development of the project.
- Automating processes through scripts and schedule developed jobs on daily-automated schedules using schedulers.
- Preparing and executing unit test cases and provide support for User Acceptance Testing and System Integration Testing. Resolve UAT issues and Production issues on specified SLA tracked through HP Quality Centre Tool.
- Worked on analyzing Hadoop cluster and different Big Data Components including Hive, Spark, Oozie.
- Responsible in analysis, design, testing phases and responsible for documenting technical specifications.
- Involved in backing up the data and logs redirection is being completely atomized using UNIX scripts.
- Working with JIRA for project management, GIT for source code management, JENKINS for continuous integration.
Environment: Hadoop, HDFS, Spark, Kafka, Teradata, Scala, Hive, Pig, Sqoop, Oozie, HBase, MongoDB, PySpark, Impala, Cloudera Manager, Autosys, UNIX Shell Scripting.
Confidential, Chicago, IL
Sr ETL Developer
Responsibilities:
- Worked with FSC team get the connection established between FSC and EAP to transfer the Retail Bank source data to EAP.
- Created the logic and code for delta processing of customer and account level data handling while transfer to SFMC. Handled entire delta process on PySpark, Hive joins.
- Creating the database structures in EAP based on the FSC source data using Hive, Spark, etc.
- Written complex Hive queries involving external dynamic partitioned on Hive tables and different join operations in hive to perform ETL.
- Experienced in creating Spark Context and performing RDD transformations and actions using Python API
- Performed various data warehousing operations like de-normalization and aggregation on Hive using DML statements.
- Developed Pig Latin scripts for transformations, sort, group, event joins, filter.
- Understanding & Analyzing the functionalities and modifications to the existing business Logic as per customer specifications. Develop the code by adopting coding standards that enables Reliable, maintainable, Reusable and security in the systems.
- Experience with Amazon AWS
- Supported Production systems with issue log and Business Manager tickets
- Performed transformations, cleaning and filtering on imported data using Hive, Spark, and loaded final data into HDFS and created hive table on it.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs with control flows.
- Scheduling the production jobs using Autosys based on various success/dependency conditions and usage of filewatchers while receiving the data from FSC.
- Usage of SVN Tortoise for migrating the code to various regions SIT, UAT, PROD.
- Spark Data Frames are created by reading the validated Parquet Files and run the SQL queries using SQLContext to get the common transaction data from all the Retail Bank systems.
- Implemented Repartition, Caching and broadcast concepts on RDD’s, DF’s and variables to achieve better performance on cluster.
- Designed various dimension tables using HBase and written scripts to automate the data loading to dimension tables.
- Involved in the Development using Spark SQL with Python.
- Worked on ETL using Spark, HBase (NoSQL), Hive, HDFS on Hadoop.
- Experienced in Performing tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Experienced in handling large datasets using partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformation and other during ingestion process itself.
- Creating the partitions on table to fetch the data quickly
- Creating the Dashboard’s using ARCADIA tool
- Preparing and executing unit test cases and provide support for User Acceptance Testing and System Integration Testing. Resolve UAT issues and Production issues on specified SLA
Environment: Hadoop, HDFS, Spark, Kafka, Teradata, Scala, Hive, Pig, Sqoop, Oozie, HBase, PySpark, Impala, Cloudera Manager, Autosys, UNIX Shell Scripting, Python and Tortoise SVN.
Confidential
Sr. Hadoop Developer
Responsibilities:
- Identify the source data from different systems and map the data into the warehouse.
- Prepare functional and technical documentation for Source - target systems mapping.
- Responsible for building scalable distributed data solutions using Hadoop.
- Working with JIRA for project management, GIT for source code management, JENKINS for continuous integration and Servicenow for change request/Incident management
- Design and development of all the modules of Metadata Driven Tool Application such as Data Quality Assessment, Data Profiling Assessment, Data Reconciliation, Data Ingestion, Data Standardization (ETL) and User Login module with authentication and authorization features.
- Migration of Metadata from Big Data environment to Oracle database and vice versa
- Prepare test specifications and system test plan.
- Created PySpark scripts in Hadoop using Spark Context/Spark Session, Spark-SQL, Data Frames and RDD’s.
- Developed Scripts to take the backup of the current data sets from business critical Teradata Tables and move it to Hadoop tables.
- Worked on different file formats (AVRO, PARQUET, TEXTFILE, CSV) and different compression codecs (GZIP, SNAPPY).
- Integrated Hive with HBase to upload data and perform row level operations
- Experienced in creating Spark Context and performing RDD transformations and actions using Python API
- Used Spark Context to create RDDs to use incoming data to perform Spark Transformations and Actions
- Created data frames out of text files to execute Spark SQL queries
- Used Spark's enable Hive Support to execute Hive queries in Spark
- Experience working with Hadoop distribution of Cloudera and highly capable in installing and managing roles.
- Wrote PIG scripts using various input and output formats. Also designed custom format as per the business requirements.
Environment: Hadoop, Spark, Kafka, Oracle, HDFS, Scala, Hive, Pig, Sqoop, Oozie, HBase, PySpark, Impala, Autosys, Python and Cloudera Manager.
Confidential, Grapevine, TX
Programmer Analyst
Responsibilities:
- Involved in Design, development, testing, maintaining and documentation for implementation/upgrade software and products in Oracle Applications.
- Worked on RICEW (Reports, Interfaces, Conversions, Extensions and Workflows) components.
- Designed and Developed Custom PL/SQL packages and procedures to customize the business flows according to User/Client requirements.
- Single handedly involved & implemented Project Management Related Activities like, Estimations, Allocation of Work, Analysis on project requirement scope, Design, Test Cycle Support, coordinated onsite project management activities with the client, functional team, QA Teams, cross flow track teams, Transition Teams, BSOS at various stages of the project life cycle, ensured smooth go live.
- Involved in MD50 Analysis and prepared Technical Specification, Unit Test Case documents. Coding and Unit testing of Forms/Packages/Reports and other PL/SQL objects. Involved in review of other migrated Components.
- Involved in Production Support (Maintenance) and documenting Functional, Technical and migration requirements.