Sr. Aws Data Engineer Resume
Oak Brook, IL
SUMMARY
- 9+ years of professional IT experience in BIGDATA using HADOOP framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies as well as Java / J2EE technologies with AWS, AZURE
- Experience in Hadoop Ecosystem components like Hive, HDFS, Sqoop, Spark, Kafka, Pig.
- Experience in architecting, designing, installation, configuration and management of Apache Hadoop Clusters, MapR, Hortonworks & Cloudera Hadoop Distribution.
- Good understanding of Hadoop architecture and Hands on experience with Hadoop components such as Resource Manager, Node Manager, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
- Expertise in Data Migration, Data Profiling, Data Ingestion, Data Cleansing, Transformation, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Centre.
- Working knowledge of Spark RDD, Dataframe API, Data set API, Data Source API, Spark SQL and Spark Streaming.
- Experience in exporting as well as importing the data using Sqoop between HDFS to Relational Database systems and vice - versa and load into Hive tables, which are partitioned.
- Worked on HQL for required data extraction and join operations as required and having good experience in optimizing Hive Queries.
- Experience in Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Developed Spark code using Scala, Python and Spark-SQL/Streaming for faster processing of data.
- Implemented Spark Streaming jobs in Scala by developing RDD's (Resilient Distributed Datasets) and used Pyspark and spark-shell accordingly.
- Profound experience in creating real time data streaming solutions using Apache Spark /Spark Streaming, Kafka and Flume.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Good working experience on AWS infrastructure services Amazon Simple Storage Service (Amazon S3), EMR, lambda functions and Amazon Elastic Compute Cloud (Amazon EC2).
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables.
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Worked and learned a great deal from AmazonWebServices (AWS) Cloud services like EC2, S3, EBS, RDS and VPC.
- Implemented AWS provides a variety of computing and networking services to meet the needs of applications
- Adept in Agile/Scrum methodology and familiar with SDLC life cycle from requirement analysis to system study, designing, testing, de-bugging, documentation, and implementation.
- Techno-functional responsibilities include interfacing with users, identifying functional and technical gaps, estimates, designing custom solutions, development, producing, documentation and production support.
- Excellent interpersonal and communication skills, creative, research-minded, technically competent and result-oriented with problem solving and leadership skills. Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.
- Good knowledge of using apache NiFi to automate the data movement between different Hadoop Systems.
- Good experience in handling messaging services using Apache Kafka.
- Knowledge in Data mining and Data warehousing using ETL Tools and Proficient in Building reports and dashboards in Tableau (BI Tool).
- Excellent knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
- Good understanding and knowledge of NoSQL databases like HBase and Cassandra.
- Good understanding of Amazon Web Services (AWS) like EC2 for computing and S3 as storage mechanism and EMR, Step functions, Lambda, RedShift, DynamoDB.
- Good understanding and knowledge of Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps.
- Worked with various formats of files like delimited text files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats and has a good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
- Hands on experience building enterprise applications utilizing Java, J2EE, Spring, Hibernate, JSF, JMS, XML, EJB, JSP, Servlets, JSON, JNDI, HTML, CSS and JavaScript, SQL, PL/SQL.
- Experienced in Software Development Lifecycle (SDLC) using SCRUM, Agile methodologies.
TECHNICAL SKILLS
BigData Eco Systems: Hadoop, HDFS, MapReduce, Hive
Programming: Python
Data Warehousing: Informatica Power Center 9.x/8.x/7.x, Informatica Cloud, Talend Open studio & Integration suite
Applications: Salesforce, RightNow, Eloqua
Databases: Oracle (9i/10g/11g), SQL Server 2005
BI Tools: Business Objects XI, Tableau 9.1
Query Languages: SQL, PL/SQL, T-SQL
Scripting Languages: Unix, Python, Windows PowerShell
RDBMS Utility: Toad, SQL Plus, SQL Loader
Scheduling Tools: ESP Job Scheduler, Autosys, Windows scheduler
PROFESSIONAL EXPERIENCE
Sr. AWS Data Engineer
Confidential, Oak Brook, IL
Responsibilities:
- Develop and add features to existing data analytic applications built with Spark and Hadoop on a Scala, java and Python development platform on the top of AWS services.
- Programming using Python, Scala along with Hadoop framework utilizing Cloudera Hadoop Ecosystem projects (HDFS, Spark, Sqoop, Hive, HBase, Oozie, Impala, Zookeeper etc.).
- Involved in developing spark applications using Scala, Python for Data transformations, cleansing as well as validation using Spark API.
- Worked on all the Spark APIs, like RDD, Dataframe, Data source and Dataset, to transform the data.
- Worked on both batch processing and streaming data Sources. Used Spark streaming and Kafka for the streaming data processing.
- Worked on Cloudera distribution and deployed on AWS EC2 Instances.
- Developed Spark Streaming script which consumes topics from distributed messaging source Kafka and periodically pushes batch of data to spark for real time processing.
- Built data pipelines for reporting, alerting, and data mining. Experienced with table design and data management using HDFS, Hive, Impala, Sqoop, MySQL, and Kafka.
- Worked on Apache Nifi to automate the data movement between RDBMS and HDFS.
- Created shell scripts to handle various jobs like Map Reduce, Hive, Pig, Spark etc., based on the requirement.
- Used Hive techniques like Bucketing, partitioning to create the tables.
- Experience on Spark-SQL for processing the large amount of structured data.
- Experienced working with source formats, which includes - CSV, JSON, AVRO, JSON, Parquet, etc.
- Worked on AWS to aggregate clean files in Amazon S3 and also on Amazon EC2 Clusters to deploy files into Buckets.
- Designed and architected solutions to load multipart files which can't rely on a scheduled run and must be event driven, leveraging AWS SNS,
- Involved in Data Modeling using Star Schema, Snowflake Schema.
- Used AWS EMR clusters for creating Hadoop and spark clusters. These clusters are used for submitting and executing Scala and python applications in production.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Migrated the data from AWS S3 to HDFS using Kafka.
- Integrating Kubernetes with network, storage of security to provide a comprehensive infrastructure and orchestrating the Kubernetes containers across the multiple hosts.
- Implementing Jenkins and built pipelines to drive all microservice builds out to Docker registry and deploying to Kubernetes.
- Experienced in loading and transforming of large sets of structured, semi structured data using ingestion tool Talend.
- Worked with NoSQL databases like HBase, Cassandra to retrieve and load the data for real time processing using Rest API.
- Worked on creating data models for Cassandra from the existing Oracle data model.
- Responsible for transforming and loading the large sets of structured, semi structured and unstructured data.
Environment: Hadoop 2.7.7, HDFS 2.7.7, Apache Hive 2.3, Apache Kafka 0.8.2.X, Apache Spark 2.3, Spark-SQL, Spark-Streaming, Zookeeper, Pig, Oozie, Java 8, Python3, S3, EMR, EC2, Redshift, Cassandra, Nifi, Talend, HBase, Cloudera (CHD
Sr Azure Data Engineer
Confidential - Rockville, MD
Responsibilities:
- Develop a data set process for data mining and data modeling and also recommend the ways to improve data quality, efficiency and reliability.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in in Azure Databricks.
- Responsible for writing Hive Queries to analyze the data in Hive warehouse using Hive Query Language (HQL). Involved in developing Hive DDLs to create, drop and alter tables.
- Extracted the data and updated it into HDFS using Sqoop Import from various sources like Oracle, Teradata, SQL server etc.
- Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
- Created Hive staging tables and external tables and also joined the tables as required.
- Implemented Dynamic Partitioning, Static Partitioning and also Bucketing.
- Installed and configured Hadoop Map Reduce, Hive, HDFS, Pig, Sqoop, Flume and Oozie on Hadoop cluster.
- Worked on Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and also done POC on Azure Data Bricks.
- Analyzed the SQL scripts and designed the solution to implement using Pyspark.
- Developed a capability to implement audit logging at required stages while applying business logic.
- Implemented spark data frames on huge incoming datasets of various data formats like JSON, CSV, Parquet.
- Actively worked in resolving many of the Tech challenges.
- One of them is like handling the nested JSON with multiple data sections in the same file and converting them in to spark friendly data frames.
- Re-formatted the end results to SOR's requested formats.
- Developed Datastage Jobs to load Collections data from multiple sources Aspect, CACS, Strata, FICO, Fidelity and FDR to the respective Dimensions and Fact tables with required business transformations.
- Implemented Sqoop jobs for data ingestion from the Oracle to Hive.
- Worked with various formats of files like delimited text files, click stream log files, Apache log files,Avro files, JSON files, XML Files.
- Mastered in using different columnar file formats like RC, ORC and Parquet formats.
- Developed custom the Unix/BASH SHELL scripts for the purpose of pre and post validations of the master and slave nodes, before and after the configuration of the name node and datanodes respectively.
- Developed job workflows in Oozie for automating the tasks of loading the data into HDFS.
- Implemented compact and efficient file storage of big data by using various file formats like Avro, Parquet, JSON and using compression methods like GZip, Snappy on top of the files.
- Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks
- Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame and Pair RDD's.
- Worked on Spark using Python as well as Scala and Spark SQL for faster testing and processing of data.
- Worked on various data modelling concepts like star schema, snowflake schema in the project. Extensively used Stash, Bit-Bucket and GITHUB for the code control purpose.
- Migrated Map reduce jobs to Spark jobs for achieving a better performance.
Environment: Hadoop 2.7, HDFS, Microsoft Azure services like HDinsight, BLOB, ADLS, Logic Apps etc, Hive 2.2, Sqoop 1.4.6, snowflake, Apache Spark 2.3, Airflow, Spark-SQL, ETL, Maven, Oozie, Java 8, Python3, Unix shell scripting.
Data Engineer
Confidential, Seattle, WA
Responsibilities:
- Developed PySpark Applications by using python and Implemented Apache PySpark data processing project to handle data from various RDBMS and Streaming sources.
- Handled importing of data from various data sources, performed data control checks using PySpark and loaded data into HDFS.
- Involved in converting Hive/SQL queries into PySpark transformations using Spark RDD, python.
- Used PySpark SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
- Developed PySpark Programs using python and performed transformations and actions on RDD's.
- Used PySpark and Spark SQL to read the parquet data and create the tables in hive using the python API.
- Implemented PySpark using python and utilizing Data frames and PySpark SQL API for faster processing of data.
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
- Involved in building the ETL source to Target mapping to load data into Data warehouse.
- Developed a batch data ingestion pipeline using Sqoop and Hive to ingest, transform and analyze Supply Chain data.
- Use Object-Oriented Programming concepts to build Hive UDFs in Python that could be reused across the pipeline.
- Extensively working on Hive queries to load data from various sources like Teradata, DB2, Oracle, Mainframe etc.
- Developed python scripts, UDFs using both Data frames/SQL/Data sets and RDD/Map Reduce in Spark 1.6 for data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Experienced in handling large datasets using Partitions, PySpark in Memory capabilities, Broadcasts in PySpark, effective & efficient Joins, Transformations and other during ingestion process itself.
- Processing the schema oriented and non-schema-oriented data using python and Spark.
- Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Setup/Optimize ELK (Elasticsearch, Logstash, Kiana} Stack and Integrated Apache Kafka for data ingestion
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in HDFS.
- Worked on streaming pipeline that uses PySpark to read data from Kafka, transform it and write it to HDFS.
- Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data Modeling tools.
- Worked on Snowflake database on queries and writing Stored Procedures for normalization.
- Worked with Snowflake’s stored procedures, used procedures with corresponding DDL statements, used JavaScript API to easily wrap and execute numerous SQL queries.
Environment: Cloudera (CDH3), Snowflake, HDFS, Pig 0.15.0, Hive 2.2.0, Kafka, Sqoop, Shell Scripting, Spark 1.8, Linux- Cent OS, Map Reduce, python 2, Eclipse 4.6.
Big Data/Hadoop Developer
Confidential
Responsibilities:
- Working with open source Apache Distribution then Hadoop admins have to manually setup all the configurations- Core-Site, HDFS-Site, YARN-Site and Map Red-Site. However, when working with popular Hadoop distribution like Hortonworks, Cloudera or MapR the configuration files are setup on startup and the Hadoop admin need not configure them manually.
- Used Sqoop to import data from Relational Databases like MySQL, Oracle.
- Involved in importing structured and unstructured data into HDFS.
- Responsible for fetching real-time data using Kafka and processing using Spark and Scala.
- Worked on Kafka to import real-time weblogs and ingested the data to Spark Streaming.
- Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
- Worked on Building and implementing real-time streaming ETL pipeline using Kafka Streams API.
- Worked on Hive to implement Web Interfacing and stored the data in Hive tables.
- Migrated Map Reduce programs into Spark transformations using Spark and Scala.
- Experienced with Spark Context, Spark-SQL, Spark YARN.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Implemented data quality checks using Spark Streaming and arranged passable and bad flags on the data.
- Implemented Hive Partitioning and Bucketing on the collected data in HDFS.
- Implemented Sqoop jobs for large data exchanges between RDBMS and Hive clusters.
- Extensively used Zookeeper as a backup server and job scheduled for Spark Jobs.
- Developed Spark scripts using Scala shell commands as per the business requirement.
- Experienced in loading the real-time data to a NoSQL database like Cassandra.
- Experience in retrieving the data present in Cassandra cluster by running queries in CQL (Cassandra Query Language).
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Developed multiple Kafka Producers and Consumers from as per the software requirement specifications.
- Created Spark jobs to do lighting speed analytics over the spark cluster.
- Evaluated Spark's performance vs Impala on transactional data. Used Spark transformations and aggregations to perform min, max and average on transactional data.
- Gather business requirements, definition and design of the data sourcing, and worked with the data warehouse architect on the development of logical data models.
- Created sophisticated visualizations, calculated columns and custom expressions and developed Map Chart, Cross table, Bar chart, Tree map and complex reports which involves Property Controls, Custom Expressions.
- Developed spark applications in pyspark on distributed environment to load hufe number of csv files with different schema in to hive ORC tables
- Research on Reinforcement Learning and control (TensorFlow, Torch), and machine learning model (Scikit-learn).
- Performed K-means clustering, Regression and Decision Trees in R. Worked on data cleaning and reshaping, generated segmented subsets using NumPy and Pandas in Python.
- Implemented various statistical techniques to manipulate the data like missing data imputation, principal component analysis and sampling.
- Responsible for design and development of Python programs/scripts to prepare, transform and harmonize data sets in preparation for modeling. Worked on connecting Cassandra database to the Amazon EMR File System for storing the database in S3.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.
- Well versed in using Elastic Load Balancer for Auto scaling in EC2 servers.
- Coordinated with the SCRUM team in delivering agreed user stories on time for every sprint.
Environment: Hadoop, Map Reduce, Hive, Spark, Oracle, GitHub, Tableau, UNIX, Cloudera, Kafka, Sqoop, Scala, NIFI, HBase, Amazon EC2, S3.
ETL Developer
Confidential
Responsibilities:
- Involved in Migrating historical as built data from Link Tracker Oracle database to TD using Abinitio.
- Implemented historical purge process for Clickstream, order broker &link tracker to TD using Abinitio
- Implemented the centralized graphs concept.
- Extensively used Abinitio components like Reformate, rollup, lookup, joiner, re-defined and also developed many sub graphs.
- Estimated the Hadoop cluster requirements
- Responsible for choosing the Hadoop components (hive, pig, map-reduce, Sqoop, flume etc)
- Responsible for building scalable distributed data solutions using Hadoop.
- Hadoop cluster building and ingestion of data using Sqoop
- Imported streaming logs to HDFS through Flume
- Scheduled the published dashboards from Tableau Server on weekly basis.
- Sending the dashboards to users by emails with the help of admin team with subscriptions.
- Performance tuning of Reports by creating Linked Universes, Joins, Contexts, Aliases for resolving loops and checked the integrity of the universes using Business Objects Designer module during development.
- Involved in integrating Tableau with Angular JS to enable self-service model kind of functionality on dashboards.
- Given training/demos to users on Tableau Desktop for development.
- Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed to HDFS
- Developed Use cases and Technical prototyping for implementing Hive,and Pig.
- Worked in analyzing data using Hive, Pig and custom MapReduce programs in Java.
- Implemented partitioning, dynamic partitions and buckets in HIVE
- Installed and configured Hive, Sqoop, Flume, Oozie on the Hadoop cluster.
- Involved in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
- Tuned the Hadoop Clusters and Monitored for the memory management and for the Map Reduce jobs.
- Responsible for Cluster maintenance, Adding and removing cluster nodes, Cluster Monitoring and Troubleshooting.
- Developed a custom Framework capable of solving small files problem in Hadoop.
- Deployed and administered 70 node Hadoop clusters. Administered two smaller clusters.
- Involved in loading the transformed data file into TD staging tables through TD Load utilities, Fast load and Multi load scripts, and Creating TD macro’s for loading the data from staging to target tables.
- Responsible as E-R consultant, ER (Extract-Replicate) Golden gate tool which is used to extract the real time data to warehouse without hitting to the data.
Environment: Abinitio, Oracle, Database, Clickstream, Reformate, Rollup, Lookup, UNIX, and Extract-Replicate.
