Aws Data Engineer Resume
Seattle, WashingtoN
SUMMARY
- Over all 9 years of professional experience as Big Data Engineer dealing wif Apache Hadoop Ecosystem like HDFS, MapReduce, Hive, Sqoop, Oozie, HBase, Spark - Scala, Kafka and Big Data Analytics.
- Experience in designing, implementing large scale data pipelines for data curation using Spark/Data Bricks along wif Python and Scala.
- Excellent understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
- Highly experienced in developing Hive Query Language and Pig Latin Script.
- Experienced in using distributed computing architectures such as AWS products (EC2, Redshift, and EMR, Elastic search, Atana and Lambda), Hadoop, Python, Spark and TEMPeffective use of MapReduce, SQL and Cassandra to solve big data type problems.
- Experience in job/workflow scheduling and monitoring tools like Oozie, AWS Data pipeline & Autosys.
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Built data pipelines using Azure Data Factory, Azure Databricks.
- Loaded teh data to Azure Data Lake, Azure SQL Database
- Used Azure SQL Data Warehouse to control and grant database access.
- Worked wif Azure services such as HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, Cosmos DB.
- Experience in developing CI/CD (continuous integration and continuous deployment) and automation using Jenkins, Git, docker, Kubernetes for ML models deployment.
- Expertise in Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Export using multiple ETL tools such as Informatica Power Centre.
- Experience in designing, building, and implementing complete Hadoop ecosystem comprising of Map Reduce, HDFS, Hive, Impala, Pig, Sqoop, Oozie, HBase, MongoDB, and Spark.
- Extensive hands on experience tuning spark Jobs.
- Experienced in working wif structured data using HiveQL, and optimizing Hive queries.
- Experience wif Client-Server application development using Oracle PL/SQL, SQL PLUS, SQL Developer, TOAD, and SQL LOADER.
- Working experience in migrating several other databases to Snowflake
- Strong experience wif architecting highly per formant databases using PostgreSQL, PostGIS, MySQL and Cassandra.
- Extensive experience in loading and analyzing large datasets wif Hadoop framework (MapReduce, HDFS, PIG, HIVE, Flume and Sqoop)
- Hands on experience in application development using Java, RDBMS, and Linux shell scripting and Object- Oriented Programming (OOPs), multithreading in Core Java, JDBC.
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
- Hands on experience in scheduling data ingestion process to data lakes using Apache Airflow.
- Good knowledge and hands on experience on python modules such as Numpy, Pandas, Matplotlib, Scikit-learn, Pyspark modules.
- Good Knowledge on architecture and components of Spark, and excellent knowledge in Spark Core, Spark SQL, Spark streaming for interactive analysis, batch processing and stream processing
- Shown expertise in building PySpark and Scala applications.
- Worked wif various streaming ingest services wif Batch and Real-time processing using Spark streaming, Kafka, confluent, Storm, Flume and Sqoop.
- Good experience working on analysis tool like Tableau for regression analysis, pie charts, and bar graphs.
- Experienced on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
TECHNICAL SKILLS
Languages: Python, Java, R, Scala, SQL, PL/SQL, T-SQL, NoSQL.
Web Technologies: HTML, CSS, XML.
Big data eco system: Hadoop, Hive, Pig, Spark, Sqoop, Oozie, Kafka, Zookeeper, Cloudera, Hortonworks.
Databases: Oracle, SQL Server, Postgres, Neo4j, MongoDB, Cassandra.
Development Tools: Jupyter, Anaconda, Eclipse, SSIS, SSRS, Pycharm.
Visualization Tools: Tableau, PowerBI
Cloud Technologies: Azure, AWS (S3, Redshift, Glue, EMR, Lambda, Atana)
Automation/ Scheduling: Jenkins, docker, Kubernetes, Airflow.
Version Control: Git, SVN.
PROFESSIONAL EXPERIENCE
Confidential, Seattle, Washington
AWS Data Engineer
Responsibilities:
- Data ingestion into data lake(S3) and used AWS Glue to expose teh data to Redshift.
- Configured EMR cluster for data ingestion and used dbt (data build tool) to transform teh data in Redshift.
- Scheduled jobs in Airflow for automating teh ingestion process into teh data lake.
- Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) wif Data Frames in Spark.
- Implemented and Developing Hive Bucketing and Partitioning.
- Implemented Kafka, spark structured streaming for real time data ingestion.
- Worked on writing, testing, debugging SQL code for transformations using data build tool (dbt).
- Orchestratedmultiple ETL jobsusingAWS step functionsandlambda, also usedAWS Gluefor loading and preparingdata Analyticsfor customers.
- Involved in writing Java and Node.js API forAmazon Lambdato manage some of teh AWS services.
- Worked on AWS Lambda to run servers wifout managing them and to trigger run code by S3 and SNS.
- Developed data transition programs from DynamoDB to AWS Redshift (ETL Process) using AWS Lambda by creating functions in Python for teh certain events based on use cases.
- Implemented teh AWS cloud computing platform by using RDS, Python, Dynamo DB, S3 and Redshift.
- Developed scripts using Jenkins wif teh integration of teh Git repository for teh build, testing, code review and teh deployment.
- Worked on CI/CD solution, using Git, Jenkins, Docker and Kubernetes to setup and configure Big data architecture on AWS cloud platform.
- Worked in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation, and aggregation from multiple file formats for analysing & transforming teh data to uncover insights into teh customer usage patterns.
- Primarily worked on a project to develop internal etl product to handle complex and large volume healthcare claims data. designed etl framework and developed number of packages to extract transform and load data using sql server integration services(ssis)into local MS sql 2012 data bases to facilitate reporting operations.
- Performing data sources investigation developed source to destination mappings and data cleansing while loading teh data into stagings/ODS regions.
- Worked wif various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files,XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats.
- Developed Scala scripts, UDFs involving both Data frames and RDD's using SparkSQL for aggregation, queries, and writing data back into teh OLTP system directly or through Sqoop.
- Performed Database activities such as Indexing, performance tuning.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations on teh fly to build teh common learner data model and persistence teh data in HDFS.
- Responsible in loading and transforming huge sets of structured, semi structured and unstructured data.
- Used AWS EMR clusters for creating Hadoop and spark clusters. These clusters are used for submitting and executing python applications in production.
- Designed and develop end to end ETL processing from Oracle to AWS using Amazon S3, EMR, and Spark.
- Written SQL Scripts and PL/SQL Scripts to extract data from Database to meet business requirements and for Testing Purposes.
- Facilitated training sessions to demo dbt tool for various teams and sent weekly communications on different topics related to Data Engineering.
- Design and implement Hadoop architectures and configurations
- Deploying, upgrading and operating teh large scale, teh multi-node Hadoop clusters
- Customising teh configurations of teh big data applications based on teh requirements.
Confidential, Boston
Azure Data Engineer
Responsibilities:
- Used Agile Methodology of Data Warehouse development using Kanbanize.
- Developed data pipeline using Spark, Hive and HBase to ingest customer behavioral data and financial histories into Hadoop cluster for analysis.
- Working Experience onAzure Databrickscloud to organizing teh data into notebooks and making it easy to visualize data using dashboards.
- Performed ETL on data from different source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in InAzure Databricks.
- Worked on managing teh Spark Databricazure
- Implemented data ingestion from various source systems using sqoop and PySpark.
- Hands on experience implementing Spark and Hive jobs performance tuning. ks by proper troubleshooting, estimation, and monitoring of teh clusters.
- Performed Data Aggregation, Validation and on Azure HDInsight using spark scripts written in Python.
- Performed monitoring and management of teh Hadoop cluster by using Azure HDInsight.
- Involved in extraction, transformation and loading of data directly from different source systems (flat files/Excel/Oracle/SQL) using SAS/SQL, SAS/macros.
- Generated PL/SQL scripts for data manipulation, validation, and materialized views for remote instances.
- Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and also created hive queries for analysis.
- Created and modified several database objects such as Tables, Views, Indexes, Constraints, Stored procedures, Packages, Functions and Triggers using SQL and PL/SQL.
- Created large datasets by combining individual datasets using various inner and outer joins in SAS/SQL and dataset sorting and merging techniques using SAS/Base.
- Extensively worked on Shell scripts for running SAS programs in batch mode on UNIX.
- Wrote Python scripts to parse XML documents and load teh data in database.
- Used Python to extract weekly information from XML files.
- Integrated Nifi wif Snowflake to optimize teh client session running.
- Used Hive, Impala and Sqoop utilities and Oozie workflows for data extraction and data loading.
- Performed File system management and monitoring on Hadoop log files.
- Used Spark API over Hadoop YARN to perform analytics on data in Hive.
- Created stored procedures to import data in to Elasticsearch engine.
- Used Spark SQL to process huge amount of structured data to aid in better analysis for our business teams.
- Implemented Optimized join base by joining different data sets to get top claims based on state using Map Reduce.
- Created HBase tables to store various data formats of data coming from different sources.
- Responsible for importing log files from various sources into HDFS using Flume.
- Worked on SAS Visual Analytics & SAS Web Report Studio for data presentation and reporting.
- Extensively used SAS/Macros to parameterize teh reports so that teh user could choose teh summary and sub-setting variables to be used from teh web application.
- Responsible for translating business and data requirements into logical data models in support Enterprise data models, ODS, OLAP, OLTP and Operational data structures.
- Created SSIS packages to migrate data from heterogeneous sources such as MS Excel, Flat files and CVS files.
- Provided thought leadership for architecture and teh design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) and Proof of Technology (POT) evaluations and to implement a Big Data solution
Confidential
Hadoop Developer
Responsibilities:
- Performed Hive partitioning, bucketing and executing different types of joins on Hive tables and implementing Hive Serves like JSON and Avro.
- Worked wif Hadoop Ecosystem components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig wif Cloudera Hadoop distribution.
- Experienced in loading data from teh UNIX file system to HDFS
- Used Google Cloud Platform Services (GCP) to process and manage data from streaming sources.
- Extraction of teh streaming data using Kafka.
- Worked in transferring objects from teh Teradata platform to teh Snowflake platform.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
- Wrote Map Reduce code to process and parsing teh data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
- Involved in creating Hive tables, loading them wif data and writing hive queries that will run internally in map reduce way.
- Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing teh data onto HDFS
- Designed ETL packages from different data sources such as SQL server, Oracle, excel file, xml files, Parquet into destination tables by performing transformations using teh SSIS.
- Worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
- Performed loading and transformations in large sets of structured, semi structured and unstructured data.
- Involved in creating Oozie workflow and Coordinator jobs to kick off teh jobs on time for data availability
- Used Flume to collect, aggregate, and store teh web log data from different sources like web servers, network devices and pushed to HDFS.
- Scripting to deploy monitors, checks and critical sysadmin functions automation.
- Managing and scheduling Jobs on a Hadoop cluster.
- Performing tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files.
- Involved in defining job flows, managing and reviewinglog files.
Confidential
ETL Developer
Responsibilities:
- Creation, manipulation and supporting teh SQL Server databases.
- Involved in teh Data modeling, Physical and Logical Design of Database
- Helped in integration of teh front end wif teh SQL Server backend.
- Created Stored Procedures, Triggers, Indexes, User defined Functions, Constraints etc on various database objects to obtain teh required results.
- Import & Export of data from one server to other servers using tools like Data Transformation Services (DTS)
- Wrote T-SQL statements for retrieval of data and involved in performance tuning of TSQL queries.
- Transferred data from various data sources/business systems including MS Excel, MS Access, Flat Files etc to SQL Server using SSIS/DTS using various features like data conversion etc. Also Created derived columns from teh present columns for teh given requirements.
- Supported team in resolving SQL Reporting services and T-SQL related issues and Proficiency in creating different types of reports such as Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP and Sub reports, and formatting them.
- Provided via teh phone, application support. Developed and tested Windows command files and SQL Server queries for Production database monitoring in 24/7 support.
- Created logging for ETL load at package level and task level to log number of records processed by each package and each task in a package using SSIS.
- Developed, monitored and deployed SSIS packages.
- Generated multiple Enterprise reports (SSRS/Crystal/Impromptu) from SQL Server Database (OLTP) and SQL Server Analysis Services Database (OLAP) and included various reporting features such as group by, drilldowns, drill through, sub-reports, navigation reports (Hyperlink) etc.
- Created different Parameterized Reports (SSRS 2005/2008) which consist of report Criteria in various reports to make minimize teh report execution time and to limit teh no of records required.
- Worked on all types of report types like tables, matrix, charts, sub reports etc.
- Created Linked reports, Ad-hoc reports and etc based on teh requirement. Linked reports are created in teh Report Server to reduce teh repetition teh reports.
Environment: Microsoft Office, Windows 2007, T-SQL, DTS, SQL Server 2008, HTML, SSIS, SSRS, XML.
Confidential
ETL Developer
Responsibilities:
- Involved as a developer for teh commercial business group data warehouse.
- Development of source data profiling and analysis, reviewed teh data content and metadata would facilitate data mapping and validate assumptions that were made in teh business requirements.
- Created logical and physical designs of teh database and ER Diagrams for Relational and Dimensional databases using Erwin.
- Extracted data from relational databases Oracle and flat files.
- Developed complex transformations, Mapplets using Informatica Power Center 8.6.1 to Extract, Transform and load data into Operational Data Store (ODS).
- Lead, created and launched new automated testing tools and accelerators for SOA services and data driven automation built wifin our practice.
- Designed complex mappings using Source Qualifier, Joiners, Lookups (Connected and Unconnected) and Expression, Filters, Router, Aggregator, Sorter, Update Strategy, Stored procedure and Normalizer transformations.
- Ensured teh data consistency by cross-checking sampled data upon migration between teh database environments.
- Developed a process to extract teh source data and load it into teh flat files after cleansing, transforms and integrating.
- Designed SSIS packages to Extract, Transfer and load teh (ETL) existing data into SQL server from different environments for teh SSAS cubes.
- Worked as an architecture and modeling teams and used Middleware SOA services.
- Performed data alignment and data cleansing and debugger to test teh mapping and fixed teh bugs.
- Created Sessions, Sequential and Concurrent sessions for proper execution of mappings in workflow manager.
- Provided SSRS and SSIS support for internal IT projects requiring report developments.
- Involved in System Integration Testing (SIT) and User Acceptance Testing (UAT).
Environment: Informatica 8.6.1, SQL Server 2005, RDBMS, Fast load, FTP, SFTP.