Aws Data Engineer Resume
New York, NY
SUMMARY
- Over 8+ years IT experience in Analysis, Design, Development and Big Data in Scala, Spark, Hadoop, Pig and HDFS environment and experience in Python, Java.
- Excellent technical and analytical skills wif clear understanding of design goals and development for OLTP and dimensions modeling for OLAP.
- Strong experience in building fully automated Continuous Integration & Continuous delivery pipelines and DevOps processors for agile store - based Applications in Retail and Transportations domain.
- Firm understanding of Hadoop architecture and various components including HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming.
- Have experience in data analytics, designing reports wif visualization solutions using Tableau desktop and publishing on to teh Tableau server.
- Good Knowledge in Amazon Web Services (AWS) concepts like EC2, S3, EMR, Elastic Cache, DynamoDB, Redshift, Aurora.
- Proven expertise in deploying major software solutions for various high-end clients meeting teh business requirements such as big data Processing, Ingestion,
- Analytics and Cloud Migration from On-prem to AWS Cloud using AWS EMR, S3, DynamoDB
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Experience wif migrating SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse, as well as controlling and giving database access and migrating on-premises databases to Azure Data Lake stores using Azure Data Factory.
- Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
- Demonstrated understanding of teh Fact/Dimension data warehouse design model, including star and snowflake design methods.
- Experienced in building Snowpipe and In-depth knowledge of Data Sharing in Snowflake and Snowflake Database, Schema and Table structures.
- Designed and developed logical and physical data models that utilize concepts such as Star Schema, Snowflake Schema and Slowly Changing Dimensions.
- Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Apache Cassandra, HBase, Hive, Oozie, Impala, Pig, Zookeeper and Flume, Kafka, Sqoop, Spark.
- Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming.
- Experienced wif teh Spark improving teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, Pair RDD's and worked explicitly on PySpark.
- Familiar wif data processing performance optimization techniques such as dynamic partitioning, bucketing, file compression, and cache management in Hive, Impala, and Spark.
- Excellent understanding and knowledge in handling database issues and connections wif SQL and NOSQL databases like Mongo DB, Cassandra, HBase and SQL server.
- Experience in Dimension Data modeling concepts like Star Join Schema Modeling,
- Snow-Flake Modeling, FACT and Dimensions Tables, Physical and Logical Data Modeling.
- Created and configured SQL Server Analysis Services database which introduced company to a multidimensional tracking of subscribers' special statistical techniques using SQL and Excel
- Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies.
- Adequate knowledge and working experience in Agile and Waterfall Methodologies.
- Defining user stories and driving teh agile board in JIRA during project execution, participate in sprint demo and retrospective.
- Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new technologies wif ease and a good team member.
PROFESSIONAL EXPERIENCE
AWS Data Engineer
Confidential, New York, NY
Responsibilities:
- Used AWS Atana extensively to ingest structured data from S3 into other systems such as RedShift or to produce reports.
- Teh Spark-Streaming APIs were used to conduct on-teh-fly transformations and actions for creating teh common learner data model, which receives data from Kinesis in near real time.
- Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3, Atana, Glue and Kinesis.
- Hive As teh primary query engine of EMR, we have built external table schemas for teh data being processed.
- AWS RDS (Relational database services) was created to serve as a Hive meta store, and it was possible to integrate teh meta data from 20 EMR clusters into a single RDS, avoiding data loss even if teh EMR was terminated.
- Involved in teh development of a shell script that collects and stores logs created by users in AWS S3 (Simple storage service) buckets. dis contains a record of all user actions and is a good indicator of security to detect cluster termination and safeguard data integrity.
- Partitioning and Bucketing ideas were implemented in teh Apache Hive database, which increases query retrieval performance.
- Using AWS Glue, me designed and deployed ETL pipelines on S3 parquet files in a data lake.
- Created a cloud formation template in JSON format to leverage content delivery wif cross-region replication through Amazon Virtual Private Cloud
- AWS Code Commit Repository was used to save programming logics and scripts and tan replicate them to new clusters.
- Used teh Multi-node Redshift technology to implement Columnar Data Storage, Advanced Compression, and Massive Parallel Processing
- Worked wif Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
- Worked on teh code transfer of a quality monitoring program from AWS EC2 to AWS Lambda, as well as teh creation of logical datasets to administrate quality monitoring on snowflake warehouses.
Environment: Amazon Web Services, Elastic Map Reduce cluster, EC2s, CloudFormation, Amazon S3, Amazon Redshift, Dynamo DB, Cloud Watch, Hive, Scala, Python, HBase, Apache Spark, Spark SQL, Shell Scripting, Tableau, Cloudera.
Data Engineer
Confidential, Rochester MN
Responsibilities:
- Design robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate teh ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.
- Worked wif building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
- Applied transformation on teh data loaded into Spark Data Frames and done in memory data computation to generate teh output response.
- Identified issue and developed a procedure for correcting teh problem which resulted in teh improved quality of critical tables by eliminating teh possibility of entering duplicate data in a Data Warehouse.
- Created scripts in Python which integrated wif Amazon API to control instance operations.
- For Processing Spreadsheets - and join wif other sources used Scala and developed a framework.
- Good knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
- Scheduled teh jobs using Airflow and used airflow hooks to connect to various traditional databases like db2, oracle and Teradata.
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
- Management and Administration of AWS Services CLI, EC2, S3, and Trusted Advisor services.
- Using Python - SQL Alchemy tried to connect to databases and query teh sources to fetch data.
- Hands on experience on developing UDF, Data Frames and SQL queries in Spark SQL.
- Creating and modified existing data ingestion pipelines using Kafka and Sqoop to ingest teh database tables and streaming data into HDFS for analysis.
- Finalize teh naming Standards for Data Elements and ETL jobs and create a Data dictionary for Meta Data Management.
- Planned, coordinated analysis, design and extraction of encounter data from multiple source systems into teh data warehouse relational database (Oracle) while ensuring data integrity.
- Involved in designing and developing Amazon EC2, Amazon S3, Amazon RDS, Amazon Elastic Load Balancing, Amazon SWF, Amazon SQS, and other services of teh AWS infrastructure.
- Worked on developing ETL workflows on teh data obtained using Python for processing it in HDFS and HBase using Flume.
- Hands in experience in working wif Continuous Integration and Deployment (CI/CD) using Jenkins, Docker.
- Developing ETL pipelines in and out data warehouse using combination of Python and Snowflake Snow SQL. Writing SQL quires against Snowflake.
- Conduct systems design, feasibility and cost studies and recommend cost-TEMPeffective cloud solutions such as Amazon Web Services (AWS).
Environment: Python, HDFS, Spark, Kafka, Hive, Yarn, Cassandra, HBase, Jenkins, Docker, Tableau, Splunk, BO Reports, Netezza, UDB, MySQL, Snowflake, IBM DataStage.
Big Data Engineer/Hadoop Developer
Confidential, Pataskala, Ohio
Responsibilities:
- Responsible for teh design, implementation, and architecture of very large-scale data intelligence solutions around big data platforms.
- Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop and Zookeeper.
- Developed multiple POC's using Spark, Scala and deployed on teh Yarn Cluster, compared teh performance of Spark, wif Hive and SQL.
- Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
- Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.
- Developed ETL procedures to ensure conformity, compliance wif minimal redundancy, translated business rules and functionality requirements into ETL procedures.
- Maintain AWS Data pipeline as web service to process and move data between Amazon S3, Amazon EMR and Amazon RDS resources.
- Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.
- Troubleshoot and resolve data processing issues and proactively engaged in data modelling discussions.
- Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and actions in Spark.
- Written programs in Spark using Python, PySpark and Pandas packages for performance tuning, optimization, and data quality validations.
- Worked on developing Kafka Producers and Kafka Consumers for streaming millions of events per second on streaming data.
- Implemented a distributing messaging queue to integrate wif Cassandra using Apache Kafka.
- Hands on experience on fetching teh live stream data from UDB into HBase table using PySpark streaming and Apache Kafka.
- Worked on Tableau to build customize interactive reports, worksheets, and dashboards.
Environment: HDFS, Python, SQL, Web Services, MapReduce, Spark, Kafka, Hive, Yarn, Pig, Flume, Zookeeper, Sqoop, UDB, Tableau, AWS, GitHub, Shell Scripting.
Big Data Engineer
Confidential
Responsibilities:
- Responsible for building scalable distributed data solution using Hadoop Cluster environment wif Hortonworks distribution.
- Convert raw data wif sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through teh network.
- Worked on building end to end data pipelines on Hadoop Data Platforms.
- Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.
- Designed developed and tested Extract Transform Load (ETL) applications wif different types of sources.
- Creating files and tuned teh SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying teh available data.
- Exploring wif Spark to improve teh performance and optimization of teh existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
- Experience wif PySpark for using Spark libraries by using Python scripting for data analysis.
- Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
- Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
- Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.
- Support for teh cluster, topics on teh Kafka manager. Cloud formation scripting, security and resource automation.
Environment: Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, Web Services, Linux RedHat, Unix.
Hadoop Engineer/Developer
Confidential
Responsibilities:
- Designed and developed teh applications on teh data lake to transform teh data according business users to perform analytics.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Worked on different files like csv, txt, fixed width to load data from various sources to raw tables.
- Conducted data model reviews wif team members and captured technical metadata through modelling tools.
- Implemented ETL process wrote and optimized SQL queries to perform data extraction and merging from SQL server database.
- Experience in loading logs from multiple sources into HDFS using Flume.
- Worked wif NoSQL databases like HBase in creating HBase tables to store large sets of semi-structured data coming from various data sources.
- Involved in designing and developing tables in HBase and storing aggregated data from Hive tables.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's and Scala.
- Data cleaning, pre-processing, and modelling using Spark and Python.
- Strong Experience in writing SQL queries.
- Responsible for triggering teh jobs using teh Control-M.
Environment: Python, SQL, ETL, Hadoop, HDFS, Spark, Scala, Kafka, HBase, MySQL, Netezza, Web Services, Shell Script, Control-M.