Aws Data Engineer Resume
Dallas, TX
SUMMARY
- Around 7+ years of extensive IT experience in all phases of Software Development Life Cycle SDLC.
- Experienced working as a Big Data Engineer using Big Data Analytics, Hadoop Ecosystem, Data Warehouse, Data Visualization, Cloud Data Engineering applications in designing data.
- Hands - on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Apache Cassandra, Sqoop, HBase, Hive, Oozie, Impala, Pig, Zookeeper, Flume and Spark.
- Experience with various data formats such as Json, Avro, parquet, RC and ORC formats and compressions like snappy and bzip.
- Expertise in deploying cloud-based services with Amazon Web Services (Database, Migration, Compute, IAM, Storage, Analytics, Network & Content Delivery, Lambda and Application Integration).
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR. Hands-on experience withAmazon RDS, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Redshift, DynamoDB and other services of the AWS family.
- Designed and developed logical and physical data models that utilize concept such as Star Schema, Snowflake Schema and Slowly Changing Dimensions.
- Design and develop Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage pattern.
- Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and migrating on premise databases to Azure Data Lake store using Azure Data factory.
- Extensive knowledge and experience on real time data streaming techniques like Kafka, and Spark Streaming.
- Design and develop Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Experience in importing and exporting data by Sqoop between HDFS and RDBMS and migrating according to client's requirement.
- Knowledge of Database Architecture for OLAP and OLTP Applications, Database designing, Data Migration, Data Warehousing Concepts with emphasis on ETL.
TECHNICAL SKILLS
Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, Hive, HBase, Flume, Kafka, Yarn, Apache Spark.
Databases: Oracle, MySQL, SQL Server, MongoDB, DynamoDB, Cassandra.
Programming Languages: Python, PySpark, Shell script, Perl script, SQL, Java.
Cloud: AWS (EC2, EMR, Lambda, IAM, S3, Athena, Glue, Kinesis, CloudWatch, RDS, Redshift) Azure (Data Factory, Data Lake, Databricks, Logic App)
Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, SQL Navigator, SQL Server Management Studio, Eclipse, Postman.
Version Control: SVN, Git, GitHub, Maven
Operating Systems: Windows 10/7/XP/2000/NT/98/95, UNIX, LINUX, OS
Visualization/ Reporting: Tableau, ggplot2, matplotlib
Database Modelling: Dimension Modeling, ER Modeling
Machine Learning Techniques: Linear & Logistic Regression, Decision Trees, Clustering.
PROFESSIONAL EXPERIENCE
Confidential, Dallas, TX
AWS Data Engineer
Responsibilities:
- Implemented solutions utilizing Advanced AWS Components: EMR, EC2, etc. integrated with Big Data/Hadoop Distribution Frameworks: Zookeeper, Yarn, Spark, Scala, NiFi etc.
- Extensively used AWS Athena to ingest structured data from S3 into various systems such as RedShift or to generate reports.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kinesis in near real time.
- Used AWS Redshift, S3, Spectrum and Athena services to query large amount data stored on S3 to create a Virtual Data Lake without having to go through ETL process.
- Performed end- to-end Architecture & implementation assessment of variousAWSservices like Amazon EMR, Redshift, S3, Athena, Glue and Kinesis.
- Created PySparkscripts to perform data Manipulation, Aggregation, and load to Data frames and eventually to S3 in the migration process.
- Developed Apache presto and Apache drill setups in AWS EMR (Elastic Map Reduce) cluster, to combine multiple databases like MySQL and Hive. This enables to compare results like joins and inserts on various data sources controlling through single platform.
- Used AWS Code Commit Repository to store their programming logics and script and have them again to their new clusters.
- Involved in POC Data Extraction, aggregations, and consolidation of data within AWS Glue using PySpark.
- Configured Postgres Database on EC2 instances and made sure application that was created is up and running, troubleshooter issues to meet the desired application state.
- Involved in designing and deploying multiple applications utilizing almost all theAWSstack (Including EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling inAWSCloud Formation.
Environment: Amazon Web Services, Elastic Map Reduce cluster, Amazon S3, EC2s, Amazon Redshift, Pyspark, Yarn, Spark, Scala, Hive.
Confidential, Englewood, CO
Azure Data Engineer
Responsibilities:
- Analyze, develop, and construct modern data solutions that allow data visualization utilizing the Azure PaaS service. Determine the impact of the new implementation on existing business processes by understanding the present status of the application in production.
- Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
- Developed Spark applications usingPysparkandSpark-SQLfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.
- Used Azure Key vault as central repository for maintaining secrets and referenced the secrets in Azure Data Factory and also in Databricks notebooks.
- UsedAzure ML to build, test and deploy predictive analytics solutions based on data.
- Helped individual teams to set up their repositories in bit bucket and maintain their code and help them setting up jobs which can make use of CI/CD environment.
- Applied technical knowledge to architect solutions that meet business, and IT needs, created roadmaps, and ensure long term technical viability of new deployments, infusing key analytics and AI technologies where appropriate (e.g., Azure Machine Learning, Machine Learning Server, BOT framework, Azure Cognitive Services, Azure Databricks, etc.)
- Designed and built a Data Discovery Platform for a large system integrator using Azure HdInsight components. Used Azure data factory and data Catalog to ingest and maintain data sources. Security on HdInsight was enabled using Azure Active directory.
- Perform analyses on data quality and apply business rules in all layers of data extraction transformation and loading process.
- Integration of data storage solutions in spark - especially with Azure Data Lake storage and Blob storage.
- Created Databricks Job workflows which extracts data from SQL server and upload the files to sftp using pyspark and python.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks and Azure SQL DB
Environment: Azure Data Factory(V2), Azure Databricks, Pyspark, Azure SQL, Azure Data Lake, Azure Blob Storage, Azure ML, Scala.
Confidential, Plano, TX
Big Data Engineer
Responsibilities:
- Responsible for developing prototypes to the selected solutions and implementing complex big data projects with a focus on collecting, parsing, managing, analyzing, and visualizing large sets of data using multiple platforms.
- Performed multiple MapReduce jobs in Pig and Hive for data cleaning, pre-processing.
- Experience in importing and exporting data by Sqoop between HDFS and RDBMS and migrating according to client's requirement.
- Implemented Python script to call the Cassandra Rest API, performed transformations and loaded the data into Hive.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
- Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka.
- Performed importing data from various sources to the Hbase cluster using Kafka connect. Worked on creating data models for Hbase from existing data model.
- Managed structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure by using Cassandra.
- Developed pig scripts to transform the data into structured format and it are automated through Oozie coordinators.
- Worked with MDM systems team with respect to technical aspects and generating reports.
- Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed complex HiveQL queries on Hive tables.
- Worked on developing ETL processes to loaddatafrom multipledatasources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Involved in complete Bigdata flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.
- Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
Environment: Spark-RDD data frames, Kafka, file formats, Scala, Spark UDFs, oracle SQL- Cassandra, hive.
Confidential
Data Engineer
Responsibilities:
- Involved in creating Hive tables, then applied HiveQL on those tables, this will invoke and run MapReduce jobs automatically.
- Experienced in analyzing Cassandra database and compare it with other open-source NoSQL databases to find which one of them better suites the current requirements.
- Creating and improving shell scripts to perform data ingestion and validation using various parameters, as well as writing custom shell scripts to invoke spark Jobs.
- Performed Linux operations on the HDFS server for data lookups, changing jobs if any commits were disabled, and rescheduling jobs for data storage.
- Using HQL to create Hive tables, perform various data operations such as joining, filtering, and sorting the data with various tables and retrieving required information onto the tables.
- Involved in moving all log files generated from various sources to HDFS for further processing through Elastic search, Kafka, Flume & Talend and process the files by using Piggybank.
- Performed Sqooping for various file transfers through the HBase tables for processing of data to several NoSQL DBs- Cassandra, Mango DB.
- Captured data logs from web server and Elastic search into HDFS using Flume for analysis.
- Performed multiple MapReduce jobs in Hive for data cleaning and pre-processing.
- Build Hadoop solutions for big data problems using MR1 and MR2 in YARN.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
- Load data from relational databases into MapR-FS filesystem and HBase using Sqoop and setting up MapR metrics with NoSQL database to log metrics data.
- UsedSpark-SQLto LoadJSONdata and createSchema RDDand loaded it intoHiveTables and handled Structured data usingSparkSQL.
Environment: Hadoop, Spark, MapReduce, Hive, HDFS, YARN, MobaExtrm, Linux, Cassandra, NoSQL database, Python Spark SQL, Tableau, Flume, Spark Streaming.
