We provide IT Staff Augmentation Services!

Sr. Aws Data Engineer Resume

0/5 (Submit Your Rating)

Alpharetta, GA

SUMMARY

  • Around 8+ years of IT experience in various industries working on Big Data technology using technologies such as Cloudera and Hort on works distributions.
  • The Hadoop working environment includes Hadoop, Spark, Map Reduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
  • Hands - on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like Map Reduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, and GraphX Spark SQL, Kafka.
  • Experience in programming with Scala, Java, Python, and SQL.
  • Experience developing PySpark code to create RDDs, Paired RDDs and Data Frames.
  • Experience analyzing data using HiveQL, Pig, HBase, and custom Map Reduce programs in Java 8.
  • Experienced working with various Hadoop Distributions (Cloudera, Hort on works, Map R, Confidential EMR) to fully implements and leverage new Hadoop features.
  • Experience implementing real-time and batch data pipelines using AWS Services, Lambda, S3, DynamoDB, Kinesis, Redshift, and EMR.
  • Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.
  • Experience in AWS services such as S3, EC2, Glue, AWS Lambda, Athena, AWS Step Function, and Redshift.
  • Adept at configuring and installing Hadoop/Spark Ecosystem Components.
  • Experienced in Optimizing the PySpark jobs to run on Kubernetes Cluster for faster data processing.
  • Experience applying various data sources like Oracle SE2, SQL Server, Flat Files, and Unstructured files into a data warehouse.
  • Able to use Sqoop to migrate data between RDBMS, No SQL databases, and HDFS.
  • Experienced of buildingData WarehouseinAzure platformusingAzure data bricksanddata factory.
  • Hands of experience inGCP, Big Query, GCS bucket, G - cloud function, cloud data flow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Experience with Snowflake Multi-Cluster Warehouses
  • Experience working in different Google Cloud Platform Technologies like Big Query, Dataflow, Data proc, Pub sub, Airflow.
  • Experience in Extraction, Transformation, and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from multiple sources using Apache Flume, Kafka, Power BI, and Microsoft SSIS.
  • Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Hadoop Map Reduce programming.
  • Used IDEs like Eclipse, IntelliJ IDE, Py Charm IDE, Notepad ++, and Visual Studio for development.
  • Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
  • Ample knowledge of data architecture, including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning, and advanced data processing.
  • Experience working with NoSQL databases like Cassandra and HBase and developing real-time read/write access to large datasets via HBase.
  • Developed Spark Applications that can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
  • Strong experience with ETL and/or orchestration tools (e.g., Talend, Oozie, Airflow)
  • Experience working with GitHub/Git 2.12 source and version control systems.
  • Experience in privacy and data security laws and regulations, including GDPR, COPPA, and VPPA
  • Strong in core Java concepts, including Object-Oriented Design (OOD) and Java components like Collections Framework, Exception Handling, and I/O system.
  • Knowledge of best practices prevalent in data governance, data quality, and data privacy. Familiarity with data management principles and practices.

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop Map Reduce, Impala, HDFS, Hive, Pig, HBase, Flume, Storm, Sqoop, Oozie, Kafka, Spark, and Zookeeper

Hadoop Distributions: Apache Hadoop 2.x/1.x, Cloudera CDP, Hort onworks HDP, Confidential EMR (EMR, EC2, EBS, RDS, S3, Athena, Glue, Elastic search, Lambda, DynamoDB, Redshift, ECS, Quick sight)

Programming Languages: Python, R, Scala, C++, SAS, Java, SQL, HiveQL, PL/SQL, UNIX shell Scripting, Pig Latin

Machine Learning: Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, XGBoost, Naïve Bayes, PCA, LDA, K-Means, KNN, Neural Network

Cloud Technologies: AWS, Azure, Google cloud platform Cloud Services (PaaS & IaaS), Active Directory, Application Insights, Azure Monitoring, Azure Search, Data Factory, Key Vault and SQL Azure, Azure Devops, Azure Analysis services, Azure Synapse Analytics (DW), Azure Data Lake, AWS Lambda

Databases: Snowflake, MySQL, Teradata, Oracle, MS SQL SERVER, Postgre SQL, DB2

NoSQL Databases: HBase, Cassandra, Mongo DB, DynamoDB and Cosmos DB

Version Control: Git, SVN, Bitbucket

ETL/BI: Informatica, SSIS, SSRS, SSAS, Tableau, Power BI, QlikView, Arcadia, Erwin, Matillion, Rivery

Operating System: Mac OS, Windows 7/8/10, Unix, Linux, Ubuntu

Methodologies: RAD, JAD, UML, System Development Life Cycle (SDLC), Jira, Confluence, Agile, Waterfall Model

PROFESSIONAL EXPERIENCE

Confidential, Alpharetta, GA

Sr. AWS Data Engineer

Responsibility:

  • Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
  • Devised a machine learning algorithm using Python for facial recognition
  • Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
  • Used Scala scripts for spark machine learning libraries API execution for decision trees, ALS, logistic and linear regressions algorithms
  • Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
  • Involved in developing data ingestion pipelines on Azure HD Insight Spark cluster using Azure
  • Integration of data stored in S3 with Databricks to perform ETL processes using PySpark and Spark SQL.
  • Worked on Migrating an on-premises virtual machine to Azure Resource Manager Subscription with Azure Site Recovery
  • Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database
  • SQL data warehouse environment. experience in DWH/BI project implementation using Azure DF
  • Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data
  • Extracting the data from Azure Data Lake into HD Insight Cluster (INTELLIGENCE + ANALYTICS) and applying spark transformations & Actions and loading into HDFS
  • Provided consulting and cloud architecture for premier customers and internal projects running on MS Azure platform for high availability of services, low operational costs
  • Development of web service using Windows Communication Foundation and Net to receive and process XML files and deploy on Cloud Service on Microsoft Azure
  • Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
  • Installed and configured Apache Airflow for workflow management and created workflows in python
  • Implement ETL process to move data from Cosmos DB to SQL Azure Database using SQLizer, SSIS, and SQL Azure Database
  • Started using Apache Nifi to copy the data from the local file system to HDFS
  • Application of various machine learning algorithms and statistical modeling like decision trees, regression models, clustering, SVM to identify Volume using Scikit-learn package in Python
  • Extensively worked on the naming standards which incorporated the enterprise data modeling
  • Experienced in R Studio packages and Python libraries like SciKit-Learn to improve the model accuracy from 65% to 86%
  • Experienced in various Python libraries like Pandas, One dimensional NumPy, and Two dimensional NumPy
  • Worked on POC to check various cloud offerings including Google Cloud Platform (GCP).
  • Developed a POC for project migration from on prem Hadoop MapR system to GCP.
  • Registered datasets to AWS Glue through Rest API.
  • Performed Data Migration to GCP
  • Implemented scripts that load Google Big Query data and run queries to export data.
  • Used Keras library to build and train deep learning models and fetched good results
  • Continuously monitor and manage data pipeline (CI/CD) performance alongside applications from a single console with GCP.
  • Ingested data to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, and Azure DW) and processing the data in Azure Data bricks.
  • Developed Merge jobs in Python to extract and load data into a MySQL database
  • Used Test driven approach for developing the application and implemented the unit tests using Python Unit test framework
  • Designed and documented REST/HTTP, SOAP APIs, including JSON data formats and API versioning strategy

Environment: Machine Learning, R Language, Hadoop, Big Data, Azure, NiFi, Python, PySpark, Java, J2EE, Spring, Dojo, JavaScript, GCP, PL/ SQL, JDBC, MongoDB, Apache CXF, Web Services, Eclipse.

Confidential, Florham Park, NJ

AWS Data Engineer

Responsibility:

  • Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
  • Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
  • Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Upgraded the Hadoop cluster from CDH4.7 to CDH5.2 and worked on installing cluster, commissioning & decommissioning of Data Nodes, Name Node recovery, capacity planning, and slots configuration.
  • Developed Spark scripts to import large files from Confidential S3 buckets and imported the data from different sources like HDFS/HBase into Spark RDD.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, and Java.
  • Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation and worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
  • Worked on Installing Cloudera Manager, CDH and install the JCE Policy File to Create a Kerberos Principal for the Cloudera Manager Server, enabling Kerberos Using the Wizard.
  • Developed Spark jobs using Scala and Java on top of Yarn/MRv2 for interactive and Batch Analysis.
  • Developed Python code to ather the data from HBase and designs the solution to implement using PySpark.
  • Monitored cluster for performance and, networking and data integrity issues and responsible for troubleshooting issues in the execution of Map Reduce jobs by inspecting and reviewing log files.
  • Install OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
  • Developed and analyzed the SQL scripts and designed the solution to implement using spark
  • Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing data.
  • Supported Map Reduce Programs and distributed applications running on the Hadoop cluster and scripting Hadoop package installation and configuration to support fully automated deployments.
  • Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with Elastic Map Reduce and setupHadoopenvironment in AWSEC2 Instances.
  • Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
  • Write UDFs in Hadoop PySpark to perform transformations and loads.
  • Analyzed the SQL scripts and designed solutions to implement using PySpark.
  • Use NIFI to load data into HDFS as ORC files.
  • Writing TDCH scripts and Apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
  • Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
  • Monitoring Hadoop cluster using tools like Nagios, Ganglia, and Cloudera Manager and maintaining the Cluster by adding and removing of nodes using tools like Ganglia, Nagios, and Cloudera Manager.
  • Integrated the Snowflake data-warehouse in the pipeline with ingestion from the ETL pipeline
  • Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.

Environment: Hadoop, Map Reduce, Hive, PIG, Sqoop, Spark, Spark-Streaming, Spark SQL, AWS EMR, AWS S3, AWS Redshift, Scala, Spark, MapR, Java, Oozie, Flume, HBase, NiFi, Nagios, Ganglia, Hue, Snowflake, Hort on works, Cloudera Manager, Zookeeper, Cloudera, Oracle, Kerberos and RedHat 6.5

Confidential

AWS Data Engineer

Responsibility:

  • Extracted, cleansed and transformed data using Data bricks and Spark data analysis.
  • Worked thoroughly in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning and troubleshooting Hadoop clusters.
  • Worked in the data science team preprocessing and feature engineering and assisted Machine Learning algorithm running in production.
  • Designed and implemented test environment on AWS.
  • Used AWSAPI Gateway to Trigger Lambda functions. Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation.
  • Integrator and developed PySpark application as ETL tool.
  • Used DynamoDB to store metadata and logs.
  • Queried with Athena on data residing in AWSS3 bucket.
  • Implemented AWS EMR Spark using PySpark, and utilized Data Frames and Spark SQL API for faster processing of data.
  • AWS Step function used to run a data pipeline.
  • Monitored and managed services with AWS CloudWatch
  • Implemented Airflow configuration as a code approach to automate the generation of workflows using Jenkins and Git.
  • Implemented Airflow visualizations into Grafana dashboards and connected failure notifications to slack channels.
  • Used Dremio in AWS as Query engine for faster Joins and complex queries over AWSS3 bucket.
  • Performed transformations using Apache Spark SQL.
  • Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Created Rest API using Flask in Python
  • Created several Data bricks Spark jobs with PySpark to perform several tables to table operations.
  • Develop Nifi workflow to pick up the data from rest API server, from Data Lake as well as from SFTP server and send that to Kafka broker.
  • Lessened access time by data model refactoring, query optimization and implemented Redis cache to support Snowflake.
  • Experienced in Automating, Configuring and deploying instances onAWS environments and Data centers, also familiar with EC2,Cloud watch,Cloud Formation and managing security groups onAWS.
  • Built interactive Power BI dashboards and worked on reporting analytics
  • Extensively used Data bricks notebooks for interactive analytics using Spark APIs.
  • Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target Snowflake database.
  • Design, develop, and test dimensionaldatamodels using Star andSnowflakeschemamethodologies under the Kimball method.
  • Implemented scalable microservices to handle concurrency and high traffic. Optimized existing Scala code and improved the cluster performance.
  • Involved in building an Enterprise Data Lake using Data Factory and Blob storage, enabling other teams to work with more complex scenarios and ML solutions.

Environment: Azure Data Factory, Python, Pandas, Hive, PySpark, Spark-SQL, RDD, Tableau, Git, Apache Airflow, Azure Cosmos DB, Azure Data Share HD Insight, Data Lake, Data bricks, Snowflake, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer.

Confidential

Azure Data Engineer

Responsibility:

  • Extensively worked with Azure cloud platform (HD Insight, Data Lake, Data bricks, Blob Storage, Azure Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
  • Created Pipelines in Azure Data Factory (ADF) using Linked Services, Datasets, Pipeline to extract, transform and load data from different sources like Azure SQL, Blob storage, Azure SQL DW, write-back tool and backwards.
  • Created Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share.
  • Designed and configured Azure Cloud relational servers and databases, analyzing current and future business requirements.
  • Worked on Power Shell scripts to automate the Azure cloud system creation of Resource groups, Web Applications, Azure Storage Blobs& Tables, firewall rules.
  • Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
  • Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes.
  • Designed and deployed data pipelines using Data Lake, Data bricks, and Apache Airflow.
  • Developed Elastic pool databases and scheduled Elastic jobs to execute T-SQL procedures.
  • Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.
  • Created and provisioned different Data bricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
  • Created several Data bricks Spark jobs with PySpark to perform several tables to table operations.
  • Created data pipeline for different events in Azure Blob storage into Hive external tables. Used various Hive optimization techniques like partitioning, bucketing and Map join.
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
  • Utilized Agile and Scrum methodology for team and project management.
  • Used Git for version control with colleagues.

Environment: HD Insight, Data Lake, Data bricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer, Azure Data Factory, Python, Pandas, Hive, PySpark, Spark-SQL, RDD, Tableau, Git, Apache Airflow, Azure Cosmos DB, Azure Data Share.

Confidential

Data Analyst

Responsibility:

  • Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
  • Performed Data Analysis, Migration, Cleansing, Transformation, Integration, Data Import, and Data Export through Python.
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation, and Materialized views to optimize query performance.
  • Developed logistic regression models (using R programming and Python) to predict subscription response rate based on customer variables like past transactions, response to initial mailings, promotions, demographics, interests, hobbies, etc.
  • Created Tableau dashboards/reports for data visualization, Reporting, and Analysis and presented them to Business.
  • Created Data Connections, Published on Tableau Server for usage with Operational or Monitoring Dashboards.
  • Knowledge in Tableau Administration Tool for Configuration, adding users, managing licenses and data connections, scheduling tasks, and embedding views by integrating with other platforms.
  • Worked with senior management to plan, define and clarify dashboard goals, objectives, and requirements.
  • Responsible for daily communications to management and internal organizations regarding the status of all assigned projects and tasks.

Environment: SQL, Tableau, R, Python, Excel, Lookups, Access

We'd love your feedback!