Sr Aws Big Data Engineer Resume
Charlotte, NC
SUMMARY
- 8+ years of professional IT experience in gathering requirement, Analysis, Architecture, Design, Documentation, and Implementation of applications using Big Data Technologies.
- Experienced in Programming languages like Scala, Java, Python, SQL, T - SQL, R.
- Expertise in using various Hadoop infrastructures such as Map Reduce, Pig, NIFI, Hive, Zookeeper, HBase, Sqoop, Oozie, Flume, Drill and spark for data storage and analysis.
- Experience in converting Hive/SQL queries into Spark transformations using Spark RDD and PySpark concepts.
- Experience in Implementing Apache Airflow for authoring, scheduling and monitoring Data Pipelines.
- Experience in Importing and exporting data from different databases like MySQL, Oracle, Netezza, Teradata, DB2 into HDFS using Sqoop, Talend.
- Experience in Hadoop ecosystem including Spark, Kafka, HBase, Impala, Mahout, Storm, Tableau, Talend big data technologies.
- Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
- Experienced in Spark technologies like Spark Core, Spark RDD, Pair RDD, Spark Deployment architectures.
- Proficient with Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
- Experience in implementing various Big Data Engineering, Cloud Data engineering, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization Solution.
- Experience in developing and scheduling ETL workflows in Hadoop using Oozie.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Experience in relational databases like SQL, MYSQL, Oracle, DB2 and NoSQL databases such as MongoDB, Hbase, DynamoDB, Cosmos and Cassandra.
- Experienced in developing and designing the automation framework using Python and Shell scripting.
- Hands on experience in machine learning, big data, data visualization, Python development, Java, Linux, Windows, SQL, GIT/GitHub.
- Experienced in Data Analysis, Design, Development, Implementation and Testing using Data Conversions, Extraction, Transformation and Loading (ETL) and SQL Server, ORACLE, and other relational and non-relational databases.
- Experience in Building and Deploying and Integrating with Ant, Maven.
- Experience on Big Data technology using technologies such as Cloudera and Hortonworks distributions.
- Experience in the creation of Test Cases for JUnit Testing.
- Experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed knowledge of MapReduce framework.
- Experience on Instantiating, creating, and maintaining CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications using Jenkins, Docker and Kubernetes.
- Expertise in snowflake to create and Maintain Tables and views.
- Experience with developing applications using Java, J2EE Technologies Servlets, JSP, Java Web Services, JDBC, XML, Cascading, spring, Hibernate.
- Experience with big data on AWS cloud services - EC2, S3, EMR, Glue, Athena, RDS, VPC, SQS, ELK, Kinesis, DynamoDB and Cloud Watch.
- Experience on data warehouse product Amazon Redshift, which is a part of the AWS and configuring the servers for Auto scaling and Elastic load balancing.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
- Experienced in creating pipelines that move, transform, and analyze data from a wide variety of sources using multiple methods like the Azure Power shell utility.
- Experience in Various SDLC methodologies like Agile, Scrum, Waterfall.
- Experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
- Experience using source code management tools such as GIT, SVN, and Perforce.
TECHNICAL SKILLS
Hadoop Distributions: Cloudera, AWS EMR and Azure Data Factory.
Languages: Scala, Python, SQL, Hive QL, KSQL.
IDE Tools: Eclipse, IntelliJ, pycharm.
Cloud platform: AWS, Azure, GCP
AWS Services: VPC, IAM, S3, Elastic Beanstalk, CloudFront, Redshift, Lambda, Kinesis, DynamoDB, Direct Connect, Storage Gateway, EKS, DMS, SMS, SNS, and SWF.
Reporting and ETL Tools: Tableau, Power BI, Talend, AWS GLUE.
Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (Hbase, Cassandra, Mongo DB)
Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks, Kafka, Cloudera
Machine Learning And Statistics: Regression, Random Forest, Clustering, Time-Series Forecasting, HypothesisExplanatory Data Analysis
Containerization: Docker, Kubernetes
CI/CD Tools: Jenkins, Bamboo, GitLab CI, uDeploy, Travis CI, Octopus
Operating Systems: UNIX, LINUX, Ubuntu, CentOS.
Other Software: Control M, Eclipse, PyCharm, Jupyter, Apache, Jira, Putty, Advanced Excel
Frameworks: Django, Flask, WebApp2
PROFESSIONAL EXPERIENCE
Confidential, Charlotte, NC
Sr AWS Big Data Engineer
Responsibilities:
- Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods.
- Utilized Agile Methodology to help manage and organize a team with regular code review sessions.
- Worked on analyzing large and critical datasets using Cloudera, HDFS, MapReduce, Hive, Hive UDF, Pig, Sqoop and Spark.
- Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
- Implementations of generalized solution model using AWS Sage Maker.
- Work with structured/semi-structured data ingestion and processing on AWS using S3, Python. Migrate on-premises big data workloads to AWS.
- Developed Python scripts to automate the ETL process using Apache Airflow and CRON scripts in the Unix operating system as well.
- Worked in Amazon EC2 command line interface along with Python to automate repetitive work.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
- Worked in writing Spark applications in Scala and Python.
- Worked with spark SQL and spark scripts(PySpark) in data bricks environment to validate the monthly account level customer data stored in S3.
- Encoded and decoded json objects using PySpark to create and modify the data frames in Apache Spark.
- Worked with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop MapReduce, Hive, Spark jobs.
- Used Pentaho Data Integration/Kettle to design all ETL processes to extract data from various sources including live system and external files, cleanse and then load the data into target data warehouse.
- Involved in design reviews, ETL code reviews with teammates.
- Assisted users in creating/modifying worksheets and data visualization dashboards in Tableau.
- Developed analytical components using Scala, Spark, Apache Mesos and Spark Stream.
- Create Informatica and Talend Mappings / Jobs to build One-time, Full Load and Incremental Loads. Apply Data Fixes as per the discussion with Business. working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
- Experience in utilizing cloud-based technologies using Amazon Web Services (AWS), VPC, EC2, Route S3, Dynamo DB, Elastic Cache Glacier, RRS, Cloud Watch, Cloud Front, Kinesis, Redshift, SQS, SNS, RDS.
- Installed and configured Hive and written Hive UDFs and Used Map Reduce and Junit for unit testing. knowledge of Web/Application Servers like Apache Tomcat Oracle and WebLogic.
- Implementations of generalized solution model using AWS Sage Maker.
- Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
- Involved in working with CI/CD tools like Docker, Kubernetes and Jenkins.
- Designed and developed ETL process in AWS Glue to migrate usage data from S3 data source to redshift.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Worked on UNIX Shell Scripting for splitting group of files to various small files and file transfer automation.
- Responsible for using Flume sink to remove the date from Flume channel and deposit in NoSQL database like MongoDB.
- Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
Environment: Agile, Spark, Python, AWS, Pig, HBase, Ooozie, Sqoop, Kafka, MapReduce, Glue, Talend, Scala, Hive, Cloudera, Zookeper, NIFI, MongoDB, SQL, ETL, UNIX.
Confidential, Chicago, IL
Sr. Data Engineer
Responsibilities:
- Worked in Agile environment and used rally tool to maintain the user stories and tasks.
- Worked on Python / Bash / SQL Scripts to Load data from Data Lake to Datawarehouse.
- Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
- Implemented Spark using Python/Scala and utilizing Spark Core, Spark Streaming and Spark SQL for faster processing of data instead of MapReduce in Java.
- Worked on exploring and analyzed customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau.
- Worked on Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and MLlib.
- Involved in working with big data tools like Hadoop, Spark, Hive.
- Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Worked on analyzing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka.
- Worked on importing the data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Worked extensively with HDInsight clusters and used Hadoop ecosystem tools like Kafka, Spark and databricks for real time analytics streaming, sqoop, pig, hive and CosmosDB for batch jobs.
- Worked on building the models using Python and PySpark to predict probability of attendance for various campaigns and events.
- Used Pentaho Data Integration to create all ETL transformations and jobs.
- Worked in creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
- Involved in writing pyspark User Defined Functions (UDF’s) for various use cases and applied business logic wherever necessary in the ETL process.
- Involved in Querying data using SparkSQL on top of Spark Engine, implementing Spark RDD’s in Scala.
- Created Nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.
- Created concurrent access for hive tables with shared/exclusive locks enabled by implementing Zookeeper in cluster.
- Using Tableau extract to perform offline investigation.
- Responsible for data services and data movement infrastructures good experience with ETL concepts, building ETL solutions and Data modeling.
- Implemented ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.
- Worked on Azure Blob Storage, Azure Data Lake, Azure Data Factory, Azure SQL, Azure SQL Datawarehouse, Azure Analytics, Polybase, Azure HDInsight, Azure Databricks.
- Developed pipelines to move the data from Azure blob storage/fileshare to Azure sql datawarehouse and blob.
- Worked extensively on the migration of different data products from Oracle to Azure.
- Developed and deployed data pipeline in cloud such as GCP.
- Worked in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Worked in levaraging cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as GCP.
- Worked on GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
- Performed Data Migration to GCP.
- Worked on NoSQL databases such as Hbase and Cassandra.
Environment: Azure, GCP, Agile, Spark, Hadoop, Scala, Hive, NIFI, Zookeeper, ETL, Pig, Kafka, SQL, NoSQL, Cassandra, Oracle, Linux, Tableau.
Confidential, San Francisco, CA
Python Developer
Responsibilities:
- Working experience with Agile and SCRUM methodologies.
- Created data pipelines for different events to load the data from DynamoDB to AWS S3 bucket and then into HDFS location.
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Worked on Package Configuration to setup automated ETL load processing for one time and incremental Data Loads
- Created data ingestion modules using AWS Glue for loading data in various layers in S3 and reporting using Athena and Quick sight.
- Create ETL Mappings for the Operational dashboard for various KPIs, Business Metrics, allow powerful drill down, for Detail reports to understand the data at a very detailed level.
- Used pandas, NumPy, seaborn, SciPy, matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms for predictive modeling utilizing R and Python.
- Implemented a Python-based distributed random forest via Python streaming.
- Used predictive modeling with tools in SAS, SPSS, and Python.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala
- Worked on some of the CI/CD tools like Docker and Kubernetes.
- Worked on ETL code from Talend to Informatica. Involved in development, testing and post production for the entire migration project.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions.
- Design Charts with lines, Bar Graphs and Columns to generate reports with cumulative totals.
- Worked on NoSQL databases like MongoDB and HBase.
- Developed and designed automation framework using Python and Shell scripting.
- Involved in developing python scripts,abintio,informatica and other ETL tools for extraction, transformation, loading of data into data warehouse.
- Designed and developed ETL process in AWS Glue to migrate usage data from S3 data source to redshift.
- Operating the cluster on AWS by using EC2, VPC, RDS, EMR, S3 and CloudWatch.
- Written Spark programs to model data for extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV& other compressed file formats.
- Strong understanding of AWS components such as EC2 and S3.
- Worked on CI/CD pipeline with Git Hub and AWS.
- Worked on data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer.
- Developed a Spark Streaming module for consumption of Avro messages from Kafka.
- Developed Spark Sql application, Big Data Migration from Teradata to Hadoop and reduce Memory utilization in Teradata analytics.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on ETL jobs in the new environment after fully understanding the existing code.
- Extracted data from HDFS using Hive, Presto and performed data analysis using Spark with Scala, PySpark and feature selection and created nonparametric models in Spark.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Used Jenkins scheduler to schedule the ETL workflows.
- Worked on Kafka and Kafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
- Working experience on RDD’s & Data frames (SparkSQL) using Pyspark for analyzing and processing the data.
- Created data ingestion modules using AWS Glue for loading data in various layers in S3 and reporting using Athena and Quick sight.
- Worked on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks.
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Worked on JSON-Styled documents in NoSQL database like MongoDB and deployed the data in cloud service Amazon Redshift.
- Worked on data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
- Created dynamic BI report/dashboard for production support in Excel/PowerPoint/Power BI/Tableau/ My SQL Server/ PHP.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
Environment: Agile, AWS, Cloudera, Hive, Pig, Scala, Spark, Kubernetes, Redshift, ETL, HDFS, Tableau, MYSQL, Hadoop, Kafka
Confidential
Big Data Engineer
Responsibilities:
- Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and Agile methodologies.
- Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce.
- Developed Spark/Scala, Python for regular expression (RegEx) project in Hadoop/Hive environment for big data resources.
- Responsible for design and development of advanced Python programs to prepare transform and harmonize data sets in preparation for modeling.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Implemented a Python-based distributed random forest via Python streaming.
- Developed various shell scripts and python scripts to address various production issues.
- Worked on Apache spark writing python applications to convert txt, xls files and parse.
- Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
- Worked on Power BI Reports in the required originations and Made Power BI Dashboards available in Web clients and mobile apps.
- Involve in the design, development and testing of the ETL processes using Informatica.
- Work on package configuration to setup automated ETL load processing for one time and incremental Data Loads.
- Migration of ETL processes from RDBMS to Hive to test the easy data manipulation.
- Created Dax Queries to generated computed columns in Power BI.
- Used Power BI, Power Pivot to develop data analysis prototype, and used Power View and Power Map to visualize reports.
- Responsible for logical dimensional data model and use ETL skills to load the dimensional physical layer from various sources including DB2, SQL Server, Oracle, Flat file etc.
- Designed and Developed custom data flow using Apache Nifi to fully automate the ETL process by taking various worst-case scenarios into account.
- Deep understanding of monitoring and troubleshooting mission critical Linux machines.
Environment: Power BI, Hive, Pig, YARN, Hadoop, GIT, HBase, EC2, Cloudwatch, Apache NIFI, Oracle, SQL, Glue, MongoDB, Spark.
Confidential
Data Analyst
Responsibilities:
- Gathered business requirements and prepared technical design documents, target to source mapping document, mapping specification document.
- Worked in Agile environment and used rally tool to maintain the user stories and tasks.
- Imported required modules such as Keras and NumPy on Spark session, also created directories for data and output.
- Worked in data gathering, data cleaning and data wrangling performed using Python.
- Designed and developed custom data integration pipelines on Facebooks big data stack such as python, YAML, Hive, Vertica and Dataswarm.
- Optimized and tuned ETL processes & SQL Queries for better performance.
- Executed multiple SparkSQL queries after forming the Database to gather specific data corresponding to an image.
- Involved in Querying data using SparkSQL on top of Spark Engine, implementing Spark RDD’s in Scala.
- Design ETL/SSIS packages to process data from various sources to target databases.
- Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs.
- Worked with BI teams in generating the reports and designing ETL workflows on Tableau.
- Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generated data visualizations using Tableau.
- Worked on data visualization and Designed dashboards and generated complex reports, including charts, summaries, and graphs to interpret findings to team and stakeholders.
- Involve in the Operational Data mark and reporting of Sales and Service Analytics.
- Implement both object level and row level security based on roles and responsibilities of end user.
- Created Complex mappings using Unconnected, Lookup, and Aggregate and Router transformations for populating target table in efficient manner.
- Developed UNIX shell scripts for running batch jobs and scheduling them.
Environment: Tableau, Hive, Zookeeper, Scala, SVN, ANT, Maven, UNIX, Tableau, Sqoop, ETL, ELK, SparkSQL.