Aws Bigdata Engineer Resume
Reston, VA
SUMMARY
- Experience in Technologies on systems which comprises of massive amount of data running in highly distributive mode in Cloudera, Hortonworks Hadoop distributions and Amazon AWS.
- Hands on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, Scala and Hue.
- Extensive experience in working with NO SQL databases and its integration Dynamo DB, Mongo DB, HBase
- Experienced in developing Data Ingestion framework to ingest data from various data sources like Oracle, SQL Server, Flat files to Data Lake in Hadoop Eco System.
- Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end - to-end traceability, lineage, definition of Key Business elements from Aurora.
- Automated AWS volumes snapshot backups for enterprise using Lambda. Created functions and assigned roles in AWS Lambda to run python scripts. Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet and Hive.
- Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data.
- Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive User Defined Functions (UDF's) as required.
- Excellent knowledge in using Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets.
- Strong knowledge in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
- Used Python Boto3 to write scripts to automate launch, starting and stopping of EC2 instances and taking snapshots of the servers.
- Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers. Developed microservice on boarding tools leveraging Python and Jenkins allowing for easy creation and maintenance of build jobs and Kubernetes deploy and services.
- Worked on creating the Docker containers and Docker consoles for managing the application life cycle. Setup Docker on Linux and configured Jenkins to run under Docker host
- Skilled in using Amazon Redshift to perform large scale database migrations.
- Skilled in using Kerberos, and Ranger for maintaining authentication and authorization of jupyter Notebooks
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database Systems and from Relational Database Systems to HDFS.
TECHNICAL SKILLS
Cloud Platform Services: AWS Services EC2, S3, ELB, Auto scaling Groups, Glacier, EBS, Elastic Beanstalk, Cloud Formation/Terraform,, RDS, Redshift, VPC, Direct Connect, Route 53, Cloud Watch, Cloud Trail, IAM, Dynamo DB, SNS, SQS, ElastiCache, RedShift, EMR, Lambda, Elastic Search, DMS, SCT, Kinesis streams, Kinesis Firehose
Cl Apache: Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS, Airflow
Scripting: HiveQL, MapReduce, XML, Python, UNIX, Shell scripting, LINUX, Scala
Distributions: Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS)
Data processing (compute) engines and Containers: Apache Spark, Spark Streaming, Docker, Kubernetes
Database Tools: Microsoft SQL Server Database (2005, 2008R2, 2012), Apache Cassandra, Amazon Redshift, DynamoDB, Apache Hbase, Apache Hive, MongoDB
Languages: PL/SQL, HTML, XML, C++, Java, Python
File formats: Parquet, Avro & JSON, ORC, text, csv
Scheduling tools: Airflow
PROFESSIONAL EXPERIENCE
AWS Bigdata Engineer
Confidential, Reston, VA
Responsibilities:
- Extensively involved in all phases of data acquisition, data collection, data cleaning, model development, model validation and visualization to deliver python solutions.
- Extracted data from SQL Server Database, copied into HDFS File system and used Hadoop tools such as Hive and Pig Latin to retrieve the data required for building models.
- Performed data cleaning including transforming variables and dealing with missing value and ensured data quality, consistency, integrity using Pandas, NumPy.
- Fine - tuning pyspark applications/jobs to improve the efficiency and overall processing time for the pipelines.
- Written pyspark job in AWS Glue to merge data from multiple table
- Utilized Crawler to populate AWS Glue data Catalog with metadata table definitions
- Worked on Mongo DB database concepts such as locking, transactions, indexes, Sharding, replication, schema design.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and from RDBMS to HDFS.
- Designed and developed Data Ingestion pipeline to ingest data from various data sources like Oracle, SQL Server, Flat files to Data Lake in Hadoop Eco System.
- Developed Scalable Transformation/Aggregation/rollup Operations with Hive and Optimized the SLA s utilizing hive - based partitions, buckets and storing the data in different file formats (Parquet, Avro, ORC) using suitable compression codecs (snappy, lz4, gzip, lzo, bzip) based on application needs.
- Developed optimized distributed applications with Spark Core and Spark SQL in Python integrating Rest, fact and dimensional data, and feed the data to HDFS and SQL server.
- Implemented watcher alerts for Next-day alerting/ monitoring for application logs which partitioned by-day using ELK stack deployed on Docker containers and also configured Logstash-forwarders.
- Used Jenkins pipelines to drive all micro services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes
- Worked on Elasticsearch Crud API for document Indexing and reindexing using Bulk API with added X-pack extension for monitoring Elastic stack and audit both rest & transport calls.
- Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.
- Developed data ingestion modules (both real time and batch data load) to ingest data into various layers in S3, Redshift and DynamoDB using AWS Kinesis, AWS Glue, AWS Lambda and AWS Step Functions
- Migrated the DW to Redshift and the data lake layers to Cloud S3 storage and migrated the Pyspark applications to AWS and EMR jogs on cloud
- Build Ephemeral EMR cluster to execute data prep. and data transformation tasks using Python and Spark. Populated Raw, Curated, Conformed and Analytics Layers using EMR executed Pyspark jobs. Managed batch job performance using EMR autoscaling. Reporting was done in Redshift based DW with Tableau reporting tool
- Automated AWS volumes snapshot backups for enterprise using Lambda. Created functions and assigned roles in AWS Lambda to run python scripts. Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
- Migrated existing on-premises application to AWS. Designed and developed Spark jobs to process batch jobs that were run on AWS EMR.
- Designed and developed Glue jobs to transform survey data from various sources ingested in S3 to migrate into Redshift.
- Imported data from AWS S3 into Spark Data frames using Python. Performed transformations and actions on data frames storing intermediate results in parquet format in HDFS on EMR cluster and final results on S3. Python modules were used for logging, data manipulations, config parsing, reading environment variables etc
- Used HiveQL to analyse the partitioned and bucketed data, Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.
Bigdata Engineer
Confidential, Maryland
Responsibilities:
- Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, Spark and Shell scripts (for scheduling of jobs ) Extracted and loaded data into Data Lake environment
- Worked on building data lake for Loss Information system on premises cluster. This included building a pipeline for data Ingestion as well as developing Pyspark jobs for data transformations
- Solid understanding and experience in applying and implementing machine learning algorithms and concepts such as: Classification and Regression, Resampling statistics and bootstrapping using R language
- Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Zookeeper, SQOOP, flume, Spark, Kafka, Hbase with MapR Distribution.
- Assist in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and Hbase.
- Used Spark for interactive queries, processing of streaming data and integration with NoSQL database for huge volume of data.
- Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark 1.x/2.x for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Hbase.
- Implemented design patterns in Scala for the Spark application.
- Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Worked on Loading log data into HDFS using Flume, Kafka and performing ETL integrations
- Used Reporting Tool Tableau to connect with Hive for generating daily reports of data.
- Collaborated with the infrastructure, network, database, application and BA teams to ensure data quality
- Worked with different File Formats like TEXTFILE, SEQUENCE FILE, AVROFILE, ORC, and PARQUET for Hive querying and processing