Aws Big Data Engineer Resume
Seattle, WA
SUMMARY
- A dedicated and diligent Big Data Engineer who combines a strong academic background with significant experience gained in the following areas:
- Experienced in Big Data with 8+ years’ experience covering 5 projects with roles including Hadoop Developer, Big Data Developer, Data Engineer, Big Data Engineer, and AWS Big Data Engineer.
- 11+ years combined experience in Big Data and IT/database infrastructure.
- Quick to grasp new ideas and concepts.
- Identify key improvements to working practices and procedures to maximize efficiency without compromising quality or service.
- Pay attention to detail when undertaking research and analysis and during the preparation of key reports for verbal and written presentation.
- Work effectively on own initiative and proven in achieving set targets, deadlines, and objectives.
- Utilize exceptional interpersonal skills and communicative abilities to build positive and lasting relationships with customers and colleagues.
- Strong Knowledge in Hadoop and its components (HDFS, Map Reduce, Flume, Kafka, Hive, Sqoop, Pig, Spark, HBase).
- Sound Knowledge at the programming languages (Scala, Python, Java)
- Experienced in Machine Learning algorithms such as KNN, K - Means and Random Forest, Linear Regression, Logistic Regression, Naive Bayes, SVM, Word2Vector & NLTK.
- Expertise in Deep Learning and Neural Networks algorithms such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) learnt using Deep Learning techniques and LSTM.
- Experienced Transfer learning in TensorFlow, Keras to build neural networks.
- Strong knowledge in Natural Language Processing (NLP) using python.
- Design, build and deploy Machine Learning applications to solve real-world problems.
- Hands-on experience in Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO) and SQL Azure.
TECHNICAL SKILLS
HADOOP DISTRIBUTIONS: Apache Hadoop, Cloudera Hadoop, Hortonworks Hadoop
CLOUD PLATFORMS: Amazon AWS-S3, Azure, Google Cloud Platform (GPC)
VERSIONING: Git, GitHub, BitBucket
CLOUD DATABASE & TOOLS: Apache HBase, SQL, Cassandra, Hive, Amazon Redshift, Amazon RDS
PROGRAMMING LANGUAGES: Python, Java, Scala
SCRIPTING: Hive, Pig, MapReduce, SQL, Spark SQL
FILE FORMAT AND COMPRESSION: CSV, ORC, JSON, Avro, Parquet
FILE SYSTEMS: HDFS
ETL TOOLS: Apache Flume, Kafka, Spark, AWS Glue, AWS EMR, Sqoop
DATA VIZUALIZATION TOOLS: Tableau, PowerBI, Kibana
OPERATING SYSTEMS: Unix/Linux, Windows
PROFESSIONAL EXPERIENCE
AWS BIG DATA ENGINEER
Confidential, Seattle, WA
Responsibilities:
- Develop multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS.
- Responsible for designing logical and physical data modelling for various data sources on AWS Redshift.
- Define and implement schema for a customHBase.
- Create Apache Airflow DAGS using Python.
- Write numerous Spark programs in Scala for information extraction, change, and conglomeration from numerous record designs.
- Work with AWS Lambda functions for event-driven processing using AWS boto3 module in Python.
- Execute Hadoop/Spark jobs on AWS EMR using programs and data stored in S3 Buckets.
- Configure access for inbound and outbound traffic RDS DB services, DynamoDB tables, EBS volumes to set alarms for notifications or automated actions on AWS.
- Develop AWS Cloud Formation templates to create custom infrastructure of our pipeline.
- Implement AWS IAM user roles and policies to authenticate and control access.
- Specify nodes and perform data analysis queries on Amazon redshift clusters on AWS.
- Define the Spark/Python (PySpark) ETL framework and best practices for development.
- Develop Spark programs usingPySpark.
- Create User DefinedFunction (UDFs) using Python in Spark.
- Work on AWS Kinesis for processing huge amounts of real-time data.
- Develop scripts for collecting high-frequency log data from various sources and integrating it into AWS using Kinesis, staging data in the Data Lake for further analysis.
- Work with different data science teams and provide respective data as required on an ad-hoc request basis.
- Move transformed data to Spark cluster where the data is set to go live on the application using Kafka.
- Create a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
- Handle schema changes in data stream using Kafka.
- Responsible for ensuring Systems and Network Security, maintaining performance and setting up monitoring using CloudWatch and Nagios.
- Experience working on version controller tools like GitHub (GIT), Subversion (SVN) and software builds tools like Apache Maven.
- Develop, design, and test Spark SQL jobs with Scala and PythonSpark consumers.
- Work on CI/CD pipeline for code deployment by engaging different tools (Git, Jenkins, CodePipeline) in the process right from developer code check-in to Production deployment.
- Create and maintain ETL pipelines in AWS using Glue, Lambda, and EMR.
- Apply Google Dataproc managed Spark and Hadoop service to optimize processes for batch processing, querying, streaming, and machine learning.
- Automate cluster creation and management using Google Dataproc.
- Apply Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO) and SQL Azure.
Big Data Engineer
Confidential, Lake Forest, CA
Responsibilities:
- Implemented Jenkins server for CICD integrated with Git version control.
- Monitored log applications utilizing Elastic Search.
- Applied Kibana for dashboarding log information and streaming information.
- Used Kafka messaging to carry varying information between multiple endpoints.
- Configured Zookeeper to coordinate the servers in clusters to maintain the data consistency and to monitor services
- Installed, configured, and tested an AWS Lambda function workflow in Python.
- Ingestion data through AWS Kinesis Data Stream and Firehose from various sources to S3.
- Created a POC (Used PySpark to work on deduplications, null values, and data Integrity)
- Utilized Spark Data Frame and Data Set from Spark SQL API extensively for data processing.
- Worked on streaming the processed data to DynamoDB using Spark for making it available for visualization and report generation by the BI team.
- Used Spark SQL to create real-time processing of structured data with Spark Streaming processed through structured streaming.
- Used Kibana for dashboards and reporting to provide visualization of log data and streaming data.
- Development and debugging experience with Python, Scala and Java.
- Log monitoring and generating visual representations of logs using ELK stack.
- Applied Google Dataproc to streamline data processing between clusters and Google Cloud Storage.
- Created Lambda to process the data from S3 to Spark for structured streaming to get structured data by schema.
- Set the Spark job to process the data to Redshift and EMR HDFS(Hadoop).
Data Engineer
Confidential, Orange,CA
Responsibilities:
- Wrote shell scripts for automating the process of data loading.
- Performed streaming data ingestion to the Spark distribution environment using Kafka.
- Wrote complex API queries into Apache Hive on Hortonworks Sandbox.
- Formatted the response into a data frame using a schema containing, country code, artist name, number of plays,and genre to parse the JSON.
- Wrote producer /consumer scripts to process JSON response in Python.
- Developed distributed query agents for performing distributed queries against shards.
- Wrote queries, stored procedures, functions, and triggers by using SQL.
- Supported development, testing, and operations teams during new system deployments.
- Implemented Spark using Scala and utilized Data Frames and Spark SQL API for faster processing of data.
- Created Hive queries to summarize and aggregate business queries by comparing Hadoop data with historical metrics.
- Worked closely with stakeholders and data scientists/data analysts to gather requirements and create an engineering project plan.
- Created ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.
- Developed DBC//ODBC connectors between Hive and Spark for the transfer of the newly populated data frame.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language.
- Used Spark engine, Spark SQL for data analysis,and provided to the data scientists for further analysis.
- Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes, supporting full relational KQL queries, including joins.
- Worked with Amazon Web Services (AWS) and involved in ETL, Data Integration, and Migration.
- Created materialized views, partitions, tables, views,and indexes.
- Wrote custom user define functions (UDF) for complex Hive queries (HQL).
- Created and executed Hadoop Ecosystem installation and document configuration scripts with Google Dataprocon on Google Cloud Platform.
Big Data Developer
Confidential, Bloomington, IN
Responsibilities:
- Configured, installed, and managed Hortonworks (HDP) Distributions.
- Enabled security to the cluster using Kerberos and integrated clusters with LDAP at Enterprise level.
- Worked on tickets related to various Hadoop/Big data services, including HDFS, Yarn, Hive, Oozie, Spark, and Kafka.
- Worked on Hortonworks Hadoop distributions (HDP 2.5).
- Performed cluster tuning and ensured high availability.
- Established Cluster coordination services through Zookeeper and Kafka.
- Managed Hadoop clusters via Command Line, and Hortonworks Ambari agent.
- Monitored multiple Hadoop clusters environments using Ambari.
- Worked with cluster users to ensure efficient resource usage in the cluster and alleviate multi-tenancy concerns.
- Managed and scheduled batch jobs on a Hadoop Cluster using Oozie.
- Performed cluster and system performance tuning.
- Ran multiple Spark jobs in sequence for processing data.
- Performed analytics on data using Spark.
- Moved data from Spark and persist it to HDFS.
- Used Spark SQL and UDFs to perform transformations and actions on data residing in Hive.
Hadoop Developer
Confidential, Rochester NY
Responsibilities:
- Monitored Hadoop cluster using tools like Nagios, Ganglia, Ambari.
- Managing Hadoop clusters via Cloudera Manager, Command Line, and Hortonworks Ambari agent.
- Installed and configured Tableau Desktop to connect to the Hortonworks Hive Framework (Database) which contains the Bandwidth data form the locomotive through the Hortonworks ODBC connector for further analytics of the data.
- Developed Oozie workflow for scheduling and orchestrating the ETL process within the Cloudera Hadoop system.
- Automated workflows using shell scripts pull data from various databases into Hadoop.
- Involved in Cluster Level Security, Security of perimeter (Authentication- Cloudera Manager, Active directory, Kerberos/Ranger) Access (Authorization and permissions- Sentry) Visibility (Audit and Lineage - Navigator) Data (Data Encryption at Rest).
- Balancing Hadoop cluster using balancer utilities to spread data across the cluster equally.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Configured Yarn capacity scheduler to support various business SLAs.
- Implemented Capacity schedulers on the Yarn Resource Manager to share the resources of the cluster for the Map Reduce jobs given by the users.
MySQL Database Administrator
Confidential, Chicago, IL
Responsibilities:
- Designed and configured MySQL server cluster and managed each node on the Cluster.
- Responsible for MySQL installations, upgrades, performance tuning, etc.
- Collected and analyzed business requirements to derive conceptual and logical data models.
- Developed database architectural strategies at the modeling, design, and implementation stages.
- Translated a logical database design or data model into an actual physical database implementation.
- Mentored and worked with developers and analysts to review scripts and better querying.
- Performed security audit of MySQL internal tables and user access. Revoked access for unauthorized users.
- Set up replication for disaster and point-in-time recovery. Replication was used to segregate various types of queries and simplify backup procedures.
- Defined procedures to simplify future upgrades. Standardized all MySQL installs on all servers with custom configurations.
- Applied performance tuning to improve issues with a large, high-volume, multi-server MySQL installation for job applicant site of clients.
- Modified database schema as needed.
- Analyzing, profiling data for quality and reconciling data issues using SQL.
- Regular database maintenance.
- Created and implemented database standards and procedures for management.
- Prepare documentations and specifications.
Database Administrator
Confidential, Redmond, WA
Responsibilities:
- Creating and maintaining databases in SQL Server 2010.
- Design and establish SQL applications.
- Create tables and views in the SQL database.
- Supported schema changes and maintained the database to perform in optimal conditions.
- Creating and managing tables, views, user permissions, and access control.
- Sent requests to source REST Based API from a Scala script via Kafka producer.
- Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.
- Writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language.
- Built the Hive views on top of the source data tables and built a secured provisioning.
- Created and managed dynamic web parts.
- Customization of library attributes, import, and export of existing data, and connections of data.
- Provided a workflow and initiated the workflow processes.
- Worked on SharePoint Designer and InfoPath Designer and developed workflows and forms.