We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

5.00/5 (Submit Your Rating)

San Jose, CA

SUMMARY

  • Experienced Analytic Data Engineer with a demonstrated professional working Experience in Big Data Technologies like Ingestion, Data Modelling, Querying, Processing, Analysis and Implementing Enterprise level Systems Spanning Big Data and Data Integration
  • Hands on Experience on Hadoop Distribution Platforms Namely IBM Big Insights, Hortonworks and Cloudera and Cloud platforms GCP and AWS
  • Expertise in Big Data Technologies and Hadoop Ecosystems such as Pyspark, Spark - Scala, HDFS, GPFS, Hive, Sqoop, PIG, Spark-SQL, Kafka, Hue, Yarn, Trifacta and EPIC data sources.
  • Good knowledge on Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data and Machine Learning Concepts.
  • Hands on experience in Building Data pipelines and Data marts using Hadoop stack.
  • Hands on experience in Apache Spark creating RDD’s and Data Frames applying Operations Transformation and Actions and concerting RDD’s to Data Frames.
  • Experienced in data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
  • Experience in writingREST APIsinPythonfor large-scale applications.
  • Extensive experience working with AWS Cloud services and AWS SDKs to work with services like AWS API Gateway, Lambda, S3, IAM and EC2.
  • Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS and performed teh real-time analytics on teh incoming data.
  • In depth understanding of Apache Spark job execution components like DAG, Executors, Task Schedular,Stages and Spark Stearming.
  • Experience in Creating and executing Data Pipelines in GCP and AWS platforms.
  • Hands on Experinece in GCP, Big query, cloud functions, data proc.
  • Strong Experience in Control-M Job Scheduler Tool,Apache Air flow, ESP, D-series. Monitored teh jobs on call base to close Incident tickets.
  • Hands-on experience withAmazon EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, EMR, Redshift, DynamoDB and other services of teh AWS family.
  • Expertise in using CICID JENKINS pipeline to deploy teh codes into production
  • Designed and developed teh program paradigm to support data collection, filtering process in data warehouse and Hadoop data mart.
  • Strong Hadoop and platform support experience with all teh entire suite of tools and services in majorHadoop Distributions- Cloudera, Amazon EMR, and Hortonworks.
  • Hands-on Experience in working with globally distributed team in Europe,Mexico & India and AGILE Implementation methodology.
  • Deep understanding of Cyber security, pen testing and working with them to get approvals to deploy code into production.
  • Hands on experience in working Agile environment and follow release management, Gloden rules.
  • Experience in Version control tools such as GIT and Urban Code Deployment (UCD) tools.

TECHNICAL SKILLS

Programming Languages: PYTHON,Scala, SHELL SCRIPTING

Big Data Eco-System: HDFS, GPFS, Hive, Sqoop, Spark, Yark, PIG, Kafka

Operating Systems: Windows, Linux (Centos, Ubuntu)

Hadoop Distributions: Hortonworks, Cloudera, IBM Big InsightsDatabases: Hive, MYSQL,NETEZZA, SQL Server

IDE Tools & Utilities: IntelliJ IDEA, Ecplise, PyCharm, Aginity Workbench, GIT

Markup Languages: HTML

Job Scheduler: Control-M, IBM Symphony Platform, Ambari, Apache Air flow

Cloud Computing Tools: GCP,AWS, Snowflake.

Scrum Methodologies: Agile, Asana, Jira

Others: MS Office, RTC, Service Now, OPTIM, IGC(Info sphere Governance catalog), WinSCP, MS Visio

PROFESSIONAL EXPERIENCE

Confidential - San Jose, CA

Sr. Big Data Engineer

Responsibilities:

  • Building Data Stream Lines in Google Cloud Platform (Iot Registry, Pub/Sub, DataFlow, BigQuery, DataPrep, Data Studio, AI Platform)
  • Developed applications and deployed them in Google Cloud Platform using DataProc, Dataflow, Composer, BigQuery, BigTable, Cloud Storage, GCS and various operators in DAG.
  • Migrated existing data pipelines in hive to GCP platform
  • Designed and implemented data transformation, ingestion and curation functions on GCP cloud using GCP native and Python.
  • Optimized data pipelines for performance and cost for large scale data lakes.
  • Designed and automated Big Query tables and Google Cloud Functions to enable reporting, analysis, and modeling.
  • Used Node.js to write custom UDFs in Big query and used them in teh data pipeline.
  • Used Python for scripting purposes, for leveraging a wide range of technologies that include leveraging a wide range of technologies
  • Worked on Developing and supporting databases and related ETL (batch and real-time processing)
  • Good understanding of issue triaging and resolution protocols in Big Data systems
  • Designing, Testing and Implementing data migration/ingestion/processing/quality frameworks that will be able to handle hundreds of GBs data using Airflow, Pyspark, PythonandBigquery.
  • Conduct design and code reviews to ensure high quality of work is delivered.
  • Constantly engaging with data customers to get feedback around teh data solutions developed and building documentation for teh data engineering best practices.
  • Developed Hive Scripts, Hive UDFs, Python Scripting and used Spark (Spark-SQL, Spark-shell) to process data in Hortonworks.
  • Built a system for analyzing teh column names from all tables and identifying personal information columns of data across on-premises Databases (data migration) to GCP
  • Designed and Developed Scala code for data pull from cloud-based systems and applying transformations on it.
  • Usage of Sqoop to import data into HDFS from MySQL database and vice-versa.
  • Implemented optimized joins to perform analysis on different data sets using MapReduce programs.
  • Created continuous integration and continuous delivery (CI/CD) pipeline on AWS that halps to automate steps in software delivery process.
  • Experience in processing of load and transform teh large data sets of structured, unstructured and semi structured data in Hortonworks.
  • Implemented Partitioning, Dynamic Partitions and Buckets in HIVE & Impala for efficient data access.
  • Worked inAgile environment and used rally tool to maintain teh user stories and tasks.
  • Extensively worked on HiveQL, join operations, writing custom UDF's and having good experience in optimizing Hive Queries.
  • Performed analysis on data discrepancies and recommended solutions based upon root cause.
  • Designed and developed job flow using Apache Air flow.

Environment: HDFS, Python Scripting, Map Reduce, Hive, Impala, Spark-SQL, Spark Streaming, Sqoop, AWS S3, Java, GCP, BigQuery, JDBC, AWS, Python, Scala, UNIX Shell Scripting, Git.

Confidential - St. Louis, MO

Data Engineer

Responsibilities:

  • Developed PySpark pipelines which transforms teh raw data from several formats to parquet files for consumption by downstream systems.
  • Used AWS Glue services like crawlers and ETL jobs to catalog all teh parquet files and make transformations over data according to teh business needs.
  • Worked with AWS services like S3, Glue, EMR, SNS, SQS, Lambda, EC2, RDS and Athena to process data for teh downstream customers.
  • Created libraries and SDKs which will be halpful in making JDBC connections to hive database and query teh data using Play framework and various AWS services.
  • Developed scripts using Spark which are used to load teh data from Hive to Amazon RDS(Aurora) at a faster rate.
  • Created views on top of data in Hive which will be used by teh application using Spark SQL.
  • Applied security on data using Apache Ranger to set row level filters and group level policies on data.
  • Normalized teh data according to teh business needs like data cleansing, modifying teh datatypes and various transformations using Spark, Scala and AWS EMR.
  • Worked on creating teh CI/CD pipelines using tools like Jenkins and Rundeck which will be responsible for scheduling teh daily jobs.
  • Developed Sqoop jobs which will be responsible for importing teh data from Oracle to AWS S3.
  • Developed a utility which transforms and exports teh data from AWS S3 to AWS glue and sends alerts and notifications to downstream systems (AI and Data Analytics) once teh data is ready for usage.
  • Worked on groovy scripted Jenkins CICD pipelines, to automate Hadoop cluster scaling. Provisioned servers and deployed features using Ansible playbooks.
  • Worked on Jenkins for CICD, pull code from version controls like GitHub, built apache maven and Gradle. Built artifacts are stored in repositories like nexus.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Developed pipelines for auditing teh metrics of all applications using AWS Lambda, Kinesis Firehoses.
  • Developed end to end pipeline which exports teh data from parquet files in S3 to Amazon RDS.
  • Worked on optimizing performance of Hive queries using Hive LLAP and various other techniques.

Environment: AWS, Spark, Pyspark, Python, Hadoop, Hive, Sqoop, Play framework, Apache Ranger, S3, EMR, EC2, SNS, SQS, Lambda, Zeppelin, Kinesis, Athena, Jenkins CICD, Rundeck and AWS Glue.

Confidential - North Wales, PA

Data Engineer

Responsibilities:

  • Adept in Agile project management methodology and SDLC (Software Development Life cycle) Requirement gathering, analysis, Design, Development and testing of application using AGILE and SCRUM methodology.
  • Detailed understanding on existing build system and tools related to information of various products, release and test results.
  • Expertise in process improvement, data extraction, data cleansing, SCRUM data manipulation, Normalization and Denormalization concepts and principals.
  • Created ETL data pipelines in state of teh art AWS environment EC2, S3, Lambda, etc with AWS Glue.
  • Worked on Amazon Web Services ( EC2, ELB, VPC, S3, CloudFront, Elasticsearch, IAM, RDS, Route 53, CloudWatch, SNS, Redshift, kinesis, RDS, Lambda, Glue, SageMaker, Personalize).
  • Setting up and configuring AWS Virtual Private Cloud (VPC) Components—subnets, IGW, Security Groups. EC2 Instances, Elastic Load Balancers & NAT Gateways for an Elastic Map Reduce Cluster.
  • Deploying, managing, and operating scalable, highly available, and fault tolerant systems on AWS
  • Actively managed teh day to day AWS accounts, make recommendations on how best to support our global infrastructure and interact with Developers and Architects in cross functional areas
  • Hands on experience with AWS CLI interface and designing Scalable AWS solutions.
  • Strong Experience in implementing Data warehouse solutions in Redshift; Worked on various projects to migrate data from on premise databases to Redshift, RDS and S3.
  • Hands on experience working knowledge on AWS SageMaker.
  • Developed and deployed a product recommendation system using AWS SageMaker based on Matrix Factorization and KNN algorithms.
  • Researched extensively on AWS Personalize and deployed teh event ingestion code snippet on our product website.
  • Assisted in implementation and maintenance of security and data encryption technologies.
  • Conducted complete analysis of database capacity and performance requirements.
  • Experience building reusable ETL components using Postgres and snowflake.
  • Worked extensively on writing triggering Snowpipe, Snowflake data loads automatically using Amazon SQS (Simple Queue Service) notifications for an S3 bucket.
  • Extensively wrote Postgres triggers for automating teh ETL process in Postgres Database.
  • Extracted teh data from legacy systems into staging area using ETL jobs & SQL queries
  • Perform Quality assurance testing and automated teh error record detection on Postgres.
  • Worked closely with teh developers in API development using node.js in express frame work and wrote a service for data obfuscation.
  • Performed unit and Integration testing on different API components using mocha and chai.
  • Utilized google analytics to track teh visitor flow and interaction throughout teh company website.
  • Linked teh Jupiter note books on my local to teh google analytics platform to perform analysis on teh customer interaction data.
  • Extensively researched and implemented various regression, classification and clustering Machine Learning algorithms in Jupyter notebooks.
  • Created customer analytics metrics dashboard on google cloud platform using Big Query
  • Strategic expertise in design of experiments, data collection, data analysis and visualization using various tools and technologies.

Environment: Apache Hadoop, EC2, ELB, VPC, S3, CloudFront, IAM, RDS, Route 53, AWS CloudWatch, SNS, AWS Lambda, AWS Glue, AWS SageMaker, AWS Personalize, RedSHift, Python, Maven, GIT, MySQL, PostgreSQL, Oozie, Sqoop, Flume, JDK 1.8, Agile and Scrum Development Process, google Analytics, Big Query, Dialog Flow.

Confidential - Foster City, CA

Data Engineer

Responsibilities:

  • Worked on Hadoop eco-systems including Hive, HBase, Oozie, Pig, Zookeeper, Spark Streaming MCS (MapR Control System) and so on with MapR distribution.
  • Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in Java for data cleaning and pre-processing.
  • Built code for real time data ingestion using Java, MapR-Streams (Kafka) and STORM.
  • Involved in various phases of development analyzed and developed teh system going through Agile Scrum methodology.
  • Involved in development of Hadoop System and improving multi-node Hadoop Cluster performance.
  • Worked on analyzing Hadoop stack and different big data tools including Pig, Hive, HBase database and Sqoop.
  • Developed data pipeline using flume, Sqoop and pig to extract teh data from weblogs and store in HDFS
  • Worked with different data sources like Avro data files, XML files, JSON files, SQL server and Oracle to load data into Hive tables.
  • Used Spark to create teh structured data from large amount of unstructured data from various sources.
  • Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
  • Performed transformations, cleaning and filtering on imported data using Hive, MapReduce, Impala and loaded final data into HDFS.
  • Developed Python scripts to find vulnerabilities with SQL Queries by doing SQL injection.
  • Experienced in designing and developing POC’s in Spark using Scala to compare teh performance of Spark with Hive and SQL/Oracle.
  • Specified teh cluster size, allocating Resource pool, Distribution of Hadoop by writing teh specification texts in JSON File format.
  • Imported weblogs & unstructured data using teh Apache Flume and stores teh data in Flume channel.
  • Exported event weblogs to HDFS by creating a HDFS sink which directly deposits teh weblogs in HDFS.
  • Used RESTful web services with MVC for parsing and processing XML data.
  • Managing teh OpenShift cluster, which involves scaling up and down teh Amazon Web Services application nodes.
  • Collaborated and communicated teh results of analysis to teh decision makers by presenting actionable insights by using visualization charts and dashboards in Amazon Quick Sight.
  • Developed data warehouse model in snowflake for over 100 datasets using WhereScape.
  • Worked on various data modeling concepts like star schema, and snowflake schema in teh project.

Environment: Hadoop, Apache Spark, HDFS, Hive, Spark SQL, Pyspark, Python, Django, Oracle SQL, Tableau, AWS, Hadoop distribution of Horton Works, Cloudera, Pig, HBase, Linux, XML, Zookeeper, Snowflake.

Confidential

ETL Developer

Responsibilities:

  • Extraction, transformation, and data loading were performed using stored procedures in teh database. Involved in Logical and Physical modeling of teh drugs database.
  • Based on teh requirements created Functional design documents and technical design specification documents for ETL.
  • Created tables, views, indexes, sequences, and constraints.
  • Developed stored procedures, functions, and database triggers using PL/SQL according to specific business logic.
  • Transferred data using SQL Loader to teh database.
  • Involved in testing of Stored Procedures and Functions.
  • Designed and developed table structures, stored procedures, and functions to implement business rules.
  • Used legacy systems, Oracle, and SQL Server sources to extract teh data and load teh data.
  • Involved in teh design and development of data validation, load process, and error control routines.
  • Analyzed teh database for performance issues and conducted detailed tuning activities for improvement.
  • Generated monthly and quarterly drug inventory/purchase reports.
  • Coordinated database requirements with Oracle programmers and wrote reports for sales data.

Environment: Java 1.5, Java Script, Struts 2.0, Hibernate 3.0, Ajax, JAXB, XML, XSLT, Eclipses, Tomcat.

We'd love your feedback!