We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Plano, TX

SUMMARY

  • Experienced Data Engineer with over 8+ years of experience in Big Data using Hadoop framework and related technologies including HDFS, HBase, Spark, Kafka, Map - Reduce, Hive, Pig, Flume, Oozie, Sqoop, Impala, MapRDB, Drill, and Zookeeper.
  • Hands-on experience in Business process analysis, Cloud Data products and Applications built using Amazon web services (AWS cloud), Data-lake using Snowflake SQL, Programming languages using Python, Reporting tools like Power BI, Tableau and worked in Agile Methodology, in Big data ecosystem, Java and .Net related technologies.
  • Worked with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Kinesis, Cloud Front, CloudWatch, SNS, SES, SQS, and other services of the AWS family.
  • 4+ Years of experience in Big Data Analytics using Various Hadoop eco-systems tools, Spark Framework and Currently working on Spark and Spark Streaming frameworks extensively using Scala as the main programming dialect
  • Good understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node and Data Node.
  • Strong understanding of Hadoop daemons and Map Reduce concepts
  • Hands-on experience in installing, configuring and using Hadoop ecosystems such as Map-Reduce, HIVE, PIG, SQOOP, FLUME, and OOZIE.
  • Have experience in Apache Spark, Spark Streaming, Spark SQL, and NoSQL databases like HBase, Cassandra, and MongoDB.
  • Responsible for migration of applications running on-premise onto Azure cloud.
  • Experience installing/configuring/maintaining Apache Hadoop clusters for application development and Hadoop tools like Sqoop, Hive, Pig, Flume, HBase, Kafka, Hue, Storm, Zookeeper, Oozie, Cassandra, Sqoop, Python
  • Worked with major distributions like Cloudera (CDH 3&4) & Horton works Distributions and AWS. Also worked on UNIX and DWH in support of various Distributions.
  • Extensively used Informatica client tools Source Analyzer, Warehouse designer, Mapping designer, Mapplet Designer, ETL Transformations, Informatica Repository Manager and Informatica Server Manager, Workflow Manager & Workflow Monitor.
  • Strong Hadoop and platform support experience with the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, Azure.
  • Hands-on expertise with AWS Databases such as RDS (Aurora), Redshift, and DynamoDB.
  • Experience in JIRA and tracked the test results and interacted with the developers to resolve issues.
  • Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster.
  • Having experience in developing a data pipeline using Kafka to store data into HDFS.
  • Good Knowledge in loading the data from Oracle and MySQL databases to HDFS systems using SQOOP (Structured Data) and FLUME (Log Files & XML).
  • Extensive experience in developing PIG Latin Scripts and using Hive Query Language for data analytics.
  • Experienced in writing custom Hive UDF’s to in corporate Business logic with Hive Queries.
  • Experienced data pipelines usingKafkaand Akka for handling large terabytes of data.
  • Experience with AWS components like Amazon Ec2 instances, S3 buckets, Cloud Formation templates, and Boto library.
  • Managed migration of on-prem servers to AWSby creating golden images for upload and deployment
  • Created AWS VPC network for the installed Instances and configured security groups and Elastic IPs Accordingly
  • Developed AWSCloud formation templates to create custom sized VPC, subnets, EC2 instances, ELB, and security groups.
  • Implemented the real-time streaming ingestion using Kafkaand Spark Streaming
  • Loaded data usingSpark-streaming with Scala and Python
  • Knowledge on analyzing data interactively using Apache Spark and Apache Zeppelin.
  • Good Knowledge in understanding the Apache Storm-Kafka pipelines.
  • Good experience in optimizing Map Reduce algorithms using Mappers, Reducers, combiners, and Partitioners to deliver the best results for the large datasets.
  • Extensive experience in working with application servers like WebSphere, WebLogic, and Tomcat.
  • Hands-on experience in developing PIG Latin Scripts and Hive Query language for data analytics
  • Good Knowledge in job/workflow scheduling and monitoring tools like Oozie & Zookeeper.
  • Implemented data pipeline by chaining multiple mappers by using Chained Mapper.
  • Hands-on experience in application development using core SCALA, RDBMS, Linux shell scripting, and developed UNIX shell scripts to automate various processes.
  • Proficiency in using BI tools like Tableau/Pentaho.
  • Experience in understanding the security requirements for Hadoop and integrating with Key Distribution Centre.
  • Extensive Experience in using database applications of RDBMS in ORACLE and MS Access, SQL Server
  • Detailed understanding of Software Development Life Cycle (SDLC), and sound knowledge of project implementation methodologies including Scrum, Waterfall, and Agile.

TECHNICAL SKILLS

Big Data technologies: Apache Hadoop, Hive, Sqoop, HBase, Spark, Zookeeper, Oozie, NIFI, Ranger, Kafka

Programming Languages: Java, Scala, Python, SQL, JavaScript, Bash,Pyspark

Methodologies: Agile, Rad, V-model, waterfall model

Databases: Oracle, MySQL, HBase, MS SQL Server, Mongo DB, Teradata, SnowFlake

Web Technologies: HTML, ASP, XML, SOAP, XSLT, JavaScript

IDE’s: Eclipse, IntelliJ, Visual Studio

Build tools: Maven, Ant, Jenkins

Web services: SOAP & RESTful Web Services

Portals/Application servers: Weblogic, WebSphere Application Server, WebSphere Portal Server, JBOSS

Cloud technologies: Amazon Web Services (AWS), Azure

Monitoring Tools: Splunk 8.2.6, Nagios

Operating System: Windows, Ubuntu, Red Hat Linux, CentOS.

BI Tools: Power BI

PROFESSIONAL EXPERIENCE

Confidential, Plano, TX

Sr. Big Data Engineer

Responsibilities:

  • Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce HIVE.
  • Worked collaboratively to manage build-outs of large data clusters and real-time streaming with Spark.
  • Developed ETL data pipelines using Spark, Spark Streaming, and Scala.
  • Responsible for loading Data pipelines from web servers using Sqoop, Kafka, and Spark Streaming API.
  • Have experience working on the Snowflake data warehouse.
  • Used Spark for interactive queries, processing of streaming data, and integration with popular NoSQL database for huge volume of data.
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data Catalog, HDInsight, Azure SQL Server, Azure ML, and Power BI.
  • Used Azure Databricks for a fast, easy, and collaborative spark-based platform on Azure.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
  • Developed various UDFs in Map-Reduce and Python for Pig and Hive.
  • Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala
  • Implemented the Machine learning algorithms using Spark with Python
  • Defined job flows and developed simply too complex Map Reduce jobs as per the requirement.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Developed PIG UDFs for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders.
  • Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR and MapR (MapR data platform).
  • Developed analytical components using Scala, Spark, and Spark stream.
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs.
  • Managed resources and scheduling across the cluster using Azure Kubernetes Service.
  • Designing and Developing Apache NiFi jobs to get the files from transaction systems into the data lake raw zone.
  • Developed PIG Latin scripts for the analysis of semi-structured data
  • Experienced in Databricks platform where it follows best practices for securing network access to cloud applications.
  • Used Hive and created Hive tables and was involved in data loading and writing Hive UDFs.
  • Installed and configured Hive, Pig, Sqoop, Flume, and Oozie on the Hadoop cluster.
  • Analyzed the SQL scripts and designed them by using PySpark SQL for faster performance.
  • Used Azure Data Factory, SQL API, and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, Cosmos DB)

Environment: Spark, Spark Streaming, Apache Kafka, Apache NiFi, Hive, Tez, Azure, Azure Databricks, Azure data grid, Azure Synapse analytics, Azure data catalog, ETL, PIG, PySpark, UNIX, Linux, Tableau, Teradata, Pig, Snowflake, Sqoop, Hue, Oozie, Java, Scala, Python, GIT, GIT HUB

Confidential, Foster city, CA

Sr. Data Engineer

Responsibilities:

  • Translate business and data requirements into Logical data models in support of Enterprise Communication Data Model, OLTP, OLAP, Operational Data Store (ODS) and Analytical systems.
  • Implemented Core Framework leveraging Spark that can handle the whole pipeline in one Config(json).
  • Implemented optimal solutions for migrating the data into Data Lake.
  • Developed Kafka producer and consumers, HBase clients and Spark jobs along with components on HDFS, Hive.
  • Worked on analyzing source systems and their connectivity, discovery, data profiling and data mapping.
  • Created Hive tables, developed and modified HQL queries to solve the issues reported by QA.
  • Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
  • Implemented dimension model (logical and physical data modeling) in the existing architecture using ER/Studio and developed, managed and validated existing Data Models including Logical and Physical Models of the Data Warehouse and source systems utilizing a 3NFmodel.
  • Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
  • Involved in Database using Oracle, XML, DB2, Teradata 14.1, Netezza, SQL server, Big Data and NoSQL based MongoDB and Cassandra.
  • Extracted the data from MySQL, AWS RedShift into HDFS using Sqoop and Worked with Hadoop ecosystem covering HDFS, HBase, YARN and Map Reduce.
  • Designed Source to Target mapping from primarily Flat files, SQL Server, Oracle 11g, Netezza using Informatica Power Center9.6.
  • Designed Logical data model and Physical Conceptual data documents between source systems and the target data warehouse and worked on Normalization and De-Normalization techniques for both OLTP and OLAP systems.
  • Developed the Pysprk code for AWS Glue jobs and for EMR and Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
  • Worked on Data Modeling using Dimensional Data Modeling, Star Schema/Snow Flake schema, and Fact & Dimensional, Physical & Logical data modeling and Created, managed, and modified logical and physical data models using a variety of data modeling philosophies and techniques including Inmon or Kimball.
  • Used External Loaders like Multi Load, T Pump and Fast Load to load data into Teradata Database analysis, development, testing, implementation and deployment and Worked on Teradata SQL queries, Teradata Indexes, MDM Utilities such as Mload, Tpump, Fast load and Fast Export.
  • Developed Map Reduce jobs in java for data cleaning and preprocessing and developed UDFs in Java as and when necessary to use in PIG and HIVE queries.
  • Used Spark framework for the development of spark jobs in MapR Cluster to perform analytics on data in Hive.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, and Java.
  • Build and maintain scalable Data pipelines using the Hadoop ecosystem and other open-source components like Hive and HBase.
  • Implemented API for different vendors to have the sale data available for marketing and investments purposes.
  • Extensively used agile methodology as the Organization Standard to implement the data Models.
  • Performed Data Reconciliation between integrated systems, involved in extensive Data Validation with SQL queries, involved in Regression testing and Investigated Data Quality issues.

Environment: MAPR 5.2, Map Reduce V2, YARN, HDFS, Hive, Pig, Tez, Sqoop, Kafka, NoSQL, HBase, Sqoop Connectors, Spark, XML.

Confidential, Atlanta, GA

Big Data Engineer

Responsibilities:

  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Informatica BDM, T-SQL, Spark SQL, and Azure Data Lake Analytics.
  • Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Datastore.
  • Developed custom multi-threaded Java-based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
  • Developed PySpark and Scala-based Spark applications for performing data cleaning, event enrichment, data aggregation, de-normalization, and data preparation needed for machine learning and reporting teams to consume.
  • Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase and MongoDB.
  • Worked on troubleshooting spark applications to make them more error-tolerant.
  • Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
  • Experienced in developing scripts for doing transformations usingScala.
  • Wrote Kafka producers to stream the data from external rest API to Kafka topics.
  • Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, Effective & efficient Joins, transformations, and other capabilities.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
  • DevelopedSparkjobs using Scala for faster real-time analytics and usedSparkSQL for querying
  • Good experience with continuous Integration of applications using Bamboo.
  • Worked extensively with Sqoop for importing data from Oracle.
  • Created private cloud using Kubernetes that supports DEV, TEST, and PROD environments
  • Designing and customizing data models for Data warehouse supporting data from multiple sources in real-time.
  • Written HBase bulk load jobs to load processed data to Hbase tables by converting to HFiles.
  • Experience working for EMR cluster in AWS cloud and working with S3, Redshift, Snowflake.
  • Wrote Glue jobs to migrate data from hdfs to the S3 data lake.
  • Experience in the binding of Services in Cloud and Installed Pivotal Cloud Foundry (PCF) on Azure to manage the containers created by PCF.
  • Involved in creating Hive tables, loading and analyzing data using hive scripts.
  • Implemented Partitioning, Dynamic Partitions, Buckets in Hive.
  • DevelopedSparkjobs using Scala for faster real-time analytics and usedSparkSQL for querying
  • Designed, documented operational problems by following standards and procedures using JIRA.
  • Used Azure Synapse to manage processing workloads and served data for BI and prediction needs.
  • Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
  • Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
  • Collaborated with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.

Environment: Hadoop, Spark, Scala, Python, Hive, HBase, MongoDB, Sqoop, Oozie, Kafka, Snowflake, Amazon EMR, Glue, YARN, JIRA, amazon S3, Shell Scripting, SBT, GITHUB, Maven, Azure.

Confidential

Big Data Engineer/ETL Developer

Responsibilities:

  • Existing hive is written ETL pipelines migration to PySpark and direct development of new pipelines using PySpark.
  • Involved in creation of shell scripts for ingestion of sftp files from external vendor servers
  • Developed shell script to run the Oozie workflow to execute the jobs runs.
  • Involved in data cleansing, event enrichment, data aggregation, de-normalization, and data preparation needed for downstream model learning and reporting.
  • To extract data from heterogeneous sources and persist the data into HDFS complex Hive queries are written.
  • Creating Hive tables, loading, and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in Hive.
  • To perform Data Integrity and Data Quality checks, hive Sql is prepared.
  • Run PySpark job to load data into hive tables for test data validation. Using Hue check the Hive tables which are loaded with the daily job.
  • Experience in the binding of Services in Cloud and Installed Pivotal Cloud Foundry (PCF) on Azure to manage the containers created by PCF.
  • Designing the distribution strategy for tables in Azure SQL data warehouse
  • Handled large sets of structured, semi-structured, and unstructured data using Hadoop/BigData concepts.
  • Troubleshooting Spark applications for improved error tolerance.
  • Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines.
  • Created Kafka producer API to send live-stream data into various Kafka topics.
  • Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.
  • In accordance with architectural layers, generated DDL and created the tables and views.
  • Expertise in Extraction, Transformation, loading data from JSON, DB2, SQL Server, Excel, Flat Files.
  • Implementing CICD using in-house automation frameworks and GitHub as VCS.
  • On top of Spark Engine used SparkSQL for Querying data and implemented Spark RDD's in Scala.
  • Created schema check for hive tables against corresponding Hbase tables.

Environment: Spark, Python, Hive, HDFS, Kafka, Oozie, HBase, Scala, MapReduce, UNIX, Shell scripts, GitHub, SQL, Confluence

Confidential

Data Engineer

Responsibilities:

  • Designed robust, reusable, and scalable data-driven solutions and data pipeline frameworks to automate the ingestion, processing, and delivery of both structured and semi-structured batch and real-time data streaming data.
  • Applied efficient and scalable data transformations on the ingested data using the Spark framework.
  • Built Spark Scripts by utilizing Scala shell commands depending on the requirement.
  • Worked closely with machine learning teams to deliver feature datasets in an automated manner to help them with model training and mode scoring.
  • Gained good knowledge in troubleshooting and performance tuning Spark applications and Hive scripts to achieve optimal performance.
  • Developed various custom UDFs in spark for performing transformations on date fields, complex string columns and encrypting PI fields, etc.
  • Performed Spark join optimizations, troubleshooting, monitored, and wrote efficient codes using Scala.
  • Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
  • Written complex hive scripts for performing various data analyses and creating various reports requested by business stakeholders.
  • Used AWS Simple workflows for automating and scheduling our data pipelines.
  • Utilized Glue metastore as common metastore between EMR clusters and Athena query engine with S3 as the storage layer for both.
  • Worked extensively on migrating our existing on-Prem data pipelines to AWS cloud for better scalability and infrastructure maintenance.
  • Worked extensively in automating creation/termination of EMR clusters as part of starting the data pipelines.
  • Worked extensively on migrating/rewriting existing oozie jobs to AWS simple workflow.
  • Loaded the processed data into Redshift tables for allowing downstream ETL and reporting teams to consume the processed data.
  • Good experience working on analysis tools like Tableau, Splunk for regression analysis, pie charts, and bar graphs.

Environment: AWS Cloud, S3, EMR, Redshift, Athena, Scala, Spark, Kafka, Hive, Yarn, HBase, Jenkins, Docker

We'd love your feedback!