We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Columbus, IndianA

SUMMARY

  • 7+ years of experience in Application development/Architecture and Data Analytics with specialization in Web Applications and Client - Server, Big data applications with Java and Big Data Technologies and expertise in Java, Scala, Python, Spark, Hadoop Map Reduce, Pig, Hive, Oozie, Sqoop, and NoSQL Databases.
  • Strong experience in frameworks like Spring MVC, Spark, Struts, Hibernate, Spark, HBase, Hive, Hadoop, Sqoop, HDFS. Expertise in developing Spark code using Scala and Spark-SQL/Streaming for faster testing and data processing.
  • Experience using Sqoop to import data into HDFS from RDBMS. Experienced with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Excellent knowledge of Hadoop Architecture and ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, and Map Reduce programming paradigm. Good understanding of cloud configuration in Amazon web services (AWS).
  • Experienced working with Hadoop Big Data technologies (HDFS and MapReduce programs), Hadoop echo systems (HBase, Hive), and NoSQL database MongoDB
  • I am experienced in using NoSQL database column oriented HBase, Hive. Extensive experience working with semi/unstructured data by implementing complex map-reduce programs using design patterns.
  • Knowledge in BI/DW solution (ETL, OLAP, DataMart), Informatica, BI Reporting tools like Tableau and QlikView. Expertise in loading the data from the different Data sources like (Teradata and DB2) into HDFS using Sqoop and load into partitioned Hive tables.
  • I am experienced in Hadoop cluster maintenance, including data and metadata backups, file system checks, commissioning and decommissioning nodes, and upgrades.
  • Wrote python scripts to manage AWS resources from API calls using BOTO SDK and worked with AWS CLI.
  • In-depth Knowledge of AWS cloud services like Compute, Network, Storage, and Identity & access management.
  • Hands-on Experience in configuration network architecture on AWS with VPC, Subnets, Internet gateway, NAT, Route table.
  • Significant expertise analyzing massive data sets, producing Pig scripts and Hive queries, and substantial experience working with structured data utilizing Hive QL, joining operations, writing custom UDFs, and optimizing Hive Queries.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models using Data warehouse techniques, developing Data Mining and reporting solutions that scale across a massive volume of structured and unstructured data.
  • I am experienced in importing and exporting data using Sqoop from HDFS to Relational Database and have expertise in job workflow scheduling and monitoring tools like Oozie TWS (Tivoli Workload Scheduler).
  • Involve in integrating the applications with tools like Team City, GitLab, Bit bucket, and JIRA for issues and story tracking.
  • Extensive experience in SQL, Stored Procedures, Functions, and Triggers with databases such as Oracle, IBM DB2, and MS SQL Server.
  • Familiar with the Angular 2 and Typescript to build the Components in the application. Hands-on experience in system and network administration and Knowledge of cloud technology and distributed computing
  • Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover customer usage patterns.
  • Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks
  • Experienced in Automating, Configuring, and deploying instances on AWS, Azure environments, and Data centers, also familiar with EC2, Cloud watch, Cloud Formation, and managing security groups on AWS.
  • Private Cloud Environment - Leveraging AWS and Puppet to provision internal computer systems for various clients rapidly.
  • Develop Puppet modules and role/profiles for installing and configuring software required for various applications/blueprints.
  • Wrote python scripts to manage AWS resources from API calls using BOTO SDK and worked with AWS CLI.
  • Excellent communication, presentation, and interpersonal skills. Ability to work independently or in a group with minimal supervision to meet deadlines.

TECHNICAL SKILLS

Big Data: Apache Spark, Hadoop, HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Flume, HBase, YARN, Cassandra, Phoenix, Airflow

Frameworks: Hibernate, Spring, Cloudera CDs, Hortonworks HDPs, MAPR

Programming & Scripting Languages: Java, Python, R, C, C++, HTML, JavaScript, XML, Git

Database: Oracle 10g/11g, PostgreSQL, DB2, SQL Server, MySQL, Redshift

NoSQL Database: HBase, Cassandra, MongoDB

IDE: Eclipse, Net beans, Maven, STS (Spring Tool Suite), Jupyter Notebook

ETL Tools: Pentaho, Informatica, Talend

Reporting Tool: Tableau, PowerBI

Operating Systems: Windows, UNIX, Linux, Sun Solaris

Testing Tools: Junit, MRunit

AWS: EMR, Glue, Athena, Dynamo DB, Redshift, RDS, Data Pipelines, Lake formation, S3, IAM, CloudFormation, EC2, ELB/CLB.

Azure: Data Lakes, Data Factory, SQL Data warehouse, Data Lake Analytics, Databricks, other azure services.

PROFESSIONAL EXPERIENCE

Sr. Big Data Engineer

Confidential, Columbus, Indiana

Responsibilities:

  • Installed and configured multi-nodes fully distributed Hardtop cluster.
  • Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Zookeeper and Spark.
  • Worked with NoSQL databases like Base to create tables and store the data Collected and aggregated large amounts of log data using Apache Flume and staged data in HDFS for further analysis.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Wrote Pig scripts to store the data into HBase.
  • Created Hive tables, dynamic partitions, buckets for sampling, and worked on them using Hive QL
  • Stored the data in tabular formats using Hive tables and Hive Sere.
  • Experienced on loading and transforming of large sets of Structured, Semi Structured and Unstructured Data.
  • Collaborated with intra applications teams to fit our business models on existing on-prem platform setup. Implemented algorithms for real time analysis in Spark
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Used the Spark -Cassandra Connector to load data to and from Cassandra.
  • Real time streaming the data using Spark with Kafka and SOA
  • Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, MapReduce and then loading data into HDFS.
  • Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
  • Analyzed the data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) to study customer behavior.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Developed Pig Latin Scripts to perform Map Reduce jobs.
  • Developed product profiles using Pig and commodity UDFs.
  • Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
  • Experience in creating tables, dropping and altered at run time without blocking updates and queries using HBase and Hive.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's.
  • Scheduled the batch jobs using event engine while creating dependency jobs.
  • Created flow diagrams, UML diagrams of designed architecture to make understand and get approval from product owners and the business teams for all the user requirements requested.
  • Integrated with Restful API’s to create Service now Incidents when there is a process failure within the batch job.
  • Developed a capability to implement audit logging at required stages while applying business logic.
  • Implemented spark data frames on huge incoming datasets of various data formats like JSON, CSV and Parquet.
  • Actively worked in resolving many of the Tech challenges. one of them is like handling the nested JSON with multiple data sections in the same file and converting them in to spark friendly Data Frames.

Environment: Hadoop, Map Reduce, Hive, Pig, Hbase, Sqoop, Flume, Cassandra, Scala, Spark, Oozie, Kafka, Linux, Java, Tableau, Eclipse, HDFS, PIG, Java (JDK), MySQL

Big Data Engineer

Confidential, Englewood, CO

Responsibilities:

  • Prepared design blueprints and application flow documentation, gathering requirements from the Business patterns.
  • Expertise in Microsoft Azure Cloud Services (PaaS & IaaS), Application Insights, Document DB, Internet of MYY Things (IoT), Azure Monitoring, Key Vault, Visual Studio Online (VSO), and SQL Azure.
  • Hands-on experience in Azure Development, worked on Azure web application, App services, Azure storage, Azure SQL Database, Virtual machines, Fabric controller, Azure AD, Azure search, and notification hub.
  • Designed, configured, and deployed Microsoft Azure for many applications utilizing the Azure stack (Including Compute, Web & Mobile, Blobs, Resource Groups, Azure SQL, Cloud Services, and ARM), focusing on high - availability, fault tolerance, and auto-scaling.
  • Worked on data pre-processing and cleaning to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, directed Machine Learning use cases under Spark ML and Mllib.
  • Maintained the data in Data Lake (ETL), coming from the Teradata Database, writing on an average of 80 GB daily. Overall, the data warehouse had 5 PB of data and used a 135-node cluster to process the data.
  • Responsible for creating Hive Tables to load the data from MySQL by using Sqoop, writing java snippets to perform cleaning, pre-processing, and data validation.
  • Experienced in creating Hive schema, external tables, and managing views. Worked on performing Join operations in Spark using hive. Writing HQL statements as per the user requirements.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and working with Spark-Shell. Developed Spark code using Java and Spark-SQL for faster testing and data processing.
  • Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in parquet format.
  • Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python
  • To process the massive volume of structured data, Spark SQL was used. In addition, spark Data Frames transformations and steps to migrate Map Reduce algorithms were implemented.
  • Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Used Data Frame API solutions to pre-process massive volumes of structured data in various file formats, including Text files, CSV, Sequence files, XML and JSON files, and Parquet files, and then turn the data into named columns.
  • Ensure necessary system security by using best-in-class AWS cloud security solutions. Additionally, I am experienced in deploying Java projects using Maven/ANT and Jenkins.
  • DevOps and CI/CD pipeline knowledge - Mainly Teamcity, Selenium. Implement continuous integration/ delivery (CI/CD) pipelines in AWS when necessary.
  • Experienced with batch processing of data sources using Apache Spark. I am developing predictive analytics using Apache Spark Java APIs. Expert in implementing advanced procedures like text analytics and processing using in-memory computing capabilities like Apache Spark written in Java.
  • Worked on the core and Spark SQL modules of Spark extensively. Extensively used Broadcast Variables and Accumulators for better performances.

Environment: Hadoop, HDFS, Hive, Java 1.7, Spark 1.6, SQL, HBase, UNIX Shell Scripting, MapReduce, Putty, WinSCP, IntelliJ, Teradata, Linux.

Big Data Engineer

Confidential, Costa Mesa, CA

Responsibilities:

  • Involve in all phases of SDLC (Software Development Life Cycle), including requirement collection, design, analysis, development, and application deployment.
  • Architecture and design of business requirements and making Visio Diagrams for designing and developing the application and deploying the application in various environments.
  • Develop Spark 2.1/2.4 Scala component to process the business logic and store the computation results of 10 TB data into HBase database to access the downstream web apps using Big SQL db2 database.
  • Uploaded and processed more than ten terabytes of data from various structured and unstructured sources into HDFS using Sqoop and Flume. Test the developed modules in the application using Junit Library and Junit Testing Framework
  • Analyze structured, unstructured data, and file system data and load the data to HBase tables based on the project requirement using IBM Big SQL with Sqoop mechanism and processing the data using Spark SQL in-memory computation & processing results to Hive, HBase
  • Handle integrating additional enterprise data into HDFS using JDBC, loading Hadoop in Big SQL, and performing transformations on the fly using Spark API to develop the standard learner data model, which obtains data from upstream in near real-time persists into HBase.
  • I am working with different file structures with other Hive file formats like Text file, Sequence file, ORC file, Parquet, and Avro to analyze the data to build a data model and read them from HDFS and process through parquet files and loading into HBASE tables.
  • Develop the Batch jobs using Scala programming language to process the data from files and tables, transform the data with the business logic and deliver it to the user
  • Work in the Continuous Deployment module, which is used to create new tables or update the existing table structure if needed in different environments, along with DDL (Data Definition Language) creation for the tables. Also, I wrote AZURE POWERSHELL scripts to copy or move data from the local file system to HDFS Blob storage.
  • Created Pipelines in ADF (Azure Data Factory) using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward.
  • Developed JSON Scripts for deploying the Azure Data Factory (ADF) Pipeline that processes the data.
  • Loading data from Linux/Unix file system to HDFS and working with PUTTY for the better communication between Unix and Window system and for accessing the data files in the Hadoop environment;
  • Developed and Implemented HBase capabilities for Big de-normalized data set and then applied transformation on the de-normalized data set using Spark/Scala.
  • Involved Spark tuning to improve the Jobs performance based on the Pepper Data monitoring tool metrics. Worked in building application platforms in the Cloud by leveraging Azure Databricks
  • Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Ability to apply the spark DataFrame API to complete Data manipulation within spark session
  • Import Data from various systems/sources like MYSQL into HDFS
  • Involving on creating Table and then applied HiveQL on those tables for Data validation
  • Develop shell scripts for configuration checks and files transformation, which is to be done before loading the data into the Hadoop Landing area HDFS
  • Developed and Implement Spark ETL custom component to extract the data from upstream systems and push the data to HDFS and finally store the data in HBase with wide row format Work with Apache Hadoop environment by Hortonworks
  • Enhance the application with new features and make the performance improvement in all the modules of the application
  • Exposure to Microsoft Azure in the processing of migrating the on-prem data to Azure Cloud, studying and implementing spark techniques like partitioning the data with Keys, and writing it to parquet files which boost speed improvement.
  • Understand the Mapping documents existing Source Data and prepare load strategies for different source systems, and implement them using Hadoop technology
  • Worked with Continuous integration tools like maven, Team city, IntelliJ and scheduled the jobs with TWS (Tivoli Workload Scheduler) tool, Creating and cloning the jobs and Job streams in the TWS tool promoting them to higher environments.
  • Contributed to the DevOps to automate the build and deployment process using Jenkins, shell scripting, chef, Python, AWS Lambda, Cloud Formation Template.
  • Built on-premise data pipelines using Kafka and Spark streaming using the feed from API streaming Gateway REST service.
  • Implemented the application using Spring Boot Framework and handled the security using Spring Security.
  • Used Micro service architecture with Spring Boot based services interacting through a combination of REST and Apache Kafka message brokers and also worked with Kafka Cluster using ZooKeeper.
  • I coordinated with co-developers, agile development and project management team, and external systems and was responsible for demos presentation of developed modules to the project management team.
  • Perform Code review activities with peer developers in the team and architect of the team for delivering exception/error-free application, testing it with real-time test scenarios, and deploying it to the next level environments. Analyze and fix the data processing issues and troubleshoot the process.

Environment: Scala, Java, Spark framework, Linux, Jira, Bitbucket, IBM Big SQL, Hive, HBase, IntelliJ IDEA, Maven, Db2 Visualizer, ETL, TeamCity, WinSCP, PuTTY, IBM TWS (Tivoli Workload Scheduler), Windows, Azure Data Factory, Linux.

Big Data /Hadoop Developer

Confidential

Responsibilities:

  • Gathered project requirements and liaised with Business stakeholders to gain the application knowledge and architect solutions for existing problems.
  • Played an active role in end-to-end implementation of the Hadoop infrastructure using Pig, Sqoop, Hive, and Spark and migrated the data from legacy systems (like Teradata, MySQL) into HDFS.
  • Ingested data from Teradata using Sqoop into HDFS and worked with highly unstructured and semi structured data.
  • Developed Oozie workflows to perform daily, weekly, and monthly incremental loads into hive tables.
  • Migrated complex Map Reduce programs into Spark RDD transformations using PySpark.
  • Loaded the data into Spark RDD and do in memory data Computation to generate the Output response.
  • Ingested data in mini-batches and performed RDD transformations on those mini-batches of data.
  • Used Oozie workflow engine to run multiple jobs which run independently.
  • Worked on Kafka while dealing with raw data, by transforming into new Kafka topics for further consumption.
  • Involved in creating Hive Tables, loading with data, and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Writing MapReduce (Hadoop) programs to convert text files into AVRO and loading into Hive (Hadoop) tables.
  • Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources.
  • Developing design documents considering all possible approaches and identifying best of them.
  • Import the data from different sources like HDFS/HBase into Spark RDD.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, and Python.
  • Created a POC to orchestrate the migration from existing Teradata platform to Hadoop infrastructure to increase the efficiency of the data used for analytics and decision making.
  • Utilized Informatica Power Center ETL tool to extract the data from heterogeneous sources and load them into the target systems.
  • Implemented the Slowly Changing Dimensions to capture the updated master data and load into the target Teradata system according to the business logic.
  • Created mapping variables and data flow logic from source to target systems.
  • Participated in daily status calls with internal team and weekly calls with client and updated the status report.
  • Ability to work with onsite and offshore team members.

Environment: Hadoop, MapReduce, HDFS, Hive QL, Pig, Java, Spark, Kafka, AWS, SBT, Maven, Sqoop, Zookeeper, Python, Informatica Power Center, Teradata.

Data Engineer

Confidential

Responsibilities:

  • Involved in complete project life cycle starting from design discussion to production deployment.
  • Worked closely with the business team to gather their requirements and new support features.
  • Developed a 16-node cluster in designing the Data Lake with the Cloudera Distribution.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Implemented and configured High Availability Hadoop Cluster.
  • Installed and configured Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, and Zookeeper).
  • Developed Hive scripts to analyze data and PHI are categorized into different segments and promotions are offered to customer based on segments.
  • Worked on POC to check various cloud offerings including Google Cloud Platform (GCP).
  • Compared Self hosted Hadoop with respect to GCPs Data Proc and explored Big Table (managed HBase) use cases, performance evolution.
  • Developed pipeline for POC to compare performance/efficiency while running pipeline using the AWS EMR Spark cluster and Cloud Dataflow on GCP.
  • Created KPIs, calculated members, Named Sets by using SSAS
  • Extensive experience in writing Pig scripts to transform raw data into baseline data.
  • Developed UDFs in Java as and when necessary to use in Pig and HIVE queries.
  • Worked on Oozie workflow engine for job scheduling.
  • Develop, maintain, monitor, and performance tuning of the data mart databases and SSAS OLAP cube(s)
  • Created Hive tables, partitions and loaded the data to analyze using HiveQL queries.
  • Created different staging tables like ingestion tables and preparation tables in Hive environment.
  • Optimized Hive queries and used Hive on top of Spark engine.
  • Worked on Sequence files, Map side joins, Bucketing, Static and Dynamic Partitioning for Hive performance enhancement and storage improvement.
  • Experience in retrieving data from oracle using PHP and Java programming.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
  • Created tables in HBase to store the variable data formats of data coming from different upstream sources.
  • Experience in managing and reviewing Hadoop log files.
  • Good understanding of ETL tools and how they can be applied in a Big Data environment.
  • Followed Agile Methodologies while working on the project.
  • Bug fixing and 24-7 production support for running the processes

Environment: Hadoop, MapReduce, HDFS, Sqoop, flume, Kafka, Hive, Pig, HBase, SQL, Shell Scripting, Eclipse, DBeaver, Datagrid, SQL Developer, IntelliJ, Git, SVN, JIRA, Unix, SSIS, SSAS, Azure, Azure Data Factory.

We'd love your feedback!