We provide IT Staff Augmentation Services!

Big Data Engineer Resume

Vienna, VirginiA


  • Over 8+ years of IT experience in software analysis, design, development, testing and implementation of Big Data, Hadoop, No SQL and Java/J2EE technologies.
  • Skilled experience in installing, configuring and using Apache Hadoop ecosystems such as MapReduce, Hive, Pig, Sqoop, Flume, Yarn, Spark, Kafka and Oozie.
  • Strong understanding of Hadoop daemons and Map - Reduce concepts.
  • Strong experience in importing-exporting data into HDFS format.
  • Expertise in Java and Scala
  • Experienced in developing UDFs for Hive using Java.
  • Worked with Apache Falcon which is a data governance engine that defines, schedules, and monitors data management policies.
  • Experience in Amazon AWS services such as EMR, EC2, S3, Cloud Formation, RedShift, and Dynamo DB which provides fast and efficient processing of Big Data.
  • Hands on experience with Hadoop, HDFS, MapReduce and Hadoop Ecosystem (Pig, Hive, Oozie, Flume and HBase).
  • Good experience transformation and storage: HDFS, Map Reduce, Spark
  • Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Sparkcore, Spark Streaming and Spark SQL.
  • Strong understanding and strong knowledge in NoSQL databases like HBase, Mongo DB&Cassandra.
  • Experience in working with Angular 4, Nodejs, Bookshelf, Knex, and Maria DB.
  • Understanding of data storage and retrieval techniques, ETL, and databases, to include graph stores, relational databases, tuple stores
  • Good skills in developing reusable solution to maintain proper coding standard across different java project.
  • Good knowledge on Python Collections, Python Scripting and Multi-Threading.
  • Written multiple Map Reduce programs in Python for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file EJB, Hibernate, Java WebService, SOAP, REST Services, Java Thread, Java Socket, Java Servlet, JSP, JDBC formats.
  • Used Pandas, Numpy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression.
  • Expertise in debugging and optimizing Oracle and java performance tuning with strong knowledge in Oracle 11g and SQL
  • Worked with YarnQueue Manager to allocate queue capacities for different service accounts.
  • Hands on experience on Horton works and Cloudera Hadoop environments.
  • Familiar with handling complex data processing jobs using Cascading.
  • Strong database skills in IBM- DB2, Oracle and Proficient in database development, including Constraints, Indexes, Views, Stored Procedures, Triggers and Cursors.
  • Leading the testing efforts in support of projects/programs across a large landscape of technologies ( Unix, Angular JS, AWS, sauce LABS, Cucumber JVM, Mongo DB, GITHub, SQL, NoSQL database, API, Java, Jenkins)
  • Testing automation by using Cucumber JVM to develop a world class ATDD process.
  • Setup JDBC connection for database testing using cucumber framework.
  • Experience in component design using UML Design-Use Case, Class, Sequence, and Development, Component diagrams for the requirements.
  • Expertise in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4) distributions, Horton works and on Amazon web services (AWS).
  • Excellent analytical and programming abilities in using technology to create flexible and maintainable solutions for complex development problems.
  • Good communication and presentation skills, willing to learn, adapt to new technologies and third party products.


Confidential, Vienna, Virginia

Big Data Engineer


  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Set up AWS cloud environment and on S3 storage and EC2 instances
  • Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/BigData concepts.
  • Implemented programs to retrieve results from unstructured data set.
  • Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Worked on and designed Big Data analytics platform for processing customer interface preferences and comments using Hadoop, Hive and Pig, Cloudera.
  • Importing and exporting data into HDFS and Hive using Sqoop from Oracle and vice versa.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Provided thought leadership for architecture and the design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) and Proof of Technology (POT) evaluations and to implement a Big Data solution.
  • Developed numerous MapReduce jobs in Scala for Data Cleansing and Analyzing Data in Impala.
  • Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat File, RDBMS as part of a POC using Amazon EC2.
  • Worked on reading multiple data formats on HDFS using Scala.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Installed and configured Pig and also written Pig Latin scripts.
  • Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Build data platforms, pipelines, and storage systems using the Apache Kafka, Apache Storm and search technologies such as Elastic search.
  • Implemented POC's to migrate iterative MapReduce programs into Spark transformations using Scala.
  • Developed Spark scripts by using Python and Scalashell commands as per the requirement.
  • Involved in batch processing of data sources using Apache Spark, Elastic search.
  • Developed Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
  • Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
  • Designed and implemented SOLR indexes for the metadata that enabled internal applications to reference Scopus content.
  • Extensively worked on Shell scripts for running SAS programs in batch mode on UNIX.
  • Wrote Python scripts to parse XML documents and load the data in database.
  • Used Python to extract weekly information from XML files.
  • Developed Python scripts to clean the raw data.
  • Used Spark for Parallel data processing and better performances using Scala.
  • Extensively used Pig for data cleansing and extract the data from the web server output files to load into HDFS.
  • Developed a data pipeline using Kafka and Storm to store data into HDFS.
  • Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it using MapReduce programs.
  • Developed simple to complex MapReduce streaming jobs using Python.

Environment: Pig, Hive, HBase, Sqoop, Flume, Cassandra, zookeeper, AWS, MapReduce, HDFS, Oracle, Cloudera, Scala, Spark, SQL, Apache Kafka, Apache Storm, Python, Unix and SOLR.

Confidential, Smyrna, GA

Big Data Engineer


  • Architected, Designed and Developed Business applications and Data marts for reporting.
  • Developed Big Data solutions focused on pattern matching and predictive modeling
  • Objective of this project is to build a data lake as a cloud based solution in AWS using Apache Spark.
  • Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services(AWS) on EC2.
  • Created Hive External tables to stage data and then move the data from Staging to main tables
  • Worked in exporting data from Hive tables into Netezza database.
  • Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
  • Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
  • Developed Scalascripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Created Data Pipeline using Processor Groups and multiple processors using ApacheNiFi for Flat File, RDBMS as part of a POC using AmazonEC2.
  • Build Hadoop solutions for big data problems using MR1 and MR2 in YARN.
  • Load the data from different sources such as HDFS or HBase into Spark RDD and implement in memory data computation to generate the output response.
  • Developed complete end to end Big-data processing in Hadoop eco system.
  • Used AWS Cloud with Infrastructure Provisioning / Configuration.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • Worked on configuring and managing disaster recovery and backup on Cassandra Data.
  • Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
  • Continuously tuned HiveUDF's for faster queries by employing partitioning and bucketing.
  • Implemented partitioning, dynamic partitions and buckets in Hive.
  • Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed to HDFS.
  • Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.

Environment: : Apache Spark, Hive, Informatica, HDFS, MapReduce, Scala, Apache Nifi, Yarn, HBase, PL/SQL, Mongo DB, Pig, Sqoop, Flume.

Confidential, Santa Clara, CA

Big Data Engineer


  • Extracted files from DB2 through Kettle and placed in HDFS and processed.
  • Analyzed large data sets by running Hive queries and Pig scripts.
  • Developed the Sqoop scripts to make the interaction between Hive and vertica Database.
  • Involved in creating Hive tables and loading and analyzing data using hive queries.
  • Developed Simple to complex MapReduce Jobs using Hive and Pig.
  • Involved in running Hadoop jobs for processing millions of records of text data.
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Involved in unit testing using MR unit for MapReduce jobs.
  • Involved in loading data from LINUX file system to HDFS.
  • Managing work flow and scheduling for complex map reduce jobs using ApacheOozie.
  • Loading data from multiple sources on AWSS3 cloud storage.
  • Experienced in running Hadoop streaming jobs to process terabytes of Xml format data.
  • Load and transform large sets of structured, semi structured data.
  • Assisted in exporting analyzed data to relational databases using Sqoop.
  • Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts.
  • Implemented Data Integrity and Data Quality checks in Hadoop using Hive and Linuxscripts.
  • Used AVRO, Parquet file formats for serialization of data.

Environment: Hadoop Oozie, HDFS, Pig, Hive, MapReduce, AWS S3, Sqoop, LINUX, MRUnit

Confidential, Richardson, TX

Hadoop developer


  • Worked on Spark SQL to handle structured data in Hive.
  • Worked in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
  • Worked on migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau.
  • Worked on complex MapReduce program to analyses data that exists on the cluster.
  • Analyzed substantial data sets by running Hive queries and Pigscripts.
  • Written HiveUDF to sort Structure fields and return complex data type.
  • Worked in AWS environment for development and deployment of custom Hadoop applications.
  • Creating files and tuned the SQL queries in Hive utilizing HUE (Hadoop User Experience).
  • Worked on collecting and aggregating large amounts of log data using Storm and staging data in HDFS for further analysis.
  • Worked on Tableau to build customized Interactive reports, Worksheets and dashboards.
  • Managed real-time dataprocessing and real time Data Ingestion in MongoDB and Hive using Storm.
  • Developed Sparkscripts by using Pythonshell commands.
  • Stored the processed results In Data Warehouse, and maintaining data using Hive
  • Experienced in working with Spark eco system using SparkSQL and Scala queries on different formats like Text file, CSVfile.
  • Created Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability.
  • Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3,and EMR.
  • Developed a Proof of Concept which uses ApacheNifi for ingestion of data from the Kafka, to perform the conversion of Raw XML data into JSON, AVRO and implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.

Environment: Cloudera, HDFS, Map Reduce, Storm, Hive, Pig, SQOOP, Apache Spark, Python, Accumulo, Oozie Scheduler, Kerberos, AWS, Tableau, Java, UNIX Shell scripts, HUE, NIFI, Git, Maven.


Data Analyst


  • Interacted with business users to identify and understand business requirements and identified the scope of the projects.
  • Identified and designed business Entities and attributes and relationships between the Entities to develop a logical model and later translated the model into physical model.
  • Developed normalized Logical and Physical database models for designing an OLTP application.
  • Enforced Referential Integrity (R.I) for consistent relationship between parent and child tables.
  • Work with users to identify the most appropriate source of record and profile the data required for sales and service.
  • Involved in defining the business/transformation rules applied for ICP data.
  • Define the list codes and code conversions between the source systems and the data mart.
  • Developed the financing reporting requirements by analyzing the existing business objects reports
  • Utilized Informatica toolset (Informatica Data Explorer, and Informatica Data Quality) to analyze legacy data for data profiling.
  • Reverse Engineered the Data Models and identified the Data Elements in the source systems and adding new Data Elements to the existing data models.
  • Created XSD's for applications to connect the interface and the database.
  • Compare data with original source documents and validate Data accuracy.
  • Used reverse engineering to create Graphical Representation (E-R diagram) and to connect to existing database.
  • Generate weekly and monthly asset inventory reports.
  • Evaluated data profiling, cleansing, integration and extraction tools (e.g. Informatica)
  • Coordinate with the business users in providing appropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality
  • Worked on some impact of low quality and/or missing data on the performance of data warehouse client.
  • Worked with NZLoad to load flat file data into Netezza tables. Good understanding about Netezza architecture.
  • Identified design fl in the data warehouse and executed DDL to create databases, tables and views.
  • Generated comprehensive analytical reports by running SQL queries against current databases to conduct data analysis.
  • Involved in Data Mapping activities for the data warehouse.
  • Created and Configured Workflows, Work lets, and Sessions to transport the data to target warehouse Netezza tables using Informatica Workflow Manager.
  • Extensively worked on Performance Tuning and understanding Joins and Data distribution.
  • Coordinated with DBAs and generated SQL codes from data models.
  • Generate reports using crystal reports for better communication between business teams.

Environment: SQL/Server, Oracle9i, MS-Office, Embarcadero, Crystal Reports, Netezza, Teradata, Enterprise Architect, Toad, Informatica, ER Studio, XML, Informatica, OBIEE


System Analyst/Admin


  • Have been responsible for administering large, multi-site UNIX/LINUX server environments and operating systems, software package installation, upgrades, system integrity, security, disaster recovery and performance.
  • Implemented Cloudera Hadoop clusters for three different environments on HPE ProLiant servers.
  • Administered core Hadoop components like HDFS, Yarn, Hive and MapReduce.
  • Commissioning and Decommissioning master and worker Nodes in Hadoop clusters.
  • Configured/Managed Cisco and Brocade Fabric environment for soft/hard Zoning.
  • Worked on MySQL and successfully launched queries to provide required data to the department.
  • Created users, manage user permissions, maintain User & File System quota on Redhat Linux.
  • Installation & Configuration of Logical Volume Manager - LVM and RAID.
  • Automated administration tasks through use of scripting and Job Scheduling using CRON.
  • Wrote shell scripts for taking data backups, cleaning junk content and updating software regularly.
  • Experience in using protocols/services like Http, Https, TLI/SSL, DHCP, DNS, SSH, SFTP, TCP/IP, FTP/SFTP, SMTP Provided Linux System Administration, Linux System Security, Project Management and Risk Management in Information Systems.
  • Day to day provisioning of storage including (Storage device/LUN/Volume selection & creation, Fabric Zoning, LUN Masking & Mapping) Administration of environment running VMware ESXi Hosts and Virtual Machines.
  • Worked successfully towards improving and maintaining the Backup Success rate to > 98%.
  • Worked with server teams to insure the configuration and installation of the proper drivers, firmware, and multipath drivers to support the SAN environment.
  • Provide performance tuning and regular maintenance in order to minimize downtime and maximize performance.
  • Take care about Data Center (DC) by ordering and upgrading necessary hardware, supporting RAIDs, maintaining servers and installing new ones.

Environment: Linux, Hadoop Ecosystem, VMware, Storage, CDH, Cloudera Manager

Hire Now