We provide IT Staff Augmentation Services!

Big Data Engineer Resume

Houston, TX


  • Over 7 years of professional IT experience in Analysis, Design, Develop, Test large scale applications using Java, SQL, Hadoop and other Big data technologies.
  • About 3+ years of Data analytics experience on designing and implementing complete end - to- end Big Data/Hadoop Infrastructure solutions using HDFS, PIG, HIVE, HBase, Sqoop, Flume, Oozie, MapReduce and Spark.
  • Hands on experience in installation, configuration, supporting and managing Hadoop Clusters using Cloudera and Hortonworks distribution of Hadoop.
  • Experience in Hadoop administration activities such as installation and configuration of clusters using Cloudera Manager.
  • Good knowledge in writing complex MapReduce jobs, Pig Scripts and Hive data modeling.
  • Excellent understanding of Hadoop Distributed system architecture and design principles.
  • In depth understanding of Hadoop Architecture and various components such as HDFS, JobTracker, TaskTracker, NameNode, DataNode and MapReduce concepts.
  • Experience in fine-tuning MapReduce jobs for better scalability and performance and converting them to Spark.
  • Experience in working with BI team and transform big data requirements into Hadoop centric technologies.
  • Experience in developing solutions to analyze large data sets efficiently.
  • Developed multiple POCs using Spark-Shell and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
  • Worked with cloud services like Amazon Web Services (AWS) and involved in ETL, Data Integration and Migration.
  • Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR) and Athena.
  • Involved in data processing using an ETL pipeline orchestrated by AWS Data Pipeline using PIG and Hive on Amazon EMR
  • Have good experience creating real time data streaming solutions using ApacheSpark/Spark Streaming/Apache Storm.
  • Extending Hive and Pig core functionality by writing custom UDFs.
  • Experienced in working with structured data using Hive QL, join operations, Hive UDFs, partitions, bucketing and internal/external tables.
  • Extensively worked on Informatica IDE/IDQ.
  • Involved in massive data profiling using IDQ (Analyst Tool) prior to data staging.
  • Experience in developing and implementing web applications using Java, J2EE, JSP, Servlets, JSF, HTML, JSON, JQuery, CSS, XML, JDBC and JNDI.
  • Experience in writing SQL, PL/SQL queries, Stored Procedures for accessing and managing databases such as Oracle, SQL, MySQL, and IBM DB2.
  • Good Experience in giving Production Support to the applications.
  • Strong experience in Software Development Life cycle models and experience working with Agile (scrum) process.


Operating Systems: UNIX, LINUX, Ubuntu, Fedora, Windows Vista/7/8/10

Big Data Technologies: HDFS,Hive, MapReduce, Pig, Sqoop,Oozie,YARN and Spark

Scripting Languages: Shell,Python,Perl

Programming Languages: Java, Python, C, SQL, PL/SQL, HQL, Hive, PIG and HBase.

Methodologies: OOAD, UML, Design Patterns.

Frameworks: Spring, Hibernate

SCM Tools: SVN,GitHub

Web Services: SOAP,JMS, Apache Tomcat, IBM WebSphere 7 and 8.0, JBOSS, IBM HTTP Server, Apache HTTP Server.


Web Technologies: HTML, CSS, Java, Java Script, JDBC, XML, JSF, DHTML, XML, XSD, XPATH, CSS, Hibernate, HQL, Criteria

Web servers: Web logic, Web Sphere, Apache Tomcat, JBOSS

Databases: Oracle, SQL Server, MySQL, DB2.


Confidential, Houston, TX

Big Data Engineer


  • Developed enterprise data pipelines using agile methodology and planned the Scrum meetings.
  • Involved in design of functional and technical documents.
  • Work on fine tuning of the SQL and Hive Queries.
  • Ingesting the data from CSV files and load them to hive external tables using Hive and Spark.
  • Working with JSON Serde to parse the JSON data and load the data to hive external tables stored on HDFS.
  • Worked on production Deployment's and troubleshooting the production and platform issues.
  • Worked on performance tuning of the multiple production jobs.
  • Worked on Sqoop to import the data from Oracle database and store the data in HDFS.
  • Worked on file formats like Avro and Parquet to store the data.
  • Worked on writing Spark scripts for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context/Session, SparkSQL, Data Frame, Pair RDD's, transformations and actions.
  • Wrote spark programs using python to handle batch processing, and used Spark-SQL to query the data for adhoc usage.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Work on Apache Kafka for collecting, aggregating and moving large amounts of data in real time.
  • Used Hive for ETL and used static and dynamic partitions.
  • Documented the data pipelines and prepare basic trouble shooting documents for Business and production support day to day usage.
  • Developed the shell scripts to automate adhoc tasks.
  • Worked on Cloudera distribution of Hadoop and used tools like Hue, Cloudera Manager, Cloudera Navigator, Impala and Oozie.
  • Worked on Continuous Integration tool Jenkins for deployments.
  • Used Run deck for automating tasks on multiple nodes.
  • Used Git/Bit Bucket for source code management.

Environment: Hadoop 2.0, Yarn, Hive, Sqoop, Apache Drill, Map Reduce, Git, Python, Run deck, EMR, S3, EC2, CDH 5.7, HDFS, Apache Spark, SOAP, Unix, Shell Scripting, Jenkins, JIRA, Hue.

Confidential, Minneapolis, MN

Hadoop Developer


  • Responsible for building scalable distributed data solutions using Hadoop.
  • Developed custom User Defined Function (UDF's) in Hive to transform the large volumes of data with respect to business requirement.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Involved in loading data from edge node to HDFS using shell scripting.
  • Developed Spark code using Python and Spark-SQL for faster testing and processing of data.
  • Experience in working with Hadoop 2.x version and Spark (Python and Scala).
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
  • Involved in Installation and configurations of patches and version upgrades.
  • Involved in Hadoop Cluster environment administration that includes adding and removing cluster nodes, cluster capacity planning, performance tuning, cluster Monitoring.
  • Worked on using tools like Cloudera Manager Ganglia and Nagios to monitor performance of the Hadoop Cluster and collect different metrics.
  • Worked on writing Spark scripts for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context/Session, SparkSQL, Data Frame, Pair RDD's.
  • Worked on Cloudera distribution of Hadoop CDH 5.x.
  • Worked on AWS services like EC2, EMR, S3 and RDS.
  • Monitor and Troubleshoot Hadoop jobs using YARN Resource Manager and EMR job logs using Genie and kibana.
  • Triggering Spark jobs on the AWS - Elastic Map Reduce (EMR) cluster resources and perform fine tuning based on the cluster scalability.
  • Used Spark transformations and actions.
  • Used IDQ’s standardized plans for addresses and names clean ups.
  • Worked on IDQ file configuration at user’s machines and resolved the issues.
  • Experience in performance tuning of the long running jobs.
  • Worked on querying tools like Hue, Zeeplin, Presto and Impala.
  • Used Impala for adhoc querying of larger datasets.
  • Involved in requirement gathering meetings.
  • Experience with Jenkins for deploying big data solutions (CI/CD).
  • Involved in moving data from HDFS to AWS Simple Storage Service (S3).
  • Worked on importing the data from RDBMS (Oracle, MySQL) Databases to HDFS and S3 and vice versa.
  • Managed Hadoop jobs using Oozie and Airflow workflow scheduler system for scheduling and orchestrating Map Reduce, Hive, Pig and Spark jobs.
  • Worked on agile methodology and deliver the tasks in sprints.
  • Written Hive queries for data analysis to meet the business requirements.
  • Involved in loading the created HFiles into HBase for faster access of large customer base without taking Performance hit.
  • Created Hbase tables to store various data formats coming from different portfolios.
  • Experienced in managing and reviewing Hadoop log files.

Technologies : Apache Hadoop, HDFS, Hive, HBase, Pig, Spark Transformations, AWS, EMR, HBase, UNIX, Shell Scripting, Spark, Scala, Oozie, Zookeeper, Cloudera CDH5.x

Confidential, LA, CA



  • Evaluated suitability of Hadoop and its ecosystem to the above project and implementing/ validating with various proof of concept (POC) applications to eventually adopt them to benefit from the Big DataHadoop Initiative.
  • Involved in gathering and analyzing business requirements.
  • Worked with SQOOP import and export functionalities to handle large data set transfer between Oracle database and HDFS.
  • Developed Custom Input Formats in MapReduce jobs to handle custom file formats.
  • Experience in handling data in different file formats like Text, Sequence, Avro, Parquet and ORCFile.
  • Worked on Spark to parse the JSON data and handle the Incremental update logic with in Spark code.
  • Used various Spark transformation and actions to perform different operations on the dataset and aggregate the datasets to store the data in HDFS and S3.
  • Developed MapReduce/EMR jobs to analyze the data and provide heuristics and reports. The heuristics were used for improving campaign targeting and efficiency.
  • Used EMR to perform big data operations in AWS. Created ORC tables to improve the performance for the reporting purpose.
  • Wrote MapReduce jobs for data processing and the result is stored in HDFS for BI reporting.
  • Experience in development of Pig Latin, HiveQL and other Hadoop ecosystem tools for trend analysis and pattern recognition on user data.
  • Used IDQ to profile the project source data, define or confirm the definition of the metadata, cleanse and accuracy check the project data, check for duplicate or redundant records, and provide information on how to proceed with ETL processes.
  • Experienced in massive data profiling using IDQ (Analyst tool) prior to data staging.
  • Wrote complex Hive queries and UDFs.
  • Good understanding of python scripts.
  • Worked on Hortonworks distribution (HDP) of Hadoop and used Ambari to monitor the health of the Hadoop Cluster and perform various activities on the Hadoop cluster.
  • Worked as BI Production Support according to On Call rotation.
  • Performed validation and standardization of raw data from XML and JSON files with Pig and MapReduce.
  • Developing Scripts and Batch Job to schedule various Hadoop Program using Oozie.
  • Developed Hive scripts for implementing control tables logic in HDFS.
  • Designed and Implemented Partitioning (Static, Dynamic), Buckets in HIVE.
  • Responsible for performing extensive data validation using Hive.
  • Implemented complex map reduce programs to perform joins on the Map side using Distributed Cache in Java.
  • Worked on ingesting the data from a service provider (web service call) using Rest API and Python.

Technologies: Linux, Eclipse, SVN, Jira, JSON, Restful Web Servives, AWS, Hadoop, Hive, Oozie, Hortonworks, Map Reduce, PIG.


Java Developer


  • As a team member involved in Design, Development, and Testing of Collaborative and Administration modules involving workflow and access control sub modules.
  • Designed functional and technical design documents.
  • Designed and developed Java classes using Object Oriented Methodology.
  • Coding of front-end and back-end comprising JSP, JDBC, JavaScript, and HTML.
  • Developed application, which provides interface between middle tier to database using JDBC.
  • Responsible for gathering and analysis of the specifications, providing estimates through interfacing with Business Analysts.
  • Developed/Modified SQL queries, Stored Procedures and Triggers for data retrieval and modification on Oracle 10g.
  • Developed Servlets to handled requests from multiple client.
  • Worked on SOAP and Restful Web Services.
  • Used Spring Beans to encapsulate business logic and Implemented Application MVC Architecture using Spring MVC framework.
  • Data access layer is implemented using Hibernate.
  • Used Apache POI to generate Excel documents.
  • Implemented Struts action classes.
  • Used core J2EE Design Patterns like singleton and MVC.
  • Hibernate is used to extract data from database and bind to corresponding Java objects.
  • Used Dependency Injection feature in spring to instantiate classes.
  • Involved in the usage of DAO, Abstract factory, Factory design patters.
  • Worked extensively in the backend Oracle database.

Technologies:JDK 1.3.1, JSP 2.0, JDBC, Java Bean, Java Script, HTML, MS Access 2000, Tomcat 4.1.17, Eclipse3.0, CVS,Toad 6.5.


Java Developer


  • Analysis of the client’s business process and interacting with the client to get all requirements.
  • Design of the entire system including Graphical User Interface & Database using Data Flow & Entity Relationship diagrams.
  • Provide value information to clients Ad-hoc inquiries.
  • Verify and Validate Infrastructure Upgrade support that include patch fixes, memory up gradation activities and version upgrades.
  • Implemented JSP, Apache Struts Tag Libraries, Java Script, Dojo, XML and CSS in Eclipse 5.1 IDE for developing Web Interface.
  • Actively involved in designing and implementing Data Access Object (DAO), Service Locator, Data Access Objects and Singleton and Data Transfer Object design patterns.
  • Created and configured JMS resources for queue based messaging.
  • Used connection pooling to handle data transmission from the database.
  • Involved in Bug fixing and doing production support.
  • Preparation of Unit, Integration and System Test Cases.

Technologies: Java, JDBC, JFC/Swing, Java Beans, Oracle 8, Windows NTLinux, JDK 1.4.2 15, J2EE 1.4, Rational Application Developer (RAD) 6.0

Hire Now