We provide IT Staff Augmentation Services!

Big Data Engineer Resume

3.00/5 (Submit Your Rating)

Dallas, TX


  • I has around 7+ years of IT experience in software development and support with experience in developing strategic methods for deploying Big Data technologies to efficiently solve Big Data processing requirement.
  • Expertise in Hadoop eco system components HDFS, Map Reduce, Yarn, HBase, Pig, Sqoop, Spark, Spark SQL, Spring boot, Spark Streaming, and Hive for scalability, distributed computing, and high performance computing.
  • Experience in using Hive Query Language for data Analytics.
  • Experienced in Installing, Maintaining and Configuring Hadoop Cluster.
  • Strong knowledge on creating and monitoring Hadoop clusters on Amazon EC2, VM, Hortonworks Data Platform 2.1 & 2.2, CDH3, CDH4 Cloudera Manager on Linux, Ubuntu OS.
  • Capable of processing large sets of structured, semi - structured and unstructured data and supporting systems application architecture.
  • Having Good knowledge on Single node and Multi node Cluster Configurations.
  • Strong knowledge in NOSQL column oriented databases like HBase, Cassandra, MongoDB, and Mark Logicand its integration with Hadoop cluster.
  • Expertise on Scala Programming language and Spark Core.
  • Worked with AWS based data ingestion and transformations.
  • Worked with Cloud Break and Blue Print to configure AWS plotform.
  • Worked with data warehouse tools like Informatica, Talend.
  • Experienced in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Good knowledge on Amazon EMR, Amazon RDS S3 Buckets, Dynamo DB, RedShift.
  • Analyze data, interpret results, and convey findings in a concise and professional manner
  • Partner with Data Infrastructure team and business owners to implement new data sources and ensure consistent definitions are used in reporting and analytics
  • Promote full cycle approach including request analysis, creating/pulling dataset, report creation and implementation and providing final analysis to the requestor
  • Good experience on Kafkaand Storm.
  • Worked with Docker to establish connection between Spark and NEO4J database.
  • Knowledge of java virtual machines (JVM) and multithreaded processing.
  • Hands on experience working with ANSI SQL.
  • Strong programming skills in designing and implementation of applications using Core Java, J2EE, JDBC, JSP, HTML, Spring Framework, Spring batch framework, Spring AOP, Springboot, Struts, JavaScript, Servlets.
  • Experience in build scripts using Maven and do continuous integrations systems like Jenkins.
  • Java Developer with extensive experience on various Java Libraries, API’s,and frameworks.
  • Hands on development experience with RDBMS, including writing complex Sql queries, Stored procedure,and triggers.
  • Very Good understanding of SQL, ETL and Data Warehousing Technologies
  • Knowledge of MS SQL Server 2012/2008/2005 and Oracle 11g/10g/9i and E-Business Suite.
  • Expert in TSQL, creating and using Stored Procedures, Views, User Defined Functions, implementing Business Intelligence solutions using SQL Server 2000/2005/2008.
  • Developed Web-Services module for integration using SOAP and REST.
  • NoSQL database experience onHBase, Cassandra,DynamoDB.
  • Flexible with Unix/Linux and Windows Environments working with Operating Systems like Centos 5/6, Ubuntu 13/14, Cosmos.
  • Has sound knowledge on designing data warehousing applications with using Tools like Teradata, Oracle,and SQL Server.
  • Experience working with Solr for text search.
  • Experience on using Talend ETL tool.
  • Experience in working with job scheduler like Autosys and Maestro.
  • Strong in databases like Sybase, DB2, Oracle, MS SQL,Clickstream.
  • Strong understanding of Agile Scrum and Waterfall SDLC methodologies.
  • Strong Working experience in snowflake.
  • Hands on experience with automation tools such as Puppet, Jenkins,chef,Ganglia,Nagios.
  • Strong communication, collaboration & team building skills with proficiency at grasping new Technical concepts quickly and utilizing them in a productive manner.
  • Adept in analyzing information system needs, evaluating end-user requirements, custom designing solutions and troubleshooting information systems.
  • Strong analytical and Problem solving skills.


Hadoop/Big Data Technologies: HDFS, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, impala, Spark, Zookeeper and Cloudera Manager,Splunk.

NO SQL Database: HBase, Cassandra

Monitoring and Reporting: Tableau, Custom shell scripts

Hadoop Distribution: Horton Works, Cloudera, MapR

Build Tools: Maven, SQL Developer

Programming & Scripting: JAVA, C, SQL, Shell Scripting, Python, Scala

Java Technologies: Servlets, JavaBeans, JDBC, Spring, Hibernate, SOAP/Rest services

Databases: Oracle, MY SQL, MS SQL server, Teradata

Web Dev. Technologies: HTML, XML, JSON, CSS, JQUERY, JavaScript, angular JS

Version Control: SVN, CVS, GIT

Operating Systems: Linux, Unix, Mac OS-X, Cen OS, Windows10, Windows 8, Windows 7, Windows Server 2008/2003

Hadoop/Big Data Technologies: HDFS, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, impala, Spark, Zookeeper and Cloudera Manager,Splunk.

NO SQL Database: HBase, Cassandra


Confidential -Dallas, TX

Big Data Engineer


  • Developed Data Pipeline with Kafka and Spark.
  • Contributedindesigning the Data Pipeline with Lambda Architecture.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
  • Involved in installation, configuration, supporting and managing Hadoop clusters.
  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Used Spark for interactive queries and processing of streaming data.
  • Expansively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
  • Developed Spark Applications by using Scala,Pythonand Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.
  • Using Spark Context, Spark-SQL, Data Frame, Spark Yarn.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
  • Configured a data model to get data from Kafka in near real time and persist it to Cassandra.
  • Developed Kafka consumer API in Python for consuming data from Kafka topics.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Migrated an existing on-premises application to AWS.
  • Used AWS services like EC2 and S3 for small data sets processing and storage.
  • Experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDDs.

Environment: Big Data Horton Work, Apache Hadoop, Hive, Python, Hue Tool, Zookeeper, Map Reduce, Sqoop, crunch API,Pig 0.10 and 0.11, HCatalog, Unix, Java, JSP, Eclipse, Maven, Oracle, SQL Server, Linux,MYSQL.

Confidential -San Francisco, CA

Big Data Engineer


  • Communicated deliverables status to stakeholders and facilitated periodic review meetings.
  • Developed Spark streaming application to pulldata from cloud to Hive and HBase.
  • Built Real-Time Streaming Data Pipelines with Kafka, Spark Streaming and Hive.
  • Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
  • Handled schema changes in data stream using Kafka.
  • Responsible for Kafka operation and monitoring, and handling of messages funneled through Kafka topics.
  • Coordinated Kafka operation and monitoring with dev ops personnel; formulated balancing impact of Kafka producer and Kafka consumer message(topic) consumption.
  • Designed and developed ETL workflows using Python and Scala for processing data in HDFS.
  • Collected, aggregated, and shuffled data from servers to HDFS using Apache Spark & Spark Streaming.
  • Worked on importation and claims information between HDFS and RDBMS.
  • Created Hive External tables and loaded the data into tables and query data using HQL.
  • Worked on streaming the prepared information to HBase utilizing Spark.
  • Performed performance calibration for Spark Steaming e.g., setting right Batch Interval time, correct level of executors, choice of correct publishing& memory.
  • Used HBase connector for Spark.
  • Performed gradual cleansing and modeling of datasets.
  • Utilized Avro-tools to build the Avro schema to create external hive tables using PySpark.
  • Created and managed externaltables to store ORC and Parquet files using HQL.
  • Developed Apache Airflow DAGs to automate the pipeline.
  • Created a NoSQL HBase database to store the processed data from Apache Spark.

Environment: Snowflake Web UI, Snow SQL, Hadoop MapR 5.2, Hive, Hue, Azure, Control-M, AWS, Teradata Studio, Oracle 12c, Tableau, Hadoop Yarn, Spark Core, Spark Streaming, Spark SQL, Spark MLlib

Confidential -San Diego, CA

Data Engineer/Hadoop Spark Developer


  • Extensively worked with Spark-SQL context to create data frames and datasets to preprocess the model data.
  • Data Analysis: Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
  • Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
  • Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order.
  • Wrote Junit tests and Integration test cases for those Microservice.
  • Worked in Azure environment for development and deployment of Custom Hadoop Applications.
  • Work heavily with Python, C++, Spark, SQL, Airflow, and Looker
  • Experienced in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
  • Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology.
  • Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
  • Built pipelines to move hashed and un-hashed data from XML files to Data lake.
  • Developed NiFi workflow to pick up the multiple files from ftp location and move those to HDFS on daily basis.
  • WrittenTemplatesforAzure Infrastructure as codeusingTerraformto build staging and production environments. IntegratedAzure Log AnalyticswithAzure VMsfor monitoring thelog files, store them and track metrics and usedTerraformas a tool,Manageddifferent infrastructure resourcesCloud,VMware, andDockercontainers.
  • Scripting: Expertise in Hive, PIG, Impala, Shell Scripting, Perl Scripting, and Python.
  • Worked with developer teams on NiFi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka.
  • Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
  • Proven experience with ETL frameworks (Airflow, Luigi, or our own open sourced garcon)
  • Created Hive schemas using performance techniques like partitioning and bucketing.
  • Used Hadoop YARN to perform analytics on data in Hive.
  • Developed and maintained batch data flow using HiveQL and Unix scripting
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Build large-scale data processing systems in data warehousing solutions, and work with unstructured data mining on NoSQL.
  • S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
  • Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format.
  • Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
  • Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
  • Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
  • Queried both Managed and External tables created by Hive using Impala.
  • Developed customized Hive UDFs and UDAFs in Java, JDBC connectivity with hive development and execution of Pig scripts and Pig UDF’s.

Environment: Hadoop, Microservices, Java, MapReduce, Agile, HBase, JSON, Spark, Kafka, JDBC,AWS, EMR/EC2/S3,Hive, JSON, Pig, Flume, Zookeeper, Impala, Sqoop


Hadoop Developer


  • Used Sqoop to expeditiously transfer information between information databases and HDFS and used Flume to stream the log data from servers.
  • Enforced partitioning, bucketing in Hive for higher organization of the data.
  • Worked with totally different file formats and compression techniques to standards.
  • Loaded information from a UNIX system to HDFS.
  • Used UNIX system shell scripts to alter the build method, and to perform regular jobs like file transfers between totally different hosts.
  • Assigned in production support, that concerned observance server and error logs, and foreseeing and preventing potential problems, and escalating issue once necessary.
  • Documented Technical Specs, Dataflow, information Models, and sophistication Models using Confluence.
  • Documented needs gatheird from stakeholders.
  • With success loaded files to HDFS from Teradata and loaded from HDFS to HIVE.
  • Used Zookeeper and Oozie for coordinating the cluster and programming workflows
  • Involved in researching various available technologies, industry trends, and cutting-edge applications.Data ingestion is done using Flume with source as Kafka Source & sink as HDFS.
  • Performed storage capacity management, performance tuning, and benchmarking of clusters.

Environment: Hadoop, Zookeeper, Kafka, UNIX


Data Engineer


  • Created and executed Hadoop Ecosystem installation and document configuration scripts on Google Cloud Platform.
  • Transformed batch data from several tables containing hundreds of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames usingPySpark.
  • Developed aPySparkprogram that writesdataframesto HDFS asavrofiles.
  • Utilized Spark's parallel processing capabilities to ingest data.
  • Created and executed HQL scripts that creates external tables in a raw layer database in Hive.
  • Developed a Script that copiesavroformatted data from HDFS to External tables in raw layer.
  • CreatedPySparkcode that uses Spark SQL to generatedataframesfromavroformatted raw layer and writes them to data service layer internal tables as orc format.
  • In charge ofPySparkcode, creatingdataframesfrom tables in data service layer and writing them to a Hive data warehouse.
  • Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
  • Configured documents which allow Airflow to communicate to its PostgreSQL database.
  • Developed Airflow DAGs in python by importing the Airflow libraries.
  • Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.

Environment: Cloudera Manager, HDFS, Sqoop, Pig, Hive, Oozie, Spark SQL, Tableau, My SQL, Python, Kafka, flume, Java, Scala, Git.

We'd love your feedback!