We provide IT Staff Augmentation Services!

Big Data Developer Resume

5.00/5 (Submit Your Rating)

Piscataway, NJ

SUMMARY:

  • Extensive IT experience in a variety of industries working on Big Data technologies, experience in Cloudera 4, Hortonworks 2.6.5 distributions and Hadoop working environment including Hadoop 2.8.3, Hive 1.2.2, Sqoop 1.4.7, MapReduce, HBase 2.0.0, Apache Spark 2.2.1, Impala and Kafka 1.3.2.
  • Comfortable with installation and configuration of Hadoop Ecosystem Components.
  • Experience in integration of various data sources like Oracle , SQL Server , Flat Files and Unstructured files into the data warehouse.
  • Having Hands on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming paradigm.
  • Hands on experience in developing and deploying enterprise - based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, Pig, HBase, Flume, Sqoop, Spark Streaming, Spark SQL, Storm, Kafka, Oozie.
  • Familiar with data architecture including data ingestion pipeline design, Hadoop architecture, data modeling, data mining, machine learning and advanced data processing. Experience optimizing ETL workflows.
  • Strong experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses and Data Marts using Informatica Power Center (Repository Manager, Designer, Workflow Manager, Workflow Monitor, Metadata Manger), Power Connect as ETL tools on Oracle, DB2 and SQL Server Databases.
  • Developed Spark Applications Apache Spark data processing project to handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
  • Extensive SQL experience in querying, data extraction and data transformations
  • Capable of processing large sets of structured, semi-structured and unstructured data.
  • Experience in handling various file formats like Avro, ORC, and Parquet files.
  • Experience with the Spark Core, Spark SQL, Spark Streaming for complex data transformations and processing using the in-memory computing capabilities written using Scala. Worked with the Spark for improving performance and optimization of the existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
  • Extensive experience in developing MapReduce Jobs using Java and thorough understanding of MapReduce infrastructure framework.
  • Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java.
  • Experience in installation, configuration, Management, supporting and monitoring Hadoop cluster using various distributions such as Apache and Cloudera.
  • Worked on Build Management tools like SBT, Maven and Version control tools like Git.
  • Used IDEs - Eclipse, IntelliJ IDEs, Edit plus, Notepad ++ etc., for development.
  • Expert in analyzing and reporting using tools like Tableau, IBM Cognos, Oracle BI, SSRS.
  • Good Knowledge in HTML, CSS, JavaScript and web-based applications.

TECHNICAL SKILLS:

Languages \ Cluster Mgmt.& Monitoring: Python3.7.3, Java1.8, Scala2.12.8, SQL, \ Cloudera 5.7.6, Horton works Ambari 2.5.\R, C++, HTML5, CSS3, JavaScript\

Hadoop Ecosystem \ Apache Spark: Hadoop 2.8.3, MapReduce v1 & v2, YARN, \ Spark2.3, Spark SQL, Spark Streaming with HDFS, HBase, SQOOP1.4.7, Scala, Spark with python.\ Hive1.2.2, Pig, Kafka.\

Database \ Virtualization: MySQL, SQL Server Oracle11g, MS Access.\ VM ware workstation, AWS

No SQL Databases \ Visualization: MongoDB, Cassandra.\ R, Tableau, IBM Cognos, Oracle BI

Cloud Computing IDE: Google Cloud.\ Eclipse, Net Beans, GitHub, Maven, IntelliJ.

Operating Systems \ Versioning Systems: Unix, Linux, Windows.\ Git, SVN.\

Markup Languages: HTML5, CSS3, XML.

PROFESSIONAL EXPERIENCE:

Confidential, Piscataway, NJ

Big Data Developer

Responsibilities:

  • Perform Data Profiling to learn about user behavior and merge data from multiple data sources.
  • Implemented big data processing applications to collect, clean and normalization large volumes of open data using Hadoop ecosystems such as PIG, Hive, and HBase.
  • Designed and developed Hive tables to store staging and historical data.
  • Developed multiple MapReduce jobs in Hive for data cleaning and data pre-processing.
  • Created Hive tables as per requirement, internal and external tables are defined with appropriate static and dynamic partitions, intended for efficiency. Demonstrated better organization of the data using techniques like hive partitioning, bucketing.
  • Created and worked on Sqoop jobs with incremental load to populate Hive External tables.
  • Experience in using ORC file format with Snappy compression for optimized storage of Hive tables.
  • Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and Aggregation and used them using Impala process engine.
  • Involved in migrating MapReduce jobs into Spark jobs and used SparkSQL and DataFrames API to load structured and semi-structured data into Spark clusters.
  • Hive was used for data analysis and Spark context, Spark-SQL for optimizing the analysis of data. Spark RDD was used to store and perform in-memory computations on the data.
  • Created Oozie workflows for Sqoop to migrate the data from source to HDFS and then to target tables. Developed Oozie workflow for scheduling and orchestrating the ETL process.
  • Developed Oozie workflow jobs to execute Hive, Pig, Sqoop and MapReduce actions.
  • Performed cleaning data and data pre-processing for ensuring data was suitable to be fed into the machine learning model.
  • Used Apache Mahout to choose the recommendation algorithm, choosing n-nearest neighbors and similarity method.
  • Applied Spark MLlib to train analytical models including linear regression , decision trees , logistic regression and k-means . Evaluated parallelized Machine Learning model accuracy and computed metrics using evaluators.
  • Created web-based User interface for creating, monitoring and controlling data flows using Apache Nifi.

Environment: Apache Hadoop 2.8.3, HDFS 2, MapReduce, Sqoop, Flume, Pig, Hive 1.2.2, HBase, Oozie, Scala, Apache Spark 2.2.1, Kafka, Oozie 4.2.0, Apache NiFi

Big Data Developer

Confidential - Boston

Responsibilities:

  • Loaded the data from various data source to the Cassandra database and designed data models using Cassandra Query Language (CQL).
  • Experience in creating key spaces, tables and secondary indexes in Cassandra.
  • Installed Kafka manager for consumer lags and for monitoring Kafka Metrics, this has been used for adding topics and partitions etc.
  • Successfully generated consumer group lags from Kafka using their API.
  • Developed and designed system to collect data from multiple portal using Kafka and using Kafka as IoT data producer.
  • Using IoT data producer to generate IoT data in JSON format, wrote the custom Kafka serializer class that serialized IoT data object.
  • Developed the custom deserializer class which will deserialize IoT data JSON String to IoT data object.
  • Configured Spark Streaming to get ongoing information from the Kafka, create Java Streaming Context using SparkConf object and Duration value of five seconds.
  • Set checkpoint directory in Java Streaming context. Read IoT data stream using KafkaUtils.createDirectStream API.
  • Obtained DStream of IoT data objects by transforming the map operation and created key-value pair in Spark.
  • Developed Spark code in Scala and SparkSQL environment for faster testing and processing of data and Loading the data into Spark RDD and doing In-memory computation to generate the output response with less memory usage.
  • Used Datastax’s Spark Cassandra connector library to save data in Cassandra database and provided API to save DStream and RDD to Cassandra.
  • Created Tableau dashboard based on Cassandra data, and used the Spark ODBC driver to integrate Cassandra and Spark.

Environment: Apache Spark 2.2.1, Cassandra 2.2, Kafka 1.3.2, Scala 2.11.12, Maven, Tableau.

Confidential

Data Engineer

Responsibilities:

  • Building an ETL (extract, transform and load) pipeline with streaming processing from source database (Oracle database) to data warehouse using Kafka.
  • Extracted data into Kafka by JDBC connector, pulled data from Kafka topics and created the Avro schema files and KStream objects.
  • Using the Kafka stream API to transform data in KStream object.
  • Carried out ETL data processing on the Hadoop platform by Hive and Pig using MapReduce.
  • Participated in the model building team, including building the forewarning model using R and Python.
  • Independently responsible for analyzing and displaying data with Tableau; developed Tableau dashboard with various complex worksheets to help team efficiency with equipment quality data.
  • Involved in migrating MapReduce jobs into Spark jobs and used SparkSQL and Data frames API to load structured data into Spark clusters.

Environment: Apache Hadoop 2.0, Apache Spark 2.2.1, MapReduce, Hive, Java API, Python 3, R, Oracle, SQL, Tableau, Kafka

Hadoop Developer

Confidential

Responsibilities:

  • Design and configure the Data Lake with technologies in Hadoop ecosystem including but not limited to HDFS, YARN, Spark, Hive, Kafka, Flume, and Sqoop.
  • Apache Kafka is used for ingestion and Apache Spark Streaming which helps to use the Spark APIs for processing.
  • Installing, configuring and Load testing Kafka by following best-practices along with monitoring using Kafka-Manager. Setup Customized Dashboards on Grafana using JMX Metrics of the Kafka Brokers.
  • Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS.
  • Designed solution to perform ETL tasks like data acquisition, data transformation, data cleaning and efficient data storage on HDFS.
  • Developed Sqoop scripts to import and export data between HDFS and MySQL Database.
  • Involved in creating Hive tables, loading with data and writing hive queries that will run internally in MapReduce way.
  • Supported MapReduce Programs those are running on the cluster.
  • Analyzed large data sets by running Hive queries and Pig scripts.
  • Worked on tuning the performance Pig queries. Involved in Developing the Pig scripts for processing data.
  • Written Hive queries to transform the data into tabular format and process the results using Hive Query Language.
  • Developed Spark code using Scala and Spark Streaming for faster testing and processing of data.
  • Trained and validated a random-forest model for identifying frauds. Challenges included dealing with a highly imbalanced dataset consisting of 175 frauds in a million transactions, and cap false positive rate to 2% owing to manpower limitations.
  • Performed K-means clustering , Regression and Decision Trees in R . Worked with Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression.

Environment: Apache Hadoop 2.0, Sqoop, HDFS, Pig, MapReduce, Hive, SQL, Kafka, R, Scala

Web Application Developer

Responsibilities:

  • Developed an internal audit application for audit department to assist audit management, including annual risk management, risk calculation, audit scheduling, time tracking and issue tracking.
  • Implemented using React.js and Bootstrap on frontend.
  • Following the audit application design requirement to design user-friendly layout by using HTML and CSS.
  • Used JavaScript and jQuery to handle all events that are triggered by users, such as hover and click.
  • Database implementation using MongoDB.
  • Used the concept of Redux Development Tools for maintaining the state using the reducer.
  • Used Git for version control.
  • Daily website maintenance and updating content.

Environment: HTML, CSS, JavaScript, jQuery, React.js, Node.js, Redux, Bootstrap, MongoDB

We'd love your feedback!