We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

3.00/5 (Submit Your Rating)

New York City, NY

SUMMARY

  • Over all 8+ years of professional IT experience in multiple technology methodologies like Hadoop Big Data Ecosystem and Java/J2EE related technologies
  • 4+ years of experience in Hadoop ecosystem components like MapReduce, Spark, Hive, Spark Streaming, Spark SQL, Spark MLlib, Kafka, Pig, HBase, Cassandra, Zookeeper, Sqoop, Flume, Oozie
  • Good experience in various Hadoop Distributions (Cloudera, Horton Works)
  • Have a good understanding on architecture and the components of Spark, and good experience in Spark Streaming, SparkSQL, Spark Core
  • Performed various Spark RDD transformations and actions on large datasets.
  • Performed various jobs in SparkSQL, Spark streaming using Data frames/ Data sets and D - Stream RDDs
  • Experience in designing and developing applications in Spark using Scala. Also implemented Scala scripts, UDF’s for data aggregation, queries and writing data into HDFS through Sqoop
  • Configured Spark Streaming to receive real- time data from a messaging stream platform like Apache Kafka and store it in the HDFS using Sqoop
  • Experience in configuring and monitoring Kafka clusters and connectors
  • Performed Read and write operations on thousands of megabytes of streaming data using Apache Kafka
  • Worked on Producer API and Consumer API to publish data and to subscribe data to and from the Kafka topics
  • Integrating and leveraging new data streaming tools like Apache Flink
  • Implemented transformations on bounded or unbounded data streams of data using Flink’s Data Stream API
  • Strong experience in writing MapReduce scripts using Scala, Java with Java API, Apache Hadoop API, Python API, PySpark API and Spark API for analyzing the data
  • Knowledge on Flume to extract click- stream data from web server
  • Experience in building data ingestion pipelines using Kafka, Flume, NIFI frameworks
  • Implemented NiFi dataflow framework, monitored, controlled, and performed streaming and batch processing in HDP 2.6.4
  • Worked on various NoSQL databases such as Cassandra, MongoDB, HBase
  • Designing data models in Cassandra and working with CQL
  • Experience in creating key spaces, tables and secondary indexes in Cassandra
  • Good knowledge in CQL. Have performed CRUD operations on the file system
  • Worked with Spark on parallel computing to enhance knowledge on RDDs using Cassandra
  • Imported data from sources like AWS S3, local file system into Spark RDD
  • Designed and implement test environment on AWS
  • Good experience in designing Row keys & Schema on NoSQL database like MongoDB.
  • Also managing life cycle of Mongo DB including sizing, automation, monitoring and tuning
  • Designed and developed functionalities to get JSON documents from MongoDB and export it to client using REST API
  • Experience in HBase to load and retrieve data for real- time processing using RESTful API
  • Created Hive tables according to the requirement and defined appropriate static and dynamic partitions, on the account efficiency
  • Very good experience in writing Hive QL queries using various operations like Partitioning, Bucketing and windowing operations
  • Experience in exporting and importing data from Hive/ HDFS to RDBMS using Sqoop
  • Worked on loading CSV/TXT/AVRO/PARQUET file formats for Hive Querying and Processing
  • Have exposure to Talend for designing ETL jobs for processing of Data
  • Performed job scheduling and workflow using Oozie
  • Involved in developing Impala scripts for Adhoc queries and created Tableau dashboards for results
  • Used cloud AWS and monitored using CloudWatch, performed Lambda operations on them
  • Developed modules in Applications using Java, J2EE, Spring, Hibernate frameworks and Web Services like REST, SOAP
  • Experience in using different SDLC facets like Agile, Waterfall models along with enterprise tools like JIRA, Confluence, Jenkins to develop projects

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Apache Kafka, Apache Flink, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Mahout, MongoDB, Cassandra, Avro, Storm, Parquet and ORC

Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks (HDP 2.6), MapR and DSE

Languages: Java, Python, Scala, SQL, HTML, DHTML, JavaScript, XML and C/C++, CQL, HQL, Pig Latin, PySpark

No SQL Databases: Cassandra, MongoDB and HBase

Java Technologies: Servlets, JavaBeans, JSP, JDBC, JNDI, EJB and struts

XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB

Development Methodology: Agile, waterfall

Web Design Tools: HTML, DHTML, AJAX, JavaScript, jQuery and CSS, AngularJS, ExtJS and JSON

Development / Build Tools: Eclipse, Ant, Maven, IntelliJ, JUNIT and log4J

Frameworks: Struts, Spring and Hibernate

App/Web servers: WebSphere, WebLogic, JBoss and Tomcat

DB Languages: MySQL, PL/SQL, PostgreSQL and Oracle

RDBMS: Oracle 9i,10g,11i, MS SQL Server, MySQL and DB2

Operating systems: UNIX, LINUX, Mac OS and Windows Variants

Data analytical tools: R and MATLAB

ETL Tools: Informatica, Talend

PROFESSIONAL EXPERIENCE

Confidential, New York City, NY

Sr Data Engineer

Responsibilities:

  • Our team was responsible for the build and support of vital technology solutions supporting scheduling, acquisition, processing, and distribution of Confidential content.
  • Responsible in a detailed technical design, development and implementation of applications using core Spark and Spark Streaming RDD’s
  • Created RDD’s in Spark technology and extracted data from data warehouse on to the Spark RDD’s
  • Configured Spark Streaming to receive real time data from the Kafka and store the stream of data to HDFS using Scala
  • Initially used Spark- Cassandra connector to load the data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping.
  • Optimized the existing algorithms in Hadoop using Spark Context, Spark- SQL, Data frames and RDD’s
  • Created, altered and deleted topics (Kafka Queues) when required with varying performance tuning using partitioning, bucketing of Hive tables
  • Used the Kafka Producer application to publish clickstream events into the Kafka topic and later explored the data with Spark SQL
  • Worked on Producer API and created a custom partition to publish the data to Kafka
  • Importing streaming logs and aggregating the data to HDFS and MySQL databases like Oracle through Kafka
  • Integrating and leveraging new data streaming tools like Apache Flink
  • Implemented transformations on bounded or unbounded data streams of data using Flink’s Data Stream API
  • Implemented Nifi dataflow framework, monitored, controlled, and performed streaming and batch processing in HDP 2.6.4
  • Worked totally in Agile methodologies, used Rally scrum tool to track the User stories and Team performance
  • Converting SQL codes to Spark codes using Java and Spark SQL, Spark Streaming for faster testing and processing of data and import and indexing data from HDFS for secure searching, reporting, analysis
  • Written Python scripts to call Cassandra REST API and performed some transformations and transferred the data into Spark
  • Performed CRUD operations on CQL using Cassandra. Also worked on creating key spaces, tables and secondary indexes in Cassandra
  • Designed and implement test environment on AWS
  • Used cloud AWS and monitored using CloudWatch, performed Lambda operations on them
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS. Big Data tool to load the big volume of source files from S3 to Redshift
  • Hands on expertise in running the Spark and Spark SQL on Amazon Elastic map reduce (EMR). Also, good knowledge in cloud integration with EMR
  • Monitor and Troubleshoot Hadoop jobs using Yarn Resource Manager and EMR job logs using Genie and Kibana.
  • Created indexes for various statistical parameters on Elastic Search and generated visualization using Kibana
  • Used Spark and Spark SQL to read the parquet data and create the tables in hive using the Scala API
  • Worked on PySpark SQL where the task is to fetch the NOTNULL data from two different tables and loads
  • Configured various property files like core-site.xml, hdfs-site.xml, mapred-site.xml based upon the job requirement.
  • Involved in Configuring core-site.xml and mapred-site.xml according to the multi node cluster environment.
  • Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest claim data and financial histories into HDFS for analysis. Using Curator API on Elastic Search to data back up and restoring
  • Managing work flow and scheduling for complex map reduce jobs using Apache Oozie

Environment: Spark Streaming, Hive, Spark, Spark SQL, Impala, Kafka, Pig, PySpark, HBase, Cassandra, Zookeeper, Sqoop, Flume, Oozie, Splunk, Elastic Search, Python, Java, NiFi, Agile, HDP 2.6.4

Confidential, Louisville, KY

Data Engineer

Responsibilities:

  • Responsible for taking customer behavioral data, store- level consumer information, digital analytics, and a variety of other disparate data sources and generating descriptive and predictive models
  • Responsible for migrating the existing RDBMS system to Hadoop
  • Responsible to lead the development of programs to clean and organize data sets, using tools along with more general knowledge of data cleaning and wrangling
  • Implemented Hadoop cluster on Cloudera and assisted it with performance tuning, monitoring and troubleshooting
  • Developed a data pipeline using Sqoop and Java MR to ingest customer behavioral data and financial histories into HDFS for analysis
  • Used Hive partitioning and bucketing for performance optimization of the Hive tables and created around 20000 partitions. Importing and exporting data into HDFS and Hive using Sqoop
  • Used Hive to analyze the partitioned and bucketed data and also computed various metrics for the reporting
  • Good experience in developing Hive DDLs to create, alter and drop Hive TABLEs
  • Involved in developing Hive UDF’s for the needed functionality that is not out of the box available from Apache Hive
  • Involved in Performance Tuning for optimizing the jobs in Hive, Pig and HBase.
  • Expert in importing and exporting terabytes of data into HDFS and Hive using Sqoop from other Traditional Relational Database Systems.
  • Worked on File Optimization Framework to convert CSV, JSON, XML, Avro formatted files on S3 in to Parquet formatted files and created External Hive Partitioned tables on top of Parquet S3 files and automated the process.
  • Experience in indexing, replication, aggregation and some Ad- hoc queries in Mongo DB
  • Imported and exported the data into Mongo DB
  • Experienced in managing Mongo DB with the help of availability, performance and scalability trade offs
  • Involved in collecting and aggregating large amounts of data into HDFS using Flume and defined channel selectors to multiplex data into different sinks
  • Written Sqoop Scripts to export and import data into HDFS and Hive
  • Developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS
  • Importing the data from various sources like HDFS/ HBase into Spark RDD
  • Load balancing of ETL processes, database performance tuning and Capacity monitoring using Talend
  • Involved in pivot the HDFS data from Rows to Columns and Columns to Rows.
  • Created reports using Tableau for Hive
  • Used Sqoop to import customer information data from SQL server database into HDFS for data processing
  • Loaded and transformed large sets of structured, semi structured data using Pig Scripts
  • Involved in developing Shell scripts to orchestrate execution of all other scripts (Pig, Hive and MapReduce) and move the data files within and outside of HDFS

Environment: Hive, SQL, Sqoop, Flume, Mongo DB, Talend, Tableau, Scala, Python, Java, HBase, Pig, Java MR

Confidential, Livermore, CA

Hadoop Developer

Responsibilities:

  • Was involved in the requirement analysis, design, coding and implementation of Hadoop Cluster
  • Was Responsible for business logic using Java, Java scripts, JDBC for querying data base
  • Gathered business requirements from the business partners and subject matter experts
  • Responsible for designing and development of Hive Data Model
  • Imported Bulk Data into HBase using HiveQL and MapReduce programs.
  • Developed Hive and Impala scripts on Avro and Parquet file formats.
  • Deployed Hive and HBase integration to perform OLAP operations on HBase data.
  • Processed the flat files using Pig, load them into Hive and further converted into Fixed width files.
  • Responsible to manage data coming from different sources
  • Implemented Apache Solr for fast retrieval
  • Written UDF’s in Java for Hive and imported data into it
  • Analyzed large datasets in Hive Data Model by using Hive queries and produced results
  • Designed and developed Talend jobs to extract data from Oracle into MongoDB
  • Extensively used SQL, PL/ SQL, Triggers and views using IBM DB2
  • Hands on experience in importing the data from RDBMS to HDFS
  • Created partitioning, bucketing and Map side joins for performance optimization
  • Worked on creating Pig Scripts and compared the code development to the ones in Java
  • Was responsible for technical documentation of Hadoop Clusters and how to execute Hive queries
  • Also, performed real time analytics on HBase using Java API and REST API
  • Written Java program to retrieve data from HDFS and provide REST services
  • Installed Oozie work flow engine to run multiple Hive jobs
  • Used Oozie scheduler to automate the pipeline workflow and orchestrate the Sqoop, hive and pig jobs that extract the data on a timely manner.

Environment: Hive, Java, Pig, Map Reduce, HBase, HDFS, Oozie, REST API, Oracle, MongoDB, SQL, PL/ SQL, Apache Solr

Confidential, Columbia, MO

Hadoop Developer

Responsibilities:

  • Was responsible for building Map Reduce programs in Java for data cleaning and preprocessing
  • Was responsible for continuous monitoring and managing the Hadoop Cluster using Cloudera Manager
  • Responsible to manage data coming from different sources
  • Created reports for the BI team using Sqoop to export data into HDFS and Hive
  • Involved in loading data form Unix file system to HDFS
  • Written shell scripts to monitor the health checks of Hadoop Daemon Services and respond accordingly
  • Imported and exported data into HDFS and Hive using Sqoop
  • Created HBase tables to store variable data formats coming from different port folios
  • Created HBase tables to store variable data formats coming from different portfolios Performed real-time analytics on HBase using Java API and Rest API
  • Was involved in the review of functional and non- functional requirements
  • Was involved in creating ER relations diagrams for the relational database
  • Monitoring the Hadoop clusters using Cloudera Manager
  • Managing and scheduling jobs on Hadoop Cluster
  • Extracted feeds from social media platforms such as Facebook, Twitter using Python Scripts
  • Strong experience of J2SE, XML, Web services, WSDL, SOAP, TCP/ IP
  • Also have Hands- on experience on JSP, Servlets, JDBC, Struts, Maven, Junit and SQL
  • Worked with various database systems such as Oracle 8i and Oracle 9i and DB2
  • Have experience with Web logic Application server, Web sphere Application server and J2EE application deployment technology

Environment: Map Reduce, Java, Python, Hive, Pig, Apache Solr, XML, Web Services, SOAP, TCP/ IP, Oracle 8i, HBase, Oracle 9i, Web Logic Application server

Confidential

Java developer

Responsibilities:

  • Was Responsible for the design and development of MVC 2 (Model View Controller) Architecture, Using the Front controller design Pattern
  • Used Core Java concepts to design the applications
  • Was involved in connecting the JDBC to connect to data bases such as Oracle and SQL server 2005
  • Have written servlets to generate dynamic HTML ages
  • Have written SQL queries to retrieve and to insert data into multiple data base schemas
  • Developed the XML Schema and Web services for the data maintenance and structures Wrote test cases in JUnit for unit testing of classes
  • Used DOM and DOM Functions
  • Debugged the application using Firebug to traverse the documents
  • Provided Technical support for production environments resolving the issues, analyzing the defects, providing and implementing the solution defects
  • Created database program in SQL server to manipulate data accumulated by internet transactions.
  • Involved in writing SQL Queries, Stored Procedures and used JDBC for database connectivity with MySQL Server

Environment: Java 7, Oracle, SQL, XML, JUnit DOM, SQL Server 2005, Eclipse

Confidential

Java Developer

Responsibilities:

  • Used exception handling and multi-threading for optimum performance of the application
  • Responsible for developing and maintaining the necessary Java Components, Enterprise Java Beans, Servlets
  • Developed applications using Java Object oriented concepts such as inheritance, polymorphism, Multi- Threading and Collections classes
  • Implemented Java Scripts for client- side validations
  • Involved in the analysis, design, development and testing phases of SDLC using agile development methodology
  • Developed the application under J2EE architecture, developed Designed dynamic and browser compatible user interfaces using JSP, Custom Tags, HTML, CSS, and JavaScript
  • Was responsible for defects allocation and ensuring the defects are resolved
  • Used JDBC to connect to web applications
  • Developed complex SQL queries, PL/ SQL stored procedures and functions
  • Used Eclipse IDE for the development and debugging

Environment: Java 7, Java Scripts, HTML, CSS, Servlets, J2EE, SQL, Eclipse, Agile Methodologies

We'd love your feedback!