We provide IT Staff Augmentation Services!

Sr Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Schaumburg, IL

PROFESSIONAL SUMMARY:

  • 4+ years of experience in Hadoop components like MapReduce, Flume, Kafka, Pig, Hive, Spark, HBase, Oozie, Sqoop and Zookeeper.
  • Experience in working with different Hadoop distributions like CDH and Hortonworks. Good knowledge on MAPR distribution & Amazon’s EMR .
  • Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark - SQL, Data Frame, Pair RDD's, Spark YARN.
  • Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended the default functionality by writing User Defined Functions (UDFs), User Defined Aggregate Function (UDAFs) for custom data specific processing.
  • Strong Knowledge on Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.
  • Good experience in creating data ingestion pipelines, data transformations, data management, data governance and real time streaming at an enterprise level.
  • Solid experience in using the various file formats like CSV, TSV, Parquet, ORC, JSON and AVRO.
  • Experience working with Data Lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.
  • Excellent understanding of Data Ingestion, Transformation and Filtering.
  • Provides Output for multiple stake holders at the same time
  • Coordinated with the Machine Learning team to perform Data Visualization using Cognos TM1, PowerBI and Tableau.
  • Developed Spark and Scala applications for performing event enrichment, data aggregation, and de-normalization for different stake holders.
  • Designed new data pipelines and made the existing data Pipelines to be more efficient.
  • Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing the HiveQL queries.
  • In depth understanding of Hadoop Architecture and its various components such as YARN, Resource Manager, Application Master, Name Node, Data Node, HBase design principles etc.
  • Experience developing iterative algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
  • Experience with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.
  • Experience in job workflow scheduling and monitoring tools like Oozie and good knowledge on Zookeeper to coordinate the servers in clusters and to maintain the data consistency.
  • Profound understanding of Partitions and Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Worked on NoSQL databases like HBase , Cassandra and MongoDB .
  • Experienced with performing CRUD operations using HBase Java Client API and Solr API
  • Good experience in working with cloud environment like Amazon Web Services (AWS) EC2 and S3.
  • Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins, and AWS.
  • Experience writing Shell scripts in Linux OS and integrating them with other solutions.
  • Strong Experience in working with Databases like Oracle 10g, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.
  • Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
  • Hands-on experience on fetching the live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.
  • Good knowledge on creating Data Pipelines in SPARK using SCALA.
  • Experience in developing Spark Programs for Batch and Real-Time Processing. Developed Spark Streaming applications for Real Time Processing.
  • Good knowledge on Spark components like Spark SQL, MLlib, Spark Streaming and GraphX.
  • Expertise in integrating the data from multiple data sources using Kafka .
  • Knowledge about unifying data platforms using Kafka producers/ consumers, implement pre-processing using storm topologies.
  • Kafka Deployment and Integration with Oracle databases.
  • Experience data processing like collecting, aggregating, moving from various sources using Apache Kafka .
  • Experienced in moving data from Hive tables into Cassandra for real time analytics on hive tables and Cassandra Query Language ( CQL ) to perform analytics on time series data.
  • Good Knowledge in custom UDF's in Hive & Pig for data filtering.
  • Experience in Apache NIFI which is a Hadoop technology and Integrating Apache NIFI and Apache Kafka.
  • Hands-on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
  • Excellent communication, interpersonal and analytical skills. Also, a highly motivated team player with the ability to work independently
  • Working experience with Linux lineup like Redhat and CentOS .
  • Experience on ETL concepts using Informatica Power Center, AB Initio .
  • Good Knowledge on AWS components like EC2 Instance, S3 and EMR .
  • Comprehensive knowledge of Software Development Life Cycle (SDLC).
  • Exposure to Waterfall , Agile and Scrum models.
  • Highly adept at promptly and thoroughly mastering new technologies with a keen awareness of new industry developments and the evolution of next generation programming solutions.

SKILL SET:

Big Data Space: Hadoop, MapReduce, Pig, Hive, HBASE, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Elastic Search, Solr, MongoDB, Cassandra, Avro, Storm, Parquet, Snappy, AWS

Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR, Apache EMR

Databases & warehouses: NoSQL, Oracle, DB2, MySQL, SQL Server, MS Access, Teradata.

Java Space: Core Java, J2EE, JDBC, JNDI, JSP, EJB, Struts, Spring Boot, REST, SOAP, JMS

Languages: Python, Java, JRuby, SQL, PL/SQL, Scala, JavaScript, Shell Scripts, C/C++

Web Technologies: HTML, CSS, JavaScript, AJAX, JSP, DOM, XML, XSLT

IDE: Eclipse, NetBeans JDeveloper, IntelliJ IDEA.

Operating systems: UNIX, LINUX, Mac OS, Windows, Variants

RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL, DB2.

Version controls: GIT, SVN, CVS

ETL Tools: Informatica, AB Initio, Talend

Reporting: Cognos TM1, Tableau, SAP BO, Power BI

PROFESSIONAL EXPERIENCE:

Confidential, Schaumburg, IL

LSr Big Data Engineer

Responsibilities:

  • Experienced in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Kafka, Spark with Cloudera distribution.
  • Worked on Cloudera distribution and deployed on AWS EC2 Instances.
  • Hands on experience on Cloudera Hue to import data on the GUI.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Performed Data Ingestion from multiple internal clients using Apache Kafka.
  • Worked on integrating Apache Kafka with Spark Streaming process to consume data from external REST APIs and run custom functions.
  • Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment.
  • Developed Spark scripts by using Scala Shell commands as per the requirement.
  • Configured, deployed, and maintained multi-node Dev and Tested Kafka Clusters.
  • Implemented real time system with Kafka and Zookeeper.
  • Configured spark streaming data to receive real time data from Kafka and store it in HDFS.
  • Developed in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
  • Involved in running Hadoop streaming jobs to process terabytes of text data. Worked with different file formats such as Text, Sequence files, Avro, ORC and Parquet.
  • Configured, supported and maintained all network, firewall, storage, load balancers, operating systems, and software in AWS EC2.
  • Implemented the use of Amazon EMR for Big Data processing among a Hadoop Cluster of virtual servers on Amazon related EC2 and S3.
  • Worked on custom Pig Loaders and storage classes to work with variety of data formats such as JSON and XML file formats.
  • Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
  • Good knowledge in using Data Manipulations, Tombstones, Compactions in Cassandra. Well experienced in avoiding faulty Writes and Reads in Cassandra.
  • Performed data analysis with Cassandra using Hive External tables.
  • Designed the Column families in Cassandra.
  • Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
  • Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
  • Implemented YARN Capacity Scheduler on various environments and tuned configurations according to the application wise job loads.
  • Configured Continuous Integration system to execute suites of automated test on desired frequencies using Jenkins, Maven & GIT.
  • Involved in loading data from LINUX file system to HDFS.
  • Followed Agile Methodologies while working on the project.

Environment: Hadoop, HDFS, Hive, Spark, Cloudera, AWS EC2, S3, ERM, Sqoop, Kafka, Yarn, Shell Scripting, Scala, Pig, Cassandra, Oozie, Agile methods, MySQL

Confidential, Minneapolis -MN

Sr Data Engineer

Responsibilities:

  • Experienced in development using Cloudera distribution system.
  • As a Hadoop Developer my responsibility is managing the data pipelines and data lake.
  • Have experience of working on Snow -flake data warehouse.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data
  • Designed custom Spark REPL application to handle similar datasets
  • Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation
  • Performed Hive test queries on local sample files and HDFS files
  • Used AWS services like EC2 and S3 for small data sets.
  • Developed the application on Eclipse IDE
  • Developed Hive queries to analyze data and generate results
  • Used Spark Streaming to divide streaming data into batches as an input to spark engine for batch processing.
  • Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.
  • Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.
  • Used Scala to write code for all Spark use cases.
  • Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL
  • Assigned name to each of the columns using case class option in Scala.
  • Developed multiple Spark Sql jobs for data cleaning
  • Created Hive tables and worked on them using Hive QL
  • Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS
  • Developed Spark SQL to load tables into HDFS to run select queries on top.
  • Developed analytical component using Scala, Spark and Spark Stream.
  • Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
  • Worked on the NoSQL databases HBase and mongo DB.

Environment:: Hadoop, Hive, Oozie, Java, Linux, Maven, Oracle 11g/10g, Zookeeper, MySQL, Spark.

Confidential, Northbrook, IL

Data Engineer

Responsibilities:

  • Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS so as to use it for the analysis
  • Migrated Existing MapReduce programs to Spark Models using Python.
  • Migrating the data from Data Lake (hive) into S3 Bucket.
  • Done data validation between data present in Data Lake and S3 bucket.
  • Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
  • Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
  • Used Kafka for real time data ingestion.
  • Created different topic for reading the data in Kafka
  • Read data from different topics in Kafka.
  • Moved data from s3 bucket to snowflake data warehouse for generating the reports.
  • Written Hive queries for data analysis to meet the business requirements
  • Migrated an existing on-premises application to AWS.
  • Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
  • Created many Spark UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark Sql.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
  • Good knowledge on Spark platform parameters like memory, cores and executors
  • By using Zookeeper implementation in the cluster, provided concurrent access for hive tables with shared and exclusive locking

Environment: Linux, Apache Hadoop Framework, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Scala, Spark, SQOOP

Confidential

Data Analyst

Responsibilities:

  • Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
  • Recommended structural changes and enhancements to systems and databases.
  • Conducted Design reviews and Technical reviews with other project stakeholders.
  • Was a part of the complete life cycle of the project from the requirements to the production support
  • Created test plan documents for all back-end database modules
  • Used MS Excel, MS Access, and SQL to write and run various queries.
  • Worked extensively on creating tables, views, and SQL queries in MS SQL Server.
  • Worked with internal architects and assisting in the development of current and target state data architectures.
  • Coordinate with the business users in providing appropriate, effective, and efficient way to design the new reporting needs based on the user with the existing functionality.
  • Remain knowledgeable in all areas of business operations to identify systems needs and requirements.

Environment : SQL, SQL Server, MS Office, and MS Visio

We'd love your feedback!