We provide IT Staff Augmentation Services!

Data Engineer Resume

Chicago, IL

SUMMARY:

  • To utilize my technical and management skills for achieving the target and developing the best performance.
  • I would like to implement my innovative ideas, skills and creativity for accomplishing the projects.
  • 4+ years of overall experience in IT Industry which includes experience in, Java, Hadoop Administration and Big data technologies and web applications in multi - tiered environment using Hadoop, Spark, Hive, HBase, Pig, Sqoop and Kafka.
  • Having strong technical skills in Core Java with working knowledge.
  • A strong working knowledge of relational database management systems like Cassandra and MySQL.
  • Hands on experience in programming languages like C, C++ to create new applications.
  • Experience in managing and reviewing Hadoop log files.
  • Building a Data Quality framework, which consists of a common set of model components and patterns that can be extended to implement complex process controls and data quality measurements using Hadoop.
  • Implemented the Spark Scala code for Data Validation in Hive
  • Working experience on Hortonworks distribution and Cloudera Hadoop distribution versions CDH4 and CDH5 for executing the respective scripts.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Experience in setting up HIVE, PIG, HBASE, and SQOOP on Linux Operating System.
  • Experience on Apache Oozie for scheduling and managing the Hadoop Jobs. Extensive experience with Amazon Web Services (AWS).
  • Work with Data Engineering Platform team to plan and deploy new Hadoop Environments and expand existing Hadoop clusters.
  • Create visuals and dashboards to effectively report your insights.
  • Familiar with using GIT/SVN for software develop version control.
  • Managing and scheduling Jobs on a Hadoop cluster using Airflow DAG.
  • Loaded the data into Spark RDD and do in memory data Computation to generate the Output response.
  • Worked on extracting files from Cassandra through Sqoop and placed in HDFS and processed.
  • Hands on experience in loading data from UNIX file system to HDFS. Also performed parallel.
  • Worked with Agile methodology and Involved in daily Scrum meetings, Sprint planning. Development process tools like Jira.
  • Procedures, Functions, Packages, Views, materialized views, function-based indexes and Triggers, Dynamic SQL, ad-hoc reporting using SQL.
  • Involved in all Software Development Life Cycle (SDLC) phases of the project from domain sharing, knowledge requirement analysis, system design, implementation and deployment.
  • Good Experience of working on Linux and windows operating systems.
  • Provide high-level customer support to remote clients using a support e-ticketing system.
  • Collected, organized, and documented infrastructure project attributes, data, and project metrics.
  • Perform testing, install, configure, and troubleshoot various software programs. Write, modify, and maintain software documentation and specifications.
  • Processed data load requests, manually entered data, reconciled data conflicts, and created data extracts and reports.
  • Involved in all Software Development Life Cycle (SDLC) phases of the project from domain knowledge sharing, requirement analysis, system design, implementation and deployment.
  • Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency.
  • Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
  • Knowledge on CSS and leveraging best practices, modifications of existing CSS files to enhance the user experience.
  • Knowledge on Backend Web application using AngularJS 2.0 with cutting edge HTML and CSS3 techniques.
  • Bachelor’s degree & Master’s in Computer Science related field and equivalent work experience.
  • Excellent verbal and written communication skills.
  • Ability to meet deadlines in a fast pace environment.
  • A strong passion for learning new things and solving problems.

TECHNICAL SKILLS:

Programming Languages: Java, Scala, Unix Shell Scripting

Big Data Ecosystem: HDFS, HBase, Map Reduce, Hive, Pig, Spark, KafkaSqoop, Impala, Cassandra, Oozie, Zookeeper, Flume.

DBMS: Oracle 11g, MySQLModeling Tools: UML on Rational Rose 4.0

Web Technologies: AngularJS, HTML5, CSS3.

IDEs: Eclipse, Net beans, WinSCP, Visual Studio and Intellij.

Operating systems: Windows, UNIX, Linux (Ubuntu), Solaris, Centos.

Servers: Apache Tomcat

Frameworks: MVC, Maven, ANT.

PROFESSIONAL EXPERIENCE:

Confidential - Chicago IL

Data Engineer

  • Developed solutions to process data into HDFS (Hadoop Distributed File System), process within Hadoop and emit the summary results from Hadoop to downstream systems.
  • Used to manage and review the Hadoop log files.
  • Involved in loading data from UNIX/LINUX file system to HDFS.
  • Importing and exporting data into HDFS, Pig, Hive and HBase using SQOOP.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Developed spark streaming application using Scala to accept messages from Kafka and store data in Cassandra.
  • Set up secured Kafka clusters along with Zookeeper in cloud environment.
  • Developed Kafka consumers to consume data from Kafka topics.
  • Used Different Spark Modules like Spark core, Spark SQL, Spark Streaming, Spark Data sets and Data frames.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Worked on different file formats like Text files, Parquet, Sequence Files, Avro, Record columnar RC), ORC files.
  • Experienced in implementing Spark RDD transformations, actions to implement business analysis.
  • Understood complex data structures of different type (structured, semi structured) and de-normalizing for storage in Hadoop.
  • Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
  • Developed Kafka producer and consumer components for real time data processing.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Consumed the data from Kafka queue using Spark. Configured different topologies for Spark cluster and deployed them on regular basis.
  • Tested Apache Tez, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
  • Used Maven for building and deployment purpose.

Environment: Hadoop, MapReduce, HDFS, Spark, AWS, Hive, Java, Scala, Kafka, SQL, Pig, Sqoop, HBase, Zookeeper, MySQL, Jenkins, Git, Agile.

Confidential, Chicago

Software Developer.

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop.
  • This project will download the data that was generated by sensors from the cars activities, the data will be collected in to the HDFS system online aggregators by Kafka.
  • Experience in creating Kafka producer and Kafka consumer for Spark streaming which gets the data from different learning systems of the patients.
  • Spark Streaming collects this data from Kafka in near-real-time and performs necessary transformations and aggregation on the fly to build the common learner data model.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Experience in AWS to spin up the EMR cluster to process the huge data which is stored in S3 and push it to HDFS. Implemented automation and related integration technologies.
  • Implemented Spark SQL to access hive tables into spark for faster processing of data.
  • Involved in Converting Hive/SQL queries into Spark transformations using Spark RDD, Scala.
  • Used Apache Oozie for scheduling and managing the Hadoop Jobs. Extensive experience with Amazon Web Services (AWS).
  • Developed and updated social media analytics dashboards on regular basis.
  • Monitored workload, job performance and capacity planning using Cloudera Manager.
  • Worked on migrating PIG scripts and MapReduce programs to Spark Data frames API and Spark SQL to improve performance Involved in moving all log files generated from various sources to HDFS for further processing through Flume and process the files by using some piggy bank.
  • Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS. Used Flume to stream through the log data from various sources.
  • Using Avro file format compressed with Snappy in intermediate tables for faster processing of data. Used parquet file format for published tables and created views on the tables.
  • Created sentry policy files to provide access to the required databases and tables to view from impala to the business users in the dev, test and prod environment.

Environment: Hadoop, MapReduce, Cloudera, Spark, Kafka, HDFS, Hive, Pig, Oozie, Scala, Eclipse, Flume, Oracle, UNIX Shell Scripting.

Confidential, Bentwood, TN

Hadoop Developer

Responsibilities:

  • Building a Data Quality framework, which consists of a common set of model components and patterns that can be extended to implement complex process controls and data quality measurements using Hadoop.
  • Experience working on Solr to develop search engine on unstructured data in HDFS.
  • Used Solr to enabling indexing for enabling searching on Non-primary key columns from Cassandra keyspaces.
  • Created and populated bucketed tables in Hive to allow for faster map side joins and for more efficient jobs and more efficient sampling. Also performed partitioning of data to optimize Hive queries.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data
  • Implemented DDL Curated Data Store logic using Spark Scala and Data frames concepts.
  • Used Spark, hive for implementing the transformations need to join the daily ingested data to historic data.
  • Enhanced the performance of queries and daily running spark jobs using the efficient design of partitioned hive tables and Spark logic.
  • Implemented the Spark Scala code for Data Validation in Hive
  • Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.
  • Implemented the automated workflows for all the jobs using the Oozie and shell script.
  • Used Spark SQL functions to move data from stage hive tables to fact and dimension tables in
  • Implemented dynamic partitioning in hive tables and used appropriate file format, compression technique to improve the performance of map reduce jobs.
  • Work with Data Engineering Platform team to plan and deploy new Hadoop Environments and expand existing Hadoop clusters.
  • Good understanding of ETL tools and how they can be applied in a Big Data environment.
  • Collaborate with BI teams to create reporting data structures.
  • Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and spark jobs on AWS.
  • Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
  • Experienced in Cloud Services such as AWS EC2 , EMR , RDS, S3 to assist with big data tools, solve the data storage issue and work on deployment solution.

Environment: Spark, Scala, Hadoop, Hive, Sqoop, Oozie, Design Patterns, SOLID & DRY principles, SFTP, Code Cloud, Jira, Bash.

Hire Now