We provide IT Staff Augmentation Services!

Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Detroit, MI

SUMMARY:

  • Possess 7 years of work experience as software developer in the IT industry.
  • Possess very good experience with programming in Java, Python, and Scala.
  • Possess exclusive experience in Big Data technologies and Hadoop ecosystem components like Spark, MapReduce, Hive, Pig, YARN, HDFS, Oozie, Sqoop, Flume and Kafka, NoSQL systems like Cassandra, Couchbase Server, Elastic Search.
  • Strong knowledge in architecture of distributed systems and parallel processing, in - depth understanding of MapReduce Framework and Spark Execution Framework
  • Good experience in creating data ingestion pipelines, data transformations, data management and data governance, real time steaming engines at an Enterprise level.
  • Very good experience in real time data streaming solutions using Apache Spark/Spark streaming
  • Possess hands on experience in deploying Spark Applications into production in a multi-node distributed cluster
  • Expertise in writing end to end Data processing batch Jobs to analyze data using MapReduce, Spark
  • Possess strong hands on experience in using confluents Kafka API’s like Kafka Connect API, Kafka Streams API. Possess experience in deploying Kafka applications into production.
  • Experience in integrating Flume and Kafka in moving the data from sources to sinks in real-time.
  • Experience in working with various Hadoop distributions like Cloudera, Hortonworks
  • Experience in working with Kerberos integrated Hadoop clusters.
  • Possess Hands on experience in Hive Data Modeling. Very good understanding of Partitions, Bucketing concepts in Hive; designed both Managed and External tables in Hive to optimize performance.
  • Extensive experience in importing/exporting data from/to RDBMS from the Hadoop Ecosystem using Apache Sqoop.
  • Experience in building, deploying and integrating applications with Maven.
  • Hands-on experience developing workflows that execute MapReduce, Sqoop, Flume, Hive and Pig scripts using Oozie.
  • Possess experience in building distributed cache systems using Couch base server.

TECHNICAL SKILLS:

Bigdata Ecosystem components: HDFS, MapReduce, Hive, Pig, Zookeeper, Yarn, Spark, Kafka, Sqoop, Cassandra, Hue, Oozie, Apache Flume

Programming Languages: C, Java, PL/SQL, Scala, HiveQL, Pig Latin

Scripting Languages: PHP, Python, Shell

Distribution Platform: Cloudera, Hortonworks HDP

UI/UX Technologies: HTML, CSS3, JavaScript, JQuery, Angular-Js, Bootstrap, JSON, XML

Data Visualization: Tableau Desktop, Tableau Server, Microsoft Power BI

Databases: Couchbase server, Oracle 9i, Cassandra, Elastic Search

IDE & Build Tools: Eclipse, IntelliJ IDEA, Maven

Version Control Systems: BitBucket, TFS

PROFESSIONAL EXPERIENCE:

Confidential, Detroit MI

Big Data Engineer

Responsibilities:

  • Implement data ingestion pipelines to import data from various sources into HDFS/S3 buckets
  • Set up tasks in Attunity to ingest data from SQL Server into raw access layer of HDFS in near time of low latency
  • Configured Apache NiFi flow for loading data from non-relational data sources into raw access layer of HDFS.
  • Created Managed and External hive tables in analytics layer of HDFS.
  • Worked with different Hadoop file formats like ORC, Parquet, and Avro.
  • Used Spark-sql, data frames API to do data processing.
  • Performance tuning of hive queries.
  • Built spark applications in apache zeppelin on spark 2.0
  • Implemented Oozie workflows to schedule jobs.
  • Integrate Collibra with Data Lake using Collibra connect API
  • Implemented shell scripts to automate ad-hoc activities.

Environment: Hortonworks HDP 2.6, AWS, Pyspark, Attunity, Apache NiFi, Hive, Oozie, Python, Hadoop, HDFS, Shell Scripting, Collibra Connect

Data Quality

Responsibilities:

  • Create standardized rules for different data quality dimensions that includes Completeness, Accuracy, Integrity and Consistency.
  • Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data
  • Implement data quality rules using necessary query or scripting language based on source. Ex: T-SQL, HiveQL for processing structured data, Python for handling semi-structured data.
  • Worked with team to design and develop solution architecture in SQL Server using SSIS,ETL Processes to automate the end-end process of data quality framework.
  • Involved in writing ETL’s and stored procedures to support data quality architecture built in sql server
  • Collaborate with team members in creating Data Quality Playbook.
  • Create data quality dashboards in Microsoft Power BI to visualize data quality results.
  • Work with various data profiling tools like Pitney Bowes, Control force, informatica by doing proof of concepts and evaluate the best ones for the Organization use cases
  • Integrate data quality architecture with Collibra, a meta data management tool using Collibra Connect API
  • Follow agile methodologies.

Environment: SQL Server, T-SQL, HiveQL, Python, Microsoft Power BI, Informatica, Pitney Bowes, TFS, DynamoDB, JSON, XML, RESTful API

Confidential, Bentonville AR

Big Data Engineer

Responsibilities:

  • Designed and developed Spark Streaming applications that consumes data from Kafka, does enrichment, validation operations and finally loads the data into HDFS sink.
  • Performance tuning of the Spark Streaming applications for optimizing performance.
  • Designed and developed a shell script that handled the process of merging all the small files that were resulted by spark streaming into a file of size >=128MB for optimizing the performance of all the jobs running on top of the data.
  • Designed and developed Elastic Search Connector using Kafka Connect API with source as Kafka and sink as elastic search.
  • Designed and developed Cassandra Sink Connector using Kafka Connect API with source as Kafka and sink as elastic search.
  • Instrumentation of code with metrics and integrating with Medusa which is a visualization tool.
  • Extensively used spark SQL and Data frames API in building spark applications.
  • Involved in setting up the Kafka Streams Framework which is the core of enterprise inventory.
  • Designed and built the item relationship cache in Couch base.
  • Developed a java based application using Couch base Java Client API to load the data from a remote server into Couch base DB which is used by many other applications for various purposes.
  • Configured Flume with sources as multiple Kafka topics, channel as Kafka channel, and sink as HDFS.
  • Developed MapReduce job that runs on top of encoded, encrypted JSON messages. This job decodes, decrypt and parses through each message and extract the necessary fields and store them in the form of text files.
  • Created and loaded the data into partitioned External hive tables.
  • Data Modeled and implemented partitioned hive tables of types ORC, Parquet, Text based on consumer applications.
  • Developed OOZIE Coordinator Workflow that runs hourly and the workflow involves MapReduce and Hive.
  • Familiar in using various file formats such as Avro, parquet, sequence based on the scenario.
  • Developed shell scripts that purges the data based on the retention period

Environment: Hortonworks HDP 2.7.4, Spark-Streaming, spark SQL, HDFS, Map-Reduce, Hive, Flume, Oozie, Kafka, Couchbase DB, Elastic Search, Cassandra, Java, Scala, BitBucket, IntelliJ IDEA

Confidential

Software Developer

Responsibilities:

  • Developed the GUI of the system using JSP, HTML.
  • Struts Framework in conjunction with JSP and Tag libraries used in developing user interfaces.
  • Developed session beans for necessary truncations like fetching the data required.
  • Designed the database using the specific requirements that were given by our professor to meet the needs of the project
  • Implemented the approved database design using Oracle.
  • Wrote DDL, DML statements and other programs required in order to get the database live.
  • Accessed the database using JDBC API from the Front end.
  • Involved in writing PL/Sql Procedures, Packages, and Triggers whenever required.

Environment: Java, SQL, Pl/Sql, Eclipse, Oracle 9i, Pl/Sql developer

Confidential

Systems Analyst

Responsibilities:

  • Understanding business needs, analyzing functional specifications and map them to Hadoop ecosystem components and extract the results according to the needs
  • Developed Map Reduce Jobs for cleansing, accessing and validating the data.
  • Developed Sqoop scripts to import/export data from Oracle to HDFS and into Hive tables.
  • Stored the data in tabular formats using Hive tables.
  • Effectively used Hive Partitioning and bucketing concepts to provide better performance with the HiveQL queries.
  • Implemented Hive Generic UDF’s to in corporate business logic into the Hive Queries.
  • Analyzed the web log data using the HiveQL to extract various required columns in detail.
  • Involved in developing Pig scripts/Pig UDF and to store unstructured data into the HDFS.
  • Used Hive Join Optimizations for improving the performance.
  • Created partitioned tables and loaded data using both static partition and dynamic partition methods.
  • Used different data formats (text, Avro) while loading the data into the HDFS.
  • Used Oozie for automating the end to end pipelines and Oozie coordinators for scheduling the workflows.

Environment: HDFS, Map Reduce, Pig, Java, Sqoop, Oozie, Cloudera, Eclipse, Shell, Hive, Linux.

We'd love your feedback!