We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Boston, MA

SUMMARY

  • Over 8 years of experience in Information Technology, worked on Data warehousing and providing solutions, developing, maintaining, and supporting the client requirements, in Data Extraction, Data Modelling, Data Wrangling, Statistical Modeling,Data Mining, Machine Learning, andData Visualization.
  • Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, Map Reduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
  • High Exposure to Big Data technologies and the Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
  • Expertise in writing end-to-end Data processing Jobs to analyze data using MapReduce, and Spark.
  • Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD, and knowledge on Spark MLLib.
  • Developed Spark Streaming jobs by creating RDDs (Resilient Distributed Datasets) using Scala, PySpark, and Spark-Shell.
  • Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy, and Pandas for data analysis and numerical computations.
  • Experienced in using Pig scripts to do transformations, event joins, filters, and pre-aggregations before storing the data into HDFS.
  • Working experience on Hive analytical functions, extending Hive functionality by writing custom UDFs.
  • Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured, and unstructured data sets and storing them in HDFS.
  • Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, and Fact & Dimension tables.
  • Strong knowledge of creating and monitoring Hadoop clusters on Amazon EC2, VM, Horton works Data Platform 2.6, CDH5 Cloudera Manager on Linux, etc.
  • Hands-on experience working with Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
  • Hands-on experience in SQL and NoSQL databases such as Snowflake, HBase, Cassandra, and MongoDB.
  • Hands-on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Built and implemented big data ingestion pipelines to process TB size data from various data sources using Kafka, Pyspark, and Snow pipe.
  • Developed data audit checks and stored them as efficient columnar storage formats enabling data wrangling for various downstream analytics purposes.
  • Excellent knowledge of J2EE architecture, design patterns, and object modeling using various J2EE technologies and frameworks with Comprehensive experience in Web-based applications using J2EE Frameworks like Spring, Hibernate, Struts, and JMS.
  • Expert in machine learning techniques like Clustering Analysis, Market Basket Analysis, Association.
  • Having extensive knowledge of using Azure DevOps for deploying Notebooks and Pipelines.
  • Experienced in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, and XML Files.
  • Strong skills in analytical, presentation, communication, and problem-solving with the ability to work independently and in a team and have the ability to follow the best practices and principles defined for the team.

TECHNICAL SKILLS

Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/Impala, YARN, Kafka, Flume, Sqoop, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.

Hadoop Distribution: Cloudera distribution and Horton works.

Programming Languages: Python (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), Oracle, T-SQL, PL/SQL, Scala, Spring, Spark, Hibernate, JDBC, JSON, HTML, CSS, Java

Scripting Languages: JavaScript, jQuery, Python Azure Power Shell

Databases: Oracle 11g/10g/9i, Sybase, Netezza, Hive, Impala, SQL, NO SQL (Cassandra, HBase, and MongoDB) MySQL, DB2, MS-SQL Server

Cloud: AWS, Azure, GCP

Scheduling Tools: Control-M, Active Batch, Zena.

Software Methodologies: Agile, Scrum, Waterfall

IDE: Intellij, Eclipse, Jupyter, and NetBeans

Version controls, and Tools: GIT, GitLab, GitHub, Subversion

PROFESSIONAL EXPERIENCE

Confidential, Boston, MA

Sr. Big data Engineer

Responsibilities:

  • Involved in Requirement gathering, business Analysis, Design and Development, testing and implementation of business rules.
  • Developed Spark programs using Scala to compare the performance of Spark with Hive and SparkSQL.
  • Developed spark streaming application to consume JSON messages from Kafka and perform transformations.
  • Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
  • Implemented Spark using Scala and SparkSql for faster testing and processing of data.
  • Involved in developing a MapReduce framework that filters bad and unnecessary records.
  • Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs with Scala.
  • Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
  • Exported the analyzed data to the relational databases using Sqoop to further visualize and generate reports for the BI team.
  • Migrated the computational code in hql to PySpark.
  • Worked with Spark Ecosystem using Scala and Hive Queries on different data formats like Text file and parquet.
  • Worked in migrating Hive QL into Impala to minimize query response time.
  • Responsible for migrating the code base to Amazon EMR and evaluated Amazon eco systems components like Redshift.
  • Collected the logs data from web servers and integrated in to HDFS using Flume
  • Developed Python scripts to clean the raw data.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's
  • Worked on AWS services like EC2 and S3 for data sets processing and storage
  • Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.
  • Worked on different file formats (ORCFILE, Parquet, Avro) and different Compression Codecs (GZIP, SNAPPY, LZO).
  • Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.
  • Worked on importing and exporting data into HDFS and Hive using Sqoop, built analytics on Hive tables using Hive Context in spark Jobs.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS.
  • Worked in Agile environment using Scrum methodology.

Environment: Hadoop, Hive, MapReduce, Sqoop, Kafka, Spark, Yarn, Pig, PySpark, Cassandra, Oozie, Nifi, Solr, Shell Scripting, Hbase, Scala, AWS, Maven, Java, JUnit, agile methodologies, Horton works, Soap, Python, Teradata, MySQL,Python 3, Boto3 SDK

Confidential, Richmond, VA

Sr. Data Engineer

Responsibilities:

  • Work on requirements gathering, analysis and designing of the systems.
  • Actively involved in designing Hadoop ecosystem pipeline.
  • Developed Spark code Scala Spark-SQL/Streaming for faster testing and processing of data.
  • Involved in designing Kafka for multi data center cluster and monitoring it.
  • Responsible for importing real time data to pull the data from sources to Kafka clusters.
  • Worked with spark techniques like refreshing the table and handling parallelly and modifying the spark defaults for performance tuning.
  • Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
  • Involved in migrating MapReduce jobs into Spark jobs and used SparkSQL and Data frames API to load structured data into Spark clusters.
  • Involved in using Spark API over Hadoop YARN as execution engine for data analytics using Hive and submitted the data to BI team for generating reports, after the processing and analyzing of data in Spark SQL.
  • Performed SQL Joins among Hive tables to get input for Spark batch process.
  • Worked with data science team to build statistical model with Spark MLLIB and PySpark.
  • Involved in performing importing data from various sources to the Cassandra cluster using Sqoop.
  • Worked on creating data models for Cassandra from Existing Oracle data model.
  • Designed Column families in Cassandra and Ingested data from RDBMS, performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
  • Used Sqoop to import functionality for loading Historical data present in RDBMS to HDFS
  • Designed workflows and coordinators in Oozie to automate and parallelize Hive jobs on Apache Hadoop environment by Hortonworks (HDP 2.2)
  • Configured Hive bolts and written data to hive in Hortonworks as a part of POC.
  • Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
  • Developed Python script for start a job and end a job smoothly for a UC4 workflow
  • Developed Oozie workflow for scheduling & orchestrating the ETL process.
  • Created Data Pipelines scheduled it using Oozie Coordinators.
  • Wrote Python scripts to parse XML documents and load the data in database.
  • Worked extensively on Apache Nifi to build Nifi flows for the existing Oozie jobs to get the incremental load, full load and semi structured data and to get data from Rest API into Hadoop and automate all the Nifi flows runs incrementally.
  • Created Nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.
  • Developed shell scripts to periodically perform incremental import of data from third party API to Amazon AWS
  • Worked extensively with importing metadata into Hive using Scala and migrated existing tables and applications to work on Hive and AWS cloud.
  • Used version control tools like GITHUB/Subversion/SVN to share the code snippet among the team members.

Environment: Hadoop, MapReduce, HDFS, Hive QL, Pig, Java, Spark, Kafka, AWS, SBT, Maven, Sqoop, Zookeeper, Python, Informatica Power Center, Teradata.

Confidential, Conway, AR

Big data Engineer

Responsibilities:

  • Configured Spark Streaming to receive real-time data from the Apache Kafka and store the stream data to HDFS using PySpark.
  • Processed data into HDFS by developing solutions and analyzed the data using Map Reduce, Pig, and Hive to produce summary results from Hadoop to downstream systems.
  • Written Map Reduce code to process and parse the data from various sources and store parsed data into HBase and Hive using HBase-Hive Integration.
  • Involved in loading and transforming large sets of Structured, Semi-Structured, and Unstructured data and analyzed them by running Hive queries and Pig scripts.
  • Created Managed tables and External tables in Hive and loaded data from HDFS.
  • Using Azure Data Factory to ingest, egress, and transform data from multiple sources.
  • Developed Spark code by using Python and Spark-SQL for faster processing and testing and performed complex HiveQL queries on Hive tables.
  • Scheduled several times based on Oozie workflow by developing Python scripts.
  • Developed Pig Latin scripts using operators such as LOAD, STORE, DUMP, FILTER, DISTINCT, FOREACH, GENERATE, GROUP, COGROUP, ORDER, LIMIT, UNION, SPLIT to extract data from data files to load into HDFS.
  • Exporting the data using Sqoop to RDBMS servers and processed that data for ETL operations.
  • Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script, Sqoop, package, and MySQL.
  • Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.
  • Develop Pipelines in Azure using ADF, for data extractions and transformation.
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS, and Extracted the data from Oracle into HDFS using Sqoop
  • Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using Python.
  • Scheduled map reduces jobs in a production environment using Oozie scheduler.
  • Involved in Cluster maintenance, Cluster Monitoring, and Troubleshooting, Managing and reviewing data backups and log files.
  • Designed and implemented map-reduce jobs to support distributed processing using java, Hive, and Apache Pig.
  • Analyzing Hadoop clusters and different Big Data analytic tools including Pig, Hive, HBase, and Sqoop.
  • Improved the Performance by tuning HIVE and map-reduce.

Environment: HDFS, Map Reduce Hive, Sqoop, Pig, Flume, Vertica, Oozie Scheduler, Java, Shell Scripts, Teradata, Oracle, Azure Data Factory, HBase, Cassandra, Cloudera, JavaScript, JSP, Kafka, Spark, Scala and ETL, Python.

Confidential

Data Engineer

Responsibilities:

  • Involved in Technical and Business decisions for Business requirement, Interaction with Business Analysts, Client team, and Development team through Agile Kanban process.
  • Creating Azure Data factories for loading the data to SQL database from Cosmos platform.
  • Acted as build and release engineer, deployed the services by VSTS pipeline. Created and Maintained pipelines to manage the IAC for all the applications.
  • Created complex Power BI dashboards.
  • Develop Databricks Python notebooks to Join, filter, pre-aggregate, and process the files stored in Azure data lake storage based on business logic.
  • Performed Column Mapping, Data Mapping and Maintained Data Models and Data Dictionaries.
  • Built system to perform real-time data processing using Spark streaming and Kafka.
  • Involved in retrieving multi-million records for data loads using SSIS and by querying against Heterogeneous Data Sources like SQL Server, Oracle, Text files and some Legacy systems.
  • Expertise in using different Transformations like Lookups, Derived Column, Merge Join, Fuzzy Lookup, For Loop, For Each Loop, Conditional Split, Union all, Script component etc.
  • Transferred data from various data sources/business systems including MS Excel, MS Access, and Flat Files to SQL Server using SSIS/DTS packages using various features.
  • Involved in Performance tuning of ETL transformations, data validations and stored procedures.
  • Strong experience in designing and implementing ETL packages using SSIS for integrating data using OLE DB connection from heterogeneous sources.
  • Created Complex ETL Packages using SSIS which upsets data from staging table to database tables.
  • Developed and designed system to collect data from multiple portal using Kafka and then process it using spark.
  • Extensively worked on UNIX Shell Scripting for splitting group of files to various small files.
  • Analyzed the SQL scripts and designed it by using PySpark SQL for faster performance.
  • Experience in creating reports from scratch using Power BI.
  • Created roles using SSAS to restrict cube properties.
  • Implemented cell level security in cubes using MDX expressions to restrict users of one region seeing data of another region using SSAS.
  • Created calculated measures using MDX implementing business requirement.
  • Experienced working on Star and Snowflake Schemas and used the fact and dimension tables to build the cubes, perform processing and deployed them to SSAS database.
  • Designed aggregations and pre-calculations in SSAS.
  • Involved in designing Partitions in Cubes to improve performance using SSAS.
  • Experienced in Developing Power BI Reports and Dashboards from multiple data sources using Data Blending.
  • Responsible for creating and changing the visualizations in Power BI reports and Dashboards on client requests.
  • Created Calculated Columns and Measures in Power BI and Excel depending on the requirement using DAX queries.
  • Created hierarchies in Power BI reports using visualizations like Bar chart, Line chart, etc.
  • Worked with both live and import data into Power BI for creating reports.
  • Managed relationship between tables in Power BI using star schema.
  • Used different type of slicers available in Power BI for creating reports.Converted business requirements by developing views and dashboards in Tableau Desktop and publishing them to Tableau Server allowing end-users to control data and filter down further as desired.
  • Experience in Data Visualization including producing tables, graphs, listings using various procedures and tools like Tableau.

Environment: SQL, Azure Data Factory, Kafka, SQL Server, MS Excel, Microsoft Teams, Visual Studio 2014, SSIS, Power BI, UNIX.

Confidential

Jr. Data Engineer

Responsibilities:

  • Responsible for loading the customer’s data and event logs from MySQL database into HDFS using Sqoop.
  • Worked on loading data from Linux file system too HDFS.
  • Imported and exported data into HDFS and Hive using Sqoop and Flume.
  • Knowledge on using Cloudera manager, and end to end tool to manage Hadoop operations.
  • Develop UDF’s to provide custom hive and pig capabilities.
  • Wrote Pig scripts to generate MapReduce jobs and performed ETL procedures on the data in HDFS.
  • Performed data transformations in HIVE and used partitions, buckets for performance improvement.
  • Created Hive external tables on the map reduce output before partitioning, bucketing is applied.
  • Scheduling of Sqoop jobs using Oozie.
  • Involved in development of REST Web Services using Spring MVC to extract data from databases.
  • Developing and Consuming the REST web services using Jersey API.
  • Involved in multi-tiered J2EE design utilizing Spring Inversion of Control (IOC) architecture, Spring MVC, Spring Annotations, Hibernate, JDBC and Tomcat Web server.
  • Implemented navigation using Spring MVC controllers, configured controllers using Spring MVC annotations and configuration files.

Environment: REST Web Services, Java, Spring MVC, JUnit, Maven, TFS, Apache Hadoop, Pig, Hive, Impala, Sqoop, Oozie, IBM Web Sphere Data Stage 8.5, MS SQL Server 2012 R2/2008 R2, Flat Files, Active Batch V9, Windows Server 2008.

We'd love your feedback!