We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

2.00/5 (Submit Your Rating)

Lake Success, NY

SUMMARY

  • 8+ years of Professional experience in Big Data Development primarily using Hadoop and Spark Ecosystems.
  • Experience in design, development, and Implementation of Big data applications using Hadoop ecosystem frameworks and tools like HDFS, MapReduce, Yarn, Pig, Hive, Sqoop, Spark, Storm HBase, Kafka, Flume, Nifi, Impala, Oozie, Zookeeper, Airflow, etc.
  • Expertise in developing using Scala, Java and Python
  • Expertise in concepts like ingesting, processing, exporting, analyzing Terabytes of structured and unstructured data on Hadoop clusters.
  • In - depth knowledge of Hadoop Architecture and working with Hadoop components such as HDFS, JobTracker, TaskTracker, NameNode, DataNode, and MapReduce concepts.
  • Demonstrated experience in delivering data and analytic solutions using AWS, Azure or similar Cloud Data Lake.
  • Data Streaming from various sources like cloud (AWS, Azure) and on - premises by using the tools like Spark.
  • Hands on experience with AWS (Amazon Web Services), Elastic Map Reduce (EMR), Storage S3, EC2 instances and Data Warehousing.
  • Expertise in developing and tuning Spark applications using various optimizations techniques for executor tuning, memory management, garbage collection, Serialization assuring the optimal performance of applications by following best practices in the industry.
  • Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats.
  • Worked with various compression techniques like BZIP, GZIP, Snappy, and LZO.
  • Expertise in writing DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS and Hive.
  • Expertise in working with Hive optimization techniques like Partitioning, Bucketing, vectorizations and Map side-joins, Bucket-Map Join, skew joins, and creating Indexes.
  • Developed, deployed, and supported several MapReduce applications in Java to handle different types of data.
  • Expertise in developing streaming applications in Scala using Kafka and Spark Structured Streaming
  • Experience in importing and exporting data from HDFS to RDBMS systems like Teradata (Sales Data Warehouse), SQL-Server, and Non-Relational Systems like HBase using Sqoop by efficient column mappings and maintaining the uniformity.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Good Working Knowledge on working with AWS cloud services like EMR, S3, Redshift, EMR cloud watch, for big data development.
  • Experience in working with various build and automation like Maven, SBT, GIT, SVN, Jenkins.
  • Experience in understanding of the Specifications for Data Warehouse ETL Process and interacting with the designers and the end users for informational requirements.
  • Experienced in Microsoft Business Intelligence tools, developing SSIS (Integration Service), SSAS (Analysis Service) and SSRS (Reporting Service), building Key Performance Indicators, and OLAP cubes.
  • Have good exposure with the star, snowflake schema, data modelling and work with different data warehouse projects.
  • Strong experience in the design and development of relational database concepts with multiple RDBMS databases including Oracle10g, MySQL, MS SQL Server & PL/SQL.
  • Trouble-shooting production incidents requiring detailed analysis of issues on web and desktop applications, Autosys batch jobs, and databases.
  • Experience in working with various SDLC methodologies like Waterfall, Agile Scrum, and TDD for developing and delivering applications.
  • Developing SQL queries to research, analyze, and troubleshoot data and to create business reports
  • Take the database reporting needs and turn them into powerful SQL queries that will extract data and compile it into meaningful reports.

PROFESSIONAL EXPERIENCE

Sr. Data Engineer

Confidential, Lake Success, NY

Responsibilities:

  • Developed applications using spark to implement various aggregation and transformation functions of Spark RDD and Spark SQL.
  • Worked on DB2 for SQL connection to Spark Scala code to Select, Insert, and Update data into DB.
  • Used Broadcast Join in SPARK for making smaller datasets to large datasets without shuffling of data across nodes.
  • Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the Spark jobs.
  • Created Spark Streaming jobs using Python to read messages from Kafka & download JSON files from AWS S3 buckets
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Prototyped analysis and joining of customer data using Spark in Scala and processed it to HDFS
  • Implemented Spark in EMR for processing Big Data across our One Lake in AWS System
  • Developed AWS strategy, planning, and configuration of S3, Security groups, IAM, EC2, EMR and Redshift
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Developed data processing applications in Scala using SparkRDD as well as Dataframes using SparkSQL APIs.
  • Used pandas UDF like building the array contains, distinct, flatten, map, sort, split and overlaps for filtering the data
  • Prototyped analysis and joining of customer data using Spark in Scala and processed it to HDFS
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Building automate pipelines using Jenkins and groovy scripts
  • Used shell commands to push the environment and test file AWS using Jenkins automated pipelines
  • Develop database application forms using MS Access in coordination with SQL tables and stored procedures.

Environment: Spark, Scala, AWS, Python, Spark SQL, Redshift, PgSQL, Data bricks, Jupiter, Kafka

Sr Data Engineer

Confidential, Indianapolis, IN

Responsibilities:

  • Work on requirements gathering, analysis and designing of the systems.
  • Developed Spark programs using Scala to compare the performance of Spark with Hive and SparkSQL.
  • Developed spark streaming application to consume JSON messages from Kafka and perform transformations.
  • Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
  • Implemented Spark using Scala and SparkSql for faster testing and processing of data.
  • Involved in developing a MapReduce framework that filters bad and unnecessary records.
  • Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs with Scala.
  • Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
  • Exported the analyzed data to the relational databases using Sqoop to further visualize and generate reports for the BI team.
  • Responsible for migrating the code base to Amazon EMR and evaluated Amazon eco systems components like Redshift.
  • Collected the logs data from web servers and integrated in to HDFS using Flume
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's
  • Used AWS services like EC2 and S3 for small data sets processing and storage
  • Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.
  • Worked on different file formats (ORCFILE, Parquet, Avro) and different Compression Codecs (GZIP, SNAPPY, LZO).
  • Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.
  • Worked on importing and exporting data into HDFS and Hive using Sqoop, built analytics on Hive tables using Hive Context in spark Jobs.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS.
  • Worked in Agile environment using Scrum methodology.

Environment: Hadoop, Hive, MapReduce, Sqoop, Kafka, Spark, Yarn, Pig, PySpark, Cassandra, Oozie, Nifi, Solr, Shell Scripting, Hbase, Scala, AWS, Maven, Java, JUnit, Horton works, Soap, Python, MySQL.

Data Engineer

Confidential, New York, NY

Responsibilities:

  • Work on requirements gathering, analysis and designing of the systems.
  • Actively involved in designing Hadoop ecosystem pipeline.
  • Involved in designing Kafka for multi data center cluster and monitoring it.
  • Responsible for importing real time data to pull the data from sources to Kafka clusters.
  • Worked with spark techniques like refreshing the table and handling parallelly and modifying the spark defaults for performance tuning.
  • Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
  • Involved in migrating MapReduce jobs into Spark jobs and used SparkSQL and Data frames API to load structured data into Spark clusters.
  • Involved in using Spark API over Hadoop YARN as execution engine for data analytics using Hive and submitted the data to BI team for generating reports, after the processing and analyzing of data in Spark SQL.
  • Worked with data science team to build statistical model with Spark MLLIB and PySpark.
  • Involved in performing importing data from various sources to the Cassandra cluster using Sqoop.
  • Worked on creating data models for Cassandra from Existing Oracle data model.
  • Designed Column families in Cassandra and Ingested data from RDBMS, performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
  • Used Sqoop to import functionality for loading Historical data present in RDBMS to HDFS
  • Designed workflows and coordinators in Oozie to automate and parallelize Hive jobs on Apache Hadoop environment by Hortonworks (HDP 2.2)
  • Configured Hive bolts and written data to hive in Hortonworks as a part of POC.
  • Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
  • Developed Oozie workflow for scheduling & orchestrating the ETL process.
  • Created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
  • Wrote Python scripts to parse XML documents and load the data in database.
  • Worked extensively on Apache Nifi to build Nifi flows for the existing Oozie jobs to get the incremental load, full load and semi structured data and to get data from Rest API into Hadoop and automate all the Nifi flows runs incrementally.
  • Created Nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.
  • Worked extensively with importing metadata into Hive using Scala and migrated existing tables and applications to work on Hive and AWS cloud.
  • Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
  • Used version control tools like GITHUB to share the code snippet among the team members.

Environment: Hadoop, HDFS, Hive, Python, Hbase, Nifi, Spark, MYSQL, Oracle 12c, Linux, Hortonworks, Oozie, MapReduce, Sqoop, Shell Scripting, Apache Kafka, Scala, AWS.

Data Engineer

Confidential

Responsibilities:

  • Ingested data from various data sources into Hadoop HDFS/Hive Tables using SQOOP, Flume, Kafka.
  • Extended Hive core functionality by writing custom UDFs using Java.
  • Worked on multiple POCs in Implementing Data Lake for Multiple Data Sources ranging from TeamCenter, SAP, Workday, Machine logs.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Worked on MS Sql Server PDW migration for MSBI warehouse.
  • Planning, scheduling and implementing Oracle to MS SQL server migrations for AMAT in house applications and tools.
  • Integrated Tableau with Hadoop data source for building dashboard to provide various insights on sales of the organization.
  • Worked on Spark in building BI reports using Tableau. Tableau was integrated with Spark using Spark-SQL.
  • Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
  • Developed work flows in Live Compare to Analyze SAP Data and Reporting.
  • Worked on Java development and support and tools support for in house applications.

ETL/Data Warehouse Developer

Confidential

Responsibilities:

  • Gathered requirements from Business and documented for project development.
  • Coordinated design reviews, ETL code reviews with teammates.
  • Developed mappings using Informatica to load data from sources such as Relational tables, Sequential files into the target system.
  • Extensively worked with Informatica transformations.
  • Created datamaps in Informatica to extract data from Sequential files.
  • Extensively worked on UNIX Shell Scripting for file transfer and error logging.
  • Scheduled processes in ESP Job Scheduler.
  • Performed Unit, Integration and System testing of various jobs.

Environment: Informatica Power Center 8.6, Oracle 10g, SQL Server 2005, UNIX Shell Scripting, ESP job scheduler

We'd love your feedback!