We provide IT Staff Augmentation Services!

Sr. Big Data/data Engineer Resume

4.00/5 (Submit Your Rating)

Harrison, NY


  • Over 7+ years of working experience as Data Engineering with high proficient knowledge in Data Analysis and Big data.
  • Experienced using "Bigdata" work on Hadoop, Spark, PySpark, Hive, HDFS and other NoSQL platforms.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing teh big data as per teh requirement.
  • Experienced in Technical consulting and end - to-end delivery with architecture, data modeling, data governance and design - development - implementation of solutions.
  • Experience in installation, configuration, supporting and managing - Cloudera Hadoop platform along with CDH4&CDH5 clusters.
  • Strong experience and knowledge of NoSQL databases such as MongoDB and Cassandra.
  • Proficient in Normalization/De-normalization techniques in relational/dimensional database environments and have done normalizations up to 3NF.
  • Hands on experience with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2instances, RDS and others.
  • Hands on experience with Google cloud services like GCP, BigQuery, GCS Bucket and G-Cloud Function.
  • Experienced in Informatica ILM and Informatica Lifecycle Management and its tools.
  • Efficient in all phases of teh development lifecycle, coherent with Data Cleansing, Data Conversion, Data Profiling, Data Mapping, Performance Tuning and System Testing.
  • Experience in Big Data Hadoop Ecosystem in ingestion, storage, querying, processing and analysis of Big data.
  • Good understanding of Ralph Kimball (Dimensional) & Bill Inman (Relational) model Methodologies.
  • Experienced working extensively on teh Master Data Management(MDM) and application used for MDM.
  • Experience in transferring teh data using Informatica tool from AWS S3 to AWS Redshift.
  • Good Knowledge on SQL queries and creating database objects like stored procedures, triggers, packages and functions using SQL and PL/SQL for implementing teh business techniques.
  • Supporting ad-hoc business requests and Developed Stored Procedures and Triggers and extensively used Quest tools like TOAD.
  • Good understanding and exposure to Python programming.
  • Excellent working experience in Scrum/Agile framework and Waterfall project execution methodologies.
  • Experience in migrating teh data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement.
  • Extensive experience working with business users/SMEs as well as senior management.
  • Strong experience in using MS Excel and MS Access to dump teh data and analyze based on business needs.
  • Good experienced in Data Analysis as a Proficient in gathering business requirements and handling requirements management.


Big Data & Hadoop Ecosystem: MapReduce, Spark 3.3, HBase 2.3.4, Hive 2.3, Flume 1.9, Sqoop 1.4.6, Kafka 2.6, Oozie 4.3, Hue, Cloudera Manager, Neo4j, Hadoop 3.3, Apache NiFI 1.6

Cloud Platforms: GCP, Google big-query, AWS, EC2, EC3, Redshift & MS Azure

NOSQL Database: Mongo DB, Azure Sql DB, Cassandra 3.11.10

Data Modeling Tools: Erwin R9.7/9.6, ER Studio V17

Databases: , Microsoft SQL Server 2017, Teradata 15.0, Oracle 12c, and MS Access

BI Tools: Tableau 10, SSRS, Crystal Reports, Power BI.

Programming Languages: SQL, PL/SQL, UNIX shell Scripting, R

Operating Systems: Microsoft Windows Vista7/8 and 10, UNIX, and Linux.

Methodologies: Agile, RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Waterfall Model.


Sr. Big Data/Data Engineer

Confidential, Harrison, NY


  • Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in Python.
  • Worked collaboratively to manage build-outs of large data clusters and real-time streaming with Spark.
  • Developed ETL data pipelines using Spark and PySpark.
  • Analyzed SQL scripts and designed teh solutions to implement using PySpark.
  • Developing data processing tasks using PySpark such as reading data from external sources, merging data, performing data enrichment, and loading it to target data destinations.
  • Implement teh Kafka to hive streaming process flow and batch loading of data into MariaDB using Apache NiFi.
  • Implement end-end data flow using Apache NiFi.
  • Responsible for loading Data pipelines from web servers using Kafka and Spark Streaming API.
  • Used Spark for interactive queries, processing of streaming data, and integration with popular NoSQL database for huge volume of data.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Developed teh batch scripts to fetch teh data from AWS S3 storage and do required transformations in Scala using Spark framework.
  • Implemented Spark using Scala and Spark-SQL for faster testing and processing of data.
  • Data Processing: Processed data using Map Reduce and Yarn. Worked on Kafka as a proof of concept for log processing.
  • Data Integrity checks have been handled using Hive queries, Hadoop, and Spark.
  • Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.
  • Defined job flows and develops simple to complex Map Reduce jobs as per teh requirement.
  • Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Used Sqoop to import data into HDFS and Hive from other data systems.
  • Installed and configured Apache Hadoop to test teh maintenance of log files in teh Hadoop cluster.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run teh Airflow.
  • Installed and configured Hive, Sqoop, Flume, and Oozie on teh Hadoop cluster.
  • Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using MapReduce, Hive.
  • Involved in NoSQL database design, integration, and implementation.
  • Developed Kafka producer and consumers, HBase clients, Spark, and Hadoop MapReduce jobs along with components on HDFS, Hive.

Environment: Spark, Spark Streaming, Apache Kafka, Apache NiFi, Hive, AWS, ETL, UNIX, Linux, Tableau, Teradata, Sqoop, Scala, Python.

Sr. Data Engineer

Confidential, Pleasonton, CA


  • As a Data Engineer involved in Agile Scrum meetings to halp, manage and organize a team of developers with regular code review sessions.
  • Participated in Code Reviews, Enhancement discussion, maintenance of existing pipelines & systems, testing and bug-fix activities on-going basis.
  • Worked closely with teh business analysts to convert teh Business Requirements into Technical Requirements and prepared low and high level documentation.
  • Worked on Spark improving teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's
  • Developed ETL Processes in AWS Glue to migrate data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Involved in daily Scrum meetings to discuss teh development/progress and was active in making scrum meetings more productive.
  • Seamlessly worked on Python to build data pipelines after teh data got loaded from Kafka.
  • Used Kafka Streams to Configure Spark Streaming to get information and tan store it in HDFS.
  • Worked on loading data into Spark RDD's, perform advanced procedures like text analytics using in-memory data computation capabilities of Spark to generate teh Output response.
  • Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
  • Created AWS Lambda functions and assigned IAM roles to schedule python scripts using Cloud Watch Triggers to support teh infrastructure needs (SQS, Event Bridge, SNS)
  • Involved in converting MapReduce programs into Spark transformations using Spark RDD's using Scala and Python.
  • Integrated Kafka-Spark streaming for high efficiency throughput and reliability.
  • Developed a python script to hit REST API’s and extract data to AWS S3
  • Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script
  • Worked on functions inLambdadat aggregates teh data from incoming events, and tan stored result data in AmazonDynamo DB.
  • Deployed teh project on Amazon EMR with S3 connectivity for setting a backup storage.
  • Designed and Developed ETL jobs to extract data from oracle and load it in data mart in Redshift
  • Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift
  • Used JSON schema to define table and column mapping from S3 data to Redshift
  • Connected Redshift to Tableau for creating dynamic dashboard for analytics team
  • Used JIRA to track issues and Change Management
  • Involved in creating Jenkins jobs for CI/CD using GIT, Maven and Bash scripting

Environment: Spark 3.3, AWS S3, Redshift, Glue, EMR, IAM, EC2, Tableau, Jenkins, Jira, Python, Kafka, Agile.

Sr. Data Engineer

Confidential, Chicago, IL


  • As a Data Engineer me is responsible for building scalable distributed data solutions using Hadoop.
  • Involved in Agile Development process (Scrum and Sprint planning).
  • Handled Hadoop cluster installations in Windows environment.
  • Migrated on-premise environment in GCP (Google Cloud Platform)
  • Migrated data warehouses to Snowflake Data warehouse.
  • Defined virtual warehouse sizing for Snowflake for different type of workloads.
  • Involved in porting teh existing on-premise Hive code migration to GCP (Google Cloud Platform) BigQuery.
  • Involved in migration an Oracle SQL ETL to run on Google cloud platform using cloud Dataproc & BigQuery, cloud pub/sub for triggering teh Apache Airflow jobs.
  • Extracted data from data lakes, EDW to relational databases for analyzing and getting more meaningful insights using SQL Queries and PySpark.
  • Developed PySpark script to merge static and dynamic files and cleanse teh data.
  • Created Pyspark procedures, functions, packages to load data.
  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems.
  • Developed MapReduce programs to parse teh raw data, populate staging tables and store teh refined data in partitioned tables in teh EDW.
  • Wrote Sqoop Scripts for importing and exporting data from RDBMS to HDFS.
  • Set up Data Lake in Google cloud using Google cloud storage, BigQuery and Big Table.
  • Developed scripts in BigQuery and connecting it to reporting tools.
  • Designed workflows using Airflow to automate teh services developed for Change data capture.
  • Carried out data transformation and cleansing using SQL queries and PySpark.
  • Used Kafka and Spark streaming to ingest real time or near real time data in HDFS.
  • Worked related to downloading BigQuery data into Spark data frames for advanced ETL capabilities.
  • Worked on PySpark APIs for data transformations.
  • Built reports for monitoring data loads into GCP and drive reliability at teh site level.
  • Participated in daily stand-ups, bi-weekly scrums and PI panning.

Environment: Hadoop 3.3, GCP, BigQuery, Big Table, Spark 3.0, PySpark, Sqoop 1.4.7, ETL, HDFS, Snowflake DW, Oracle Sql, MapReduce, Kafka 2.8 and Agile process.

Python Developer



  • Wrote scripts and indexing strategy for migration to Confidential Redshift from SQL Server and MySQL databases.
  • Implement software enhancements to port legacy software systems to Spark and Hadoop ecosystems on Azure Cloud.
  • Involved in Relational and Dimensional Data modeling for creating Logical and Physical Designs of Databases and ER Diagrams with all related entities and relationships with each entity based on teh rules provided by teh business manager.
  • Used SAS, SQL, Oracle, Teradata, and MS Office analysis tools to complete analysis requirements Created SAS data sets by extracting data from Oracle database and flat files.
  • Analyzed existing systems and proposed improvements in processes and systems for teh usage of modern scheduling tools like Airflow and migrating teh legacy systems into an Enterprise data lake built on Azure Cloud.
  • Designed and Implemented Sharding and Indexing Strategies for MongoDB servers.
  • Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting on teh dashboard.
  • Generated weekly and monthly reports and maintained manipulated data using SAS macro, Tableau.
  • Expertise in all facets of Business Intelligence applications with a strong background in Data extraction, data visualization, report generation, infographics, and information visualization.
  • Utilize Power BI and SSRS to produce parameter-driven, matrix, sub-reports, drill-down, drill-through, dashboards, and integrated report hyperlink functionality to access external applications and make dashboards available in Web clients and mobile apps.
  • Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry-leading Data modeling tools.
  • Used Proc SQL, Proc Import, SAS Data Step, cleaned, validated, and manipulated data by SAS and SQL.
  • Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
  • Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
  • Develop SQL queries using stored procedures, common table expressions (CTEs), temporary tables to support SSRS and Power BI reports.

Environment: Azure, Teradata, Sqoop, MongoDB, MySQL, HDFS, Linux, Shell, scripts, SSIS, SSAS, HBase.

We'd love your feedback!