We provide IT Staff Augmentation Services!

Sr. Big Data/data Engineer Resume

New York, NY


  • Around 8 years of professional experience involving project development, implementation, deployment, and maintenance using Big Data technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.
  • Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer
  • Experience in working with different Hadoop distributions like CDH and Hortonworks.
  • Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Having proficient experience in various Big Data technologies like Hadoop, Apache NiFi, Hive Query Language, HBase NoSQL database, Sqoop, Spark, Scala, OOZIE and Pig. Oracle Database and Unix shell Scripting technologies.
  • Experience in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster related issues.
  • 3 Years of Experience ETL-Snowflake Developer.
  • Works on loading data into Snowflake DB in the cloud from various sources.
  • Implemented Enterprise Data Lakes using Apache NiFi.
  • Developed and designed Microservices components for the business by using Spring Boot.
  • Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended the default functionality by writing User Defined Functions (UDFs), User Defined Aggregate Function (UDAFs) for custom data specific processing.
  • Strong Knowledge on Architecture of Distributed systems and parallel process ing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.
  • Good experience in creating data ingestion Pipelines, Data Transformations, Data Management, Data Governance, and real time streaming at an enterprise level.
  • Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis).
  • Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
  • Experience in using SDLC methodologies like Waterfall, Agile Scrum for design and development.
  • Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing Partitioning and Bucketing, Writing, and Optimizing the HiveQL queries.
  • Experience in implementing Azure data solutions, provisioning storage account, Azure Data Factory, SQL server, SQL Databases, SQL Data warehouse, Azure Data Bricks and Azure Cosmos DB.
  • In depth understanding of Hadoop Architecture and its various components such as Resource Manager, Application Master, Name Node, Data Node, HBase design principles etc.
  • Experience in various distributions: Cloudera distributions like (CDH4/CDH5).
  • Experience developing iterative Algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
  • Experience with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.
  • Excellent Programming skills at a higher level of abstraction usingScala, AWS, and Python.
  • Experience in job workflow scheduling and monitoring tools like Oozie and good noledge on Zookeeper.
  • Profound understanding of Partitions and Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Experience in writing real time query processing using Cloudera Impala.
  • Strong working experience in planning and carrying out of Teradata system extraction using Informatica, Loading Process and Data warehousing, Large-scale Database Management and Reengineering.
  • Highly experienced in creating complex Informatica mappings and workflows working with major transformations.
  • In depth understanding of Apache spark job execution Components like DAG, lineage graph, Dag Scheduler, Taskscheduler, Stages and Worked on NoSQL databases including HBase and Mongo DB.
  • Experienced with performing CRUD operations using HBase Java Client API and Solr API.
  • Good experience in working with cloud environment like Amazon Web Services (AWS) EC2 and S3.
  • Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins, and AWS.
  • Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks, Managing the Machine Learning Lifecycle
  • Experience in creating Docker Containers leveraging existing Linux Containers and AMI’s in addition to
  • Creating Docker Containers from scratch.
  • Managed Docker orchestration and Docker containerization using Kubernetes.
  • Experience writing Shell scripts in Linux OS and integrating them with other solutions.
  • Strong Experience in working with Databases like Oracle 10g, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.
  • Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
  • Experience in automation and building CI/CD pipelines by usingJenkinsandChef.
  • Experience on agile methodologiesScrum.
  • Experience with Snowflake Multi - Cluster Warehouses.
  • Understanding of Snowflake cloud technology.
  • Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
  • Very keen in noing newer techno stack dat Google Cloud platform (GCP) adds.
  • Researchandresolveissuesin regard withScrum/Kanban methodologies/Process Improvement.
  • Strong noledge of Data Warehousing implementation concept in Redshift. TEMPHas done a POC with Matillion and Redshift for DWImplementation
  • Very good exposure and rich experience in ETL tools like Alteryx, Matillion & SSIS


Hadoop Technologies: HDFS, MapReduce, YARN, Hive, Pig, HBASE, Impala, Zookeeper, Sqoop, OOZIE, Apache, Cassandra, Flume, Spark, AWS, EC2


Web Technologies: HTML, CSS, JavaScript, AJAX, Servlets, JSP, DOM, XML, XSLT

Languages: C, Java, SQL, PL/SQL, Scala, Shell Scripts

Operating Systems: Linux, UNIX, Windows

Databases: NoSQL, Oracle, DB2, MySQL, SQL S server, MS Access, HBase

Application Servers: WebLogic, WebSphere, Apache Tomcat, JBOSS

IDE s: Eclipse, NetBeans JDeveloper, IntelliJ, IDEA

Version Control: CVS, SVN, Git

Reporting Tools: Jasper soft, Qlik Sense, Tableau, JUnit


Confidential, New York, NY

Sr. Big Data/Data Engineer


  • Developed mat data pipelines using Spark and PySpark.
  • Analyzed SQL scripts and designed the solutions to implement using PySpark.
  • Developing data processing tasks using PySpark such as reading data from external sources, merge data, perform data enrichment and load in to target data destinations.
  • Used Pandas, NumPy, Spark in Python for developing Data Pipelines.
  • Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in Python.
  • Part of team conducting logical Data analysis and Data modeling JAD sessions, communicated data-related standards.
  • Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
  • Implement the Kafka to hive streaming process flow and batch loading of data into MariaDB using Apache NiFi.
  • Involved in the analysis, design, and development and testing phases of Software Development Lifecycle (SDLC)
  • Involved in SDLC requirements gathering, analysis, design, development and testing of application, developed using AGILE/Scrum methodology
  • Implement end-end data flow using Apache NiFi.
  • Responsible for loading Data pipelines from web servers using Kafka and Spark Streaming API.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
  • Data Processing: Processed data using Map Reduce and Yarn. Worked on Kafka as a proof of concept for log processing.
  • Evaluate Snowflake Design considerations for any change in the application
  • Build the Logical and Physical data model for snowflake as per the changes required
  • Define virtual warehouse sizing for Snowflake for different type of workloads.
  • Consulting on Snowflake Data Platform Solution Architecture, Design, Development, and deployment focused to bring the data driven culture across the enterprises
  • Monitoring the Hive Meta store and the cluster nodes with the halp of Hue.
  • Created Data Pipeline using Processor Groups and multiple processors using Apache NiFi for Flat File, RDBMS as part of a POC using Amazon EC2.
  • Created AWS EC2 instances and used JIT Servers.
  • Extensively worked on various GCP infrastructure design and implementation strategies and experienced in Designing, architecting, and implementing scalable cloud-based web applications using AWS and GCP.
  • Developed various UDFs in Map-Reduce and Python for Pig and Hive.
  • Data Integrity checks have been handled using Hive queries, Hadoop, and Spark.
  • Build Hadoop solutions for big data problems using MR1 and MR2 in s3.
  • Moved data from S3 bucket to Snowflake Data Warehouse for generating the reports.
  • Bulk loading and unloading data into Snowflake tables using COPY command.
  • Created DWH, Databases, Schemas, Tables, write SQL queries against Snowflake.
  • Validate the data feed from the source systems to Snowflake DW cloud platform.
  • Integrated and automated data workloads to Snowflake Warehouse.
  • Ensure ETL/ELTs succeeded and loaded data successfully in Snowflake DB
  • Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.
  • Implemented the Machine learning algorithms using Spark with Python.
  • Defined job flows and developed simple to complex Map Reduce jobs as per the requirement.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Developed PIG UDFs for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders.
  • Used different AWS Data Migration Services and Schema Conversion Tool along with Matillion ETL tool.
  • Responsible in handling Streaming data from web server console logs.
  • Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
  • Developed PIG Latin Scripts for the analysis of semi structured data.
  • Used Hive and created Hive Tables and involved in data loading and writing Hive UDFs.
  • Used Sqoop to import data into HDFS and Hive from other data systems.
  • Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
  • Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
  • Involved in NoSQL database design, integration, and implementation.
  • Loaded data into NoSQL database HBase.
  • Developed Kafka producer and consumers, HBase clients, Spark, and Hadoop MapReduce jobs along with components on HDFS, Hive.
  • Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Extensive experience in Amazon Web Services (AWS) Cloud services such as EC2, VPC, S3, IAM, EBS, RDS, ELB, VPC, Route53, Ops Works, DynamoDB, Autoscaling, CloudFront, CloudTrail, CloudWatch, CloudFormation, Elastic Beanstalk, AWS SNS, AWS SQS, AWS SES, AWS SWF & AWS Direct Connect.
  • Leveraged Spark as ETL tool for building Data pipelines on various cloud platforms like AWS EMRs, Azure HD Insights and MapR CLDB architectures.
  • Experience with designing, building, and operating solutions using virtualization using private hybrid/public cloud technologies.
  • Created Automation to create infrastructure for Kafka clusters different instances as per components in cluster using Terraform for creating multiple EC2 instances & attaching ephemeral or EBS volumes as per instance type in different availability zones & multiple regions in AWS.

Environment: Spark, Spark Streaming, Apache Kafka, Apache NiFi, Hive, Tez, AWS, ETL, PIG, UNIX, Linux, Tableau, Teradata, Pig, Sqoop, Hue, Oozie, Java, Scala, Python, GIT, Alteryx, Matillion & SSIS, Azure Databricks, Snowflake, GCP.

Confidential, Philadelphia, PA

Data Engineer


  • Experienced in development using Cloudera Distribution System.
  • Experience in Designing, Architecting, and implementing scalable cloud-based web applications usingAWSandGCP.
  • Can work parallelly in both GCP and Azure Clouds coherently.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Experience in moving data between GCP and Azure using Azure Data Factory.
  • Design and develop ETL integration patterns using Python on Spark.
  • Optimize the PyCfFspark jobs to run on Secured Clusters for faster data processing.
  • Developed Spark scripts by using Python and Scala shell commands as per the requirement.
  • Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation.
  • Developed spark applications in Python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Designing and Developing Apache NiFi jobs to get the files from transaction systems into data lake raw zone.
  • Analyzed the user requirements and implemented the use cases using Apache NiFi.
  • Proficient working experience on big data tools like Hadoop, Azure Data Lake, and AWS Redshift.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Excellent working with Data modeling tools like Erwin, Power Designer and ER Studio.
  • As a Hadoop Developer, my role is to manage the Data Pipelines and Data Lake.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Designed custom Spark REPL application to handle similar datasets.
  • Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation.
  • Performed Hive test queries on local sample files and HDFS files.
  • Used AWS services like EC2 and S3 for small data sets.
  • Managed user and AWS access using AWS IAM and KMS.
  • Deployed microservices into AWS - EC2.
  • Implementation of data movements from on-premises to cloud in Azure.
  • Designed relational and non-relational data stores on Azure
  • Strong working noledge on Kubernetes and Docker.
  • Developed the application on Eclipse IDE.
  • Developed Hive queries to analyze data and generate results.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark Engine for batch processing.
  • Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.
  • Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
  • Used Scala to write code for all Spark use cases.
  • Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.
  • Assigned name to each of the columns using case class option in Scala.
  • Worked on migratingMap Reduce programsintoSparktransformations usingSparkandScala, initially done usingPython (PySpark).
  • Involved in converting the hql’s in to spark transformations using spark RDD with support of Python and Scala.
  • Developed multiple Spark SQL jobs for data cleaning.
  • Created Hive tables and worked on them using Hive QL.
  • Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS.
  • Developed Spark SQL to load tables into HDFS to run select queries on top.
  • Developed analytical component using Scala, Spark, and Spark Stream.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster
  • Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
  • Worked on the NoSQL databases HBase and mongo DB.

Environment: Hadoop, Hive, Oozie, Java, Linux, Maven, Apache NiFi, Oracle 11g/10g, Zookeeper, MySQL, Spark, Managing Clusters Databricks, Azure, GCP

Confidential, Boston, MA

Big Data Engineer


  • Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS to use it for the analysis.
  • Migrated Existing MapReduce programs to Spark Models using Python.
  • Migrating the data from Data Lake (hive) into S3 Bucket.
  • Done data validation between data present in Data Lake and S3 bucket.
  • Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
  • Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to dat of MR jobs.
  • Used Kafka for real time data ingestion.
  • Created different topic for reading the data in Kafka.
  • Read data from different topics in Kafka.
  • Involved in converting the hql’s in to spark transformations using Spark RDD with support of python and Scala.
  • Implemented Azure Data Factory operations and deployment into Azure for moving data from on-premises into cloud.
  • Written Hive queries for data analysis to meet the business requirements.
  • Migrated an existing on premises application to AWS.
  • Used AWS Cloud with Infrastructure Provisioning / Configuration.
  • Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Created many Spark UDF and UDAFs in Hive for functions dat were not preexisting in Hive and Spark Sql.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
  • Good noledge on Spark platform parameters like memory, cores, and executors
  • By using Zookeeper implementation in the cluster, provided concurrent access for Hive Tables with shared and exclusive locking.
  • Configured the monitoring solutions for the project using Data Dog for infrastructure, ELK for app logging.

Environment: Linux, Apache Hadoop Framework, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Scala, Spark, SQOOP.


Data Analyst


  • Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
  • Recommended structural changes and enhancements to systems and Databases.
  • Conducted Design reviews and technical reviews with other project stakeholders.
  • Was a part of the complete life cycle of the project from the requirements to the production support.
  • Created test plan documents for all back-end database modules.
  • Used MS Excel, MS Access, and SQL to write and run various queries.
  • Worked extensively on creating tables, views, and SQL queries in MS SQL Server.
  • Worked with internal architects and assisting in the development of current and target state data architectures.
  • Coordinate with the business users in providing appropriate, TEMPeffective, and efficient way to design the new reporting needs based on the user with the existing functionality.
  • Remain noledgeable in all areas of business operations to identify systems needs and requirements.
  • Add according to your clients and business stories at client location

Hire Now