We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

Bellevue, WA


  • Having around 7+ years of IT experience in Design, Development, Maintenance and Support of Big Data Applications.
  • Over 5+ years of experience with big data Hadoop core and Azure data components like HDFS, MR, Yarn, Hive, Hbase, ADLS, Blob, Databricks, and Azure Data Factory.
  • Exposure to Spark, Spark Streaming, Spark MLlib, Scala and Creating teh Data Frames handled in Spark with Scala.
  • Hands on experience in working on Spark SQL queries, Data frames, and import data from Data sources, perform transformations, perform read/write operations, save teh results to output directory into HDFS.
  • Experience in using D - Streams, Accumulator, Broadcast variables, RDD caching for Spark Streaming.
  • Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL.
  • Developed Python code to gather teh data from HBase and designs teh solution to implement using spark.
  • Strong experience and knowledge of real time data analytics using Spark, Kafka and Flume.
  • Hands on experience in Capturing data from existing relational databases (Oracle, MySQL, SQL and Teradata) dat provide SQL interfaces using Sqoop.
  • Experience in Analyzing teh SQL scripts and designed teh solution to implement using spark.
  • Worked with join patterns and implemented Map side joins and Reduce side joins using Map Reduce.
  • Developed multiple MapReduce jobs to perform data cleaning and preprocessing.
  • Designed HIVE queries & Pig scripts to perform data analysis, data transfer and table design to load data into Hadoop environment.
  • Expertise in writing Hive UDF, Generic UDF's to in corporate complex business logic into Hive Queries.
  • Extensive experience on importing and exporting data using stream processing platforms like Flume and Kafka.
  • Good working experience on Azure infrastructure services Azure Data Lake Gen2 (ADLS), HDInsight, azure functions, and virtual machines.
  • Expertise in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning, bucketing, writing and optimizing teh HiveQL queries.
  • Worked and learned a great deal from Amazon Webservices (AWS) Cloud services likeEC2,S3,EBS,RDSandVPC.
  • ImplementedAWSprovides a variety of computing and networking services to meet teh needs of applications
  • Experience in composing shell scripts to dump teh shared information from MySQL servers to HDFS.
  • Experience in data workflow scheduler Zoo-Keeper and Oozie to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with teh control flows.
  • Experienced in performance tuning and real time analytics in both relational database and NoSQL database (HBase).
  • Worked on Implementing and optimizing Hadoop/MapReduce algorithms for Big Data analytics.
  • Experience on Mongo DB, Cassandra and various No-Sql databases like HBase, Neon, Radis etc.
  • Experience in setting up teh Hadoop clusters, both in-house and as well as on teh cloud.
  • Profound experience in working with Cloudera (CDH4 &CDH5) and Horton Works Hadoop Distributions and Amazon EMRHadoopdistributors on multi-node cluster.
  • Exposure towards simplifying and automating big data integration with graphical tools and wizards dat generate native code using Talend.
  • Worked on different file formats (ORCFILE, TEXTFILE) and different Compression Codecs (GZIP, SNAPPY, LZO).
  • Good understanding of all aspects of Testing such as Unit, Regression, Agile, White-box, Black-box.
  • Adept in Agile/Scrum methodology and familiar with SDLC life cycle from requirement analysis to system study, designing, testing, de-bugging, documentation, and implementation.
  • Techno-functional responsibilities include interfacing with users, identifying functional and technical gaps, estimates, designing custom solutions, development, producing, documentation and production support.
  • Excellent interpersonal and communication skills, creative, research-minded, technically competent and result-oriented with problem solving skills.


Confidential -Bellevue, WA

Sr. Data Engineer


  • Analyze and cleanse raw data using SparkSQL and PySpark.
  • Experience in data transformations using Azure HDInsight, HIVE for different file formats.
  • Analyze teh user needs, interactwith various SOR's to understand their incoming data structure and ran POC'swith best possible processing framework in big data platform.
  • Documented teh results with various tools and technologies which can be implemented accordingly based on teh business usecase.
  • Developed Spark and SparkSQL code to process teh data in Apache Spark on Azure HDInsight to perform teh necessary transformations based on teh STMs developed
  • Worked with teh Spark for improving performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Developed UNIX shell scripts to load large number of files into HDFS from Linux File System.
  • Played a key role in Finalizing teh tech stack for our project (GPC) and ran end to end vigorous testing qualifying theuser needs as well as tech requirements.
  • Ran data formatting scripts in Java and created terabyte csv files to be consumed by Hadoop MapReduce jobs.
  • Developed Python code using version control tools like GIT hub and SVN on vagrant machines.
  • Collaborated with intra applications teams to fit our business models on existing on-Prem platform setup.
  • Experience in creating tables, dropping and altered at run time without blocking updates and queries using HBase and Hive.
  • Encoded and decoded json objects using Spark to create and modify teh data frames in Apache Spark
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's.
  • Migrated an existing on-premises application to Azure.
  • Used Azure services like ADLS and Synapse analytics for small data sets.
  • Created flow diagrams, UML diagrams of designed architecture to make understand and get approval from productowners and teh business teams for all teh user requirements requested.
  • Created Azure functions and configured it to receive events from your Synapse warehouse.
  • Developed teh ETL Data pipeline for data loadingfrom centralized Data Lake/ Azure Data Lake as a data source to Azure Synapse using Spark.
  • Integrated with Restful APIs to create Service now Incidents when there is a process failure within teh batch job.
  • Analyzed teh SQL scripts and designed teh solution to implement using spark.
  • Developed a capability to implement audit logging at required stages while applying business logic.
  • Implemented spark data frames on huge incoming datasets of various data formats like JSON, CSV, Parquet.
  • Actively worked in resolving many of teh Tech challenges. One of them is like handling teh nested JSON withmultiple data sections in teh same file and converting them in to spark friendly data frames.
  • Re-formatted teh end results to SOR's requested formats.

Environment: Spark, Azure, Python, SparkSQL, Cassandra Spark SQL, Azure Data Lake, HDInsight, Databricks, HDFS, Hive, Apache Kafka, Scala, Shell scripting, Linux, Jenkins, Eclipse, Git

Confidential - Pittsburgh, PA

Data Engineer


  • Work in a fast-paced agile development environment to quickly analyze, develop, and test potential use cases for teh business.
  • Used Spark Streaming APIs to perform transformations and actions on teh fly for building common learner data model which gets teh data from Kafka in near real time and persist it to Cassandra.
  • Developed Kafka consumer's API in Scala for consuming data from Kafka topics.
  • Consumed XML messages using Kafka & processed xml using Spark Streaming to capture UI updates.
  • Developed Preprocessing job using Spark Data frames to flatten Jason documents to flat file.
  • Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Worked on importing and exporting data from Snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
  • Optimize teh spark jobs to run on Kubernetes Cluster for faster data processing
  • Implemented Elastic Search on Hive data warehouse platform.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Java, and Scala.
  • Designed Column families in Cassandra and Ingested data from RDBMS, performed data transformations, and tan export teh transformed data to Cassandra as per teh business requirement.
  • Used teh Spark DataStax Cassandra Connector to load data to and from Cassandra.
  • Experienced in creating data-models for client data sets, analyzed teh data from Casandra tables for quick searching, sorting and grouping using teh Cassandra Query Language (CQL).
  • Tested teh cluster Performance using Cassandra-stress tool to measure and improve teh Read/Writes.
  • Used Hive QL to analyze teh partitioned and bucketed data, Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet teh business requirements.
  • Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds.
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in downstream systems for analysis.
  • Experience in using Avro, Parquet, Crile and JSON file formats, developed UDFs in Hive.
  • Develop Autosys job for scheduling.
  • Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE.
  • Generated various kinds of reports using Power BI and Tableau based on client’s requirements.
  • Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
  • Worked with Network, database, application and BI teams to ensure data quality and availability.
  • Prepare ITSM documents (Implementation Plan, DMIO document, Runbook, PT Metrics) and get signoff from respective teams to implement teh code in production.
  • Assist in Deployment and provide Technical & Operational support during Install.
  • Post implementation support.
  • Coordinate with offshore team.
  • Review code developed by offshore team and validates teh test results.
  • Developed spark applications in python (Spark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using Spark.
  • Responsible for generating actionable insights from complex data to drive real business results for various application teams and worked in Agile Methodology projects extensively.
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint.

Environment: Spark, Spark SQL, Azure, HDFS, Hive, Apache Kafka, Sqoop, Scala, Shell scripting, Linux, MySQL Oracle Enterprise DB, Python, Jenkins, Git, Oozie, MySQL, Soap, NIFI, Cassandra and Agile Methodologies.

Confidential -New York, NY

Data Engineer


  • Worked on migrating MapReduce programs into Spark transformations using Spark and Python.
  • Developed Spark jobs using Scala on top of Yarn/MRv2 for interactive and Batch Analysis.
  • Experienced in querying data using SparkSQL on top of Spark engine for faster data sets processing.
  • Worked on teh Ad hoc queries, Indexing, Replication, Load balancing, and Aggregation in MongoDB.
  • Processed teh Web server logs by developing Multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis
  • Expert knowledge on MongoDB NoSQL data modeling, tuning, and disaster recovery backup used it for distributed storage and processing using CRUD.
  • Extracted and restructured teh data into MongoDB using import and export command line utility tool.
  • Extracted files from MongoDB through Stream sets and placed in HDFS and processed.
  • Used Amazon Dynamo DB to gather and track teh event based metrics.
  • Experience in setting up Fan-out workflow in flume to design v shaped architecture to take data from many sources and ingest into single sink.
  • Implemented Custom Sterilizer, interceptors to Mask, createdconfidential data and filter unwanted records from teh event payload in flume.
  • Worked with Apache SOLR to implement indexing and wrote Custom SOLR query segments to optimize teh search.
  • Experienced on apache Solr for indexing and load balanced querying to search for specific data in larger datasets and implemented Near Real Time Solr index on Hbase and HDFS.
  • Experience in working with different join patterns and implemented both Map side and Reduce Side Joins.
  • Used AWS Step Functions to monitor series of ETL tasks which are part of workflows.
  • Developing Data load functions, which reads teh schema of teh input data and load teh data into a table
  • Worked on teh Spark SQL for analyzing and applying teh transformations on data frames created from teh SQS queue and loads them into DB tables.
  • Worked on Amazon S3 for persisting teh transformed Spark Data Frames in S3 buckets and using Amazon S3 as a Data-lake to teh data pipeline running on spark and Map-Reduce.
  • Wrote Flume configuration files for importing streaming log data into HBase with Flume.
  • Imported logs from web servers with Flume to ingest teh data into HDFS. Using Flume and Spool directory loading teh data from local system to HDFS.
  • Installed and configured pig, written Pig Latin scripts to convert teh data from Text file to Avro format.
  • Created Partitioned Hive tables and worked on them using HiveQL.
  • Loading Data into HBase using Bulk Load and Non-bulk load.
  • Installed, Configured Talend ETL on single and multi-server environments.
  • Experience in monitoring Hadoop cluster using Cloudera Manager, interacting with Cloudera support and log teh issues in Cloudera portal and fixing them as per teh recommendations.
  • Worked with Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
  • Experience in setting up teh whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search.
  • Used Zookeeper to coordinate teh servers in clusters and to maintain teh data consistency.
  • Worked in Agile development environment having KANBAN methodology. Actively involved in daily scrum and other design related meetings.
  • Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.

Environment: Hadoop, HDFS, Hive, Map Reduce, AWS Ec2, SOLR, Impala, MySQL, Sqoop, Kafka, Spark, SQL Talend, Python, Spark, Yarn, Pig, Oozie, Linux-Ubuntu, Scala, Ab Initio, Maven, Jenkins, Cloudera, JUnit, agile methodologies.

Confidential - Dayton, OH

ETL Developer


  • Prepare Functional Requirement Specification and done coding, bug fixing and support.
  • Involved in various phases of Software Development Life Cycle (SDLC) as requirement gathering, data modeling, analysis, architecture design & development for teh project.
  • Developed SSIS pipelines to automate ETL activities and migrate SQL server data to Azure SQL database.
  • Built SSIS packages and scheduled jobs to migrate data from disparate sources into SQL server and vice versa.
  • Created SSIS packages with which data from different resources were loaded daily to create and maintain a centralized data warehouse. Made teh package dynamic so it fit teh environment.
  • Developed schemas for business applications with providing full-life cycle architectural guidance and ensured quality technical deliverables.
  • Developed data profiling, munging and missing value imputation scripts in Python on raw data as a part of understanding teh data and its structuring.
  • Manage team of analysts responsible for executing business reporting functions, performing analysis and drive operational metrics.
  • Formulated teh strategy, development, and implementation of executive and line of business dashboards through SSRS.
  • Managed internal Sprints, release schedules, and milestones through JIRA. Functioned as teh primary point of contact for teh client Business Analysts, Directors and Data Engineers for project communications.
  • Involved in design, development and Modification ofPL/SQLstored procedures, functions, packages, and triggers to implement business rules into teh application.
  • DevelopedETLprocesses to load data from Flat files, SQL Server, and Access into teh target Oracle database by applying business logic on transformation mapping for inserting and updating records when loaded.
  • Has good InformaticaETLdevelopment experience in an offshore and onsite model and involved inETLCode reviews and testingETLprocesses.
  • Scheduling teh sessions to extract, transform and load data in to warehouse database on Business requirements.

Environment: MSBI, SSIS, SSRS, SSAS, Informatica, ETL, PL/SQL, SQL Server 2000, Ant, CVS, PL/SQL, Hibernate, Eclipse, Linux


BI Developer


  • DesignedSSISpackages to transfer teh data from various sources. Created various MSBI(SSIS) packages using data transformations like Merge, Aggregate, Sort, Multicasting, Conditional Split, and SCD (Slowly Changing Dimension) and Derived column.
  • Used various Transformations inSSISControl Flow, involving teh loop Containers.
  • Implemented Event Handlers and Error Handling inSSISpackages.
  • CreatedSSISpackages for Data Conversion using data conversion transformation.
  • Plan, design, and implement application database code objects, such as stored procedures and views.
  • Using SQL to query Databases Performing various validations and mapping activities.
  • Build and maintain SQL scripts, indexes, and complex queries for data analysis and extraction.
  • Validated accuracy of data to ensure database integrity.
  • Sub reports, linked reports, Snapshot, Cached, Ad-hoc reports using SSRS.
  • Designed and built Star and Snowflake dimensional models creating facts, dimensions, measures, cube, and established data granularity.
  • Created calculated measures and dimension members using Multi-Dimensional Expressions (MDX).
  • Monitoring Performance of SQL server using SQL Profiler, Query Analyzer, Database Engine Tuning Advisor (DTA) and Performance monitoring
  • Developed teh functionalities using Agile Methodology.

Environment: SSIS, SSRS, SQL Server, SSAS, T-SQL, SSDT, MDX, DAX, OLAP.

Hire Now