We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

Boston, MA


  • 7+ years of experience as Big Data Engineer, Data Analyst wif data applications, relational databases, NoSQL databases, data warehousing, cloud technologies like AWS, Azure and GCP
  • Extensive experience in working wifCloudera(CDH), andHortonworksHadoop distribution andAWSAmazonEMR, to fully leverage and implement new Hadoop features
  • Hands on experience on Hadoop, HDFS, Hive, Sqoop, Pig, HBase, Oozie, Flume, Spark, MapReduce, Cassandra, Zookeeper, YARN, Kafka, Scala, PySpark, Airflow, Snowflake, SQL, Python
  • Utilized Sqoop to migrate data between RDBMS, NoSQL databases and HDFS
  • Experience in importing streaming data into HDFS using Flume sources, and Flume sinks and transforming the data using Flume interceptors
  • Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS and transformed into format dat can be used by business users
  • Configured Oozie workflow to run multiple Hive and Pig jobs which run independently wif time and data availability
  • Hands on experience in SQL and NoSQL database such as Snowflake, HBase, Cassandra and MongoDB
  • Experience in developing and scheduling ETL workflows in Hadoop using Oozie wif the help of deployment and managing Hadoop cluster using Cloudera and Horton works
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, Power BI and Microsoft SSIS, Databricks
  • Developed Databricks ETL pipelines using notebooks, Spark Dataframes, Spark SQL and python scripting
  • Experience in developing and designing data integration solutions using ETL tool such as Informatica Power Center for handling large volumes of data
  • Working on AWS Services IAM, EC2, VPC, AMI, SNS, RDS, SQS, EMR, LAMBDA, GLUE, ATHENA, Dynamo DB, Cloud Watch, Auto Scaling, S3, and Route 53
  • Implemented Lambda to configure Dynamo DB Autoscaling feature and implemented Data Access Layer to access AWS DynamoDB data
  • Developed and deployed various Lambda functions in AWS wif in - built AWS Lambda Libraries and also deployed Lambda Functions in Scala wif custom Libraries
  • Experience wif developing and maintaining Applications written for AWS S3, AWS EMR (Elastic Map Reduce), and AWS Cloud Watch
  • Used Confidential EMR to create spark clusters and EC2 instances and imported data stored in S3
  • Experience in tuning EMR according to requirements on importing and exporting data using stream processing platforms like Kafka
  • Strong experience and noledge of real time data analytics using Spark Streaming, Kafka and Flume
  • Experience in Apache Airflow to author workflows as directed acyclic graphs (DAG), to visualize batch and real - time data pipelines running in production, monitor progress, and troubleshoot issues when needed
  • Data streaming from various sources like cloud (AWS, Azure) and on - premises by using the Spark.
  • Proficient wif Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities in Scala
  • Worked wif in Azure Cloud Services (PaaS & IaaS), Azure Databricks, Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure HDInsight, Key Vault, Azure Data Lake for data ingestion, ETL process, data integration, data migration, AI solutions
  • Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Spark Scala scripts and UDF's to perform transformations on large dataset
  • Experience working wif Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics
  • Created Databrick notebooks to streamline and curate the data for various business use cases and also mounted blob storage on Databrick
  • Experience in building data pipelines, computing large volume of data using Azure Data factory
  • Developed Python scripts to do file validations in Databricks and automated the process using ADF
  • Developed streaming pipelines using Azure Event Hubs and Stream Analytics to analyze data for good efficiency and open table counts for data coming in from IOT devices
  • Excellent SQL programming skills and developed Stored Procedures, Triggers, Functions, Packages using SQL, PL/SQL
  • Good experience in Shell Scripting, SQL Server, UNIX, T-SQL, PL/SQL scripts and Linux and noledge on version control software Github
  • Worked on different libraries related to Data science and Machine learning like Pandas, NumPy, SciPy, Matplotlib, Seaborn, Bokeh, nltk, Scikit - learn, OpenCV, TensorFlow
  • Hands on Experience in using Visualization tools like Tableau, Power BI
  • Involved in the design, development and testing phases of application using AGILE methodology.
  • Used JIRA as an agile tool to keep track of the tickets dat were worked on using the Agile methodology
  • Experienced in working in SDLC, Agile and Waterfall Methodologies


Hadoop/Big Data Technologies: Hadoop, Apache Spark, HDFS, Map Reduce, Sqoop, Hive, Oozie, Zookeeper, Cloudera Manager, Kafka, Flume

Programming & Scripting: Python, Scala, SQL, Shell Scripting, R

Databases: MY SQL, Oracle, PostgreSQL, MS-SQL Server, Teradata

NO SQL Database: HBase, Cassandra, Dynamo DB, Mongo DB, Cosmos DB

Hadoop Distribution: Horton Works, Cloudera, Spark

Version Control: Git, Bitbucket, SVN

Operating Systems: Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7

Cloud Computing: AWS, Azure, GCP


Confidential, Boston, MA

Sr. Data Engineer


  • Importing data from different sources like customer database, MySQL database, MongoDB, SFTP folder for converting raw data in structured format and analyzing the different patterns/trends of customers
  • Utilizing python/API calls to import data from databases like MySQL, PostgreSQL, MongoDB and different web applications
  • Importing data from various data sources wif Apache Airflow and performing the data transformation using Hive, MySQL and loading data in HDFS
  • Parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing
  • Extracting Real time feed using Kafka and Spark Streaming to convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS
  • Using Spark, performed various transformations/actions and the final result data is saved back to HDFS from their to target database Snowflake
  • Migrating data from external sources like S3, used text/parquet file formats to be loaded into AWS redshift and designing, developing ETL processes in AWS Glue
  • Developing the code in PySpark to transform raw data into structured format jobs on EMR
  • Utilizing Apache Spark to write the code in PySpark and SparkSQL to process data on Amazon EMR and performing necessary transformation
  • Processing the data wif stateless and state full transformations wif Spark Streaming programs to process near real time data from Kafka
  • Building the structured data model wif elastic search by using Python/Spark and developing the ETL pipelines for further analysis
  • Building the ETL pipelines to automate the data transformation which are increasing efficiency, scalability and reusability of data
  • Receiving event from S3 bucket by creating lambda deployment function and configuring it
  • Writing AWS Lambda functions in python for AWS's Lambda which is invoking python scripts to perform various transformations and analytics on large data sets in EMR clusters
  • Deploying AWS Lambda code from Amazon S3 buckets and implementing a 'serverless' architecture using API Gateway, Lambda, and Dynamo DB
  • Loaded data into S3 buckets using AWS Lambda Functions, AWS Glue and PySpark and filtered data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables. Maintained and operated Hadoop cluster on AWS EMR
  • Created Databricks notebooks to streamline and curate the data for various business use cases and also mounted S3 on Databricks posted over AWS
  • Developing Airflow Workflow to schedule batch and real-time data from source to target
  • Monitor Resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such as EBS, EC2, ELB, RDS, S3, SNS and configured notifications for the alarms generated based on events defined
  • Working on Apache Airflow for data Ingestion, Hive & Spark for data processing & Oozie for designing complex workflows in Hadoop framework
  • Developed and run UNIX shell scripts and implemented auto deployment process
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production
  • Generating the necessary reports in Tableau by creating the workflow model wif data lake in Hadoop

Environment: AWS (EC2, S3, EBS, ELB, RDS, Cloud Watch), Cassandra, PySpark, Apache Spark, HBase, Apache Kafka, Hive, Python, Spark streaming, Machine Learning, Snowflake, Oozie, Tableau, Power BI, NoSQL, PostgreSQL, Shell Scrip, Scala

Confidential, New York, NY

Data Engineer


  • Collected data from various sources like customer transaction database, Mongo DB, Azure Blob Storage, MS SQL server to convert data into analysis format to find the fraud transactions, customer churn
  • Worked on python scripts to import data from sources like MS SQL Server, SQL Lite, Oracle DB
  • Import and export data from different sources into HDFS for further processing and vice versa using Apache Sqoop
  • Developed Python scripts to do file validations in Databricks and automated the process using ADF
  • Implemented data pipelines in Azure Data Factory to extract, transform and load data from multiple sources like Azure SQL, Blob storage and Azure SQL Data warehouse
  • Deployed the data pipeline in Azure Data Factory using JSON scripts to process the data
  • Extensively used the Azure Service like Azure Data Factory and Logic App for ETL, to push in/out the data from DB to Blob storage, HDInsight - HDFS, Hive Tables
  • Extract, Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics
  • Extracted and loaded data into Datalake wif ETL jobs and developed shell scripts for dynamic partitions adding to Hive stage
  • Used notebooks, Spark Data frames, SPARK SQL and python scripting to build ETL pipelines in Databricks
  • Developed and maintained the data pipeline on Azure Analytics platform using Azure Databricks, PySpark, and Python
  • Involved in building an Enterprise Data Lake utilizing Data Factory and Blob storage, empowering different groups to work wif more perplexing situations and ML solutions
  • Worked on Azure Functions to write the code in Python to perform different transformations task and performed analytics in HD Insight on large volume of data
  • Automated the loading of data into Blob Storage wif Data Factory, PySpark, and extracted the required data in Blob Storage using HD Insight and loaded into HDFS
  • Developed an automated process in Azure cloud which can ingest data daily from web service and load in to Azure SQL DB
  • Analyzed & transformed data to uncover insights from multiple file formats wif data transformation and aggregation wif spark applications using PySpark and SparkSQL
  • Monitor Resources using Azure Automation and created alarm for VM, Blob Storage, ADF, Databricks, Synapse Analytics based on different events dat are occurring
  • Wrote complex SQL queries using stored procedures, common table expressions (CTEs), temporary table to support Power BI reports
  • Worked in a SAFE (Scaled Agile Framework) team wif daily standups, monthly/quarterly planning

Environment: Azure Cloud, Data Factory (ADF v2), DataLake, BLOB Storage, SQL server, Teradata Utilities, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Cosmos DB, Stream Analytics, Event Hub

Confidential, MA

Data Analytics Engineer


  • Chaired a team of 2 people for telecom inventory maintenance of 5+ customer using SQL server and achieved historic annual savings of 8% wif service providers AT&T, Verizon, Granite Telecom, Centurylink
  • Collaborated wif cross functional operations teams to gather, organize, customize and analyze product related data to maintain inventory of customers worth around $10M- $40M
  • Used SQL on data sets to provide ad-hoc data requests and created a report metrics which accelerated reporting by 3%
  • Created tableau dashboards of $96K savings for client and credited back, which resulted in more business
  • Developed SQL queries of 8% efficiency to fetch required data, analyzed expense based on requirements
  • Wrote stored procedures, triggers to automate the database and faster retrieval of data
  • Communicated the progress of ongoing projects to CSM’s and clients on weekly calls
  • Worked on servicenow tickets on daily basis to resolve the inventory related issues and coordinated wif audit, invoice and IT teams to update the charges
  • Developed the interactive Tableau dashboards wif calculated fields, parameters, sets as per client’s requirements and presented on calls
  • Prepared technical specifications to develop ETL mappings to load data into various tables confirming to the business rules
  • Designed and implemented data profiling and data quality improvement solution to analyze, match, cleanse, and consolidate data before loading into data warehouse
  • Utilization of Power BI functions and Pivot Tables to further analyze in given complex data
  • Responsible for designing, development of SSIS Packages using various control flow tasks and dataflow tasks for performing ETL operations
  • Extracted data from CRM systems to Staging Area and loaded the data to the target database by ETL process using Informatica Power Center
  • Analyzed the data by performing Hive queries and running Pig scripts to no user behavior
  • Developed Python programs and batch scripts on various environments windows, Linux and for automation of ETL processes to AWS Redshift

Environment: MS SQL Server, Tableau, Power BI, MS office, SSRS, SSIS, SAS, PL/SQL


Junor Data Engineer


  • Imported data from sources like different logs, API’s, databases like MySQL, Neo4J, Cassandra, DynamoDB to understand the customer behavior and make the necessary reports as per client requirement
  • Worked on Python/Scala APIs to get data from DynamoDB, MYSQL, Cassandra and loaded into Databricks for further analysis
  • Created workflows wif Sqoop to import data from databases and written the Oozie workflows to perform the necessary ETL operations
  • Loaded python scripts into AWS S3, DynamoDB and Snowflake to read CSV, json and parquet files from S3 buckets
  • Extracting data wif Apache Kafka and used Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs
  • Importing data from S3 into AWS Glue to perform the ETL operations and store the data into Redshift
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala
  • Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD and Pyspark concepts and experience in pulling data from Sales force and Ingest data to redshift and save the data in amazon S3 buckets
  • Implemented AWS Lambda functions to run scripts n response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
  • Loaded data back into S3 using AWS Lambda Functions, AWS Glue and PySpark
  • Used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis and written Spark applications for data validation, cleansing, transformation, and custom aggregation
  • Collaborated wif business owners of products for understanding business needs and automated business processes and data storytelling in Tableau

Environment: AWS (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM), Apache Spark, SQL, Scala, Snowflake, Python


ETL Developer


  • Designed and built terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Redshift for large scale data handling Millions of records
  • Developed workflows in Oozie for business requirements to extract the data using Sqoop
  • For data exploration stage used Hive to get important insights about the processed data from HDFS
  • Worked on Big data on AWS cloud services i.e., EC2, S3, EMR and DynamoDB
  • Expertise noledge in Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done
  • Responsible for ETL and data validation using SQL Server Integration Services
  • Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift
  • Involved in Data Extraction from Oracle and Flat Files using SQL Loader Designed and developed mappings using Informatica
  • Developed PL/SQL procedures/packages to kick off the SQL Loader control files/procedures to load the data into Oracle
  • Build and maintain complex SQL queries for data analysis, data mining and data manipulation
  • Developed Matrix and tabular reports by grouping and sorting rows
  • Actively participated in weekly meetings wif the technical teams to review the code
  • Participate in requirement gathering and analysis phase of the project in documenting the business requirements by conducting workshops/meetings wif various business users
  • Built machine learning algorithms like linear regression, decision tree, random forest for continuous variable problems, estimated machine learning algorithm's performance for time series data
  • Analyzed large data sets using pandas to identify different trends/patterns about data
  • Utilized regression models using SciPy to predict future data and visualized them
  • Managed large datasets using Pandas data frames and MySQL for analysis purposes
  • Developed schemas to handle reporting requirements using Tableau

Environment: Python, Hadoop, Map Reduce, Hive QL, Hive, HBase, Sqoop, Cassandra, Flume, Tableau, Impala, Oozie, MYSQL, Oracle SQL, Pig Latin, AWS, NumPy, SciPy

Hire Now