Sr. Data Engineer Resume Boston, MA - Hire IT People

SUMMARY

7+ years of experience as Big Data Engineer, Data Analyst with data applications, relational databases, NoSQL databases, data warehousing, cloud technologies like AWS, Azure and GCP
Extensive experience in working withCloudera(CDH), andHortonworksHadoop distribution andAWSAmazonEMR, to fully leverage and implement new Hadoop features
Hands on experience on Hadoop, HDFS, Hive, Sqoop, Pig, HBase, Oozie, Flume, Spark, MapReduce, Cassandra, Zookeeper, YARN, Kafka, Scala, PySpark, Airflow, Snowflake, SQL, Python
Utilized Sqoop to migrate data between RDBMS, NoSQL databases and HDFS
Experience in importing streaming data into HDFS using Flume sources, and Flume sinks and transforming the data using Flume interceptors
Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS and transformed into format that can be used by business users
Configured Oozie workflow to run multiple Hive and Pig jobs which run independently with time and data availability
Hands on experience in SQL and NoSQL database such as Snowflake, HBase, Cassandra and MongoDB
Experience in developing and scheduling ETL workflows in Hadoop using Oozie with the help of deployment and managing Hadoop cluster using Cloudera and Horton works
Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, Power BI and Microsoft SSIS, Databricks
Developed Databricks ETL pipelines using notebooks, Spark Dataframes, Spark SQL and python scripting
Experience in developing and designing data integration solutions using ETL tool such as Informatica Power Center for handling large volumes of data
Working on AWS Services IAM, EC2, VPC, AMI, SNS, RDS, SQS, EMR, LAMBDA, GLUE, ATHENA, Dynamo DB, Cloud Watch, Auto Scaling, S3, and Route 53
Implemented Lambda to configure Dynamo DB Autoscaling feature and implemented Data Access Layer to access AWS DynamoDB data
Developed and deployed various Lambda functions in AWS with in - built AWS Lambda Libraries and also deployed Lambda Functions in Scala with custom Libraries
Experience with developing and maintaining Applications written for AWS S3, AWS EMR (Elastic Map Reduce), and AWS Cloud Watch
Used Confidential EMR to create spark clusters and EC2 instances and imported data stored in S3
Experience in tuning EMR according to requirements on importing and exporting data using stream processing platforms like Kafka
Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume
Experience in Apache Airflow to author workflows as directed acyclic graphs (DAG), to visualize batch and real - time data pipelines running in production, monitor progress, and troubleshoot issues when needed
Data streaming from various sources like cloud (AWS, Azure) and on - premises by using the Spark.
Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities in Scala
Worked with in Azure Cloud Services (PaaS & IaaS), Azure Databricks, Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure HDInsight, Key Vault, Azure Data Lake for data ingestion, ETL process, data integration, data migration, AI solutions
Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Spark Scala scripts and UDF's to perform transformations on large dataset
Experience working with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics
Created Databrick notebooks to streamline and curate the data for various business use cases and also mounted blob storage on Databrick
Experience in building data pipelines, computing large volume of data using Azure Data factory
Developed Python scripts to do file validations in Databricks and automated the process using ADF
Developed streaming pipelines using Azure Event Hubs and Stream Analytics to analyze data for good efficiency and open table counts for data coming in from IOT devices
Excellent SQL programming skills and developed Stored Procedures, Triggers, Functions, Packages using SQL, PL/SQL
Good experience in Shell Scripting, SQL Server, UNIX, T-SQL, PL/SQL scripts and Linux and knowledge on version control software Github
Worked on different libraries related to Data science and Machine learning like Pandas, NumPy, SciPy, Matplotlib, Seaborn, Bokeh, nltk, Scikit - learn, OpenCV, TensorFlow
Hands on Experience in using Visualization tools like Tableau, Power BI
Involved in the design, development and testing phases of application using AGILE methodology.
Used JIRA as an agile tool to keep track of the tickets that were worked on using the Agile methodology
Experienced in working in SDLC, Agile and Waterfall Methodologies

TECHNICAL SKILLS

Hadoop/Big Data Technologies: Hadoop, Apache Spark, HDFS, Map Reduce, Sqoop, Hive, Oozie, Zookeeper, Cloudera Manager, Kafka, Flume

Programming & Scripting: Python, Scala, SQL, Shell Scripting, R

Databases: MY SQL, Oracle, PostgreSQL, MS-SQL Server, Teradata

NO SQL Database: HBase, Cassandra, Dynamo DB, Mongo DB, Cosmos DB

Hadoop Distribution: Horton Works, Cloudera, Spark

Version Control: Git, Bitbucket, SVN

Operating Systems: Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7

Cloud Computing: AWS, Azure, GCP

PROFESSIONAL EXPERIENCE

Confidential, Boston, MA

Sr. Data Engineer

Responsibilities:

Importing data from different sources like customer database, MySQL database, MongoDB, SFTP folder for converting raw data in structured format and analyzing the different patterns/trends of customers
Utilizing python/API calls to import data from databases like MySQL, PostgreSQL, MongoDB and different web applications
Importing data from various data sources with Apache Airflow and performing the data transformation using Hive, MySQL and loading data in HDFS
Parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing
Extracting Real time feed using Kafka and Spark Streaming to convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS
Using Spark, performed various transformations/actions and the final result data is saved back to HDFS from there to target database Snowflake
Migrating data from external sources like S3, used text/parquet file formats to be loaded into AWS redshift and designing, developing ETL processes in AWS Glue
Developing the code in PySpark to transform raw data into structured format jobs on EMR
Utilizing Apache Spark to write the code in PySpark and SparkSQL to process data on Amazon EMR and performing necessary transformation
Processing the data with stateless and state full transformations with Spark Streaming programs to process near real time data from Kafka
Building the structured data model with elastic search by using Python/Spark and developing the ETL pipelines for further analysis
Building the ETL pipelines to automate the data transformation which are increasing efficiency, scalability and reusability of data
Receiving event from S3 bucket by creating lambda deployment function and configuring it
Writing AWS Lambda functions in python for AWS's Lambda which is invoking python scripts to perform various transformations and analytics on large data sets in EMR clusters
Deploying AWS Lambda code from Amazon S3 buckets and implementing a 'serverless' architecture using API Gateway, Lambda, and Dynamo DB
Loaded data into S3 buckets using AWS Lambda Functions, AWS Glue and PySpark and filtered data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables. Maintained and operated Hadoop cluster on AWS EMR
Created Databricks notebooks to streamline and curate the data for various business use cases and also mounted S3 on Databricks posted over AWS
Developing Airflow Workflow to schedule batch and real-time data from source to target
Monitor Resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such as EBS, EC2, ELB, RDS, S3, SNS and configured notifications for the alarms generated based on events defined
Working on Apache Airflow for data Ingestion, Hive & Spark for data processing & Oozie for designing complex workflows in Hadoop framework
Developed and run UNIX shell scripts and implemented auto deployment process
Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production
Generating the necessary reports in Tableau by creating the workflow model with data lake in Hadoop

Environment: AWS (EC2, S3, EBS, ELB, RDS, Cloud Watch), Cassandra, PySpark, Apache Spark, HBase, Apache Kafka, Hive, Python, Spark streaming, Machine Learning, Snowflake, Oozie, Tableau, Power BI, NoSQL, PostgreSQL, Shell Scrip, Scala

Confidential, New York, NY

Data Engineer

Responsibilities:

Collected data from various sources like customer transaction database, Mongo DB, Azure Blob Storage, MS SQL server to convert data into analysis format to find the fraud transactions, customer churn
Worked on python scripts to import data from sources like MS SQL Server, SQL Lite, Oracle DB
Import and export data from different sources into HDFS for further processing and vice versa using Apache Sqoop
Developed Python scripts to do file validations in Databricks and automated the process using ADF
Implemented data pipelines in Azure Data Factory to extract, transform and load data from multiple sources like Azure SQL, Blob storage and Azure SQL Data warehouse
Deployed the data pipeline in Azure Data Factory using JSON scripts to process the data
Extensively used the Azure Service like Azure Data Factory and Logic App for ETL, to push in/out the data from DB to Blob storage, HDInsight - HDFS, Hive Tables
Extract, Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics
Extracted and loaded data into Datalake with ETL jobs and developed shell scripts for dynamic partitions adding to Hive stage
Used notebooks, Spark Data frames, SPARK SQL and python scripting to build ETL pipelines in Databricks
Developed and maintained the data pipeline on Azure Analytics platform using Azure Databricks, PySpark, and Python
Involved in building an Enterprise Data Lake utilizing Data Factory and Blob storage, empowering different groups to work with more perplexing situations and ML solutions
Worked on Azure Functions to write the code in Python to perform different transformations task and performed analytics in HD Insight on large volume of data
Automated the loading of data into Blob Storage with Data Factory, PySpark, and extracted the required data in Blob Storage using HD Insight and loaded into HDFS
Developed an automated process in Azure cloud which can ingest data daily from web service and load in to Azure SQL DB
Analyzed & transformed data to uncover insights from multiple file formats with data transformation and aggregation with spark applications using PySpark and SparkSQL
Monitor Resources using Azure Automation and created alarm for VM, Blob Storage, ADF, Databricks, Synapse Analytics based on different events that are occurring
Wrote complex SQL queries using stored procedures, common table expressions (CTEs), temporary table to support Power BI reports
Worked in a SAFE (Scaled Agile Framework) team with daily standups, monthly/quarterly planning

Environment: Azure Cloud, Data Factory (ADF v2), DataLake, BLOB Storage, SQL server, Teradata Utilities, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Cosmos DB, Stream Analytics, Event Hub

Confidential, MA

Data Analytics Engineer

Responsibilities:

Chaired a team of 2 people for telecom inventory maintenance of 5+ customer using SQL server and achieved historic annual savings of 8% with service providers AT&T, Verizon, Granite Telecom, Centurylink
Collaborated with cross functional operations teams to gather, organize, customize and analyze product related data to maintain inventory of customers worth around $10M- $40M
Used SQL on data sets to provide ad-hoc data requests and created a report metrics which accelerated reporting by 3%
Created tableau dashboards of $96K savings for client and credited back, which resulted in more business
Developed SQL queries of 8% efficiency to fetch required data, analyzed expense based on requirements
Wrote stored procedures, triggers to automate the database and faster retrieval of data
Communicated the progress of ongoing projects to CSM’s and clients on weekly calls
Worked on servicenow tickets on daily basis to resolve the inventory related issues and coordinated with audit, invoice and IT teams to update the charges
Developed the interactive Tableau dashboards with calculated fields, parameters, sets as per client’s requirements and presented on calls
Prepared technical specifications to develop ETL mappings to load data into various tables confirming to the business rules
Designed and implemented data profiling and data quality improvement solution to analyze, match, cleanse, and consolidate data before loading into data warehouse
Utilization of Power BI functions and Pivot Tables to further analyze in given complex data
Responsible for designing, development of SSIS Packages using various control flow tasks and dataflow tasks for performing ETL operations
Extracted data from CRM systems to Staging Area and loaded the data to the target database by ETL process using Informatica Power Center
Analyzed the data by performing Hive queries and running Pig scripts to know user behavior
Developed Python programs and batch scripts on various environments windows, Linux and for automation of ETL processes to AWS Redshift

Environment: MS SQL Server, Tableau, Power BI, MS office, SSRS, SSIS, SAS, PL/SQL

Confidential

Junor Data Engineer

Responsibilities:

Imported data from sources like different logs, API’s, databases like MySQL, Neo4J, Cassandra, DynamoDB to understand the customer behavior and make the necessary reports as per client requirement
Worked on Python/Scala APIs to get data from DynamoDB, MYSQL, Cassandra and loaded into Databricks for further analysis
Created workflows with Sqoop to import data from databases and written the Oozie workflows to perform the necessary ETL operations
Loaded python scripts into AWS S3, DynamoDB and Snowflake to read CSV, json and parquet files from S3 buckets
Extracting data with Apache Kafka and used Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs
Importing data from S3 into AWS Glue to perform the ETL operations and store the data into Redshift
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala
Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket
Involved in converting Hive/SQL queries into Spark transformations using Spark RDD and Pyspark concepts and experience in pulling data from Sales force and Ingest data to redshift and save the data in amazon S3 buckets
Implemented AWS Lambda functions to run scripts n response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
Loaded data back into S3 using AWS Lambda Functions, AWS Glue and PySpark
Used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis and written Spark applications for data validation, cleansing, transformation, and custom aggregation
Collaborated with business owners of products for understanding business needs and automated business processes and data storytelling in Tableau

Environment: AWS (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM), Apache Spark, SQL, Scala, Snowflake, Python

Confidential

ETL Developer

Responsibilities:

Designed and built terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Redshift for large scale data handling Millions of records
Developed workflows in Oozie for business requirements to extract the data using Sqoop
For data exploration stage used Hive to get important insights about the processed data from HDFS
Worked on Big data on AWS cloud services i.e., EC2, S3, EMR and DynamoDB
Expertise knowledge in Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done
Responsible for ETL and data validation using SQL Server Integration Services
Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift
Involved in Data Extraction from Oracle and Flat Files using SQL Loader Designed and developed mappings using Informatica
Developed PL/SQL procedures/packages to kick off the SQL Loader control files/procedures to load the data into Oracle
Build and maintain complex SQL queries for data analysis, data mining and data manipulation
Developed Matrix and tabular reports by grouping and sorting rows
Actively participated in weekly meetings with the technical teams to review the code
Participate in requirement gathering and analysis phase of the project in documenting the business requirements by conducting workshops/meetings with various business users
Built machine learning algorithms like linear regression, decision tree, random forest for continuous variable problems, estimated machine learning algorithm's performance for time series data
Analyzed large data sets using pandas to identify different trends/patterns about data
Utilized regression models using SciPy to predict future data and visualized them
Managed large datasets using Pandas data frames and MySQL for analysis purposes
Developed schemas to handle reporting requirements using Tableau

Environment: Python, Hadoop, Map Reduce, Hive QL, Hive, HBase, Sqoop, Cassandra, Flume, Tableau, Impala, Oozie, MYSQL, Oracle SQL, Pig Latin, AWS, NumPy, SciPy

We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

Boston, MA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship