Sr. Data Engineer Resume

SUMMARY

Qualified professional with over 7+ years of extensive experience in IT industry including BigData, Hadoopenvironments on premise and Cloud environments for hosting cloud - baseddata warehouses and databases using Redshift, Cassandra, and RDBMS sources.
Adept in writing Python scripts to parse XML documents and load thedatain database.
Extensive experience of implementing solutions using AWS services like (EC2,S3, and Redshift), Hadoop HDFS architecture and Map-Reduce framework.
Worked in AWS environment for development and deployment of custom Hadoop applications
Strong understanding of developingMapReduce programs using Apache Hadoop for working withBigData.
Hands on experience in working ofbigdataanalysis using Pig and Hive.
In-depth knowledge of various Cloudera distributions like (CDH 4/CDH 5), Knowledge of working on Hortonworks and Amazon EMR Hadoop distributors.
DevelopedPysparkScripts to process streaming data from data lakes using SparkStreaming.
Experienced in creating Spark jobs that run in EMR clusters using AMR Notebooks.
Hands on experience in Python Boto3 for developing Lambda functions inAWS.
Analysis was done using Python libraries such as PySpark.
Hands on expertise in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala and Python.
Strong experience in working with ELASTIC MAP REDUCE (EMR) and setting up environments on Amazon AWS EC2 instances.
Involved in development of data processing framework (flow master and DCT) usingPyspark.
Hands on experience in installation, configuration, supporting and managingHadoop Clusters using Apache, Cloudera (CDH3, CDH4), Yarn distributions (CDH 5.X).
Experience working withbigdataand real time/near real time analytics using thebigdata platforms like Hadoop and Spark using Python.
Familiar withdataarchitecture includingdataingestion pipeline design, Hadoop information architecture,datamodeling anddatamining, machine learning and advanceddataprocessing. Experience optimizing ETL workflows.
Solid programming knowledge on Python and shell scripting.
Experience in migrating on premise to Windows Azure in DR on cloud using Azure Recovery Vault and Azure backups.
Experience on Migrating SQL database to AzuredataLake, Azuredatalake Analytics, Azure SQL Database, DataBricks and Azure SQLDatawarehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using AzureDatafactory.
Strong understanding ofDataModelling and experience withDataCleansing,Data Profiling and Data analysis.
Good Experience on SDLC (Software Development Life cycle) like Agile, Scrum.
Experience in all phases of SDLC like Requirement Analysis, Implementation and Maintenance
Prepare technical reports by collecting, analyzing, and summarizing information and trends.
Expert in performing business analytical scripts using HiveQL.
Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required to Validations in the data.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, Spark, Airflow, HBase, Pig, Zookeeper, Hive

Cloud Environment: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

Hadoop Distribution: Apache Hadoop 2.x/1.x, AWS (EMR, S3, EC2, RDS, Athena, SQS, DynamoDB, Redshift, CloudWatch, Kinesis), Microsoft Azure (Databricks, Azure Data Factory, SQL database, SQL Data Warehouse, Data Lake, Azure Active Directory)

Scripting Languages: Python, Scala, R, PowerShell Scripting, Pig Latin, HiveQL

Databases: MySQL, Teradata, DynamoDB, Snowflake, Redshift (spectrum tables, materialized views), HBase

PROFESSIONAL EXPERIENCE

Confidential

Sr. Data Engineer

Responsibilities:

Developed various data loading strategies and performed various transformations for analysing the datasets by using Hortonworks Distribution for Hadoop ecosystem.
Implemented solutions for ingestingdatafrom various sources utilizingBigData technologies such as Hadoop, Map Reduce Frameworks, Hive
Developing thePySparkapplications for Spark SQL,Dataframes and transformations using Python APIs to perform the business requirement on Hive staging tables, and load the final transformeddatainto Hive master tables
Worked as a Hadoop consultant on technologies like Map Reduce, Pig, and Hive.
Involved in ingesting large volumes of creditdatafrom multiple providerdatasources to AWS S3. Created modular and independent components for AWS S3 connections,datareads.
ImplementedDatawarehouse solutions in AWS Redshift by migrating thedatatoRedshift from S3.
Developed Spark code using Python to run in the EMR clusters.
Created User Defined Functions (UDF) using Scala to automate some business logic in the applications.
Automated the jobs anddatapipelines using AWS Step Functions, AWS Lambda and configured various performance metrics using AWS Cloud watch.
Worked using Apache Hadoop ecosystem components like HDFS, Hive, Pig, andMap Reduce.
DesignedAWSGlue pipelines to ingest, process, and storedatainteracting with different services inAWS.
Implemented usage of Amazon EMR for processingBigDataacross Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
Developed a process to migrate local logs toCloudWatchfor better integration and monitoring.
Executed the program by using python API written in python to support Apache Spark or PySpark.
Helped Dev ops Engineers for deploying code and debug issues.
Worked in writing Hadoop Jobs for analyzingdatalike Text format files, sequence files, Parquet files using Hive and Pig.
Worked on analyzing Hadoop cluster and differentBigDatacomponents including Pig, Hive, Spark, and Impala.
Populated database tables viaAWSKinesis Firehose andAWSRedshift.
Developed Spark code using Python and Spark-SQL for faster testing anddata processing.
Created Hive External tables and loaded thedatainto tables and querydatausing HQL.
Developed ETL modules and data workflows for solution accelerators usingPySparkand Spark SQL.
Used Spark SQL to process the huge amount of structureddata.
Extracted thedatafrom MySQL and AWS RedShift into HDFS using Kinesis.
DevelopedPysparkapplication for creating reporting tables with different masking in both Hive and MySQL DB and made available for newly build fetch API’s.
Wrote numerous Spark code in Scala for information extraction, change, and conglomeration from numerous record designs.

Environment: BigData, Spark, Hive, Pig, Python, Hadoop, AWS,Databases, AWS RedShift, Agile, SQL, HQL, Impala, CloudWatch, AWS Kinesis

Confidential

Cloud Data Engineer

Responsibilities:

Created Pipelines in ADF using Linked Services/Datasets/Pipeline to Extract, Transform, and loaddatafrom different sources like Azure SQL, Blob storage, Azure SQLDatawarehouse, write-back tool, and backward.
Extracted, Transformed and Loadeddatafrom Sources Systems to AzureDataStorage services using a combination of AzureDataFactory, Spark SQL, and U-SQL AzureData Lake Analytics.
Datais Ingested to one or more Azure Services - (AzureDataLake, Azure Storage, Azure SQL, Azure DW) and processing thedatain In Azure Databricks.
Worked on Azure Services like IaaS, PaaS and worked on storage like Blob (Page and Block), SQL Azure.
Implemented OLAP multi-dimensional functionality using Azure SQLDataWarehouse.
Retrieved data using Azure SQL and Azure ML which is used to build, test, and predict thedata.
Worked on Cloud databases such as Azure SQL Database, SQL managed instance, SQL Elastic pool on Azure, and SQL server.
Architect & implement medium to large scale BI solutions on Azure using AzureData Platform services (AzureDataLake,DataFactory,DataLake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB)
Responsible for estimating the cluster size, monitoring, and troubleshooting of the Sparkdatabricks cluster.
Designed and developedAzureData Factory pipelines to Extract, Load and Transform data from difference sources systems (Mainframe, SQL Server, IBM DB2, Shared Drives, etc.) toAzureData Storage services using a combination ofAzureData Factory,Azure Databricks (PySpark, Spark-SQL),AzureStream Analytics and U-SQLAzureData Lake Analytics. Data Ingestion into variousAzureStorage Services likeAzureData Lake,Azure Blob Storage,AzureSynapse Analytics (formerly known asAzureData Warehouse).
Configured and deployed Azure Automation Scripts for a multitude of applications utilizing the Azure stack (including Compute, Web & Mobile, Blobs, ADF, ResourceGroups, AzureDataLake, HDInsight Clusters, AzureDataFactory, Azure SQL, CloudServices, and ARM), Services and Utilities focusing on Automation.
Involved in Migrating Objects from Teradata to Snowflake and created Snow pipe for continuousdataload.
Increased consumption of solutions including Azure SQL Databases, Azure Cosmos DB, Azure SQL.
Created continuous integration and continuous delivery (CI/CD) pipeline on Azure that helps to automate steps in the software delivery process.
Deploying and managing applications in Datacenter, Virtual environment, and Azure platform as well.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's and PySpark.
LogdataStored in HBase DB is processed and analyzed and then imported into Hive warehouse, which enabled end business analysts to write HQL queries.
Handled importing ofdatafrom variousdatasources, performed transformations using Hive, and loadeddatainto HDFS.
Design, development, and implementation of performant ETL pipelines using PySpark and AzureDataFactory.

Environment: AzureDataFactory(V2), AzureDataBricks (Pyspark, Spark SQL), AzureDataLake, Azure BLOB Storage, Azure ML, Azure SQL, Hive, Git, GitHub, JIRA, HQL, Snowflake, Teradata

Confidential

Data Engineer

Responsibilities:

Worked onAWSCloud Formation templates for using Terraform with existing plugins.
UsedAWSCloud Formation to ensure successful deployment of database.
Configured Sqoop to import/exportdatafrom database to HDFS andDataLake onAWS.
Implemented Spark in EMR for processingBigDataacross ourDataLake inAWS System.
Developed Docker images to support Development and Testing Teams and their pipelines and distributed images like Jenkins, Selenium, JMeter and Elasticsearch, Logstash and Kibana (ELK), and handled the containerized deployment using Kubernetes.
Installed, configured, and managed the ELK (Elasticsearch, Logstash and Kibana) for Log management withinAWSEC2/ Elastic Load balancer for Elastic search involving in cloud automation with configuration management system Ansible.
Collecteddatausing Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations to build thedatamodel and persists thedatain HDFS.
ConfiguredAWSIAM and Security Group as per requirement and distributed them as groups into various availability zones of the VPC.
Implemented a Spark-Streaming consumer job to consumedatain near real time from AWS Kinesis and sink to S3 for downstream systems consumptions.
Involved in deploying spark and hive applications in AWS stack.
Worked on architecting Serverless design usingAWSAPI, Lambda, S3 and Dynamo DB with optimized design with Auto scaling performance.
Populated database tables viaAWSKinesis Firehose andAWSRedshift.
Led many critical on-premdatamigration toAWScloud, assisted in performance tuning and providing successful path towards Redshift Cluster andAWSRDS DB engines.
Set up Scala scripts to create snapshots onAWSS3 buckets and deleted old snapshots.
Worked onAWSS3 bucket integration for application and development projects.
Utilized Agile Scrum Methodology to help manage and organize a team of 4 developers with regular code review sessions.
Experience managing and reviewing Hadoop log files inAWSS3.

Environment: AWS, Spark, Yarn, Hive, Flume, Pig, Python, Hadoop,Databases, RedShift

Confidential

Hadoop Developer

Responsibilities:

Installed and configured Flume, Hive, Pig, and Oozie on the Hadoop cluster.
Collected and aggregated large amounts of web logdatafrom different sources such as webservers, mobile and network devices using Apache Flume and stored thedatainto HDFS for analysis.
Also used Spark SQL to handle structureddatain Hive.
Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
Handled importing ofdatafrom variousdatasources, performed transformations using Hive, MapReduce, loadeddatainto HDFS and extracteddatafrom Teradata intoHDFS using Sqoop.
Written and Implemented Teradata Fast load, Multiload and Bteq scripts, DML and DDL.
Involved in migrating tables from RDBMS into Hive tables using Sqoop and later generate particular visualizations using Tableau.
Analyzed substantialdatasets by running Hive queries and Pig scripts.
Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
Involved in transformingdatafrom Mainframe tables to HDFS, and HBase tables using Sqoop.
Defined the Accumulo tables and loadeddatainto tables for near real-timedatareports.
Created the Hive external tables using Accumulo connector.
Ability to spin up different AWS instances including EC2-classic and EC2-VPC using cloud formation templates.
Imported thedatafrom different sources like AWS S3, LFS into Spark RDD.
Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Impala and MapReduce) and move thedatainside and outside of HDFS.
Creating files and tuned the SQL queries in Hive utilizing HUE.
Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
Designed Power BIdatavisualization utilizing cross tabs, maps, scatter plots, pie, bar and density charts.
Expertized in implementing Spark using Scala and Spark SQL for faster testing and processing ofdataresponsible to managedatafrom different sources.
Worked with NoSQL databases like HBase in making the tables to load expansive arrangements of semi structureddata.
Designed the ETL process and created the high-level design document including the logicaldataflows, sourcedataextraction process, the database staging, job scheduling and Error Handling
Created ETL Mapping with Talend Integration Suite to pulldatafrom Source, apply transformations, and loaddatainto target database.

Environment: Hadoop, Cloudera, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, HBase,Apache Spark, Accumulo, Oozie Scheduler, Kerberos, AWS, Tableau, Talend,HUE, HCATALOG, Flume, Git, Maven.

Confidential

Data Analyst

Responsibilities:

Involved in the complete life cycle of the project performing various tasks like DataUnderstanding by performing ExploratoryDataAnalysis,DataCleansing,Data
Modeling for creating visual representation, statistics, and marketing from several sources with the business, Evaluating and Deploying with deep knowledge of consumer and marketing analysis
Closely worked with business requirements converting them into technical requirements and working withdataowners and stewards to gather all thedatarequirements for analysis for management of reports with help of excel pivot tables and Presentation charts with applied knowledge of HEDIS metrics.
Developed, analyzed, reported, and interpreted complexdatafor ongoing activities/projects while ensuring that alldataare accurate.
Worked withDataGovernance tools and extract-transform-load (ETL) processing tool for data mining, datawarehousing, anddatacleaning using SQL.
Useddatato identify trends, needs, and opportunities, and prepares reports, visualizations, and recommendations to help the district determine what’s working and what needs to change.
Integrating Word, Excel, and PowerPoint for business communication more effective by organizing separated information into one place for easy access and analysis.
Optimized SQL performance, integrity, and security of the project’s databases/schemas.
PerformedDatacleaning,Datapre-processing, and Manipulation using Python.
Developed intricate algorithms based on deep-dive statistical analysis and predictive and time-seriesdatamodeling.
Devised effective metrics, KPIs, Visualizations, and built Dashboards for reporting the Daily, Weekly, and Monthly summary, trending, and benchmark reports to the Management in Tableau Desktop.
Have good working knowledge of Michigan reporting requirements, including theRegistry of Educational Personnel (REP) reporting
Statistical analysis anddatamining for Utilizing software to look for patterns in large batches ofdataso can learn more about customers.

Environment: Tableau (Power BI), R, Python, SQL, PL/SQL, PostgreSQL, Excel,Spark, Databricks (Jupyter Notebooks), Redshift, ETL, AWS, Nifi, Netezza,Hadoop, Hive, Quantitative Analysis

Confidential

Data Analyst

Responsibilities:

Responding and action on incoming metrics, reporting requests from senior management and global team
Provide management support by identifying trends and developing strategies to assist management in decision making processes
Discuss workforce managementdata/reports and documenting findings and suggested improvements using knowledge of relational databases.
Provide timely and accurate status report on information for project in progress.
Forming spreadsheets to analyze business requirements.
Use of Tableau business intelligence tool to generate report and dashboard overview to analyze trend.
Extracting and manipulatingdatafrom various reporting environments and producing regular reports in agreed format.
Wrote queries to retrieve and analyzedatafrom various sources for projects, program, or reports.
Worked with team to evaluate.datareporting enhancements as well as query optimization to increase performance of reports.
Created SQL views, triggers, store procedures and defect management for ETL andBI systems.
Developed and published reports to provide operational and analytical insights to management via Excel and PowerPoint.
Modified existingdatavisualizations and made adjustments as per requirements.
Liaised with stakeholders to understanddatarequirements and developed tools and models such as segmentations, dashboards,datavisualizations, decision aids and business case analysis to support the organization.
Developed and supported analytic solutions aimed at improving outcomes, efficiencies, costs, and patients experience.
Ensure design quality by creating, conducting, and documenting testing.
Identifies technical roadblocks and troubleshoots and resolves functional and performance related issues.

Environment: Jira, SharePoint, Confluence, Tableau, Oracle, Microsoft Excel, SQL Server,Power BI

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship