Sr. Data Engineer Resume
OH
SUMMARY
- Over all 8 years of professional IT experience and over 5 years of Big Data Ecosystem experience in ingestion, storage, querying, processing and analysis of big data/ Databricks/cloud Technologies.
- Hands - on experience in Azure Cloud Services (PaaS & IaaS), Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Spark scripts and UDF's to perform transformations on large dataset.
- Experience working with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics.
- Experience in building, maintaining multiple Hadoop clusters of different sizes and configuration.
- Created Databricks notebooks to streamline and curate the data for various business use cases and also mounted blob storage on Databricks.
- Experience in building data pipelines, computing large volume of data using Azure Data factory.
- Developed Python scripts to do file validations in Databricks and automated the process using ADF.
- In-depth knowledge ofHadoop and Spark, experience with data mining and stream processing technologies (Kafka, Spark Streaming)
- Expertise in Big Data architecture like Hadoop (Azure, Hortonworks, Cloudera) distributed system, MongoDB, NoSQL and HDFS, parallel processing - MapReduce framework
- Development of Spark-based application to load streaming data with low latency, using Kafka and Pyspark programming.
- Extensive hands-on experience tuning spark Jobs. Experienced in working with structured data using HiveQL, and optimizing Hive queries.
- Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing and analysis of data.
- Experience in development of Big Data projects using Hadoop, Hive, HDP, Pig, Flume, Storm and MapReduce open-source tools.
- Experience in installation, configuration, supporting and managing Hadoop clusters.
- Experience in working with MapReduce programs using Apache Hadoop for working with Big Data.
- Experience in developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Talend Integration Suite.
- Experience in installation, configuration, supporting and monitoring Hadoop clusters using Apache, Cloudera distributions and AWS.
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
- Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Scala, Pig, Hive, Sqoop, Oozie, Flume, Storm, big data technologies.
- Worked on Sparks, Spark Streaming and using CoreSparkAPI to explore Spark features to build data pipelines.
- Experienced in working with different scripting technologies like Python, UNIX shell scripts.
- Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services successfully loaded files to HDFS from Oracle, SQL Server, Teradata and Netezza using Sqoop.
- Excellent working experience in Scrum/Agile framework and Waterfall project execution methodologies.
- Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Pig, Impala, Sqoop, Oozie, Flume, Storm, big data technologies.
- Experienced in working with different scripting technologies like Python, UNIX shell scripts.
- Installed and configured Apache airflow for workflow management and created workflows in python.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
- Experience in database design, entity relationships and database analysis, programming SQL, stored procedures PL/SQL, packages and triggers in Oracle.
- Experience in working with different data sources like Flat files, XML files and Databases.
- Hands on experience in working with Continuous Integration and Deployment (CI/CD)
- Strong communication skills, analytic skills, good team player and quick learner, organized and self-motivated.
TECHNICAL SKILLS
Languages: Python, SQL, PL/SQL, Scala, R, Shell scripting
Big Data Technologies: Apache Hadoop, Apache Spark, Apache Kafka, Apache Sqoop, Apache Crunch, Apache Hive, Map Reduce, Oozie, Apache Nifi and Apache Pig
Reporting tools: PowerBI, Tableau, BO
Integration Tools: Jenkins, GIT
Operating Systems: Mac OS, Windows XP/ Visa/ 7
Packages & Tools: MS Office Suite (Word, Excel, PowerPoint, SharePoint, Outlook, Project),Visual studio,Informatica.
Database: JDBC, MySQL, SQL Server, Snowflake
NoSQL Database: HBase and MongoDB
Cloud services: Azure(ADF,AKS,AzureAnalytics,HDInsight’s,ADL,Synapse),Aws(S3,EMR,Glue,Redshift,lambda,Athena)
PROFESSIONAL EXPERIENCE
Confidential, OH
SR. Data Engineer
Responsibilities:
- Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server.
- Experience in developing spark applications using spark-SQL in Databricks for data extraction,transformation and aggregation from mutiple formats for analyzing and transforming the data to uncover insights into the customer usage patterns.
- Analyzing the Data from different sourcing using Big Data Solution Hadoop by implementing Azure Data Factory, Azure Data Lake, Azure Data Lake Analytics, HDInsight’s, Hive, and Sqoop.
- Used notebooks, Spark Data frames, SPARK SQL and python scripting to build ETL pipelines in Databricks.
- Developed and maintained the data pipeline on Azure Analytics platform using Azure Databricks, PySpark, and Python.
- Involved in building an Enterprise Data Lake utilizing Data Factory and Blob storage, empowering different groups to work with more perplexing situations and ML solutions.
- Worked on Azure Functions to write the code in Python to perform different transformations task and performed analytics in HD Insight on large volume of data.
- Automated the loading of data into Blob Storage with Data Factory, PySpark, and extracted the required data in Blob Storage using HD Insight and loaded into HDFS.
- Developed an automated process in Azure cloud which can ingest data daily from web service and load in to Azure SQL DB.
- Analyzed & transformed data to uncover insights from multiple file formats with data transformation and aggregation with spark applications using PySpark and SparkSQL.
- Monitor Resources using Azure Automation and created alarm for VM, Blob Storage, ADF, Databricks, Synapse Analytics based on different events that are occurring.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Azure Monitoring, Key Vault, Function app and Event Hubs.
- Good experience in tracking and logging end to end software application build using Azure DevOps.
- UsedTerraformscript for deploying the applications for higher environments.
- Involved in variousSDLCLife cycle phases like development, deployment, testing, documentation, implementation & maintenance of application software.
- Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
- Worked on an Azure copy to load data from an on-premises SQL server to an Azure SQL Data warehouse.
- Worked on re-designing the existing architecture and implementing it on Azure SQL.
- Worked on updating the scripts for EDW in informatica.
- Experience with Azure SQL database configuration and tuning automation, vulnerability assessment, auditing, and threat detection.
Confidential, FL
Sr. Data Engineer
Responsibilities:
- Collaborated with Business Analysts, Engineers across departments to gather business requirements, and identify workable items for development.
- Selected and generated data into CSV files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Hands on experience working with AWS EMR, EC2, S3, Redshift, DynamoDB, lambda, Athena and Glue.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, RDD’s, Memory optimization.
- Worked with Hadoop ecosystem and Implemented Spark using Scala and utilized Data frames and Spark SQL API for faster processing of data.
- Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.
- Developed Spark Streaming job to consume the data from the Kafka topic of different source systems and push the data into S3.
- Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark
- Hands on experience working with Snowflake database.
- Worked on loading the data to snowflake from S3. Worked on performance tuning of spark and snowflake jobs.
- Moved data from S3 bucket to Snowflake Data Warehouse for generating the reports.
- Worked on Databricks and with delta tables. Used python to write a service which is event based using AWS Lambda to achieve real time data to One-Lake (A Data Lake solution in Cap-One Enterprise).
- Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
- Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Utilized Agile and Scrum methodology for team and project management.
- Used Git for version control with colleagues.
Confidential, PA
Application Developer/Data Engineer
Responsibilities:
- Responsible for running the spark jobs along with optimizing and data validation. Proficiency in converting SQL queries into Spark transformations using Spark RDD and Python.
- Configured EMR to process the millions of customers data using spark applications in less time.
- Worked on MongoDB (NoSQL framework) to store the unstructured data before processing with HiveQL.
- Responsible for running the spark jobs along with optimizing, data validation and automation.
- Developed Spark applications utilizing PySpark and Spark-SQL for information extraction, change, and accumulation from numerous document designs for analyzing and changing the information to reveal experiences into the client utilization designs.
- Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
- Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Customized Hive UDF to develop the structured format of data from unstructured customers data and loaded into HBase environment from database using Sqoop.
- Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases by using mixture of clinical data and healthcare data.
- Loaded data into S3 buckets and involved in filtering data stored in S3 buckets using Elasticsearch & loaded data into Hive external tables. Utilized Spark's in memory capabilities to handle large datasets on S3 Data lake.
- Used Spark-streaming for consuming event-based data from Kafka and joined this data set with existing Hive table data to generate performance indicators for an application.
- Used AWS Lambda, running scripts/code snippets in response to events occurring in CloudWatch.
- Integrated applications using Apache tomcat servers on EC2 instances and automated data pipelines into AWS using Jenkins and Git
Confidential, IL
Big Data Engineer
Responsibilities:
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Used Hadoop technologies like spark and hive including using the Pyspark library to create spark data frames and converting them to normal panda’s data frames for analysis.
- Played a key role in migrating Hadoop cluster on Azure and defined different read/write strategies.
- Designed and build a Data Lake using Hadoop and its ecosystem components.
- Developed Spark, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Worked with data investigation, discovery and mapping tools to scan every single data record from many sources.
- Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in ORC format.
- Executed multiple Spark SQL queries after forming the Database to gather specific data corresponding to an image.
- Developed prototype for Big Data analysis using Spark, RDD, Data Frames and Hadoop eco system with CSV, JSON, and distributed files.
- Have knowledge on partition of Kafka messages and setting up the replication factors in Kafka Cluster and Implemented to reprocess the failure messages in Kafka using offset id.
- Reviewed Kafka cluster configurations and provided best practices to get peak performance.
- Designed and Implemented Error-Free Data Warehouse-ETL and Hadoop Integration.
- Enhancements to conventional data warehouses based on the STAR schema, data model updates, and Tableau data analytics and reporting.
- Evaluated the performance of Data bricks environment by converting complex Redshift scripts to spark SQL as part of new technology adaption project.
- Involved in creating Hive tables, loading with data and writing Hive queries which will invoke and run Map-Reduce jobs in the backend.
- In depth understanding/ knowledge of Hadoop architecture and various components such as HDFS, application manager, node master, resource manager name node, data node and map reduce concepts.
- Involved in developing a Map Reduce framework that filters bad and unnecessary records.
- Worked with the source team to understand the format and delimiters of the data file.
- Running Periodic Map-Reduce jobs to load data from Cassandra into Hadoop.
- Involved in moving all log files generated from various sources to HDFS for further processing through flume.
- Created HBase tables to store variable data formats of data coming from different legacy systems.
- Involved heavily in setting up the CI/CD pipeline using Jenkins, Maven, Nexus, GitHub.
- Developed scripts and automated data management from end to end and sync up b/w all the Clusters.
- Used HIVE to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- The Hive tables created as per requirement were internal or external tables defined with appropriate static and dynamic partitions, intended for efficiency.
- Transformed the data using Hive, Pig for BI team to perform visual analytics according to the client requirement.
- Used Spark-SQL to load JSON data and create schema RDD and loaded it into Hive tables handled structured data using Spark SQL implemented Fair schedulers on the Job Tracker to share the resources of the cluster for the Map Reduce jobs given by the users.
- Experience in analysing Cassandra database and comparing it with other open-source NoSQL databases to find which one of them better suits the current requirements.
- Implemented the workflows using Apache Oozie framework to automate tasks.
- Developing design documents considering all possible approaches and identifying best of them.
- Written Map Reduce code that will take input as log files and parse the and structures them in tabular format to facilitate effective querying on the log data.
Confidential
Azure Specialist
Responsibilities:
- Participated in all phases of development from requirements definition and design to development, deployment and maintenance.
- Worked with Microsoft Azure lead directly for better understanding the requirement.
- Certified in Azure machine learning and azure sales specialist.
- Interacted with business and development teams to understand application workflows and plan/design appropriate tests for the same.
- Analyzed requirements and works with BA to resolve clarifications on Business Requirements Documents and Technical Requirements Document.
- Developed relationship with customers, find and qualify new Azure opportunities.
- Implements sales strategies in regards to new business, retention of clients, negotiations and identification of entrepreneurial enterprises and relationship management strategies.
- Supported and independently completed testing activities for the application under test across SDLC
- Provides updates, status and completion information to manager.
- Coordinated with development team, functional and configuration teams as required to resolve defects
- Participated in status meetings with vendors
- Have hands-on experience in RDBMS, SQL queries for data extractions and validations.
Confidential
Data Analyst
Responsibilities:
- Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
- Recommended structural changes and enhancements to systems and Databases.
- Conducted Design reviews and technical reviews with other project stakeholders.
- Was a part of the complete life cycle of the project from the requirements to the production support.
- Created test plan documents for all back-end database modules.
- Used MS Excel, MS Access, and SQL to write and run various queries.
- Worked extensively on creating tables, views, and SQL queries in MS SQL Server.
- Worked with internal architects and assisting in the development of current and target state data architectures.
- Examining the different datasets and identify and implement solutions to improve the quality of all data types.
- Coordinate with the business users in providing appropriate, effective, and efficient way to design the new reporting needs based on the user with the existing functionality.
- Remain knowledgeable in all areas of business operations to identify systems needs and requirements.
- Add according to your clients and business stories at client location.
- Support and resolve data issues with accuracy and timely manner.
- Build and maintain reports and dashboards.
- Worked with portfolio managers and quantitative researchers to optimize the model testing and production framework.
- Ensure assigned projects are completed within budgets and schedules while meeting the business objectives.
