We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

5.00/5 (Submit Your Rating)

SUMMARY:

  • Accomplished IT professional with 9 years of experience, specialized in Big Data Ecosystem - Data Acquisition, Storage Analysis, Integration, and Data Processing, SQL and ETL tools.
  • A Data Science enthusiast with strong Problem solving, Debugging, and Analytical capabilities, who actively engage in understanding and delivering business requirements.
  • Closely collaborated with business products, production support, engineering team regularly for Diving deep on data, Effective decision making and to support Analytics platforms.
  • Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
  • Experience on Python data science libraries like numpy, pandas, scikit-learn for developing Machine Learning models.
  • Extensive working experience with Big data ecosystem - Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Oozie, Zookeeper.
  • Sound Experience with AWS cloud (EMR, EC2, RDS, EBS, S3, Lambda, Glue, Athena, Redshift, ECS)
  • Working knowledge on Azure cloud components.
  • Excellent knowledge of Hadoop cluster architecture and its key concepts - Distributed file systems, Parallel processing, High availability, Fault tolerance, and Scalability.
  • Obtained and processed data from Enterprise applications, Clickstream events, API gateways, Application logs, and database updates.
  • Proficient at writing MapReduce jobs and UDF's to gather, analyze, transform, and deliver the data as per business requirements.
  • Expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing, and stream processing.
  • Acquired profound knowledge in developing production ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, Data Frame, Datasets, and Spark-ML.
  • Experienced in writing Spark scripts in Python, Scala, and SQL, HQL for development and analysis.
  • Proficient at

PROFESSIONAL EXPERIENCE:

Confidential

Senior Data Engineer

Responsibilities:

  • Imported data from various sources into HDFS and Hive using Sqoop. Building data platforms for analytics, advanced analytics in Azure Databricks. Designed and automated Custom - built input adapters using
  • Spark, Sqoop, and Oozie to ingest and analyze data from RDBMS to Azure Data Lake. Involved in developing automated workflows for daily incremental loads, moved data from RDBMS to Data Lake. Involved in building an Enterprise Data Lake using Data Factory and Blob storage, enabling other teams to work with more complex scenarios and ML solutions. Designed and Co-ordinated with the Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets. Managed resources and scheduling across the cluster using Azure Kubernetes Service. Analyzed SQL scripts and designed the solution to implement using PySpark.
  • Worked on creating tabular models on Azure analysis services for meeting business reporting requirements. Have good experience working with Azure BLOB and Data Lake storage and loading data into Azure
  • SQL Synapse analytics (DW). Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting purpose. Build the Logical and Physical data model for snowflake as per the changes required. ETL pipelines in and out of data warehouse using combination of Python and Snowflakes Writing SQL queries against Snowflake. Redesigned the Views in snowflake to increase the performance. Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle &
  • MySQL. Analyzed SQL scripts and designed the solution to implement using PySpark. Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs. Built pipelines to move hashed and un-hashed data from Azure Blob to Data Lake. Performed advanced procedure like text analytics and processing, using the in-memory computing capabilities of Spark using Python. Created pipelines to move data from on-premises servers to Azure Data Lake. Utilized Python Panda Frame to provide data analysis. Enhanced and optimized Spark scripts to aggregate, group and run data mining tasks. Loaded the data into Spark RDD and do in memory data Computation to generate the output response. Involved in converting Hive/SQL queries into Spark Transformations using
  • Spark RDD's and PySpark. Performed data analysis with Cassandra using Hive External tables. Worked on data processing and transformations and actions in spark by using Python (Pyspark) language. Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper. Experienced in handling large data sets using Partitions, Spark in memory capabilities, effective and efficient Joins, Transformations and other du

Confidential

Senior Data Engineer

Responsibilities:

  • Involved in building a data pipeline and performing analytics using AWS stack (EMR, EC2, S3, RDS, Lambda, Athena, Glue, SQS, Redshift, and ECS). Developed multiple POCs using PySpark and deployed machine learning models on the Yarn cluster. Utilized Spark's in - memory capabilities to handle large datasets on S3. Loaded data into S3 buckets, then filtered and loaded into Hive external tables. Migrated
  • Java analytical applications into Scala. Used Scala where performance and logic are critical. Involved in developing batch and stream processing applications that require functional pipelining using Spark Scala and Streaming API. Involved in extracting and enriching multiple Cassandra tables using joins in SparkSQL. Also converted Hive queries into Spark transformations. Developed POC to execute Machine Learning models using SparkML library. Hands-on experience in API design and development using Spring Boot for Data movement across different systems. Fetched live data from Oracle database using Spark
  • Streaming and Kafka using the feed from API Gateway REST service. Performed ETL operations using Python, SparkSQL, S3, and Redshift on terabytes of data to obtain customer insights. Performed interactive Analytics like cleansing, validation, and quality checks on data stored in S3 buckets using AWS Athena. Involved in writing Python scripts to automate ETL pipeline and DAG workflows using Airflow. Created workflows using Airflow to automate the process of extracting weblogs into the S3 Data Lake. Integrated applications using Apache tomcat servers on EC2 instances and automated data pipelines into AWS using
  • Jenkins. Experience with Apache bigdata Hadoop components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop, Pigi. Worked with the Data Science team running Machine Learning models on the Spark EMR cluster and delivered the data needs as per business requirements. Experience with analytical reporting and facilitating data for Quick sight and Tableau dashboards. Used Git for version control and Jira for project management, tracking issues and bugs.

Environment: AWS, EC2, S3, Athena, Lambda, Glue, Elasticsearch, RDS, DynamoDB, Redshift, ECS, Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Python, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, HBase, Oracle, Cassandra, MLlib, Quick sight, Tableau, Maven, Git, Jira.

Confidential

Data Engineer

Responsibilities:

  • Experience in working with Azure cloud platform (HDInsight, Databricks, DataLake, Blob, Data Factory, Synapse, SQL DB, SQL DWH). Performed data cleansing and applied transformations using Databricks and Spark data analysis. Used Azure Synapse to manage processing workloads and served data for BI and prediction needs. Developed Spark Scala scripts for mining data and performed transformations on large datasets to provide real - time insights and reports. Supported analytical platform, handled data quality, and improved the performance using Scala's higher-order functions, lambda expressions, pattern matching, and collections. Implemented scalable microservices to handle concurrency and high traffic. Optimized existing Scala code and improved the cluster performance. Designed and automated Custom-built input adapters using Spark, Sqoop, and Oozie to ingest and analyze data from RDBMS to Azure Data Lake. Involved in developing automated workflows for daily incremental loads, moved data from RDBMS to Data
  • Lake. Monitored the Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from MS SQL to CosmosDB and improved the query performance. Created Automated ETL jobs in Talend and pushed the data to Azure SQL data warehouse. Managed resources and scheduling across the cluster using Azure Kubernetes Service. Involved in building an Enterprise DataLake using Data Factory and Blob storage, enabling other teams to work with more complex scenarios and ML solutions. Used Azure Data Factory, SQL API, and Mongo API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure
  • SQL DB). Extensive knowledge in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning, and troubleshooting Hadoop clusters. Working with the data science team to do preprocessing and feature engineering and assisted Machine Learning algorithms running in production. Reduced access time by refactoring data models, query optimization, and implemented Redis cache to support Snowflake. Facilitated data for interactive Power BI dashboards and reporting purposes.

Environment: Azure (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, SQL DB, SQL DWH, AD, AKS), Scala, Python, Hadoop 2.x, Spark v2.0.2, NLP, Airflow v1.8.2, Hive v2.0.1, Sqoop v1.4.6, HBase, Oozie, Talend, CosmosDB, MS SQL, MongoDB, Ambari, PowerBI, Azure DevOps, Ranger, Git.

Confidential

Software Engineer/ Data Engineer

Responsibilities:

  • Responsibilities: Implementing new dimensions into spark application upon the business requirements. Analyzed and understood the architectural design of the project in a step - by-step process along with the data flow. Configured & developed the triggers, workflows, validation rules & having hands on the deployment process from one sandbox to another. Extracted data using Sqoop Import query from multiple databases and ingested into Hive tables. Implemented Spark Sql to update queries based on the business requirements. Developed a Spark job which indexes data into Elasticsearch from external Hive tables which are in HDFS. Validate, manipulate, and perform exploratory data analysis tasks using Pandas, NumPy, ScikitLearn and PySpark to interpret and extract insights from large data sets consisting of millions of records. Developed ETL pipelines using Spark and Hive for performing various business specific transformations. Involved in creating Hive tables, loading, and analyzing data using hive scripts by implementing
  • Partitioning and Bucketing in Hive. Developed Spark code using Scala, Spark-SQL, Spark-Streaming to perform real-time analysis and processing of data. Created concurrent access for various Hive tables with shared and exclusive locking, enabled with the help of Zookeeper implementation in the cluster. Responsible for migrating terabytes of on-premises enterprise data to AWS S3. Automated the jobs and data pipelines using AWS Step Functions, AWS Lambda and configured various performance metrics using AWS Cloud watch. Using Jenkins for build and continuous integration in software development. Working with a data science team to do preprocessing and feature engineering and assisted in running a Machine Learning algorithm in production

Environment: HIVE, AWS, Spark Scala, PowerBI, HDFS, SQL, ETL, Zookeeper.

Confidential

Software Engineer

Responsibilities:

  • Used principles of Normalization to improve the performance. Involved in ETL code using PL/SQL to meet requirements for Extract, transformation, cleansing and loading of data from source to target data structures. Developed web application using HTML, CSS, Bootstrap, C#, MVC .net, JavaScript. Designed ETL processes for extracting, transforming, and loading of OLTP data into a central data warehouse.
  • Integrated new systems with existing data warehouse structure and refined system performance and functionality. Created SSIS Packages using Pivot Transformation, Fuzzy Lookup, Derived Columns, Condition Split, Term extraction, Aggregate, Execute SQL Task, Data Flow Task, and Execute Package Task to generate underlying data for the reports and to export cleaned data from Excel Spreadsheets, Text file, MS Access, and CSV files to data warehouse. Created Datasets using different data sources like Microsoft SQL Server, Excel. Worked on DTS Packages, DTS Import/Export for transferring data from Heterogeneous
  • Database. Created Triggers to enforce data and referential integrity. Actively participated in gathering business requirements to implement functional and technical specifications. Working on T - SQL (DDL & DML) in implementing and developing triggers, stored procedures, nested queries, joins, cursors, views, user defined functions, indexes, user profiles, relational database models. Experience in using recursive CTEs, CTE, temp tables, and effective DDL/DML Triggers to enable efficient data manipulation and to support the existing applications. Created Shell Scripts for automated execution of batch process. Used various tasks such as Data Flow, Execute SQL, Script, FTP, send mail, and Transformation's techniques like Data conversion, Merge, Row count, multicast, Sort, Lookup, Conditional Split in developing the SSIS
  • Packages. Identified, tested, and resolved database performance issues (monitoring and tuning) to ensure database optimization. Performed Database Administration of all database objects including tables, clusters, indexes, views, sequences packages and procedures. Responsible for all project lifecycle phases, from specifications and coding through deployment, testing, debugging, documentation, and maintenance.

Environment: .NET MVC, MS-Excel, Data Quality, MS-Access, SQL, MS Excel, Data Maintenance, PL/SQL, SQL Plus, Metadata, Tableau, Data Analysis, Tableau, SSIS, SSRS, SSAS.

We'd love your feedback!