Sr Data Engineer Resume
Baltimore, MD
SUMMARY
- Around 8+ years of professional experience in IT which includes experience in Big Data Ecosystem with Ingestion, Query processing and Analysis of big data.
- Experience with processing large sets of structured, unstructured, and semi - structured data using Spark and Scala.
- Excellent hands-on with importing and exporting data from different Relational Database Systems like MySQL and Oracle into HDFS and Hive using Sqoop.
- Knowledge of job workflow scheduling and monitoring tools using Control-M and Zookeeper.
- Experienced in writing Spark applications in Scala and Python (Pyspark).
- Experienced in writing and implementing Shell Scripts for automating jobs.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD and Pyspark concepts
- Experienced in handling messaging services using Apache Kafka.
- Designed and implemented database solutions in SQL Data Warehouse and Azure SQL.
- Hands-on with real-time data processing using distributed technologies Kafka and Storm.
- Experience in doing performance tuning for map reduce jobs & hive complex queries.
- Experience in efficiently doing ETL’s using Spark - in memory processing, Spark SQL and Spark streaming using Kafka distributed messaging system.
- Understanding of structured data sets, data pipelines, ETL tools, data reduction, transformation and aggregation technique, Knowledge of tools such as DBT, DataStage
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure Synapse, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Have good knowledge in Job Orchestration tools like Oozie, Zookeeper & Airflow.
- Very capable at using Amazon Web Services utilities such as EMR, S3 and CloudWatch to run and monitor Hadoop/Spark jobs on AWS.
- Written PySpark job in AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue Data Catalog with metadata table definitions.
- Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL jobs and run aggregation on PySpark code.
- Ability in Azure ADF having hands-on experience in both ADF v1 and ADF v2.
- Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.
- Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
- Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications. Worked on various automation tools like GIT, Terraform, Ansible.
- Experienced in fact dimensional modelling (Star schema, Snowflake schema), transactional modelling and SCD (Slowly changing dimension)
- Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm.
- Experience with Agile development methodologies and Azure GitHub repository.
TECHNICAL SKILLS
Big Data/Hadoop Technologies: HDFS, MapReduce, YARN, Hive, Pig, HBASE, Impala, Zookeeper, Sqoop, OOZIE, Apache Cassandra, Flume, Spark, AZURE, AWS, EC2
Languages: C, Java, HTML5, DHTML, CSS3, SQL, Json, PL/SQL, Scala, Shell Scripts
Databases: NoSQL, Oracle, DB2, MySQL, SQL Server, MS Access, HBase
NO SQL Databases: Cassandra, MangoDB, HBase
Cloud: Azure (Azure Lake, Azure storage, Azure Synapse), Aws (S3, EC2, Redshift, Lambda, Glue)
Application Servers: WebLogic, WebSphere, Apache Tomcat, JBOSS
Build Tools: Jenkins, PostgreSql, Oozie, SOAP UI
Reporting Tools: Jaspersoft, Qlik Sense, Tableau, JUnit
PROFESSIONAL EXPERIENCE
Confidential, Baltimore MD
Sr Data Engineer
Responsibilities:
- Developed AWS Pipelines by extracting customer's data from various data resources into Hadoop HDFS and included the data from Excel, Oracle, Flat Files, SQL Server, log data and Teradata from servers.
- Creation of data pipelines for cleaning, gathering, and transforming data using Spark, Hive, and used Spark-Streaming APIs to make necessary changes for developing the common learner data model which gets the data from AWS Kinesis in real time and persists.
- Written and executed Spark code using Spark-SQL and Scala for quicker testing and processing of data and transforming it using Spark Context, Pair RDD's, Spark-SQL, Spark YARN.
- Worked on Elasticsearch, Kibana (ELK stack) and Logstash for centralized logging and analytics in the continuous delivery pipeline to store logs and metrics into S3 bucket using lambda function.
- Developed Lambda function to monitor the log files which will trigger the Lambda code when there are changes in log files.
- Executed Python scripts to automate AWS Services which includes Cloud front, ELB, Lambda, database security and application configuration, also developed them to take backup of EBS volumes using CloudWatch, AWS Lambda.
- Built data pipeline in Amazon AWS using AWS Glue to get the data from weblogs and store in HDFS.
- Developed distributed frameworks such as Apache Spark and Presto in Amazon EMR, Redshift.
- Find query duplication, dependency, and complexity to reduce migration efforts Technology stack: AWS Cloud, Oracle, and Dynamo DB.
- Troubleshooting and Optimization, test case integration into CI/CD pipeline using docker images and implemented continuous integration & deployment (CID) through Jenkins for Hadoop jobs.
- Built Spark streaming application to extract data from cloud to hive table.
- Used Spark SQL to process the vast amount of structured data and executed programs in Python using Spark.
- Performed analysis and optimizations of DD's by controlling partitions for the given data and expertise in executing business analytical scripts using Hive SQL.
- Performed integration of AWS Kinesis for streaming with on premise AWS Kinesis cluster and wrote automation scripting in Python to manage and deploy applications.
- Built Hadoop Jobs for analyzing data using Pig, Hive accessing sequence files, Text format files, Parquet files.
- Converted Hive/SQL queries into Spark transformations with the help of Python, Spark RDDs, and Scala.
- Worked with Spark Streaming through core Spark API running Scala, to transform raw data into baseline data.
- Executed Python programs with a variety of packages such as Matplotlib, NumPy, & Pandas.
- Executed the SQL scripts & manipulated them for improving performance using PySpark SQL.
Environment: Hadoop, YARN, Spark, Pig, HIVE, SQL, PySpark, Python, Chef, AWS Lambda, AWS S3, Snowflake, Database, AWS EMR, Dynamo DB, Redshift, Kinesis, HBase, NOSQL, Sqoop, MYSQL, Docker, Data Warehouse and ETL.
Confidential, Chicago, IL
Sr. Data Engineer
Responsibilities:
- Created Azure Data factory pipelines for loading the data to the Azure SQL database from different platforms and sources.
- Designed ADF pipelines with multiple chaining activities, using parameters, applying incremental and delta loads, and automated triggers.
- Worked on migration of pipelines from ADFv1 to ADFv2 connecting to multiple sources and destinations.
- Developed multiple apps using the Power Apps platform with SQL, and SharePoint as databases. Created and managed the Power Platform solutions across different environments.
- Extract Transform and Load (ETL) data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load (ETL) data from various sources like Azure SQL, Blob storage, and Azure SQL Data warehouse.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Developed SQL Scripts for automation purposes.
- Migrated On-Premises database to snowflake database via shift and load method in ADF.
- Have extensive experience in creating pipeline jobs, scheduling triggers, Mapping data flows using Azure Data Factory(V2) and using Key Vaults to store credentials.
- Have good experience working with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).
- Identified the root cause for the long-running processes, Dead-lock processes, stuck processes, and time-out issues, and provided the fixes in SSAS, Power BI, and Azure Data warehouses, agent jobs, and ETL packages.
- Handled various production failures on Agent jobs, Databases, Data bricks, Cube Processing, Windows Servers, Services, and Azure VMs.
- Performed the requirement gathering, requirement analysis, design, and development of Azure Analysis Services Tabular models (SSAS) and Power BI visualizations.
- Automated the processing of Azure Analysis Services Tabular models using Azure Automation.
- Implemented role-based security in SSAS tabular and power BI models for various azure active directory groups based upon the company location.
- Implemented code check-in/check-out and managed multiple versions of complicated code within TFS.
- Handled critical production requests and analyzed and resolved SQL job failures and AZURE pipelines.
Environment: Azure Analysis Services, Power BI, Azure SQL Database, Python, NumPy, Pandas, Keras, Tensor Flow, Azure CLI, Azure HD Insights, Eclipse, IntelliJ, Power BI, SSAS, Azure Data warehouse, Azure Data Lake, Azure Data Factory(V2), Spark SQL, BLOB, JSON.
Confidential, St Louis, MO
Data Engineer
Responsibilities:
- Designed and built Spark/PySpark based ETL pipelines for migration of service transactions, account, and customer data into enterprise Hadoop Data Lake. Developed strategies in handling large datasets using partitions, Spark SQL, broadcast joins and performance tuning.
- Built and implemented performant data pipelines using Apache Spark on AWS EMR. Performed maintenance of data integration programs into Hadoop and RDBMS environments from both structured and semi- structured data source systems.
- Developed performance tuning on existing Hive queries and UDF's to analyze the data. Used Pig to analyze datasets and perform transformation according to requirements.
- Supervised on data profiling and data validation to ensure the accuracy of the data between the source and the target systems. Performed job scheduling and monitoring using Auto sys and quality testing using ALM tool
- Worked on building of Tableau desktop reports and dashboards to report customer data.
- Built and published customized interactive Tableau reports and dashboards along with data refresh scheduling using Tableau Desktop.
- Snowflake - data warehouse to consume the data from C3 Platform.
- Involved in S3 event notifications, an SNS topic, an SQS queue, and a Lambda function sending a message to the Slack channel.
- Transformed Teradata scripts and stored procedures to SQL and Python running on Snowflake's cloud platform.
- Deploy and monitor scalable infrastructure on Amazon web services (AWS) and configuration management instances and Managed servers on the Amazon Web Services (AWS) platform using Ansible configuration management tools and Created instances in AWS as well as migrated data to AWS from data center.
- Automated tasks of extracting metadata and lineage from tools using Python scripts and saved 70+ hours' manual efforts.
- Analyzed the system requirement specifications and in client interaction during requirements specifications.
- Providing daily reports to the Development Manager and participate in both the design phase and the development phase.
Environment: AWS, Hadoop, Python, Pyspark, SQL, Snowflake, Data bricks/Delta Lake, AWS S3, AWS Athena and AWS EMR, Teradata, Tableau.
Confidential, San Jose, CA
Big Data Engineer
Responsibilities:
- Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis.
- Collecting data from various Flume agents that are imported on various servers using multi-hop flow.
- Ingest real-time and near-real time (NRT) streaming data into HDFS using Flume.
- Worked with NoSQL databases like HBase in making HBase tables to load expansive arrangements of semi structured data.
- Involved in transforming data from Mainframe tables to HDFS, and HBase tables using Sqoop.
- Acted for bringing in data under HBase using HBase shell also HBase client API.
- Experienced with handling administration activations using Cloudera manager.
- Involved in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
- Experience working with Apache SOLR for indexing and querying.
- Created custom SOLR Query segments to optimize ideal search matching.
- Involved in data ingestion into HDFS using Sqoop for full load and Flume for incremental load on variety of sources like web server, RDBMS and Data API's.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
- Implemented the workflows using Apache Ooze framework to automate tasks.
- Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau.
- Created and maintained Technical documentation for launching Hadoop Clusters and for executing Pig Scripts.
- Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts.
- Designed and implemented Incremental Imports into Hive tables and writing Hive queries to run on TEZ.
- Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Java, SOLR, RDBMS, Apache Ooze.
Confidential
SQL Server BI Developer
Responsibilities:
- Analyzing and understanding the existing reporting environment.
- Worked closely with the Business analysts for requirements gathering, designing the database and creating workflow for report.
- Created one central SSIS Master Package to execute multiple child packages with Control Flow and Data Flow. Utilized Parent Package and XML Configuration to dynamically pass variable values on runtime.
- Participated in identifying data migration issues and resolved them.
- Extensively used variables, break point, check point, logging, package configuration and event handler in SSIS packages to meet the business needs.
- Involved in performance tuning to optimize queries and enhance the performance of databases, SQL queries, and stored procedures using SQL Profiler, Execution Plan and Index Tuning Wizard.
- Created Complex SSAS Cubes with multiple fact measure groups, and multiple hierarchies based on the OLAP reporting needs.
- Built MDX queries for Analysis Services & Reporting Services.
- Designed, developed, created and tested PivotTable/PivotChart reports based on OLAP cubes and offline cubes.
- Created multiple partitions and aggregations for the different measure groups for improving performance of the cubes.
- Resolving the SSAS cube connectivity and data issues as and when needed.
- Interacted with Business Users to help them understand to generate reports/look at the business data with various drill down options with Excel via connecting to SQL Server Analysis Services (SSAS).
Environment: SQL Server 2012, SQL Server Integration Services (SSIS), Reporting Services (SSRS), TFS, Pivot Tables, MS Visual Studio.Net, C#, SQL Profiler, Windows 2003/2007 Server OS.