We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

3.00/5 (Submit Your Rating)

Sioux Falls, SD

SUMMARY

  • A Certified Data Engineering specialist with 9+ years of experience implementing different Big Data/Cloud Engineering, Snowflake, Data Warehouse, Data Modelling, Data Mart, Data Visualization, Reporting, Data Quality, and Data Analytics Solutions. Willing to contribute to team success through hard work, meticulous attention to detail, and excellent organizational skills.
  • Evaluating the technology stack for building Cloud - Based Analytics solutions by conducting research and identifying appropriate strategies, tools, and methodologies for developing end-to-end analytics solutions and assisting in the development of a technology roadmap for Data Ingestion, Data Lakes, Data Processing, and DataVisualization.
  • Strong practical experience in Cloud data migration utilizing AWS, Azure, and Snowflake.
  • Experienced in AWS and Azure Deployments, with the overarching objective of transferring on-premises servers and data to the Cloud.
  • Hands-on experience with AWS services such as Amazon EC2,S3, RDS, VPC, IAM, Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR, and others.
  • Used Amazon Web Services such as EC2 and S3 to analyze and store small data processing. Extensive experience administering the Hadoop cluster running on AWS EMR.
  • Detailed exposure to Azure tools such as Azure Data Lake, Azure Data Bricks, Azure Data Factory, HDInsight, Azure SQL Server and Azure DevOps.
  • Built batch data pipelines based on data varieties using Azure Databricks Notebooks.
  • Good understanding of the Snowflake Data Platform. Experience with Snowflake Multi-Cluster Warehouses and Snow Pipe, importing data from the local system and AWS S3 Bucket, in-depth knowledge of Snowflake Database, Schema and Table structures, and experience with Snowflake Clone and Time Travel.
  • Experience designing and optimizing Data Pipelines, Architectures, and Data Sets for "Big Data".
  • Extensive knowledge of Spark Streaming, Spark SQL, and other Spark components such as accumulators, broadcast variables, various cache levels and optimization approaches for Spark employment.
  • Experience with implementing the Bigdata ecosystem, including Hadoop MapReduce, NoSQL, Apache Spark, Pyspark, Python, Scala, Hive, Impala, Sqoop, AWS, Azure, and Oozie.
  • Familiarity with data processing performance optimization strategies such as dynamic partitioning, bucketing, file compression, and cache management in Hive, Impala, and Spark.
  • Experience in ETL pipelines in and out of data warehouse using a combination of Python and Snowflakes SnowSQL to Extract, Load, and Transform data. Writing SQL queries against Snowflake.
  • Working knowledge of Data Build Tool (Dbt) with Snowflake.
  • Experience with Data Build Tool (Dbt)to performSchema Tests, Referential Integrity Tests, and Custom Tests the data and ensured Data Quality.
  • Experience with Data Pipelines, ETL and ELT data processes, and converting BigData/Unstructured Data sets (JSON, Log Data) to Structured Data Sets for Product Analysts and Data Scientists.
  • Responsible for Data Modeling, Data Migration, Design, and ETL Pipeline preparation for both cloud and Exadata platforms.
  • Developed ETL Scripts using Informatica for data acquisition and transformation.
  • Experience in both planning and implementation of data warehouse ETL Architecture using Teradata, Oracle, SQL, and PL/SQL as well as Informatica, UNIX Shell scripts and SQL*Plus and SQL*Loader.
  • Extensive experience with SQL Server, DB2, PostgreSQL, Oracle, and Excel Data Integration.
  • Experience with JSON, Avro, Parquet, RC, and ORC Data Formats, as well as compressions like Snappy and BZip.
  • Working knowledge of Object-Oriented Programming (OOP) principles using Python.
  • A thorough understanding of Usability Engineering, User Interface Design and Development.
  • Knowledgeable with Reporting tools such as Power BI, Data Studio and Tableau.
  • Good backend skills, such as designing SQL Objects such as Tables, Stored Procedures, Triggers, Indexes and Views to allow data manipulation and consistency.
  • Expertise in utilizing and implementing the most effective SDLC and ITIL approaches.
  • To guarantee that existing platforms are practical and useful, I defined product needs and produced high-level architectural specifications.
  • Prototyped components were benchmarked, and development teams were given templates to test design solutions.
  • Experience managing a team, including task planning, allocation, tracking, and execution. Relationship-driven, results-driven, creative, and able to think outside the box.

TECHNICAL SKILLS

AWS: Amazon EC2, Amazon S3, Amazon Simple DB, Amazon MQ, Amazon ECS, Amazon Lambdas, Amazon RDS, Amazon Elastic Load Balancing, Elastic Search, Amazon SQS, AWS Identity, and access management, AWS Cloud Watch, Amazon EBS, Amazon CloudFormation, AWS Sage Maker, AWS Glue, AWS Athena

MS Azure: Cloud Services (PaaS & IaaS), Active Directory, Application Insights, Azure Monitoring, Azure Search, Data Factory, Key Vault, SQL Azure, Azure DevOps, Azure Analysis Services, Azure Synapse Analytics (DW), Azure Data Lake.

Hadoop/Big Data Technologies: Hadoop, Map Reduce, Sqoop, Hive, Oozie, Spark, Zookeeper and Cloudera Manager, Kafka, Flume.

ETL Tools: Snowflake, Data Build Tool (Dbt), Informatica

Reporting: PowerBI, Tableau

Hadoop Distribution: Horton Works, Cloudera

Application Servers: Apache Tomcat, JDBC, ODBC

Programming & Scripting: Python, Scala, SQL, Shell Scripting

Databases: Oracle, MY SQL, Teradata, HBase, Cassandra, Dynamo DB.

Version Control: GIT

IDE Tools: Eclipse, Jupyter, Anaconda

Development Methodologies: Agile, Waterfall

PROFESSIONAL EXPERIENCE

Confidential

Senior Data Engineer

Responsibilities:

  • Developing and maintaining an optimal data pipeline architecture.
  • Responsible for loading data from the internal server and the Snowflake data warehouse into S3 buckets.
  • Developed a system for data extraction, transformation, and loading (ETL) from a range of sources.
  • Launch Amazon EC2 Cloud Instances and configure them for individual applications using Amazon Web Services (Linux/Ubuntu).
  • Extensive work was performed to migrate data from Snowflake to S3 for the TMCOMP/ESD feeds.
  • Implemented usage of Amazon EMR for processing Big Data across the Hadoop Cluster of virtual servers running on Amazon EC2, S3, and Redshift.
  • Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances. Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Oozie workflow engine was used to perform several Hive and Python jobs.
  • Hands-on experience on Oozie workflows for Ingest the data on hourly basis.
  • Writing codes for data pipeline definitions in JSON format for code production.
  • AWS Athena was heavily used to import structured data from S3 into numerous systems, including RedShift, and to generate reports. We used Spark-Streaming APIs to execute necessary conversions and operations on the fly for generating the common learner data model, which collects data from Kinesis in near real-time.
  • Developed Snowflake views for loading and unloading data from and to an AWS S3 bucket, as well as deploying the code.
  • Working on Snowflake modeling and highly proficient in data warehousing techniques for data cleansing, Slowly Changing Dimension phenomenon, surrogate key assignment, and change data capture.
  • Consulting on Snowflake Data Platform Solution Architecture, Design, Development, and deployment focused on bringing the data-driven culture across the enterprises.
  • Using Python and SQL queries, data sources are extracted, converted, and loaded into CSV data files.
  • Using the Data Build Tool, I created SQL queries for data transformations to generate basic datasets and models in Snowflake.
  • Built extract / load / transform (ETL) processes in the Snowflake Data Factory using dbt to manage and store data from internal and external sources.
  • Created DAGs in Airflow to automate the process using Python schedule jobs.
  • Designed, developed, and maintained data integration applications that worked with both standard and non-traditional source systems and RDBMS and NoSQL data storage for data access and analysis in a Hadoop and RDBMS context. Spark's in-memory computing capabilities were used to accomplish advanced tasks like text analytics and processing. RDDs and data frames are supported by Spark SQL queries that mix Hive queries with Scala and Python data operations.
  • Analyzed Hive data using the Spark API and the EMR Cluster Hadoop YARN. Enhancements to existing Hadoop algorithms utilizing Spark Context, Spark-SQL, Data Frames, and Pair RDDs.
  • Production support for the EMR cluster, mainly troubleshooting memory and spark job application issues.
  • Developed AWS Lambda function to monitor the EMR cluster status updates and the jobs.
  • Aided in the development of Hive tables, as well as the loading and analysis of data using Hive queries.
  • Designed various Jenkins jobs to continuously integrate the processes and executed CI/CD pipeline using Jenkins.
  • Designed and developed the applications on the data lake to transform the data according to business users to perform analytics.
  • Python was used to conduct exploratory data analysis and data visualization (matplotlib, NumPy, pandas, seaborn).

Environment: AWS S3, EC2, EMR, Redshift, Snowflake, Data Build Tool, Hadoop YARN, SQL Server, Spark, Spark Streaming, Scala, Kinesis, Python, Hive, Linux, Sqoop, Tableau, Cassandra, oozie, Control-M, RDS, Dynamo DB Oracle 12c..

Confidential

Data Engineer III

Responsibilities:

  • Analyze, create, and develop current data solutions that enable data visualization using the Azure PaaS service.
  • Contributed to developing Pyspark DataFrames in Azure Databricks, which allow users to read data from Data Lake or Blob storage and manipulate it with Spark SQL context.
  • ETL data from numerous source systems to Azure Data Lake Storage (ADLS) with a combination of Azure Data Factory (ADF), Spark SQL, and Azure Databricks for processing the data.
  • Design, develop, and implement efficient ETL pipelines using PySpark and Azure Data Factory.
  • Developed a cloud POC to select the best cloud vendor based on stringent success criteria.
  • Integrating data storage solutions like Azure Data Lake and Blob storage with Spark.
  • Design, create and implement complex ETL pipelines using PySpark and Azure Data Factory.
  • Using Pyspark, I created multiple Databricks Spark jobs to execute various table-to-table operations.
  • Developed spark programming code in Python Databricks workbooks.
  • We migrated data from SAP and Oracle and developed a data mart using Cloud Composer (Airflow) and converting Hadoop operations to Datapost workflows.
  • Using Python and Snowflakes, SnowSQLbuilt ETL pipelines into and out of the data warehouse. Creating Snowflake SQL queries.
  • Designed and built a fully operational, production-ready, large-scale data solution on Snowflake Data Warehouse.
  • Used Snowflake to create a data warehouse model for more than 80 datasets.
  • UsingAzure Monitor, continuously monitoredand managedthe performance of thedata (CI/CD) data pipelinealong with theapplications from a single console.
  • I created Hive External tables to stage data and then moved the data from Staging to the main tables. Using Azure HDInsight components, built and designed a Data Discovery Platform for a large system integrator.
  • Developed Data sets from Azure data warehousefor Power BI Reports.
  • Identified data in various data stores, including tables, files, folders, and documents, in order to construct a dataset in the Azure HDInsight pipeline.
  • Improving the Hive and Spark jobs' performance.
  • To handle data in Hadoop, Teradata SQL scripts were used to create Hive scripts.
  • To enhance efficiency, a good understanding of Hive partitions and bucketing ideas was used to create Hive's managed and external tables.
  • Generic scripts were written to automate tasks like establishing hive tables and mounting ADLS to Azure Databricks.
  • Developed JSON scripts for deploying the pipeline in the Azure Data Factory (ADF) that processes data with the SQL Activity.
  • Worked on creating correlated and non-correlated sub-queries to resolve complex business queries involving multiple tables from different databases.
  • Implemented Disaster Recovery and Failover servers in Cloud by replicating data across regions.
  • Employed Hive queries to analyze massive data sets containing structured, unstructured, and semi-structured data.
  • Using sophisticated techniques like bucketing, partitioning, and optimizing self-joins, we worked with structured data in Hive to improve performance.

Environment: Azure Data Lake, Azure SQL, Azure Data Factory(V2), Azure Databricks, Airflow, Snowflake, Power BI, Teradata, Python 2.0, SSIS, Azure Blob Storage, Spark 2.0, Hive.

Confidential

Data Engineer II

Responsibilities:

  • Used Hadoop ecosystem with AWS EMR and MapR to build a scalable distributed data system (MapR data platform).
  • Worked with Python, Hive, and Pig to create simple to complicated Map/Reduce streaming jobs.
  • Optimized Map/Reduce Jobs using multiple compression methods to make HDFS more efficient. • Sqoop, as an ETL component, was used to extract data from MySQL and load it into HDFS.
  • Performed ETL operations on the business data and developed a spark pipeline that efficiently executes ETL activities.
  • Wrote Hive and Pig scripts to analyze customer behavior data.
  • Developed Python scripts to handle semi-structured data in JSON format.
  • Worked extensively with AWS S3 buckets and was involved in file transfers between HDFS and AWS S3.
  • Load data into Amazon Redshift and utilize AWS Cloud Watch to capture and monitor AWS RDS instances within the Confidential environment.
  • Developed Kafka producer and consumer using Python API for producing Avro Schemes.
  • Developed and implemented a migration strategy from an Oracle platform to AWS Redshift for the Data Warehouse.
  • Developed the Pysprk code for EMR and AWS Glue tasks.
  • Installed Ganglia Monitoring Tool to generate reports on Hadoop cluster operations such as CPUs operating, Hosts Up and Down, etc.
  • Imported data from several sources, transformed with Spark, then loaded into Hive.
  • Experience with the Spark Core, Spark Streaming, and Spark SQL modules.
  • Used Scala to create code for all use cases in Spark and have substantial experience with Scala for data analytics on Spark clusters and map-side joins on RDD.
  • Utilized Cloud watch logs to move application logs to S3 and created alarms based on exceptions.
  • Extensive experience using Splunk to monitor log data in real-time.
  • Build clusters on AWS using EMR, S3, EC2, and Redshift.
  • Expertise in establishing an optimum data integration platform that can handle growing data volumes.
  • Used Sqoop to export the analyzed data to relational databases for visualization and report generation by our BI team.
  • Worked with DevOps team to cluster NIFI Pipeline on EC2 nodes coupled with Spark, Kafka, and Postgres running on other instances via SSL handshakes in QA and Production environments.
  • While researching Spark's modules, I worked with Data Frames, RDD, and Spark Context.
  • Great hands-on experience with Pyspark for using Spark's liberties by scripting in Python to analyze data.
  • Collaborated with (BI)Tableau teams to meet dataset requirements and worked with them to generate reports, dashboards, and visualizations.

Environment: AWS EMR, S3, EC2, Redshift, Hadoop, Hive v1.0.0, HBase, Sqoop v1.4.4, Zookeeper, Kafka v0.8.1, Python, SQL, Teradata, Splunk, Oracle, MySQL, Tableau v9.x, SVN, Jira.

Confidential - Sioux Falls, SD

Enterprise Business Intelligence Developer

Responsibilities:

  • Converted batch data from SQL Server, MySQL, PostgreSQL, and CSV files into data frames using PySpark.
  • Performed research and downloaded jars for programming using Spark-Avro.
  • Developed a PySpark program to save data frames as Avro files to HDFS.
  • Utilized Spark's parallel processing capabilities to ingest data.
  • Created and ran HQL scripts to generate external tables in a Hive raw layer database.
  • Developed a script to copy avro-formatted data from HDFS to raw layer external tables.
  • Created PySpark code that uses Spark SQL to build data frames from avro formatted raw layer and writes them to data service layer internal tables as orc format.
  • Installed Airflow and set up a PostgreSQL database to hold Airflow metadata.
  • Configured documents that allow Airflow to communicate with its PostgreSQL database.
  • Imported the Airflow libraries into Python to create Airflow DAGs.
  • Utilized Airflow to automatically schedule and trigger data ingestion pipeline execution.
  • Implemented clustering methods like DBSCAN, K-means, and K-means++ and hierarchical clustering for customer profiling to create insurance plans based on their behavior patterns.
  • Worked with Customer Churn Models, including Random Forest regression and lasso regression, along with data pre-processing.
  • Performed data cleaning, feature scaling, and feature engineering in Python using the pandas and NumPy tools and built models using deep learning frameworks.
  • Involved with the creation of Hive tables, data loading, and building hive queries that would run internally in a map-reduce way.
  • Creating new repositories from scratch, backup and restore

Environment: Spark, Redshift, Python, HDFS, Hive, Sqoop, Scala, Kafka, Shell scripting, Linux, Git, Oozie, Cloudera, Oracle 10g, PL/SQL, PostgreSQL, Pandas, NumPy.

Confidential

Tableau Developer

Responsibilities:

  • Involved in all aspects of the SDLC (Analyzing, Gathering Requirements, Development, Quality Assurance, and Deployment).
  • Developed BRDs (Business requirement documents) and FRDs (Functional necessary documents) for self-servicing the tool in collaboration with users.
  • Performed Tableau type conversion functions when connected to relational data sources.
  • Developed strategic, analytical, and operational dashboards using Tableau.
  • Developed and presented Tableau's advanced features, including calculated fields, parameters, table computations, row-level security, Python interface, joins, data blending, and dashboard actions.
  • Maintained and installed Tableau Server in all environments and performed incremental refreshes for data sources on Tableau server.
  • Created extracts, published data sources to Tableau Server, and refreshed extract in Tableau Server from Tableau Desktop.
  • Performed performance tuning of Dashboards and Collaborated with Database and User Teams for Design Review and Modeling.
  • Performed data blending to combine data from numerous sources and constructed dynamic dashboards.
  • Developed stored procedures, user-defined functions, views, tables, and T-SQL scripts for complicated business logic.
  • Developed interactive Tableau dashboards based on large SQL Server and Apache Hive datasets.
  • Used commands in Tabcmd and the Tab admin utility application.
  • Developed detailed documentation for the design, development, and implementation of the mappings and provided knowledge transfer to the end-users.
  • Performed Tableau server and desktop upgrades from 9.3 to 10.4.
  • Defining visualization pixels, creating dashboards with vertical/horizontal containers, and collecting dashboards in Tableau Story.
  • Performed Tableau dashboards integration with Salesforce to provide users with a continuous data experience.

Environment: Tableau Desktop, Tableau server, Tabadmin, SalesForce, SQL server, Oracle, DB2, Informatica, Jira SharePoint.

We'd love your feedback!