Sr. Data Engineer Resume
St Louis, MO
SUMMARY
- An IT professional with about 7 years of experience who is self - driven and goal-oriented and who can employ techniques in difficult circumstances.
- An adept learner with background in Data Modeling, Data Warehousing, Data Marts, ETL pipelines, Data Visualization.
- Extensive knowledge in working with AWS cloud platform using S3, EMR, RDS, Redshift, Athena Sagemaker, Glue, Lambda, Dynamo DB, Aurora, Kinesis.
- Extensive knowledge in working with Azure cloud platform using HDInsight, Data Lake, databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer.
- Knowledge in writing Terraform scripts for multi cloud deployment using single configuration.
- Extensive experience in working with relational databases like PostgreSQL, MySQL, SQL Server, AWS RDS, Azure SQL database, Big Query.
- Extensive experience in working with NO SQL databases and its integration Dynamo DB, Cosmos DB, MongoDB, Cassandra, Cloud Datastore, Hbase.
- Hands on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, MapReduce framework, HDFS, Yarn.
- Hands on experience in using Cloud Hadoop ecosystem components like Dataproc, HDInsight, EMR.
- Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming and expertise in building Python-Spark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
- Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet, CSV.
- Experience and understanding of Implementing large scale Data warehousing Programs and end to end Data Integration Solutions on Snowflake Cloud.
- Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud.
- Hands on experience in Architecting Legacy Data Migration projects such as Teradata to AWS Redshift.
- Worked on ETL by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena
- Experienced of building ETL workflows in Azure platform using Azure Databricks and data factory.
- Experience in Developing ETL solutions using Spark SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats
- Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling the jobs.
- Working knowledge of data migration, data profiling, data cleaning and transformation utilizing a variety of ETL such as Pentaho, Talend, Apache nifi, Oracle Data Integrator, Informatica and SSIS.
- Leveraged the Services of AWS - KMS and Chef Encrypted data bags for proper Encryption and Security of the Credentials like DB passwords.
- Hands on experience using Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Experienced with Docker and Kubernetes on multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
- Extensive experience in setting up CI/CD pipelines using tools such as Jenkins, Bit Bucket, GitHub, Maven, SVN and Azure DevOps.
- Knowledge in Data Visualization and analytics using Tableau desktop, Server, and Power BI desktop
- Worked with Cloudwatch for monitoring AWS cloud resources and the applications that deployed on AWS by creating new alarm, enable notification service.
- Capable in working with SDLC, Agile and Waterfall Methodologies.
- Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.
TECHNICAL SKILLS
Languages: Python, R, SQL, PySpark, Scala, Java, HiveQL
Databases: SQL Server, SQLite, HBase, MongoDB, Cassandra, Oracle, PostgreSQL, DynamoDB, Cosmos DB
AWS: Glue, Athena, RDS, Redshift, DynamoDB, EC2, EMR, S3, IAM, Cloud Formation, Lambda Functions, Step Functions, CloudWatch, Dynamo DB
Azure: HDInsight, Data Lake, Blob Storage, Data Factory, Synapse, Cosmos DB, Data Warehouse, Key Vault, SQL DB, DevOps.
Big Data Ecosystem: Hadoop, Map Reduce, HDFS, Sqoop, Hive, Oozie, Spark, Zookeeper and Kafka, Flume, Cloudera, Horton Works
ETL: Airflow, AWS Glue, Azure data factory, Informatica, Talend, Apache Nifi, SSIS, Oracle Data Integrator.
IDEs: Eclipse, IntelliJ, VS Code, PyCharm, Jupyter Notebook, Google Colab, SQL Server Management Studio, Notepad++.
Visualization Tools: Tableau, Power BI
Operating Systems: Windows, Linux, Unix, Mac OS, Cent OS.
CI/CD and Containerization: Jenkins, Gitlab CI, Docker, Kubernetes.
Methodologies: Software Development Lifecycle (SDLC), Waterfall, Agile
Version control: GitHub, Gitlab, Bitbucket
PROFESSIONAL EXPERIENCE
Confidential, St. Louis, MO
Sr. Data Engineer
Responsibilities:
- Involved in building data pipelines to extract transform and load the raw data and perform analytics on the transformed data using AWS services S3, Glue, Redshift, EMR, RDS, Dynamo DB, Lambda, Athena, Kinesis focusing on high-availability, fault tolerance, and auto-scaling using AWS Cloud Formation.
- Modified existing AWS Cloud Formation templates to create custom sized VPC, subnets, NAT to ensure successful deployment of Web applications and database templates
- Wrote reusable modules using Terraform as infrastructure as code, to provision AWS resources like Glue jobs, EMR cluster, EC2, S3, CloudWatch to safely deploy applications on the cloud.
- Extracted files from NoSQL database (MongoDB) and processed them using mongo Spark connector.
- Extensive experience in SQL programming, stored procedure creation and optimization as well as tuning and maintenance of highly available and highly transactional databases.
- Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Design and Develop ETL Processes using AWS Glue to migrate the data collected from external sources like S3, SQL Server, Mongo DB, and SFTP server into AWS Redshift.
- Worked in creating Redshift spectrum external schema and tables for S3 data on Redshift upon running instance and query S3 data from Redshift and load data into other fact and dimension tables, rather than using COPY command if data volume is huge.
- Automated data storage from streaming sources to AWS data lakes like S3, Redshift and RDS by configuring AWS Kinesis (Data Firehose).
- Worked in Setting up and Accessing AWS Dynamo DB Env and Creation of Tables, on Demand Backup, and access control etc.
- Worked on distributed frameworks such as Apache Spark using Python and Presto in Amazon EMR, Redshift and interact with data in other AWS data stores such as Amazon 53 and Amazon DynamoDB.
- Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker in structured streaming to get structured data by schema
- Developed multiple POCs using PySpark and deployed on the Yarn Cluster, compared the performance of Spark, with Hive and SQL.
- Developed scripts to load data to hive from HDFS and involved in ingesting data into Data Warehouse using various data loading techniques.
- Worked with Spark using Spark Context, Spark-SQL, Java, Scala, PySpark, Data Frame API, Pair RDD, Spark Streaming, MLlib and User Defined Functions to apply transformations on the Data.
- Involved in converting Map Reduce programs into Spark transformations using Spark RDD's using Scala
- Implemented Data Migration of Multistate level data from SQL server to Snowflake by using Python, Spark and SnowSQL.
- Designed and developed data warehouse, data marts and business intelligence using multi-dimensional models - star schema and snowflake schema.
- Worked on migrating the data from AWS Redshift data warehouse to Snowflake.
- Developed data ingestion modules (both real time and batch data load) to data into various layers in S3, Redshift and Snowflake using AWS Kinesis, AWS Glue, AWS Lambda and AWS Step Functions
- Utilized AWS Glue crawler to create and to update one or more tables in AWS Glue Data Catalog by editing Scala and Python scripts and to populate the tables by using data from the new S3 Buckets.
- Developed PySpark jobs on Databricks to perform tasks like data cleansing, data validation, applying transformations for creating Datasets as per the use cases for Machine Learning algorithms.
- Experience in Developing Spark applications using Spark SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data.
- Designed and Implemented the ETL process using Talend Enterprise Big Data Edition to load the data from Source to Target Database.
- Altered Security Architecture for Confidential AWS production services and lead best practice creation and implementation around Credentials/Secrets rotation with AWS Key Management Service.
- Developed microservice on boarding tools leveraging Python and Jenkins allowing for easy creation and maintenance of build jobs and Kubernetes deploy and services.
- Analyzed business user requirements, analyzed data, and designed software solutions in Tableau Desktop based on the requirements.
- Worked as Tableau Server Administrator and Tableau desktop developer. Responsible for Preparing the Report Design specifications based on the User's requirement.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
- Worked on scheduling all jobs using Airflow scripts using python. Also added different tasks to DAG’s and dependencies between the tasks
- Built an ETL framework for Data Migration from on premise data sources such as SQL Server, Mongo DB, SFTP Server to AWS using Apache Airflow, Apache Sqoop and Apache Spark (Python).
- Monitor AWS Relational Database Service (RDS) for performance and availability in CloudWatch based on events/metrics such as CPU utilization, database connections, read/write IOPS, read/write latency.
- Implemented ETL workflows into AWS cloud using S3 storage and EMR platform and automated workflows using AWS Step functions and notification services like SNS, SQS via event-based triggering using lambda.
- Used Git version control to manage the source code and integrating Git with Jenkins to support build automation and integrated with Jira to monitor the commits.
Environment: AWS (S3, Glue, Redshift, EMR, RDS, Dynamo DB, Lambda, Athena, Kinesis, KMS), MongoDB, SQL, Python, PySpark, Scala, Airflow, Hive, HDFS, Snowflake, Databricks, Talend, Kubernetes, Tableau, Git, Linux.
Confidential, Dallas
Data Engineer
Responsibilities:
- Designed, developed, and deploying Data pipelines using Azure cloud platform HDInsight, Data Lake, Blob Storage, Data Factory, Synapse, Cosmos DB, Data Warehouse, Key Vault, SQL DB, DevOps.
- Create performance measurements to monitor resources across Azure using Azure native monitoring tools utilizing ARM template.
- Responsible for configuring, integrating, and maintaining all Development, QA, Staging and Production PostgreSQL databases within the organization.
- Setup the Automatic Tuning on production database to create index based on inheritance from Azure SQL. The improvement increases sharply to show the result in few milliseconds.
- The SSIS packages were implemented to use in a SQL Agent Job using on-premises SQL Server and connect to Azure SQL database using encrypted connection
- Used DataFrame API and Scala API, with Python and Spark SQL, to ingest data to SQL ad NoSQL databases in Azure like Azure SQL DB, PostgreSQL, MySQL, Cassandra, Cosmos DB.
- Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory, Azure Databricks and Spark SQL
- Extracting the data from Azure Data Lake into HDInsight Cluster (Intelligence + Analytics) and applying PySpark transformations & Actions on the data and loading into HDFS.
- Involved in HDInsight cluster in Azure was part of the deployment and installed/Configured/Maintained Apache Hadoop clusters for application development
- Worked on Implementing Data Integrity and Data Quality checks in Hadoop using Hive and Linux scripts
- Involved in loading and transforming large Datasets from relational databases into HDFS and vice-versa using Sqoop imports and export.
- Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
- Creating pipelines, data flows and complex data transformations and manipulations using Azure Data Factory and PySpark with Databricks.
- Configured spark streaming to receive real time data from the Apache Kafka and store the stream data using Scala to Azure Table.
- Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Spark Scala scripts and UDF's to perform transformations on large datasets.
- Designed and wrote the entire ETL/ELT process to support Data Warehouse with complex dependencies in hybrid Business Intelligence environment (Azure & SQL Server).
- Designed and implemented Azure Data factory framework (V2) with Error logging to populate data in Azure SQL Data warehouse from Azure Blob storage.
- Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data to uncover insights into the customer usage patterns
- Created several Databricks Spark jobs with PySpark to perform several tables to table operations.
- Verifying JSON schema change of source files and verifying duplicate files in source location. Worked on creating a query parser script in python.
- Configured Azure Encryption for Azure Storage and Virtual Machines, Azure Key Vault services to protect and secure the data for cloud applications.
- Closely worked with Artificial Intelligence Team to create and build a Machine learning layer in the final Product.
- Created CI/CD Pipelines in Azure DevOps environments by providing their dependencies and tasks and created END-END Automation with CI Procedures using Jenkins & automated Maven builds by integrating them with Continuous Integration tools Jenkins.
- Worked with Kubernetes pipeline of deployment & operation activities where all code is written in java, python & stored into bitbucket, for staging & testing purpose
- Design and develop business intelligence dashboards, analytical reports and data visualizations using power BI by creating multiple measures using DAX expressions for user groups like sales, operations and finance team needs.
- Worked with Log Analytics and have all Azure resources to send logs to specific log analytics using VDC/Arm Template/DevOps-pipeline.
- Used JIRA for issues and bug tracking and added several options to the application to choose algorithm for data and address generation.
Environment: Azure (HDInsight, Data Lake, Blob Storage, Data Factory, Synapse, Cosmos DB, Data Warehouse, Key Vault, SQL DB, DevOps), data bricks, SQL, Python, PySpark, Scala, Hadoop, Cassandra, Power BI, Java, Bitbucket, Git, Jenkins.
Confidential
Data Engineer
Responsibilities:
- Developed entire frontend and backend modules using Python on Django Web Framework.
- Designed and developed the UI of the website using HTML, AJAX, CSS and JavaScript
- Worked on Django REST framework and integrated new and existing API's endpoints.
- Configured AWS Identity and Access Management (IAM) Groups and Users for improved login authentication.
- Conduct systems design, feasibility and cost studies and recommend cost-effective cloud solutions such as Amazon Web Services (AWS).
- Involved in Developing a Restful service using Python Flask framework.
- Designed ETL Process using Informatica to load data from Flat Files, and Excel Files to target Oracle Data Warehouse database.
- Using Informatica PowerCenter Designer analyzed the source data to Extract & Transform from various source systems (oracle 10g, DB2, SQL server and flat files) by incorporating business rules using different objects and functions that the tool supports.
- Created and Configured Workflows and Sessions to transport the data to target warehouse Oracle tables using Informatica Workflow Manager.
- Created various transformations according to the business logic like Source Qualifier, Normalizer, Lookup, Stored Procedure, Sequence Generator, Router, Filter, Aggregator, Joiner, Expression and Update Strategy.
- Worked on SQL queries to query the Repository DB to find the deviations from Company's ETL Standards for the objects created by users such as Sources, Targets, Transformations, Log Files, Mappings, Sessions, and Workflows.
- Experience in using the Lambda functions like filter (), map () and reduce () with pandas Data Frame and perform various operations.
- Involved in entire lifecycle of the projects including Design, Development, and Deployment, Testing and Implementation and support.
- Built various graphs for business decision making using Python matplotlib library.
- Worked in development of applications especially in UNIX environment and familiar with all its commands.
- Used NumPy for Numerical analysis for Insurance premium.
- Handling the day-to-day issues and fine tuning the applications for enhanced performance.
- Experienced working with Agile Methodologies and SCRUM Process.
Environment: Python, Django, MySQL, AWS, Linux, Informatica Power Centre, HTML, XHTML, CSS, AJAX, JavaScript, ETL, Oracle, NumPy, Pandas, Unix, SDLC, Jira.
Confidential
Data Engineer
Responsibilities:
- Experience working in project with machine learning, big data, data visualization, and Python development, Unix, SQL.
- Performed exploratory data analysis using NumPy, matplotlib and pandas.
- Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights.
- Experience analyzing data with the help of Python libraries including Pandas, NumPy, SciPy and Matplotlib.
- Creating complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirement gathering and translating them into clear and concise specifications and queries.
- Prepared high-level analysis reports with Excel. Provides feedback on the quality of Data including identification of billing patterns and outliers.
- Identify and document limitations in data quality that jeopardize the ability of internal and external data analysts and
- Wrote standard SQL Queries to perform data validation and created excel summary reports (Pivot tables and Charts) as well as gathered analytical data to develop functional requirements using data modeling and ETL tools.
- Read date from different sources like CSV file, Excel, HTML page and SQL and performed data analysis and written to any data source like CSV file, Excel or database.
- Used Pandas API for analyzing time series. Creating regression test framework for new code.
- Developed and handled business logic through backend Python code.
- Worked on Django REST framework and integrated new and existing API's endpoints.
- Extensive knowledge in loading data into charts using python code.
- Using High charts, passed data and created interactive JavaScript charts for the web application.
- Extensive knowledge in using python libraries like OS, Pickle, NumPy and SciPy.
- Involved in using Bit bucket for version control and coordinating with the team.
Environment: Python, HTML5, CSS3, Django, SQL, UNIX, Linux, Windows, Oracle, NoSQL, PostgreSQL, python libraries, NumPy, Bit Bucket.
