Sr Data Engineer Resume
Nashville, TN
SUMMARY
- Over 8 years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
- Hands on experience with different programming languages such as Python, PySpark, Scala and querying languages like SQL, PL/SQL
- Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
- Hands - on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
- Experience in handling python and spark context when writing Pyspark programs for ETL.
- Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
- Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning and advanced data processing.
- Hands-on experience with Amazon Redshift, EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
- Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
- Experience in using Databricks for big data processing using technologies such as Apache Spark and Delta Lake.
- Experience in integrating Databricks with other cloud services such as AWS, Azure, and Google Cloud Platform.
- Proficient in Snowflake data warehousing, data engineering, and data analysis, and in setting up and configuring Snowflake accounts, warehouses, databases.
- Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
- Worked on Dimensional Data modelling in Star and Snowflake schemas and Slowly Changing Dimensions(SCD).
- Experience working with NoSQL databases like Cassandra and HBase and developed real- time read/write access to very large datasets via HBase.
- Developed Spark Applications that can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
- Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.
- SQL concepts, Presto SQL, Hive SQL, Python (Pandas, Numpy, SciPy, Matplotlib) and Pyspark to cope up with the increasing volume of data.
- Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Python for data cleansing, filtering, and data aggregation. Also possess detailed knowledge of MapReduce framework.
- Carried out data transformation and cleansing using SQL queries, Python and Pyspark.
- Adept at configuring and installing Hadoop/Spark Ecosystem Components.
- Solid Experience and understanding of Implementing large scale Data Warehousing Programs and E2E Data Integration Solutions on Snowflake Cloud, AWS Redshift.
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
- Experience working with GitHub/Git 2.12 source and version control systems.
TECHNICAL SKILLS
Languages: SQL, PL/SQL, Python, PySpark, Scala, Unix, Linux
Data Modeling Tools: ERwin, Power Designer, MS Visio, ER Studio.
ETL Tools: AWS Redshift, Alteryx, Informatica PowerCenter.
Big Data: HDFS, Map Reduce, Spark, Airflow, Yarn, NiFi, HBase, Hive, Pig, Flume, Sqoop, Kafka, Oozie, Hadoop, Zookeeper, Spark SQL.
Concepts and Methods: Business Intelligence, Data Warehousing, Data Modeling, Requirement Analysis
RDBMS: Oracle 9i/10g/11g/12c, Teradata, My SQL, MS SQL
NO SQL: DynamoDB, HBase, Cassandra
Cloud Platform: AWS (Amazon Web Services), Microsoft Azure
Other Tools: Azure Databricks, Azure Data Explores, Azure HDInsight, Tableau, Snowflake, Airflow, Kafka
PROFESSIONAL EXPERIENCE
Sr Data Engineer
Confidential, Nashville, TN
Responsibilities:
- Implemented Responsible AWS solutions using EC2, S3, RDS, EBS, Elastic Load Balancer, and Auto scaling groups, Optimized volumes and EC2 instances.
- Wrote Terraform templates for AWS Infrastructure as a code to build staging, production environments & set up build & automations for Jenkins.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server using Python.
- Utilized S3 bucket and Glacier for storage and backup on AWS.
- Using Amazon Identity Access Management (IAM) tool created groups & permissions for users to work collaboratively.
- Implemented /setup continuous project build and deployment delivery process using Subversion, Git, Jenkins, IIS, Tomcat.
- Day to-day responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL queries in snowflake.
- Implement One Time Data Migration of Multistate level data from SQL server to Snowflake by using Python and SnowSQL.
- Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark. Designed Kafka producer client using Confluent Kafka and produced events into Kafka topic.
- Design, development, and implementation of ETL pipelines using python API (PySpark) on AWS EMR.
- Developed Python, PySpark, SQL scripts using databricks as a platform to create various transformation logics as per the business requirements. Developed end-to-end data pipelines using databricks.
- Used MySQL as backend database and MySQL dB of python as database connector to interact with MS SQL server.
- Involved in Agile Development process (Scrum and Sprint planning).
- Involved in various sectors of business, with In-depth knowledge of SDLC (System Development Life Cycle) with all phases of Agile - Scrum, & Waterfall.
- Created Airflow jobs using Python to automate the processing of large data sets by scheduling them at scheduled time.
- Worked on functional programming using Python and Scala.
- Designed AWS, Cloud migration, AWS EMR, Dynamo DB, Redshift, and event processing using lambda function.
- Analyzed the existing data flow to the warehouses and taking the similar approach to migrate the data into HDFS.
- Deriving valuable insights from datasets through statistics and creative visualization Increasing data quality and operational efficiency by designing data pipelines, databases, processes, programs, and dashboards.
Environment: Python, PySpark, AWS, ETL, Airflow, SQL, UNIX, NoSQL, Sqoop, Pig, MapReduce, SparkMLIib.
Big Data Engineer
Confidential, Sunnyvale, CA
Responsibilities:
- Build data pipelines in airflow for ETL related jobs using different airflow operators.
- Experience in building power bi reports on Azure Analysis services for better performance.
- Implement ETL framework to provide features such as Master Data Management, ETL - restart capability, security model and version control.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
- Create ETL scripts for the ad-hoc requests, requests to retrieve data from analytic sites.
- Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required. ETL framework design and development for Lucas, Confidential, Star Wars digital data hub (Pentaho data integration, Talend Data Integration).
- Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
- Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
- Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets. Wrote scripts in Hive SQL for creating complex tables with high performance metrics like partitioning, clustering and skewing
- Worked on creating POC for utilizing the ML models and Cloud ML for table Quality Analysis for the batch process
- Knowledge about cloud dataflow and Apache beam.
- Good knowledge in using cloud shell for various tasks and deploying services.
- Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, SQOOP, Apache Spark, with Cloudera Distribution.
- Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.
- Creating data model that correlates all the metrics and gives a valuable output.
- Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
- Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
- Pre-processing using Hive and Pig.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
Environment: Python, PySpark, SQL, Azure, Power BI, Hadoop, MS SQL Server 2016, T-SQL, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS), Management Studio (SSMS), Advance Excel, ETL, Tableau.
AWS Data Engineer
Confidential, Dallas, TX
Responsibilities:
- Developed MapReduce programs to parse the raw data, populate tables and store the refined data in partitioned tables.
- Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
- Implemented Amazon AWS EC2, RDS, S3, RedShift, Cloud Trail, Route 53, etc., and worked with various Hadoop Tools like Hive, Pig, Sqoop, Oozie, HBase, Flume, PySpark.
- Automated the generation of HQL, creation of Hive Tables and loading data into Hive tables by using Apache NiFi and OOZIE.
- Created Hive Tables, loaded transactional data from Teradata using Sqoop.
- Developed MapReduce (Yarn) jobs for cleaning, accessing and validating the data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and have a good experience in using Spark Streaming.
- Analyzed the SQL scripts and designed the solution to implement using PySpark
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in Hdfs.
- Worked on Apache Nifi as ETL tool for batch processing and real time processing.
- Uploaded streaming data from Kafka to HDFS, HBase and Hive by integrating with storm.
- Performed various transformations and storage in Hadoop Architecture using HDFS, Map Reduce, PySpark.
- Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3, and EMR.
- Used Apache Spark and Scala language to find Confidential ts with similar symptoms in the past and medications used for them to achieve results.
- Supported data analysis projects by using Elastic MapReduce on the Amazon Web Services (AWS) cloud performed Export and import of data into S3.
- Implemented KBB's Big data ETL processes in AWS, using Hive, Spark, AWS Lambda, S3, EMR, Data pipeline, EC2, Redshift, Athena, SNS, IAM and VPC.
- Implemented Budget cuts on AWS, by writing Lambda functions to automatically spin up and shut down the Redshift clusters.
- Integrated Kafka with PySpark Streaming for real time data processing.
- Worked with cloud provisioning team on a capacity planning and sizing of the nodes (Master and Slave) for an AWS EMR Cluster.
- Exported the analyzed data to the RDBMS using Sqoop for to generate reports for the BI team.
- Worked collaboratively with all levels of business stakeholders to implement and test Big Data based analytical solution from disparate sources.
Environment: AWS, Python, PySpark, Kafka, Hadoop, HDFS, Teradata r15, Sqoop, Linux, Yarn, MapReduce, AWS, Pig, SQL, Oozie, RDBMS.
Data Engineer
Confidential, St Louis, MO
Responsibilities:
- Involved in requirement gathering/analysis, Design, Development, Testing and Production rollover of Reporting and Analysis projects
- Analysis the metric dashboard reports and identified the formulas and functionality of the dashboard reports and digitizing the metric dashboards to Tableau application.
- Published the dashboard reports to Tableau Server for navigating the developed dashboards in web.
- Scheduled the published dashboards from Tableau Server on weekly basis.
- Sending the dashboards to users by emails with the help of admin team with subscriptions.
- Performance tuning of Reports by creating Linked Universes, Joins, Contexts, Aliases for resolving loops and checked the integrity of the universes using Business Objects Designer module during development.
- Involved in integrating Tableau with Angular JS to enable self-service model kind of functionality on dashboards.
- Given /demos to users on Tableau Desktop for development.
- Created Tableau worksheet which involves Schema Import, Implementing the business logic by customization.
- Created Data Connections, published on Tableau Server for usage with Operational/Monitoring Dashboards.
- Administered user, user groups, and scheduled instances for reports in Tableau.
- Built complex formulas in Tableau for various business calculations.
- Resolved various performance issues and analysed the best process distribution for different projects.
- Provided customer support to Tableau users and Wrote Custom SQL to support business requirements
Environment: Tableau, Python, SQL, Oracle, SSIS, MySQL, Microsoft Office Suite
SQL Developer
Confidential, Dublin, CA
Responsibilities:
- Participated in JAD sessions with business users and SME's for better understanding of the reporting requirements.
- Design and developed end-to-end ETL process from various source systems to Staging area, from staging to Data Marts.
- Analyzing the source data to know the quality of data by using Talend Data Quality.
- Broad design, development and testing experience with Talend Integration Suite and knowledge in Performance Tuning of mappings.
- Developed jobs in Talend Enterprise edition from stage to source, intermediate, conversion and target.
- Involved in writing SQL Queries and used Joins to access data from Oracle, and My SQL.
- Developed Talend jobs to populate the claims data to data warehouse - star schema.
- Loading and unit tested the mappings.
- Created Context Variables and Groups to run Talend jobs against different environments
- Experienced in writing expressions with in tMap as per the business need.
- Handled insert and update Strategy using tMap. Used ETL methodologies and best practices to create Talend ETL jobs.
- Extracted data from flat files/ databases applied business logic to load them in the staging database as well as flat files.
Environment: ETL, Talend 5.5/, Oracle 11g, Teradata SQL Assistant, HDFS, MS SQL Server 2012/2008, PL/SQL, Agile Methodology, TOAD, ERwin, AIX, Shell Scripts, AutoSys, SVN
