Data Engineer Resume
Irving, TX
SUMMARY:
- Senior Data Engineer offering Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
- Excellence in driving various data - backed projects and preparing dashboards using Tableau. Hands-on experience with common Data Science toolkits like Python & R, Data Visualization tool Tableau and ETL tools.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
- Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Experienced in big data analysis and developing data models using Hive, PIG, and Map reduce, SQL with strong data architecting skills designing data-centric solutions.
- Experience working with data modeling tools like Erwin and ER/Studio.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Expertise in Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
- Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
- Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
- Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis
- Excellent in performing data transfer activities between SAS and various databases and data file formats like XLS, CSV, etc.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS
- Experienced in development and support knowledge on Oracle, SQL, PL/SQL, T-SQL queries.
- Experience in Designing and implementing data structures and commonly used data business intelligence tools for data analysis.
- Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon’s Approach.
- Experience in working with Excel Pivot and VBA macros for various business scenarios.
- Expertise in SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS).
- Experience in Amazon web services (AWS) cloud like S3, EC2 and EMR and in Microsoft Azure.
- Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage, Azure AD Licenses, Office365)
- Experience in dealing with Windows Azure IaaS - Virtual Networks, Virtual Machines, Cloud Services, Resource Groups, Express Route, Traffic Manager, VPN, Load Balancing, Application Gateways, Auto- Scaling.
- Exposure on usage of Apache Kafka to develop data pipeline of logs as a stream of messages using producers and consumers.
- Excellent understanding and knowledge of NOSQL databases like Mongo dB and Cassandra.
- Created Cassandra tables to store various data formats of data coming from different sources.
- Extensive knowledge and experience on real time data streaming techniques like Kafka and Spark Streaming.
- Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Experienced in data formats like JSON, PARQUET, AVRO, RC and ORC formats
- Utilized Flume to analyze log files and write into HDFS.
- Experience in importing and exporting data by Sqoop between HDFS and RDBMS and migrating according to client's requirement.
- Used GitHub version control tool to push and pull functions to get the updated code from repository.
TECHNICAL SKILLS:
Big Data Ecosystem: HDFS, MapReduce, HBase, Pig, Hive, Sqoop, Kafka, Flume, Cassandra, Impala, Oozie, Zookeeper, MapR, Amazon Web Services (AWS), EMR
Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Gradient Boosting Classifier, Extreme Gradient Boosting Classifier, Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayes Classifier, Extra Trees Classifier, Stochastic Gradient Descent, etc
Cloud Technologies: AWS, Azure, Google cloud platform (GCP)
IDE’s: IntelliJ, Eclipse, Spyder, Jupyter
Ensemble and Stacking: Averaged Ensembles, Weighted Averaging, Base Learning, Meta Learning, Majority Voting, Stacked Ensemble, Auto ML - Scikit-Learn, MLjar, etc.
Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE
Programming / Query Languages: Java, SQL, R, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, SAS, R Programming (Caret, Glmnet, XGBoost, rpart, ggplot2, sqldf), RStudio, PL/SQL, Linux shell scripts, Scala.
Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, GCP, Google Shell, Linux, PuTTY, Bash Shell, Unix, Tableau, Power BI, SAS, We Intelligence, Crystal Reports, Dashboard Design.
PROFESSIONAL EXPERIENCE:
Data Engineer
Confidential
Responsibilities:
- Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
- Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig
- Extensive use of embedded SQL database calls in C code.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Expertise in dealing with data quality issue analysis using snow SQL by building analytical warehouses on Snowflake.
- Experience in evaluating snowflake design for changes in application and building logical and physical data models for snowflake.
- Hands on experience in storage, compute and networking services with implementation experience in data engineering using key AWS services such as EC2, S3, ELB, EBS, RDS, IAM, EFS, CloudFormation, Redshift, DynamoDB, Glue, Lambda, Step Functions, Kinesis, Route 53, SQS,SNS, SES, AWS Systems Manager etc.
- Processed some simple statistical analysis of data profiling like cancel rate, var, skew, kurt of trades, and runs of each stock every day group by 1 min, 5 min, and 15 min.
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, postgreSQL, Data Frame, OpenShift, Talend, pair RDD's
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Practical understanding of the Data modeling (Dimensional & Relational) concepts like Star - Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
- Utilized Agile and Scrum methodology for team and project management.
- Used Git for version control with colleagues.
Environment: Spark (PySpark, SparkSQL, SparkMLIib), SQL, Python 3.x(Scikit-learn, Numpy, Pandas), Tableau 10.1, GitHub, Spark AWS EMR/EC2/S3/Redshift,/SQS/SNS and Pig.
Azure Data Engineer
Confidential, Irving, TX
Responsibilities:
- Designed the business requirement collection approach based on the project scope and SDLC methodology.
- Installing, configuring and maintaining Data Pipelines
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Files extracted from Hadoop and dropped on daily hourly basis into S3
- Working with Data governance and Data quality to design various models and processes.
- Involved in all the steps and scope of the project data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
- Performing data analysis, statistical analysis, generated reports, listings and graphs using SAS tools, SAS/Graph, SAS/SQL, ANSI SQL, SAS/Connect and SAS/Access.
- Developing Spark applications using Scala and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Using Kafka and integrating with the Spark Streaming
- Developed data analysis tools using SQL and Python code.
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
- Working with relational database systems (RDBMS) such as Oracle and database systems like HBase.
- Using ORC, Parquet file formats on HDInsight, Azure Blobs and Azure tables to store for raw data.
- Involved in writing T-SQL working on SSIS, SSAS, Data Cleansing, Data Scrubbing and Data Migration.
- Working on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
- Performing PoC for Big data solution using Hadoop for data loading and data querying
- Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Using Sqoop to channel data from different sources of HDFS and RDBMS.
- To meet specific business requirements wrote UDF's in Scala and Store Procedures Replaced the existing Map Reduce programs and Hive Queries into Spark application using Scala
- Involved in Normalization and De-Normalization of existing tables for faster query retrieval.
- Hands on experience in Airflow for orchestration and building custom Airflow operators.
- Proficient in Azure Data Factory, Airflow 1.8 and Airflow 1.10 on multiple cloud platforms and able to understand the process of leveraging the Airflow Operators.
- Developing and maintained data dictionary to create metadata reports for technical and business purpose.
- Developing Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
- Experience in designing and developing Data Stage jobs to process FULL Data loads from SQL Server Source to Oracle Stage.
- Extensively using MS Access to pull the data from various data bases and integrate the data.
- Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive tables.
- Responsible for importing data from PostgreSQL to HDFS, HIVE using SQOOP tool.
- Experienced in migrating HiveQL into Impala to minimize query response time.
- Responsible for performing extensive data validation using Hive.
- Sqoop jobs, Hive scripts were created for data ingestion from relational databases to compare with historical data.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
Environment: Erwin9.8, BigData3.0, Hadoop3.0, Oracle12c, PL/SQL, Scala, Spark-SQL, PySpark, Python, kafka1.1, SAS, SNS, SQL, MDM, Oozie4.3, SSIS, T-SQL, ETL, HDFS, Cosmos, Pig0.17, Sqoop1.4, MSAccess
Data Engineer
Confidential, Fort Worth, TX
Responsibilities:
- Extensively used Agile methodology as the Organization Standard to implement the data Models
- Created several types of data visualizations using Python and Tableau.
- Extracted Mega Data from AWS using SQL Queries to create reports.
- Performed reverse engineering using Erwin to redefine entities, attributes and relationships existing database.
- Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models.
- Developed a data pipeline using Kafka to store data into HDFS.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
- Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts.
- Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Primarily Responsible for converting Manual Report system to fully automated CI/CD Data Pipeline that ingest data from different Marketing platform to AWS S3 data lake.
- Utilized AWS services with focus on big data analytics, enterprise data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility
- Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function
- Gathered data from Google AdWords, Apple search ad, Facebook ad, Bing ad, Snapchat ad, Omniture data and CSG using their API.
- Importing existing datasets from Oracle to Hadoop system using SQOOP
- Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
- Hands on experience in importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
- Created Sqoop jobs with incremental load to populate Hive External tables.
- Writing the Spark Core Programs for processing and cleansing data thereafter load that data into Hive or HBase for further processing.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
- Used AWS system manager to automate operational tasks across AWS resources.
- Wrote Lambda function code and set CloudWatch Event as trigger with Cron job Expression.
- Connected Redshift to Tableau for creating dynamic dashboard for analytics team.
- Setup connection between S3 to AWS Sage Maker ML (Machine Learning platform) is used for predictive analytics and uploading inferenced data to redshift.
- Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow, Worked with Cloudera and Hortonworks distributions
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
- Working in big data technologies like spark 2.3,& 3.0 scala, Hive, Hadoop cluster (Cloudera platform).
- Deployed the project on Amazon EMR with S3 connectivity for setting backup storage.
- Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script.
- Wrote Python modules to extract data from the MySQL source database.
- Worked on Cloudera distribution and deployed on AWS EC2 Instances.
- Migrated high avail webservers and databases to AWS EC2 and RDS with min or no downtime.
- Worked with AWS IAM to generate new accounts, assign roles and groups.
- Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.
- Created Jenkins jobs for CI/CD using Git, Maven and Bash scripting
Environment: AWS, Redshift, Pyspark, Cloudera, Hadoop, Spark, Sqoop, MapReduce, Python, Tableau, EC2, EMR, Glue, S3, Kafka, IAM, Azure, PostgreSQL, MySQL, Jenkins, Maven, AWS CLI, Cucumber, Java, Unix, Shell Scripting, Maven, Git.
Data Engineer
Confidential
Responsibilities:
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Evaluated existing infrastructure, systems, and technologies and provided gap analysis, and documented requirements, evaluation, and recommendations of system, upgrades, technologies and created proposed architecture and specifications along with recommendations.
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, Zookeeper and Sqoop.
- Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
- Administering large Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
- Close monitoring and analysis of the MapReduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance.
- Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
- Integrated HDP clusters with Active Directory and enabled Kerberos for Authentication.
- Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
- Setup Alerting and monitoring using Stack driver in GCP.
- Design and implement large scale distributed solutions in AWS and GCP clouds.
- Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
- Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python.
- Experienced in developing Spark scripts for data analysis in both python and Scala.
- Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts.
- Built on premise data pipelines using Kafka and spark for real-time data analysis.
- Created reports in TABLEAU for visualization of the data sets created and tested Spark SQL connectors.
- Implemented Hive complex UDF's to execute business logic with Hive Queries
- Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
- Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
- Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
- Experience in managing and reviewing Hadoop Log files.
- Used Sqoop to transfer data between relational databases and Hadoop.
- Worked on HDFS to store and access huge datasets within Hadoop.
Environment: Hadoop YARN, Spark 1.6, Spark Streaming, Spark SQL, Scala, Kafka, Python, Hive, Sqoop 1.4.6, Impala, Tableau, Talend, Oozie, Java, AWSS3, Oracle 12c, Linux.
Data Engineer
Confidential
Responsibilities:
- Participated in testing of procedures and Data utilizing, PL/SQL to ensure integrity and quality of Data in Data warehouse.
- Gathered Data from Help Desk Ticketing System and write ad-hoc reports and, charts and graphs for analysis.
- Worked to ensure high levels of Data consistency between diverse source systems including flat files, XML and SQL Database.
- Developed and run ad-hoc Data queries from multiple database types to identify system of records, Data inconsistencies, and Data quality issues.
- Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to customers.
- Provided business intelligence analysis to decision-makers using an interactive OLAP tool
- Created T/SQL statements (select, insert, update, delete) and stored procedures.
- Defined Data requirements and elements used in XML transactions.
- Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter and Update Strategy.
- Performed Tableau administering by using tableau admin commands.
- Involved in defining the source to target Data mappings, business rules and Data definitions.
- Ensured the compliance of the extracts to the Data Quality Center initiatives
- Metrics reporting, Data mining and trends in helpdesk environment using Access
- Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple heterogeneous information sources.
- Built reports and report models using SSRS to enable end user report builder usage.
- Created Excel charts and pivot tables for the Ad-hoc Data pull.
- Created Column Store indexes on dimension and fact tables in the OLTP database to enhance read operation.
Environment: SQL, PL/SQL, T/SQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLT