Sr Data Engineer Resume
Chandler, AZ
OBJECTIVE
- To work in a demanding workplace by showcasing my efficiency, displaying my intellect, and utilizing my software professional talents and adept IT professional with 7+ years of professional IT experience with Data Warehousing/Big Data which includes experience in Big Data ecosystem related technologies like Hadoop, Map Reduce, Pig, Hive and Spark, Data Visualization,Reporting and data quality solutions
SUMMARY
- Experience in Big Data analytics,Data manipulation, using Hadoop Eco system toolsMap - Reduce, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop,AWS,Spark integration with Cassandra, Avro, and Zookeeper.
- Experience in installation, configuration, supporting and managing - Cloudera’s Hadoop platform along with CDH3&4 clusters.
- Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand controlling and granting database accessandMigrating On premise databases toAzure Data Lake storeusing Azure Data factory.
- Experience in implementing Azure data solutions, provisioning storage account, Azure Data Factory, SQL server, SQL Databases, SQL Data warehouse, Azure Data Bricks and Azure Cosmos DB.
- Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Managing Clusters in Databricks, Managing the Machine Learning Lifecycle.
- Experience in Data extraction (extract, Schemas, corrupt record handling and parallelized code), transformations and loads (user - defined functions, join optimizations) and Production (optimize and automate Extract, Transform and Load).
- Has good experience in designing cloud-based solutions in Azure by creating Azure SQL database, setting up Elastic pool jobs and design tabular models in Azure analysis services.
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
- Experience in using stack driver service/ dataproc clusters in GCP for accessing logs for debugging and in building efficient pipelines for moving data between GCP and Azure using Azure Data Factory.
- Experience in building power bi reports on Azure Analysis services for better performance when comparing dat to direct query using GCP BigQuery.
- Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
- Experience in using Snowflake Clone, Time Travel and building snow pipe.
- Worked with Matillion which Leverage Snowflake’s separate compute and storage resources for rapid transformation and get the Get the most from Snowflake-specific features, such as Alter Warehouse and Flatten Variant, Object, and Array.
- Experience in working with RIVERY ELT platform which performs data integration, data orchestration, data cleansing, and other vital data functionalities.
- Has Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer
- Experienced with Dimensional modelling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
- Expertise in building CI/CD on AWS environment using AWS Code Commit, Code Build, Code Deploy and Code Pipeline and experience in using AWS CloudFormation, API Gateway, and AWS Lambda in automation and securing the infrastructure on AWS.
- Excellent noledge on Hadoop Architecture and ecosystems such as HDFS, Hive, Pig, Sqoop, Job Tracker, Task Tracker, Name Node, Data.
- Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
- Good noledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.
- Haveexperience ininstalling,configuring,andadministratingHadoop cluster for major Hadoop distributions likeCDH4, and CDH5.
- Involved with the Design and Development of ETL process related to benefits and offers data into the data warehouse from different sources.
- Possess strong Documentation skill and noledge sharing among Team, conducted data modeling sessions for different user groups, facilitated common data models between different applications, participated in requirement sessions to identify logical entities.
- Extensive experience in relational Data modeling, Dimensional data modeling, logical/Physical Design, ER Diagrams and OLTP and OLAP System Study and Analysis.
- Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
- Extensive noledge and experience inproducing tables, reports, graphs,and listings using various procedures and handlinglarge databases to perform complexdata manipulations.
- Excellent noledge in preparing requiredproject documentationandtrackingand reportingregularly on the status of projects to all project stakeholders.
- Experience in UNIX shell scripting for processing large volumes of data from varied sources and loading into databases like Teradata.
- Strong experience and noledge of real time data analytics usingSpark Streaming, Kafka,andFlume.
- Proficient inData ModelingTechniques using Star Schema, Snowflake Schema, Fact and Dimension tables, RDBMS, Physical and Logical data modeling forData WarehouseandData Mart.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
- Highly skilled in using visualization tools like Tableau, ggplot2, dash, PowerBI, flask for creating dashboards.
- Experience onAngular JS, Node JS, Mongo DB,GitHub, Git,AmazonAWS,EC2, S3andcloud front.
- SQL Reference Mapper, using Regular Expressions Successfully mapped over a hundred thousand SQL references inside of SQL Object source code, SSRS reports, and DT Packages.
- Good experience in developing web applications and implementing Model View Control (MVC) architecture using server-side applications like Django, Flask, and Pyramid.
- Experience in application development using Java, RDBMS, TALEND and Linux shell scripting and DB2.
- Experienced in Software Development Lifecycle (SDLC) using SCRUM, Agile methodologies.
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop Map Reduce, Impala, HDFS, Hive, Pig, HBase, Flume, Storm, Sqoop, Oozie, Kafka, Spark, and Zookeeper
Hadoop Distributions: Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Amazon EMR (EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, DynamoDB, Redshift, ECS, Quick sight)
Programming Languages: Python, R, Scala,C++, SAS, Java, SQL, HiveQL, PL/SQL, UNIX shell Scripting, Pig Latin
Machine Learning: Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, XGBoost, Naïve Bayes, PCA, LDA, K-Means, KNN, Neural Network cloud Technologies: AWS, Azure, Google cloud platform Cloud Services (PaaS & IaaS), Active Directory, Application Insights, Azure Monitoring, Azure Search, Data Factory, Key Vault and SQL Azure, Azure Devops, Azure Analysis services, Azure Synapse Analytics (DW), Azure Data Lake, AWS Lambda
Databases: Snowflake, MySQL, Teradata, Oracle, MS SQL SERVER, PostgreSQL, DB2
NoSQL Databases: HBase, Cassandra, Mongo DB, DynamoDB and Cosmos DB
Version Control: Git, SVN, Bitbucket
ETL/BI: Informatica, SSIS, SSRS, SSAS, Tableau, Power BI, QlikView, Arcadia, Erwin, Matillion, Rivery
Operating System: Mac OS, Windows 7/8/10, Unix, Linux, Ubuntu
Methodologies: RAD, JAD, UML, System Development Life Cycle (SDLC), Jira, Confluence, Agile, Waterfall Model
PROFESSIONAL EXPERIENCE
Confidential, Chandler, AZ
Sr Data Engineer
Responsibilities:
- Consult leadership/stakeholders to share design recommendations and thoughts to identify product and technical requirements, resolve technical problems and suggest Big Data based analytical solutions.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Implemented Azure data lake, Azure Data factory and Azure data bricks to move and conform the data from on - premises to cloud to serve the analytical needs of the company.
- Developed Spark applications using Scala and spark SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data uncover Confidential into the customer usage patterns and even Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark Databricks cluster and Ability to apply the spark Data Frame API to complete Data manipulation within spark session.
- Worked on Spark Architecture for performance tuning including spark core, spark SQL, Data Frame, Spark streaming, Driver Node, Worker Node, Stages, Executors and Tasks, Deployment modes, the Execution hierarchy, fault tolerance, and collection
- Created Azure BLOB andData Lakestorage and loading data intoAzure SQL Synapse analytics (DW).
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) dat process the data using the Sql Activity and creating UNIX shell scripts for database connectivity and executing queries in parallel job execution.
- Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis.
- Implemented Apache Sqoop for efficiently transferring bulk data between Apache Hadoop and relational databases (Oracle) for product level forecast. Extracted the data from Teradata into HDFS using Sqoop.
- Controlling and granting database access and migrating on premise databases to Azure data lake store using Azure Data Factory.
- Worked on Kafka REST API to collect and load the data on Hadoop file system and used Sqoop to load the data from relational databases which extracts Real time feed usingKafkaandSpark Streamingand convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Analyzed existing systems and propose improvements in processes and systems for usage of modern scheduling tools like Airflow and migrating the legacy systems into an Enterprise data lake built on Azure Cloud.
- Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications. Worked on various automation tools like GIT, Terraform, Ansible.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) dat process the data using the Sql Activity.
- Created data pipeline package to move data from Blob Storage to MYSQL database and executed MySQL stored procedure using events to load data into tables.
Environment: Databricks, Azure Synapse, Cosmos DB, ADF, SSRS, Power BI, Azure Data Lake, ARM, Azure HDInsight, Blob storage, Apache Spark, Azure ADF V2, ADLS, Spark SQL, Python/Scala, Ansible Scripts, Azure SQL DW(Synopsis), Azure SQL DB
Confidential, Scottsdale, AZ
Sr Data Engineer
Responsibilities:
- Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing, and Reporting of voluminous, rapidly changing data.
- Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.
- Constructed AWS Data pipelines using various resources in AWS including AWS API Gateway to receives response from AWS lambda and retrieve data from snowflake using lambda function and convert the response into Json format using Database as Snowflake, DynamoDB, AWS Lambda function and AWS S3.
- Developed and implemented data acquisition of Jobs using Scala dat are implemented using Sqoop, Hive & Pig for optimization of MR Jobs to use HDFS efficiently by using various compression mechanisms with the help of Oozie workflow.
- Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift.
- Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Analyzed large and critical datasets using Cloudera, HDFS, MapReduce, Hive, Hive UDF, Pig, Sqoop and Spark.
- Used Git version control to manage the source code and integrating Git with Jenkins to support build automation and integrated with Jira to monitor the commits.
- Written Terraform scripts to automate AWS services which include ELB, CloudFront distribution, RDS, EC2, database security groups, Route 53, VPC, Subnets, Security Groups, and S3 Bucket and converted existing AWS infrastructure to AWS Lambda deployed via Terraform and AWS CloudFormation.
- Worked on Snowflake Schemas and Data Warehousing andprocessedbatch and streaming data load pipeline usingSnow Pipeand Matillion from data lake Confidential AWS S3 bucket.
- Responsible for the design, development, and administration of complex T-SQL queries (DDL / DML), Stored Procedures, Views& functions for transactional and analytical data structures.
- Developed Hive queries for the analysts by loading and transforming large sets of structured, semi structured data using hive. Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of key business elements from Aurora.
- Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generated data visualizations using Tableau.
- Collaborated with Data engineers and operation team to implement ETL process, Snowflake models, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- ImplementedAWS EC2, Key Pairs, Security Groups, Auto Scaling, ELB, SQS, and SNS using AWS API and exposed as the Restful Web services.
- Involved in convertingMapReduceprograms intoSpark transformationsusingSpark RDD's on Scala.
- Interfacing with business customers, gathering requirements and creating data sets/data to be used by business users for visualization.
- Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real time log analysis of monitoring End to End transactions.
- Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon Sage Maker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau
Confidential, Newport, RI
Big Data Engineer
Responsibilities:
- Involved in various phases of Software Development Lifecycle (SDLC) of the application, like gathering requirements, design, development, deployment, and analysis of the application.
- Worked on creating MapReduce programs to parse the data for claim report generation and running the Jars in Hadoop. Co-ordinated with Java team in creating MapReduce programs.
- Designed and Developed Spark Workflows using Scala for data pull from AWS S3 bucket and snowflake applying transformations on it.
- Defining, designing, and developing Java applications, specially using Hadoop Map/Reduce by leveraging frameworks such as Cascading and Hive.
- Handled importing of data from various data sources, performed transformations usingHive, MapReduce, loaded data intoHDFSand Extracted the data fromSQL into HDFS using Sqoop.
- Developed analytical components usingScala, Spark, Apache Mesos and Spark Stream and InstalledHadoop, Map Reduce, and HDFSand developed multipleMapReducejobs inPIGandHivefor data cleaning and pre-processing.
- Worked onBig Data Integration&Analytics based onHadoop, SOLR, Spark, Kafka, Storm, and web Methods.
- Worked on CI/CD tools like Jenkins, Docker in Devops Team for setting up application process from end-to-end using Deployment for lower environments and Delivery for higher environments by using approvals in between.
- Integrated Hadoop with Oracle to load and then cleanse raw unstructured data in Hadoop ecosystem to make it suitable for processing in Oracle using stored procedures and functions.
- Developed workflow using Oozie for running MapReduce jobs and Hive Queries.
- Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop.Implemented AJAX, JSON, and Java script to create interactive web screens.Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend. Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
- Developed end to end ETL batch and streaming data integration into Hadoop (MapR), transforming data.
- Developed custom UDFs in Pig Latin using Python scripts to extract the data from sensor devices output files to load into HDFS.
- Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement using PySpark.
- Designed and develop JAVA API (Commerce API) which provides functionality to connect to the Cassandra through Java services.
- Successfully designed and developedJavaMulti-Threading based collector parser and distributor process, when the requirement was to collect, parse and distribute the data coming at a speed of thousand messages per seconds.
- Used Pig as ETL tool to do Transformations with joins and pre-aggregations before storing the data onto HDFS and assisted Manager by providing automation strategies, Selenium/CucumberAutomation and JIRA reports.
- Worked onJavaMessage Service JMS API for developing message-oriented middleware MOM layer for handling various asynchronous requests.
- Performed Data Engineering including Glue Sync of Semantic Layers, Data Cleansing, Data joins and calculations based on the User Stories defined
- Implemented Google Big Query which adds to the data layer between Google Analytics and PowerBI. We has alot of Web behavior data being tracked in Google Analytics which needs to be pulled into a BI system for better reporting. With native PowerBI connector for GA, we were getting sampled data not giving us accurate results.
Environment: Hadoop, HDFS, Map Reduce, Hive, Pyspark, Flume, ETL, AWS, Oozie, Sqoop, Oracle, PIG, Eclipse, MySQL, Java
Confidential
Data Analyst
Responsibilities:
- Consulted with application development business analysts to translate business requirements into data design requirements used for driving innovative data designs dat meets business objectives.
- Involved in information-gathering meetings and JAD sessions to gather business requirements, deliver business requirements document and preliminary logical data model.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- Compared the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and investigate the data quality when these types of loads are done.
- As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
- Developed SQL scripts to Upload, Retrieve, Manipulate, and handle sensitive data in Teradata, SQL Server Management Studio and Snowflake Databases for the Project.
- Used Git, GitHub, and Amazon EC2 and deployment using Heroku and Used extracted data for analysis and carried out various mathematical operations for calculation purpose using python library - NumPy, SciPy.
- Incorporated predictive modelling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations, and integrated with the Tableau viz.
- Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
- Cleansing, mapping, and transforming data, create the job stream, add, and delete the components to the job stream on data manager based on the requirement.
- Developed Teradata SQL scripts using RANK functions to improve the query performance while pulling the data from large tables.
- Used Normalization methods up to 3NF and De-normalization techniques for effective performance in OLTP and OLAP systems.Generated DDL scripts using Forward Engineering technique to create objects and deploy them into the database.
- Used Star Schema methodologies in building and designing the logicaldatamodel into Dimensional Modelsextensively.
- Designed and deployed reports with Drill Down, Drill Through and Drop-down menu option and Parameterized and Linked reports using Tableau.
- Worked with data compliance teams, Data governance team to maintain data models, Metadata, Data Dictionaries; define source fields and its definitions.
- Conducted Statistical Analysis to validate data and interpretations using Python and R, as well as presented Research findings, status reports and assisted with collecting user feedback to improve the processes and tools.
- Applied concepts of probability, distribution, and statistical inference on given dataset to unearth interesting findings through these of comparison, T-test, F-test, R-squared, P-value etc.
- Reported and created dashboards for Global Services & Technical Services using SSRS, Oracle BI, and Excel. Deployed Excel VLOOKUP, PivotTable, and Access Query functionalities to research data issues.
Environment: Informatica Power Center v 8.6.1, Power Exchange, IBM Rational Data Architect, MS SQL Server, Teradata, PL/SQL, IBM Control Center, TOAD, Microsoft Project Plan, Repository Manager, Workflow Manager, ERWIN 3.0, Oracle 10g/9i, Teradata, TOAD, UNIX, and Shell scripting