Sr. Data Engineer Resume
Columbus, OH
SUMMARY
- Over 8+ years of experience as a Data Engineer, Data Analyst, Big Data, Data Modeling Logical and Physical, and Implementation of Business Applications using the Oracle Relational Database Management System RDBMS.
- Strong experience in analysis, design, development, testing, implementation of database application in Client/ Server application using Oracle 12c/11g/10g/9i/8i, SQL, SQL Loader and open Interface.
- Experienced in database conversion from Oracle and SQL Server to RESTSQL and MySQL.
- Developing reports on SSAS on SQL Server (2000/2005/2008 ). Sound Experience and understanding of SSAS, OLAP cube and Architecture.
- Extensive knowledge in Client/Server Technology, GUI Design, Relational Database Management Systems RDBMS, and Rapid Application Development Methodology.
- Extensively worked in PL/SQL for creating stored procedures, clusters, packages, database triggers, exception handlers, cursors, cursor variables.
- In dept understanding of Monitoring/Auditing tools in AWS such as CloudWatch and Cloud Trail.
- Expertise understanding of AWS DNS Services through Route53. Understanding of Simple, Weighted, Latency, Failover & Geolocational Route types.
- Hands on experience in installing, configuring, monitoring, and using Hadoop ecosystem components like Hadoop Map - Reduce, HDFS, HBase, Hive, Sqoop, Pig, Zookeeper, Horton works and Flume.
- Expert in Amazon EMR, Spark, Kinesis, S3, Boto3, Bean Stalk, ECS, CloudWatch, Lambda, ELB, VPC, Elastic Cache, Dynamo DB, Redshift, RDS, Athena, Zeppelin & Airflow.
- Experience in handling, configuration, and administration of databases like MySQL and NoSQL databases like MongoDB and Cassandra.
- Experience with working in Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB). Experience in creating separate virtual data warehouses with difference size classes in AWS Arora.
- Worked on Data Virtualization using Teiid and Spark, RDF graph Data and Fuzzy Algorithm.
- Strong knowledge of Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally.
- Data Modeling and database and development for OLTP, OLAP (Star Schema, Snowflake Schema, Data Warehouse, Data Marts, Multi-Dimensional Modeling and Cube design), Business Intelligence and data mining.
- Extensively used SQL, NumPy, Pandas, Scikit-learn, Spark, Hive for Data Analysis and Model building.
- Developed and maintained multiple Power BI dashboards/reports and content packs.
- Created POWER BI Visualizations and Dashboards as per the requirements
- Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis).
- Responsible for designing and building a DataLake using Hadoop and its ecosystem components.
- Working experience in creating real time data streaming solutions using Apache Spark/Spark Streaming & Kafka and built Spark Data Frames using Python.
- Used Amazon Lambda for developing API to manage servers and run the code in AWS.
- Experience with ETL workflow Management tools like Apache Airflow and have significant experience in writing the Python scripts to implement the workflow.
- Experience in working with databases like MongoDB, MySQL, and Cassandra.
- Working knowledge of SQL Trace, TK-Prof, Explain Plan, and SQL Loader for performance tuning and database optimization.
- Provide regional MySQL database migrations and hot standby servers via asynchronous replication including Amazon EC2 and RDS (with solutions tailored for managing RDS).
- Extensive experience in Dynamic SQL, Records, Arrays and Exception handling, data sharing, Data Caching, Data Pipelining. Complex processing using nested Arrays and Collections.
- Experience in integrating databases like MongoDB, MySQL with webpages like HTML, PHP and CSS to update, insert, delete and retrieve data with simple ad-hoc queries.
- Developed heavy load Spark Batch processing on top of Hadoop for massive parallel computing.
- Strong knowledge of Extraction Transformation and Loading ETL processes using UNIX shell scripting, SQL, PL/SQL and SQL Loader.
- Developed Spark RDD and Spark DataFrame API for Distributed Data Processing.
TECHNICAL SKILLS
Big Data Ecosystem: MapReduce, HDFS, HBase, Spark, Kafka, Scala, Zookeeper, Hive, Pig, Sqoop Cassandra, Oozie, MongoDB, Flume.
Cloud Ecosystem: Amazon Web services (EC2, EMR, IAM and S3), AZURE
ETL Tools: SQL Server Integration Services, AWS Data pipeline, Informatica Power center 10.x, Talend
Languages: SQL, Python, PL/SQL, TSQL, C, C .NET, VB, ASP.NET, HTML, XML, XSLT
Database: Oracle 12c, 11g, 10g, 9i, SQL Server, Internet Application Server Oracle, Internet Application Server, RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis)
Special Tools: SQL Plus, SQL Loader, Toad, SQL Developer, Enterprise manager, FTP, WinSCP, Rational ClearCase and ClearQuest, SQL Server Management Studio, MS Visual Studio, Team Foundation Server TFS,Jira CloudWatch and Cloud Trail
Data Migration Tools: SQL Loader, Export/Import, Azure, AWS
Methodologies: Agile, RUP, Waterfall Model
Packages MS Office: Word, Excel, Project, Access, Visio, PowerPoint
Operating Systems: Linux, UNIX, AIX, Windows 10
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Columbus, OH
Responsibilities:
- Worked closely with Business Analysts to gather requirements and design a reliable and scalable data pipelines.
- Worked with various complex queries, sub queries and joins to check the validity of loaded and imported data.
- Designed and implemented ETL pipelines between from various Relational data Bases to the Data Warehouse using Apache Airflow.
- Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
- Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Worked on data transformation and retrieval from mainframes to oracle, using SQL loader and control files.
- Created Tableau Visualizations by connecting to AWS Hadoop Elastic MapReduce.
- Developed Custom ETL Solution, Batch processing and Real-Time data ingestion pipeline to move data in and out of Hadoop using Python and Shell Script.
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
- Developed data integration strategies for data flow between disparate source systems and Big Data enabled Enterprise Data Lake.
- Built a serverless ETL in AWS lambda to process the files that are new in the S3 bucket to be cataloged immediately.
- Worked on AWS SQS to consume the data from S3 buckets.
- Heavily involved in testing Snowflake to understand best possible way to use the cloud resources.
- Work with relational SQL and NoSQL databases, including PostgreSQL and Hadoop.
- Demonstrated expertise in designing, developing and deploying Business Intelligence solutions using SSAS.
- Objective of this project is to build a data lake as a cloud-based solution in AWS using Apache Spark and provide visualization of the ETL orchestration using CDAP tool.
- Worked and learned a great deal from AmazonWebServices (AWS) Cloud services like EC2, S3, IAM and EMR.
- Worked on data cleaning and reshaping, generated segmented subsets using NumPy and Pandas in Python.
- Developed and deployed to production multiple projects using Jenkins in the CI/CD pipeline for real-time data distribution, storage, and analytics.
- Collaborated with the PowerBI team for performance tuning tasks and patching the PowerBi reporting environment.
- Realtime data from the source were ingested as file streams to SPARK streaming platform and data was saved in HDFS and HIVE.
- Used RESTFUL web services for salesforce integration and to retrieve contacts from Oracle database.
- Configured Cloud Watch, Lambda, SQS, and SNS to send alert notifications.
- Creating S3 buckets also managing policies and Utilized S3 bucket and Glacier for storage and backup on AWS.
- Designed data flow to pull the data using Rest API from a third-party Vendor using OAUTH authentication.
- Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Designing and implementing a fully operational production grade large scale data solution on Snowflake Data Warehouse.
- Worked on Amazon EMR processes data across a Hadoop Cluster of viral servers on Amazon Elastic Computing Cloud (EC2).
- Designed and implemented Map Reduce for distributed and parallel programming.
- Created architecture stack blueprint for data access with NoSQL Database Cassandra.
- Worked with Jira to plan, track and manage various development projects during the phases and contributed in customizing the workflow and collaborate well.
- Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka.
- Oracle/Aurora Query Optimization.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
Environment: Big Data Tools, Hadoop, Hive, HBase, Spark, Oozie, Kafka, My SQL, API, Snowflake, Powershell, Git hub, AWS, Oracle Database 12c/11g, Datastage, SQL Server 2017/2016/ 2012/ 2008, RDBMS, PostgreSQL, PowerBI, MongoDB, ETL, Data Pipelining, NoSQL, SDLC, CI/CD, SQS, Python, Waterfall, Agile methodologies.
Data Engineer
Confidential, Nashville, TN
Responsibilities:
- Involvement in working with Azure cloud stage (HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB and SQL DWH).
- Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
- Using Linked Services/Datasets/Pipeline/ to extract, transform and load data from various sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards, ADF pipelines were created.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Performed information purging and applied changes utilizing Databricks and Spark information analysis.
- Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.
- Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
- Designed and mechanized Custom-constructed input connectors utilizing Spark, Sqoop and Oozie to ingest and break down informational data from RDBMS to Azure Data lake.
- Involved in building an Enterprise Data Lake utilizing Data Factory and Blob storage, empowering different groups to work with more perplexing situations and ML solutions.
- Worked on ETL tool Informatica, Oracle Database and PL/SQL, Python and Shell Scripts.
- Delta lake supports merge, update and delete operations to enable complex use cases.
- Used Azure Synapse to oversee handling outstanding workloads and served data for BI and predictions.
- Used Azure Event Gridfor managing eventservice that enables you to easily manage events across many differentAzureservices and applications.
- Developed Spark Scala scripts for mining information and performed changes on huge datasets to handle ongoing insights and reports.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Managed assets and scheduling over the cluster utilizing Azure Kubernetes Service.
- Reduced access time by refactoring information models, query streamlining and actualized Redis store to help Snowflake.
- Developed spark applications in Python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
- Provide guidance to development team working on PySpark as ETL platform.
- Included myself in making database components like tables, views, triggers utilizing T-SQL to give structure and keep up information effectively.
- Broad involvement in working with SQL, with profound knowledge on T-SQL (MS SQL Server).
- Worked with data science group to do preprocessing and include feature engineering, helped Machine Learning algorithm in production.
- Using machine learning algorithms for data analysis such as linear regression, multivariate regression, PCA, K-means, & KNN.
- Facilitated information for interactive Power BI dashboards and reporting.
Environment: Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, AKS), Scala, Python, Hadoop 2.x, Spark v2.0.2, NLP, Redshift Airflow v1.8.2, Hive v2.0.1, Sqoop v1.4.6, HBase, Oozie, Talend, CosmosDB, MS SQL, MongoDB, Ambari, PowerBI, Azure DevOps, Ranger, Git, Microservices, K-Means,KNN.
Data Engineer
Confidential, Indianapolis, IN
Responsibilities:
- Build the new universes in Business Objects as per the user requirements by identifying the required tables from Data mart and by defining the universe connections.
- Design, Develop and Document the new architecture and development process to convert existing ETL pipeline into Hadoop based systems.
- Configuring high availability using geographical MongoDB replica sets across multiple data centers.
- Developed scripts for PostgreSQL, EDB Postgres Advanced Server databases for monitoring and tuning procedures.
- Used Git, GitHub, and Amazon EC2 and carried out various mathematical operations for calculation purpose using python library - NumPy, SciPy.
- Developed multiple POCs using PySpark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and Teradata.
- Developed and deployed various Lambda functions in AWS with in-built AWS Lambda Libraries and deployed Lambda Functions in Scala with custom Libraries.
- Design the ETL which can capture data from streaming web data as well as RDBMS source data.
- Developing and running Map-Reduce jobs on YARN and Hadoop clusters to produce daily and monthly reports as per user's need.
- Performed the data analysis and mapping database normalization, performance tuning, query optimization data extraction, transfer, and loading ETL and clean up.
- Developed PL/SQL Procedures, Functions and Packages and used SQL loader to load data into the database.
- Developed complex calculated measures using Data Analysis Expression language (DAX).
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Worked on MongoDB database concepts such as locking, transactions, indexes, replication, and schema design.
- Trained and mentored development colleagues in translating and writing NOSQL queries vs legacy RDBMS.
- Worked with different feeds data like JSON, CSV, XML, DAT and implemented Data Lake concept.
- Used DataStage as an ETL tool to extract data from sources systems, loaded the data into the ORACLE database.
- Analysed the data by performing Hive queries and running Pig scripts to know customer behaviour.
- Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS.
- Analysed the SQL scripts and designed the solution to implement using PySpark.
- Work with Data Governance team and implement the rules and build physical data model on hive in the data lake.
- Good understanding of performance tuning with NoSQL, Kafka, Storm and SQL Technologies.
- Implemented AWS cloud computing platform using S3, RDS, Dynamo DB, Redshift, and Python.
- Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems.
- Expertise in implementing DevOps culture through CI/CD tools like Repos, JENKINS, Code Pipeline, GitHub.
- Extensively used Erwin for data modeling and Dimensional Data Modeling by ERWIN.
- Used EXPLAIN PLAN, TKPROF to tune SQL queries.
- Developed Shell and Python scripts to automate and provide Control flow to Pig scripts. Imported data from Linux file system to HDFS.
- Expertise in designing Python scripts to interact with middleware/back end services.
Environment: SPARK, Hive, Pig, Oozie, Flume, Kafka, HBase, AWS, SQL Server, PostgreSQL, J2EE, UNIX, MS Project, Oracle, Web Logic, JavaScript, RDBMS, Git, HTML, NoSQL, Microsoft Office Suite 2010, Excel, Oracle Database 11g, Python, Windows 2007 Enterprise, TOAD, ETL, SDLC, Waterfall, Agile methodologies.
Data Engineer
Confidential
Responsibilities:
- Build the new universes in Business Objects as per the user requirements by identifying the required tables from Data mart and by defining the universe connections.
- Used Business Objects to create reports based on SQL-queries. Generated executive dashboard reports with latest company financial data by business unit and by product.
- Performed the data analysis and mapping database normalization, performance tuning, query optimization data extraction, transfer, and loading ETL and clean up.
- Implemented Teradata RDBMS analysis with Business Objects to develop reports, interactive drill charts, balanced scorecards and dynamic Dashboards.
- Responsible for requirements gathering, status reporting, creating various metrics, projects deliverables.
- Responsible for managing MongoDB environment with high availability, performance, and scalability perspectives.
- Developing reports on SSAS on SQL Server (2000/2005/2008 ). Sound Experience and understanding of SSAS, OLAP cube and Architecture.
- Developed NoSQL database by using CRUD, Indexing, Replication and Sharing in MongoDB.
- Involved in migrating warehouse database from Oracle 9i to 10g database.
- Involved in analysing and adding new features of Oracle 10g like DBMS SHEDULER create directory, data pump, CONNECT BY ROOT in existing Oracle 9i application.
- Tuned Report performance by exploiting the Oracle's new built-in functions and rewriting SQL statements.
- Extensively used Erwin for Data Modeling and Dimensional Data Modeling by ERWIN.
- Used EXPLAIN PLAN, TKPROF to tune SQL queries.
- Developed BO full client Reports, reports in Web intelligence 6.5 and XI R2 and universes with context and loops.
- Worked on ETL tool Informatica, Oracle Database and PL/SQL, Python and Shell Scripts.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
Environment: Quick Test Professional 8.2, SQL Server, J2EE, UNIX, .Net, Python, NoSQL, MS Project, Oracle, Web Logic, Shell script, JavaScript, HTML, Microsoft Office Suite 2010, Microsoft Excel.
Data Analyst
Confidential
Responsibilities:
- Worked with Demand Gen and interactive marketing teams to build descriptive and actionable insights.
- Experience in creating various views in Tableau (Tree maps, Heat Maps, Scatter plot).
- Create action filters, parameters, calculated fields, set ad table calculations for preparing dashboards and worksheets in Tableau.
- Worked on multiple tableau visualization charts like Area Chart, Line Chart, Heat and Tree maps, Bar Chart, Stacked Bar Chart, water fall Chart and many more custom charts.
- Designed data driven B2B demand gen solutions to improve ROI on media spend.
- Hands on experience to extract, manipulate and built complex formulas in Tableau for various business calculations.
- Gathered business, system, and functional requirements by conducting detailed interviews with business users, stakeholders, and Subject Matter Experts (SME's). Defined the scope of the project, financial projections, and Cost/benefit analysis.
- Responsible for revelation, engagement, and churn analysis.
- Implemented engagement score driven content personalization & optimization.
- Generated complete digital insights reporting using online and offline data sources.
- Responsible to increase user engagement, conversion and enable data driven product development.