Senior Data Engineer Resume
Des Moines, IA
SUMMARY
- Around 7+ years of experience in Analysis, Design, Development, and Implementation as aData Engineer
- Expert in providingETL solutionsand ETL process for any type of business model
- Develop effective working relationships with client teams to understand and support requirements, develop tactical and strategic plans to implement technology solutions, and effectively manage client expectations
- An excellent team member with an ability to perform individually, good interpersonal relations, strong communication skills, hardworking and a high level of motivation
- Excellent knowledge of Machine Learning, Mathematical Modelling, and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of the Big Data Ecosystem
- Experience in development and design of various scalable systems usingHadooptechnologies in various environments
- Extensive experience in analysing data using Hadoop Ecosystems includingHDFS, MapReduce, Hive & PIG
- Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy, and Beautiful Soup.
- Experience in understanding the security requirements for Hadoop
- Extensive experience in working withInformatica PowerCenter
- Good Hands - on expertise with AWS storage services such as S3, EFS, Storage Gateways and AWS compute services such as EC2, Elastic MapReduce (EMR), EBS and accessing Instance metadata.
- ImplementedIntegration solutionsforcloud platformswithInformatica Cloud
- Proficient inSQL, PL/SQL,andPythoncoding. Worked with Java-based ETL tool,Talend
- Expertise in debugging and optimizing Oracle and java performance tuning with strong knowledge inOracle 11g and SQL
- Experience in data warehousing and business intelligence using various ETL tools Informatica, and Business Objects
- Experience in developing customized UDF’s in java to extend Hive and Pig Latin functionality
- Experience developingOn-premisesandReal-Time processes
- Excellent understanding of best practices ofEnterprise Brehouseand involved in Full life cycle development ofData Warehousing
- Experience in Data Analysis, Data Migration, Data Validation, Data Cleansing, Data Verification and identifying Data Mismatch
- Good knowledge in using apache NiFi to automate the data movement between different Hadoop systems
- Experience in Big Data technologies like Spark, SparkSQL, pySpark, Hadoop, HDFS, Hive
- Expertise inDBMSconcepts
- Experience in working with Azure Monitoring, Data Factory, Traffic Manager, Service Bus, Key Vault
- Involved in buildingData ModelsandDimensional Modellingwith3NF, Star and Snowflakeschemas forOLAPandOperational data store (ODS)applications
- Skilled in designing and implementingETL Architecturefor a cost-effective and efficient environment
- Optimized and tuned ETL processes & SQL Queries for better performance
- Performed complexdata analysisand provided critical reports to support various departments
- Work with Business Intelligence tools likeBusiness Objectsand Data Visualization tools likeTableau
- ExtensiveShell/Python scriptingexperience for Scheduling and Process Automation
- Good exposure to Development, Testing, Implementation, Documentation, and Production support
- Experience in one or more data platform services such as SQL, CosmosDB, MongoDB, Oracle, Hadoop
- Proficiency in multiple databases like MongoDB, Cassandra, My SQL, ORACLE, and MS SQL Server
- Data Platform development using Spark, Greenplum, and Hadoop
- Exposure to NoSQL databases such as MongoDB, HBase, and Cassandra. Created Java apps to handle data in MongoDB and HBase
- Designing, building, and publishing Cognos Multi-Dimensional OLAP Cube solutions
- Good experience in the design and implementation of fully automated Continuous Integration, Continuous Delivery, Continuous Deployment pipelines, and DevOps processes for Agile projects (CI/CD)
- Database Design (Conceptual, Logical) and Programming Amazon Redshift, Microsoft Azure, BigData Ecosystem, Oracle PL/SQL, Teradata, Erwin, Power Designer, and OLAP on Hadoop using HDInsight
- Building Experience ETL data pipeline/ ETL workflows on Hadoop/Teradata using Hadoop/Pig/Hive/UDFs
- Well-versed in version control and CI/CD tools such as SVN, GIT, SourceTree, Bitbucket, etc.
- Experience in Amazon Web Services (AWS) products S3, EC2, EMR, and RDS
- Strong experience in the design and development of Business Intelligence solutions using data modelling, Dimension Modelling, ETL Processes, Data Integration, OLAP, and client /server application
- Extensive experience in Agile software development methodology
- Experience in Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse and Controlling and granting database access and Migrating On-premise databases to Azure Data lake store using Azure Data factory
- Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager, and Kafka (distributed stream-processing)
TECHNICAL SKILLS
Languages: PL/SQL, SQL, T-SQL, C, C++, XML, HTML, DHTML, HTTP, MATLAB, Python
Databases: SQL Server 20017, MS-Access, Oracle 11g, Sybase and DB2
Database Design Tools and Data Modelling: Fact & Dimensions tables, physical & logical data modelling Normalization and Denormalization techniques, Kimball
Tools: and Utilities: SQL Server 2016/2017, SQL Server Enterprise Manager, TOAD, SQL, Server Profiler, Import & Export Wizard, Visual Studio v14, .Net, Microsoft Management Console, Visual SourceSafe 6.0, DTS, Crystal, Reports, Power Pivot, ProClarity, Microsoft Office 2007/10/13, Excel Power Pivot, Excel Data Explorer, Tableau 8/10, JIRA
Web Services: REST, SOAP
Development Build & Integration Tools: Eclipse, Maven, Jenkins, IntelliJ, Log4J
Operating Systems: Microsoft Windows 8/7/XP, Linux and UNIX
Cloud Technologies: AWS, Azure
Testing Management Tools: Bugzilla, JIRA, Quality Centre, QTP
SDLC Methodologies: Agile, Scrum, Waterfall
PROFESSIONAL EXPERIENCE
Confidential, Des Moines, IA
Senior Data Engineer
Responsibilities:
- Worked on designing and developing the Real-Time Tax Computation Engine usingOracle, Stream Sets, Kafka, Spark Structured Streaming, andMySQL
- Implemented Spark using Scala and utilizing Data frames andSpark SQLAPI for faster processing of data
- Involved in ingestion, transformation, manipulation, and computation of data usingStream Sets, Kafka, MySQL, Spark
- Authoring Python (PySpark) Scripts for custom UDF's for Row/ Column manipulations, merges, aggregations, stacking, data labeling, and for all Cleaning and conforming tasks.
- Involved in data ingestion intoMySQLusingKafka - MySQL pipelinefor a full load and Incremental load on a variety of sources like web server,RDBMS,and Data API’s
- Worked on Spark Data sources, Spark Data frames,Spark SQL, and Streaming using Scala
- Worked extensively on AWS Components such as Elastic Map Reduce (EMR), Elastic Compute Cloud (EC2), Simple Storage Service (S3)
- Created several DatabricksSpark jobs with Pyspark to perform several tables to table operations.
- Build ETL pipeline end to end from AWS S3 to Key, Value store DynamoDB, and Snowflake Datawarehouse for analytical queries and specifically for cloud data
- Experience in developingSparkapplication usingScala SBT
- Experience in integratingSpark-MySQL connectorandJDBC connectorto save the data processed inSparktoMySQL
- Responsible for creating tables andMySQL pipelineswhich are automated to load the data into tables fromKafkatopics
- Performed a POC to check the time taking for Change Data Capture (CDC) of oracle data acrossStrim, Stream Sets, andDB Visit
- Created instances in AWS as well as migrated data to AWS from data Center using snowball and AWS migration service and Implementations of generalized solution model using AWS SageMaker.
- Leverage AWS Sage Maker to build, train, tune and deploy state of art Machine Learning and Deep Learning models.
- Created continuous integration and continuous delivery (CI/CD) pipeline on AWS that helps to automate steps in software delivery process
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP
- Expertise in using different file formats likeText files, CSV, Parquet, JSON
- Experience in custom compute functions usingSpark SQLand performed interactive querying
- Responsible for masking and encrypting the sensitive data on the fly
- Responsible for creating multiple applications for reading the data from different Oracle instances to Kafka topics usingStrim
- Extensive experience in deploying, managing, and developing MongoDB clusters. Creation, configuration, and monitoring Shards sets
- Analysed current state Reporting Database (Access Based) and identifying the front-end user screen functionality, providing solutions and a detailed summary of their existing database functionality to the Business teams. Provided detailed data workflow diagrams for the existing reporting database
- Responsible for setting up a MySQL cluster on AWS EC2 Instance
- Configuring high availability using geographical MongoDB replica sets across multiple data centers.
- Experience in Real-time streaming the data usingSparkwithKafka
- Performed importing data from various sources to the Cassandra cluster using Java APIs or Sqoop
- Responsible for creating a Kafka cluster using multiple brokers
- Experience working on Vagrant boxes to setup local Kafka and Stream Sets pipelines
Environment: Spark 2.2, Scala, Linux, MySQL 5.8, Kafka 1.0, Striim, Streamsets, Spark SQL, Spark Structured Streaming, AWS EC2, EMR, IntelliJ, SBT, git, VagrantMetadata, MS Excel, Mainframes MS Vision, Map-Reduce, Rational Rose, Pyspark, SQL, and MongoDB, Workday HCM, Workday conversions, Workday Report Writes, Data Modeling
Confidential, Dorchester, MA
Data Engineer
Responsibilities:
- Implemented machine learning methods, optimization, and visualization, a mathematical model of statistics such as Regression Models, Decision Tree, Naïve Bayes, Ensemble Classifier, Hierarchical Clustering, and Semi-Supervised Learning on different datasets using Python
- Configuring a Workday system to meet each client's unique business requirements. Also developed test scripts for other outside systems that interface with Workday
- Researched and implemented various Machine Learning Algorithms using the R language
- Devised a machine learning algorithm using Python for facial recognition
- Used R for a prototype on a sample data exploration to identify the best algorithmic approach and then wrote Scala scripts using spark machine learning module
- Used Scala scripts for spark machine learning libraries API execution for decision trees, ALS, logistic and linear regressions algorithms
- Configuring new benefit Plans in the Workday system and to do mass uploads of Employees to that Plans
- Integration of data stored in S3 with Databricks to perform ETL processes using pyspark and spark SQL.
- Worked on Migrating an on-premises virtual machine to Azure Resource Manager Subscription with Azure Site Recovery
- Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database, and SQL data warehouse environment. experience in DWH/BI project implementation using Azure DF
- Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data
- Extracting the data from Azure Data Lake into HDInsight Cluster (INTELLIGENCE + ANALYTICS) and applying spark transformations & Actions and loading into HDFS
- Provided consulting and cloud architecture for premier customers and internal projects running on MS Azure platform for high availability of services, low operational costs
- Developed structured, efficient, and error-free codes for Big Data requirements using Hadoop and its Eco-system
- Development of web service using Windows Communication Foundation and Net to receive and process XML files and deploy on Cloud Service on Microsoft Azure
- Used CosmosDB for partitioning the data for high availability and scalability
- Implement ETL process to move data from CosmosDB to SQL Azure Database using SQLizer, SSIS, and SQL Azure Database
- Started using apache NiFi to copy the data from the local file system to HDFS
- Analyzed pre-existing predictive model developed by advanced analytics team and factors considered during model development
- Focused on Test Driven Development thereby creating detailed JUnit tests for every single piece of functionality before writing the functionality
- Involved in preparing Logical DataModels/PhysicalData Models
- Validated the Map-reduce, Pig, Hive Scripts by pulling the data from the Hadoop and validating it with the data in the files and reports
- Experienced in all phases of data mining: data collection, data cleaning, developing models, validation, and visualization
- Analyzed metadata and processed data to get better insights of the data
- Created initial data visualizations in tableau to provide basic insights of data to the project stakeholders
- Application of various machine learning algorithms and statistical modeling like decision trees, regression models, clustering, SVM to identify Volume using Scikit-learn package in Python
- Conducted regular communications with leaders of other teams to get a better understanding of the data at a deeper level
- Extensively worked on the naming standards which incorporated the enterprise data modelling
- Developed visualizations using R packages like ggplot2, choroplethr to identify patterns and trends in the preprocessed data
- Experienced in RStudio packages and Python libraries like SciKit-Learn to improve the model accuracy from 65% to 86%
- Provided conceptual and technical modeling assistance to developers and DBA's using Erwin and Model Mart Validated Data Models with IT, team members, and Clients
- Experienced in various Python libraries like Pandas, One dimensional NumPy, and Two dimensional NumPy
- Experienced in using PyTorch library and implementing natural language processing
- Developed data visualizations in Tableau to display day to day accuracy of the model with newly incoming data
- Hold a point-of-view on the strengths and limitations of statistical models and analyses in various business contexts and can evaluate and effectively communicate the uncertainty in the results
- Used Keras library to build and train deep learning models and fetched good results
- Propensity model developed that was beneficial with a greater ROI compared to other models
- Achieved 095 million dollars ROI per cycle with a cycle duration of one quarter year
- Implemented complete data science project involving data acquisition, data wrangling, exploratory data analysis (EDA), model development, and model evaluation
- Worked on various methods including data fusion and machine learning and improved the accuracy of distinguished right rules from potential rules
- Developed Merge jobs in Python to extract and load data into a MySQL database
- Used Test driven approach for developing the application and Implemented the unit tests using Python Unit test framework
- Designed and documented REST/HTTP, SOAP APIs, including JSON data formats and API versioning strategy
- Worked on developing Restful endpoints to cache application-specific data in in-memory data clusters like REDIS and exposed them with Restful endpoints
- Wrote unit test cases in Python and Objective-C for other API calls in the customer frameworks
- Tested with various Machine Learning algorithms like Support Vector Machine (SVM), Random Forest, Trees with XGBoost concluded Decision Trees as a champion model
- Machine Learning, R Language, Hadoop, Big Data, Azure, Python, Pyspark, Java, J2EE, Spring, Struts, JSF, Dojo, JavaScript, DB2, CRUD, PL/ SQL, JDBC, coherence, MongoDB, Apache CXF, soap, Web Services, Eclipse, MS Access, Teradata, Advanced SQL, RStudio (ggplot2, caret), Tableau, Excel, Workday HCM, Workday conversions, Workday Report Writer
Confidential
Data Analyst
Responsibilities:
- Collected data from the end client, performed ETL, and defined the uniform standard format
- Wrote queries to retrieve data from SQL Server database to get the sample dataset containing needed fields
- Performed string formatting on the dataset converting hours from date format to a numerical integer
- Used Python libraries like Matplotlib and Seaborn to visualize the numerical columns of the dataset such as day of the week, age, hour, and number of screens
- Create VBA programs to automatically update Excel workbooks, encompassing class and program modules and external data queries
- Developed and implemented predictive models like Logistic Regression, Decision Tree, Support Vector
- Machine (SVM) to predict the probability of enrollment
- Used Ensemble learning methods like Random Forest, Bagging, Gradient Boosting and selected the final model based on confusion matrix, ROC, AUC predicted the probability of customer enrollment
- Worked on missing value imputation, outlier identification with statistical methodologies using Pandas, NumPy
- Tuned the hyperparameters of the above models using Grid Search to find the optimum models
- Designed and implemented K-Fold Cross-validation to test and verify the model’s significance
- Developed a dashboard and story in Tableau showing the benchmarks and summary of the model’s measure
- Use tools extensively like R, Python, ODS, DB2, Metadata, MS Excel to analyze data from multiple perspectives and was able to provide a robust Machine Learning algorithm
- Created new tools and business processes that simplify, standardize, and enables operational excellence
- Used tools like Tableau for drilling-downdata, creatinginsightfulreports, and garnering actionable business insights
- Documentation business requirements, technical requirements, application and data workflows, use cases, and test plans
- Performed Database testing and the Report level testing as per the requirement with excellent knowledge in understanding the data workflow by referring through FSD’s (Functional Specification Document)
- Excellent understanding of the mapping between Source and Target by referring to the mapping document
- Performed end to end mapping testing for the database as well as reports
- Mapping involved is one to one and its lift and shift process, that means need to check whether the data gathered in the target table is mapped properly to the source table and the same target table is populating the same records into the report tool (SAP-BO, QlikView) properly
- Performed Smoke test to do the primary checks like record counts, column matching for database, and dashboard testing
- Worked with data owners, Business Units, Data Integration team and customers in fast paced Agile/Scrum environment
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP, and coordinate task among the team
- Performed testing at SIT (System Integration Testing) level and UAT (User Acceptance testing) level
- Gathered requirements from the development team and database developers to analyze the tables and entity relationships for understanding the database
- Designed the integration document/XLS derive the input and output of each of the integration points
- Documented the acceptance criteria for each of the test cases. Built the test cases based on test scenarios
- Created a test plan and strategy for the given LOB (Line of Business)
- Written queries in BigQuery to lookup all the Customer, Product, Order level data
- Verified import/export and obfuscation data
- Verified known issues, development of workarounds and wrappers as required
- Identified data scenarios, business cases, and created test case development
- Scripted, automated test cases and identified source data pattern for generating reports
- Developed scripts for comparison with the target. Planed and run the SIT (System Integration Testing) for the given LOB
- Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy
- Developed test cases, established traceability between requirements and test cases
- Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python
- Provided inputs to the test lead for documentation and reporting purposes
- Identified, documented, and updated testing dependencies and participants
- Identified primary point of contact to raise the risks/issues around testing dependencies
- Reported status on test execution including risks/issues and targets
- Updated latest information in regular testing status meetings with all involved constituencies to ensure smooth test execution and timely issue resolution
Environment: Informatica Power Center, HP-ALM, SharePoint, MS-Visio, MS-Excel, Teradata SQL Assistant, QlikView, SAP-BO, Oracle 11g, Microsoft SQL Server, Tableau report builder, MS Outlook, SQL Server 2012/2014, Python (Scikit-Learn, NumPy, Pandas, Matplotlib, Dateutil, Seaborn), Tableau, Hadoop
