Sr. Data Engineer Resume
Eden Prairie, MN
SUMMARY
- Over 9 years of experience as Data Engineer including designing, developing and implementation of data models for enterprise - level applications and systems.
- Proficient in managing entire data science project life cycle and actively involved in all the phases of project.
- Experienced using "Bigdata" work on Hadoop, Spark, PySpark, Hive, HDFS and other NoSQL platforms.
- Experienced working extensively on the Master Data Management(MDM) and application used for MDM.
- Experience in transferring the data using Informatica tool from AWS S3 to AWS Redshift.
- Hands on experience with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2instances, RDS and others.
- Hands on experience with Google cloud services like GCP, BigQuery, GCS Bucket and G-Cloud Function.
- Experienced in Informatica ILM and Informatica Lifecycle Management and its tools.
- Efficient in all phases of the development lifecycle, coherent with Data Cleansing, Data Conversion, Data Profiling, Data Mapping, Performance Tuning and System Testing.
- Good Knowledge on SQL queries and creating database objects like stored procedures, triggers, packages and functions using SQL and PL/SQL for implementing the business techniques.
- Supporting ad-hoc business requests and Developed Stored Procedures and Triggers and extensively used Quest tools like TOAD.
- Good understanding and exposure to Python programming.
- Excellent working experience in Scrum/Agile framework and Waterfall project execution methodologies.
- Extensive experience working with business users/SMEs as well as senior management.
- Experience in Big Data Hadoop Ecosystem in ingestion, storage, querying, processing and analysis of Big data.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Experienced in Technical consulting and end-to-end delivery with architecture, data modeling, data governance and design - development - implementation of solutions.
- Experience in installation, configuration, supporting and managing - Cloudera Hadoop platform along with CDH4&CDH5 clusters.
- Strong experience and knowledge of NoSQL databases such as Mongo DB and Cassandra.
- Proficient in Normalization/De-normalization techniques in relational/dimensional database environments and have done normalizations up to 3NF.
- Good understanding of Ralph Kimball (Dimensional) & Bill Inman (Relational) model Methodologies.
- Strong experience in using MS Excel and MS Access to dump the data and analyze based on business needs.
- Good experienced in Data Analysis as a Proficient in gathering business requirements and handling requirements management.
- Experience in migrating the data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement.
TECHNICAL SKILLS
Data Modeling Tools: Erwin R9.7/9.6, ER Studio V17
Machine Learning: Linear regression, Logistic regression, Decision tree, Random Forest, K nearest neighbor, K mean
Big Data & Hadoop Ecosystem: MapReduce, Spark 3.3, HBase 2.3.4, Hive 2.3, Flume 1.9, Sqoop 1.4.6, Kafka 2.6, Oozie 4.3, Hue, Cloudera Manager, Neo4j, Hadoop 3.3, Apache NiFI 1.6
NOSQL Database: Mongo DB, Azure Sql DB, Cassandra 3.11.10
Databases: Microsoft SQL Server 2017, Teradata 15.0, Oracle 12c, and MS Access
Cloud Platforms: GCP, Google big-query, AWS, EC2, EC3, Redshift & MS Azure
BI Tools: Tableau 10, SSRS, Crystal Reports, Power BI.
Programming Languages: SQL, PL/SQL, UNIX shell Scripting, R
Operating Systems: Microsoft Windows Vista7/8 and 10, UNIX, and Linux.
Methodologies: Agile, RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Waterfall Model.
PROFESSIONAL EXPERIENCE
Confidential
Sr. Data Engineer
Responsibilities:
- As Data Engineer in BNY to drive projects using Spark, SQL and Azure cloud environment.
- Worked on data governance to provide operational structure to previously ungoverned data environments.
- Participated in the requirement gathering sessions to understand the expectations and worked with system analysts to understand the format and patterns of the upstream source data.
- Done data migration from an RDBMS to a NoSQL database, and gives the whole picture for data deployed in various data systems.
- Designed and implement end-to-end data solutions (storage, integration, processing and visualization) in Azure.
- Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data.
- Designed and Configured Azure Cloud relational servers and databases analyzing current and future business requirements.
- Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
- Designed and developed data pipeline in Azure cloud which gets customer data from API and process it to Azure SQL DB.
- Created External tables in Azure SQL Database for data visualization and reporting purpose.
- Create and setup self-hosted integration runtime on virtual machines to access private networks.
- Orchestrated all Data pipelines using Azure Data Factory and built a custom alerts platform for monitoring.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory.
- Done Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
- Performed SAP Data Migration by using Business Objects Data Services as the ETL tool.
- Working with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW).
- Working on building visuals and dashboards using Power BI reporting tool.
- Developed streaming pipelines using Apache Spark with Python.
- Created PL/SQL packages and Database Triggers and developed user procedures and prepared user manuals for the new programs.
- To meet specific business requirements wrote UDF’s in Scala and PySpark.
- Developed JSON API Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
- Worked in Agile environment, and used rally tool to maintain the user stories and tasks.
- Worked with Enterprise data support teams to install Hadoop updates, patches, version upgrades as required and fixed problems, which raised after the upgrades.
- Implemented test scripts to support test-driven development and continuous integration.
- Used Spark for Parallel data processing and better performances.
- Used Azure Key vault as central repository for maintaining secrets and referenced the secrets in Azure Data Factory and also in Databricks notebooks.
- Used Python to extract data for Web scraping.
- Used to collect data from Social Media websites such as Twitter to find out what’s trending using Social Media Scraping.
- Conducted numerous training sessions, demonstration sessions on Big Data.
Environment: Hadoop 3.3, Spark 3.3, Azure, ADF, Scala 3.0, JSON, Power BI, Azure SQL DB, Azure Synapse, Python 3.9, PL/SQL and Agile.
Confidential - Eden Prairie, MN
Sr. Data Engineer
Responsibilities:
- Worked as Data Engineer to review business requirement and compose source to target data mapping documents.
- Conducted technical orientation sessions using documentation and training materials.
- Gathered the business requirements from the Business Partners and Subject Matter Experts.
- Served as technical expert guiding choices to implement analytical and reporting solutions for client.
- Worked closely with the business, other architecture team members and global project teams to understand, document and design data warehouse processes and needs.
- Involved in Agile development methodology active member in scrum meetings.
- Designed and architect various layer of Data Lake.
- Designed star schema in BigQuery.
- Using rest API with Python to ingest Data from and some other site to BigQuery.
- Monitored BigQuery, Dataproc and cloud Data flow jobs via Stack driver for the entire environment.
- Open SSH tunnel to Google Dataproc to access to yarn manager to monitor spark jobs.
- Submitted Spark jobs using gsutil and spark submission get it executed in Dataproc cluster.
- Used g-cloud function with Python to load Data in to BigQuery for on arrival CSV files in GCS bucket.
- Wrote a program to download a SQL Dump from the maintenance site and then load it in GCS bucket.
- Loaded the SQL dump from GCS bucket to MYSQL (hosted in Google cloud SQL) and load the Data from MYSQL to BigQuery using Python, Scala, spark and Dataproc.
- Process and load bound and unbound Data from Google pub/sub topic to BigQuery using cloud Dataflow with Python.
- Wrote a Python program to maintain raw file archival in GCS bucket.
- Wrote Scala program for spark transformation in Dataproc.
Environment: BigQuery, DataStage, Agile/scrum, Python, Spark, Scala, Dataproc, G-cloud, GCS bucket, SQL, Star Schema & MYSQL.
Confidential
Data Engineer
Responsibilities:
- Worked as a Data Engineer to review business requirement and compose source to target data mapping documents.
- Architected, Designed and Developed Business applications and Data marts for reporting.
- Developed Big Data solutions focused on pattern matching and predictive modeling
- Build Hadoop solutions for big data problems using MR1 and MR2 in YARN.
- Developed complete end to end Big-data processing in Hadoop eco system.
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
- Developed reconciliation process to make sureelasticsearchindex document count match to source records
- Developed data pipelines to consume data from Enterprise Data Lake (MapR Hadoop distribution - Hive tables/HDFS) for analytics solution.
- Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat File as part of a POC using Amazon EC2.
- Created Hive External tables to stage data and then move the data from Staging to main tables
- Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
- Developed incremental and complete load Python processes to ingest data intoElasticSearchfrom oracle database
- Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Created Airflow Scheduling scripts in Python.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Developed Rest services to write data intoElasticSearchindex using Python Flask specifications
- Used AWS Cloud with Infrastructure Provisioning / Configuration.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Worked on configuring and managing disaster recovery and backup on NoSQL database.
- Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from Mongo DB through Sqoop and placed in HDFS and processed.
- Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
- Implemented partitioning, dynamic partitions and buckets in Hive.
- Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed to HDFS.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.
Environment: Apache Spark, Hive 2.3, Informatica, HDFS, Airflow, MapReduce, Scala, Apache NIFI 1.6, Yarn, PL/SQL, Mongo DB, Pig 0.16, Sqoop 1.2, Flume 1.8
Confidential - Houston, TX
Data Analyst/ Data Engineer
Responsibilities:
- Worked on Data Analyst/Data Engineer and data validation to ensure the accuracy of the data between the warehouse and source systems .
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Used agile methodology and SCRUM process for project developing.
- Deployed and monitored scalable infrastructure on cloud environment Amazon web services (AWS).
- Analyzed and Prepare data, identify the patterns on dataset by applying historical models.
- Involved in Migrating Objects from Teradata to Snowflake.
- Designed and implemented effective Analytics solutions and models with Snowflake.
- Involved with data profiling for multiple sources and answered complex business questions by providing data to business users.
- Performed Data scrubbing for removing incomplete, irrelevant data and maintained consistency in the target data warehouse by cleaning the dirty data.
- Loaded data into Hive Tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.
- Perform data manipulation, data preparation, normalization, and predictive modeling.
- Improve efficiency and accuracy by evaluating model in R.
- Tested the ETL process for both before data validation and after data validation process.
- Used AWS glue catalog with crawler to get the data from S3 and perform Sql query operations.
- Designed data profiles for processing, including running SQL, Procedural/SQL queries and using R for Data Acquisition and Data Integrity which consists of Datasets Comparing and Dataset schema checks.
- Used Python and R for programming for improvement of model.
- Written SQL queries against Snowflake.
- Designed the Database Tables & Created Table and Column Level Constraints using the suggested naming conventions for constraint keys.
- Responsible for creating on - demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
- Automated solutions to manual processes with big data tools (Spark, Python and AWS).
- Involved in loading data from Unix file system to HDFS.
- Worked on the enhancing the data quality in the database.
- Maintained PL/SQL objects like packages, triggers, procedures etc.
- Designed and developed T-SQL stored procedures to extract, aggregate, transform, and insert data.
- Created several reports for claims handling which had to be exported out to PDF formats.
- Created Tableau dashboards, datasets, data sources and worksheets
Environment: Snowflake, Hadoop 2.5, Teradata, R, Python, Spark, AWS S3, Glue, PySpark, PL/SQL, T-SQL, Tableau, and agile/Scrum.
Confidential - Boston, MA
Data Analyst
Responsibilities:
- As a Data Analyst I was responsible for gathering data migration requirements.
- Identified problematic areas and conduct research to determine the best course of action to correct the data.
- Analyzed problem and solved issues with current and planned systems as they relate to the integration and management of order data.
- Under supervision of Sr. Data Scientist performed Data Transformation method for Re scaling and Normalizing Variables.
- Developed and implemented predictive models using Natural Language Processing Techniques and machine learning algorithms.
- Involved in Data Mapping activities for the data warehouse.
- Analyzed reports of data duplicates or other errors to provide ongoing appropriate inter-departmental communication and monthly or daily data reports.
- Monitor for timely and accurate completion of select data elements.
- Collected, analyze and interpret complex data for reporting and/or performance trend analysis.
- Monitor data dictionary statistics.
- Involved in analyzing and adding new features of Oracle 10g like DBMS SHEDULER, Create Directory, Data pump, CONNECT BY ROOT in existing Oracle 10g application.
- Coded, tested, debugged, implemented and documented data using R.
- Applied K-Means algorithm in determining the position of an Agent based on the data collected.
- Applied Regression in identifying the probability of the Agent's location regarding the insurance policies sold.
- Archived the old data by converting them in to SAS data sets and flat files.
- Extensively used Erwin tool in Forward and reverse engineering, following the Corporate Standards in Naming Conventions, using Conformed dimensions whenever possible.
- Enhance smooth transition from legacy to newer system, through change management process.
- Planned project activities for the team based on project timelines using Work Breakdown Structure.
- Compare data with original source documents and validate Data accuracy.
- Used reverse engineering to create Graphical Representation (E-R diagram) and to connect to existing database.
- Generate weekly and monthly asset inventory reports.
- Created Technical Design Documents, Unit Test Cases.
- Written SQL Scripts and PL/SQL Scripts to extract data from Database to meet business requirements and for Testing Purposes.
- Written complex SQL queries for validating thedataagainst different kinds of reports generated by Business Objects XIR2
- Involved in Test case/ data preparation, execution and verification of the test results
- Created user guidance documentations.
- Created reconciliation report for validating migrated data.
Environment: UNIX, Shell Scripting, XML Files, K-Means, R, XSD, XML, SAS, PL/SQL, Oracle 10g, Erwin 9.5, Autosys.