Data Engineer Resume
Pataskala, OhiO
SUMMARY
- Data Engineer professional wif 8+ years of combined experience in the fields of Data Engineer, Big Data implementations and Spark technologies.
- Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
- High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
- Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.
- Experience wif Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and noledge on Spark MLLib.
- Experienced in data manipulation using python for loading and extraction as well as wif python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
- A solid experience and understanding of designing and operationalization of large-scale data and analytics solutions in Snowflake Data Warehouse.
- Developing ETL pipelines in and out of data warehouse using a combination of Python and SnowSQL.
- Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
- Worked wif NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
- Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
- Strong Knowledge on architecture and components of Spark, and efficient in working wif Spark Core.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
- Experience on Azure cloud segments (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, and Cosmos DB).
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Has extensive experience in creating pipeline jobs, schedule triggers using Azure data factory.
- Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
- Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instances.
- Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
- Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
- Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
- Experience in data warehousing and business intelligence area in various domain.
- Created tableau dashboards designing wif large data volumes from data source SQL server.
- Extract, Transform and Load (ETL) source data into respective target tables to build the required data marts.
- Active involvement in all scrum ceremonies - Sprint Planning, Daily Scrum, Sprint Review and Retrospective meetings and assisted Product owner in creating and prioritizing user stories.
- Strong experience in working wif UNIX/LINUX environments, writing shell scripts.
- Worked wif various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
TECHNICAL SKILLS
Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, SparkAirflow, MongoDB, Cassandra, HBase, and Storm, Azure Databricks, Azure Data Explorer, Azure HDInsight
Hadoop Distribution: Cloudera distribution and Horton works.
Programming Languages: Scala, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting
Script Languages: JavaScript, jQuery, Python.
Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQLDatabase (HBase, MongoDB).
Operating Systems: Linux, Windows, Ubuntu, Unix
Web/Application server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans
Data Visualization Tools: Tableau, Power BI, SAS, Excel, ETL
OLAP/Reporting: SQL Server Analysis Services and Reporting Services.
Cloud Technologies: MS Azure, Amazon Web Services (AWS).
Machine Learning Models: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Principal Component Analysis, Linear Regression, Naïve Bayes.
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential, Pataskala, Ohio
Responsibilities:
- Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives.
- Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
- Utilize AWS services wif focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
- Developed Scala scripts, R-studio shiny UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
- Experience in data cleansing and data mining.
- Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs wif ingested data.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
- Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
- Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
- Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
- Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
- Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
- Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
- Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
- Created DAG to use the Email Operator, Bash Operator, and spark Livy operator to execute and in EC2 instance.
- Deploy the code to EMR via CI/CD using Jenkins
- Extensively used Code cloud for code check-in and checkouts for version control.
Environment: Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Code cloud, AWS.
Data Engineer | Analyst
Confidential, Tampa, FL
Responsibilities:
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Strong experience of leading multiple Azure Big Data and Data transformation implementations in Banking and Financial Services, High Tech and Utilities industries.
- Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) dat process the data using the SQL Activity.
- Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
- Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
- Created Partitioned and Bucketed Hive tables in Parquet File Formats wif Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
- Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
- Collected the Json data from HTTP Source and developed Spark APIs dat halps to do inserts and updates in Hive tables
- Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)
- Responsible for resolving the issues and troubleshooting related to performance of Hadoop cluster.
- Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
- Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
- Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Performed all necessary day-to-day GIT support for different projects, Responsible for maintenance of the GIT Repositories, and the access control strategies.
Environment: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Lambda, Glue, Azure, ADF,Blob, cosmos DB, Python, PySpark, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Oozie, HBase, Oracle, Teradata, Cassandra, MLlib, Tableau, Maven, Git, Jira.
Confidential, New York, NY
Big Data Engineer
Responsibilities:
- Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
- Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
- Developed spark applications for performing large scale transformations and denormalization of relational datasets.
- Has real-time experience of Kafka-Storm on HDP 2.2 platform for real time analysis.
- Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
- Created reports for the BI team using Sqoop to export data into HDFS and Hive.
- Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
- Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
- Created HIVE Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
- Developed Complex HiveQL's using SerDe JSON
- Created HBase tables to load large sets of structured data.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports
- Performed Real time event processing of data from multiple servers in the organization using Apache Storm by integrating wif apache Kafka.
- Managed and reviewed Hadoop log files.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Worked on PySpark APIs for data transformations.
- Data ingestion to Hadoop (Sqoop imports). To perform validations and consolidations for the imported data.
- Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
- Upgraded current Linux version to RHEL version 5.6
- Expertise in hardening, Linux Server, and Compiling, Building, and installing Apache Server from sources wif minimum modules
- Worked on JSON, Parquet, Hadoop File formats.
- Worked on different Java technologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
- Used Git hub for continuous integration services.
Environment: Agile Scrum, MapReduce, Hive, Pig, Sqoop, Spark, Scala, Oozie, Flume, Java, HBase, Kafka, Python, Storm, JSON, Parquet, GIT, JSON SerDe, Cloudera.
Data Modeler | Analyst
Confidential
Responsibilities:
- Understand the data visualization requirements from the Business Users.
- Writing SQL queries to extract data from the Sales data marts as per the requirements.
- Developed Tableau data visualization using Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.
- Designed and deploy rich Graphic visualizations wif Drill Down and Drop-down menu option and Parameterized using Tableau.
- Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
- Explored traffic data from databases connecting them wif transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.
- Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.
- Data Cleaning, merging, and exporting the dataset was done in Tableau Prep.
- Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality in order to improve the analysis.
Environment: Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.
Data Modeler
Confidential
Responsibilities:
- Processed data received from vendors and loading them into the database. The process was carried out on a weekly basis and reports were delivered on a bi-weekly basis. The extracted data had to be checked for integrity.
- Documented requirements and obtained signoffs.
- Coordinated between the Business users and development team in resolving issues.
- Documented data cleansing and data profiling.
- Wrote SQL scripts to meet the business requirement.
- Analyzed views and produced reports.
- Tested cleansed data for integrity and uniqueness.
- Automated the existing system to achieve faster and accurate data loading.
- Generated weekly, bi-weekly reports to be sent to client business team using business objects and documented them too.
- Learned to create Business Process Models.
- Ability to manage multiple projects simultaneously tracking them towards varying timelines TEMPeffectively through a combination of business and technical skills.
- Good Understanding of clinical practice management, medical and laboratory billing, and insurance claim wif processing wif process flow diagrams.
- Assisted QA team in creating test scenarios dat cover a day in the life of the patient for Inpatient and Ambulatory workflows.
Environment: SQL, data profiling, data loading, QA team, Tableau, Python, Machine Learning models.