- Data Engineer professional with 8+ years of combined experience in the fields of Data Engineer, Big Data implementations and Spark technologies.
- Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
- High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
- Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.
- Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and noledge on Spark MLLib.
- Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
- A solid experience and understanding of designing and operationalization of large-scale data and analytics solutions on Snowflake Data Warehouse.
- Developing ETL pipelines in and out of data warehouse using a combination of Python and SnowSQL.
- Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
- Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
- Strong Knowledge on architecture and components of Spark, and efficient in working with Spark Core.
- Strong noledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.
- Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
- Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
- Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instances.
- Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
- Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
- Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
- Experience in data warehousing and business intelligence area in various domain.
- Created tableau dashboards designing with large data volumes from data source SQL server.
- Extract, Transform and Load (ETL) source data into respective target tables to build the required data marts.
- Active involvement in all scrum ceremonies - Sprint Planning, Daily Scrum, Sprint Review and Retrospective meetings and assisted Product owner in creating and prioritizing user stories.
- Strong experience in working with UNIX/LINUX environments, writing shell scripts.
- Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
- Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principals defined for the team.
Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, SparkAirflow, MongoDB, Cassandra, HBase, and Storm.
Hadoop Distribution: Cloudera distribution and Horton works.
Programming Languages: Scala, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting
Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQLDatabase (HBase, MongoDB).
Operating Systems: Linux, Windows, Ubuntu, Unix
Web/Application server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans
Data Visualization Tools: Tableau, Power BI, SAS, Excel, ETL
OLAP/Reporting: SQL Server Analysis Services and Reporting Services.
Cloud Technologies: MS Azure, Amazon Web Services (AWS).
Machine Learning Models: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), TEMPPrincipal Component Analysis, Linear Regression, Naïve Bayes.
Confidential, Charlotte, NC
- Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives.
- Implemented a proof of concept deploying dis product in AWS S3 bucket and Snowflake.
- Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
- Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
- Experience in data cleansing and data mining.
- Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
- Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
- Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
- Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
- Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
- Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
- Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
- Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary queries or python scripts based on source.
- Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
- Created DAG to use the Email Operator, Bash Operator, and spark Livy operator to execute and in EC2 instance.
- Deploy the code to EMR via CI/CD using Jenkins
- Extensively used Code cloud for code check-in and checkouts for version control.
Environment: Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Code cloud, AWS.
Confidential, Rochester, MN
- Worked on designing and developing the Real - Time Tax Computation Engine using Oracle, StreamSets, Kafka, Spark Structured Streaming and MySQL
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Involved in ingestion, transformation, manipulation, and computation of data using StreamSets, Kafka, MySQL, Spark
- Involved in data ingestion into MySQL using Kafka - MySQL pipeline for full load and Incremental load on variety of sources like web server, RDBMS, and Data API’s.
- Worked on Spark Data sources, Spark Data frames, Spark SQL and Streaming using Scala.
- Worked extensively on AWS Components such as Elastic Map Reduce (EMR), Elastic Compute Cloud (EC2), Simple Storage Service (S3)
- Experience in developing Spark application using Scala SBT
- Experience in integrating Spark-MySQL connector and JDBC connector to save the data processed in Spark to MySQL.
- Responsible for creating tables and MySQL pipelines which are automated to load the data into tables from Kafka topics
- Performed a POC to check the time taking for Change Data Capture (CDC) of oracle data across Stream, StreamSets and DB Visit
- Expertise in using different file formats like Text files, CSV, Parquet, JSON
- Experience in custom compute functions using Spark SQL and performed interactive querying.
- Responsible for masking and encrypting the sensitive data on the fly
- Responsible for creating multiple applications for reading the data from different Oracle instances to Kafka topics using Stream
- Responsible for setting up a MySQL cluster on AWS EC2 Instance
- Experience in Real time streaming the data using Spark with Kafka.
- Responsible for creating a Kafka cluster using multiple brokers.
- Experience working on Vagrant boxes to setup a local Kafka and StreamSets pipelines
Environment: Spark 2.2, Scala, Linux, MySQL 5.8, Kafka 1.0, Stream, StreamSets, Spark SQL, Spark Structured Streaming, AWS EC2, EMR, IntelliJ, SBT, Git, Vagrant.
Big Data Engineer
- Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
- Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
- Developed spark applications for performing large scale transformations and denormalization of relational datasets.
- Has real-time experience of Kafka-Storm on HDP 2.2 platform for real time analysis.
- Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
- Created reports for the BI team using Sqoop to export data into HDFS and Hive.
- Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
- Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
- Created HIVE Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
- Developed Complex HiveQL's using SerDe JSON
- Created HBase tables to load large sets of structured data.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports
- Performed Real time event processing of data from multiple servers in the organization using Apache Storm by integrating with apache Kafka.
- Managed and reviewed Hadoop log files.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Worked on PySpark APIs for data transformations.
- Data ingestion to Hadoop (Sqoop imports). To perform validations and consolidations for the imported data.
- Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
- Upgraded current Linux version to RHEL version 5.6
- Expertise in hardening, Linux Server, and Compiling, Building, and installing Apache Server from sources with minimum modules
- Worked on JSON, Parquet, Hadoop File formats.
- Worked on different Java technologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
- Used Git hub for continuous integration services.
Environment: Agile Scrum, MapReduce, Hive, Pig, Sqoop, Spark, Scala, Oozie, Flume, Java, HBase, Kafka, Python, Storm, JSON, Parquet, GIT, JSON SerDe, Cloudera.
- Understand the data visualization requirements from the Business Users.
- Writing SQL queries to extract data from the Sales data marts as per the requirements.
- Developed Tableau data visualization using Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.
- Designed and deploy rich Graphic visualizations with Drill Down and Drop-down menu option and Parameterized using Tableau.
- Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
- Explored traffic data from databases connecting them with transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.
- Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.
- Data Cleaning, merging, and exporting the dataset was done in Tableau Prep.
- Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality in order to improve the analysis.
Environment: Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.
- Processed data received from vendors and loading them into the database. The process was carried out on a weekly basis and reports were delivered on a bi-weekly basis. The extracted data had to be checked for integrity.
- Documented requirements and obtained signoffs.
- Coordinated between the Business users and development team in resolving issues.
- Documented data cleansing and data profiling.
- Wrote SQL scripts to meet the business requirement.
- Analyzed views and produced reports.
- Tested cleansed data for integrity and uniqueness.
- Automated the existing system to achieve faster and accurate data loading.
- Generated weekly, bi-weekly reports to be sent to client business team using business objects and documented them too.
- Learned to create Business Process Models.
- Ability to manage multiple projects simultaneously tracking them towards varying timelines effectively through a combination of business and technical skills.
- Good Understanding of clinical practice management, medical and laboratory billing, and insurance claim with processing with process flow diagrams.
- Assisted QA team in creating test scenarios that cover a day in the life of the patient for Inpatient and Ambulatory workflows.
Environment: SQL, data profiling, data loading, QA team, Tableau, Python, Machine Learning models.