Senior Big Data Engineer Resume
Plano, TexaS
SUMMARY
- Data Engineer professional with 6+ years of combined experience in the fields of Data Engineering, Big Data implementations, and Spark technologies.
- Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
- High Exposure to Big Data technologies and the Hadoop ecosystem - depth depth understanding of Map Reduce and Hadoop Infrastructure.
- Hands-on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
- Provide guidance to the development team working on Pyspark as ETL platform.
- Make sure that quality standards are defined and met. Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing.
- Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure With Databricks, Databricks Workspace for Business Analytics, Manage Clusters In Databricks, Managing the Machine Learning Lifecycle
- Expertise in writing end-to-end Data processing Jobs to analyze data using MapReduce, Spark, and Hive.
- Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD, and knowledge of Infopark MLLib.
- Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy, and Pandas for data analysis and numerical computations.
- A solid experience and understanding of designing and operationalization large-scale data and analytics solutions on Snowflake Data Warehouse.
- Developing ETL pipelines in and out of the data warehouse using a combination of Python and SnowSQL.
- Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processing.
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
- Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
- Strong Knowledge of architecture and components of Spark, and efficient in working with Spark Core.
- Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.
- Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured, and unstructured data sets and storing them in HDFS.
- Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, and Fact and Dimension tables.
- Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instances.
- Hands-on experience working with Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
- Hands-on experience in SQL and NoSQL databases such as Snowflake, HBase, Cassandra, and MongoDB.
- Hands-on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
- Experience in data warehousing and business intelligence areas in various domains.
- Created tableau dashboards designed with large data volumes from data source SQL server.
- Extract, Transform and Load (ETL) source data into respective target tables to build the required data marts.
- Active involvement in all scrum ceremonies - Sprint Planning, Daily Scrum, Sprint Review, and Retrospective meetings and assisted Product owner in creating and prioritizing user stories.
- Strong experience in working with UNIX/LINUX environments and writing shell scripts.
- Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, and XML Files.
- Strong skills in analytical, presentation, communication, and problem-solving with the ability to work independently as well as in a team and the ability to follow the best practices and principles defined for the team.
TECHNICAL SKILLS
Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.
Hadoop Distribution: Cloudera distribution and Horton works.
Programming Languages: Scala, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting
Script Languages: JavaScript, jQuery, Python.
Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL, Database (HBase, MongoDB).
Operating Systems: Linux, Windows, Ubuntu, Unix
Web/Application server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans
Data Visualization Tools: Tableau, Power BI, SAS, Excel, ETL
OLAP/Reporting: SQL Server Analysis Services and Reporting Services.
Cloud Technologies: GCP, MS Azure, Amazon Web Services (AWS).
Machine Learning Models: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Principal Component Analysis, Linear Regression, Naïve Bayes.
PROFESSIONAL EXPERIENCE
Confidential, Plano, Texas
Senior Big Data Engineer
Responsibilities:
- Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives.
- Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
- Utilize AWS services with a focus on big data architect /analytics/enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, and performance, and to provide meaningful and valuable information for better decision-making.
- Developed Scala scripts, and UDFs using both data frames/SQL and RDD in PySpark for data aggregation, queries, and writing back into the S3 bucket.
- Experience in data cleansing and data mining.
- Wrote, compiled, and executed programs as necessary using PySpark in Scala to perform ETL jobs with ingested data.
- Provide workload estimates to client, Developed framework for Behavior Driven Development (BDD), Migrated On prem informatica EL process to AWS cloud and Snowflakes
- Used Spark Streaming to divide streaming data into batches as input to the Spark engine for batch processing.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, and Spark SQL for data analysis and provided them to the data scientists for further analysis.
- Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata, and snowflake.
- Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
- Implemented PySpark RDD transformations to Map business analysis and apply actions on top of transformations.
- Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
- Created scripts to read CSV, JSON, and parquet files from S3 buckets in Python and load them into AWS S3, DynamoDB, and Snowflake.
- Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or HTTP requests using Amazon API gateway
- Migrated data from the AWS S3 bucket to Snowflake by writing a custom read/write snowflake utility function using Scala.
- Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs. Create a Pyspark frame to bring data from DB2 to Amazon 53.
- Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
- Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary queries or python scripts based on the source.
- Install and configure Apache Airflow for the S3 bucket and Snowflake data warehouse and create dags to run the Airflow.
- Created DAG to use the Email Operator, Bash Operator, and spark Livy operator to execute in the EC2 instance.
- Deploy the code to EMR via CI/CD using Jenkins.
- Extensively used Code cloud for code check-in and checkouts for version control.
Environment: Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Code cloud.
Confidential - Chicago,IL
Lead Data Engineer
Responsibilities:
- Involved in Requirement gathering, Business Analysis, Design and Development, testing, and implementation of business rules.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
- Extract Transform and Load data from source Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks
- Understand business use cases, integration business, write business & technical requirements documents, logic diagrams, process flow charts, and other application-related documents.
- Design and develop ETL integration patterns using Python on Spark.
- Translate business requirements into maintainable software components and understand impact (Technical and Business)
- Used Pandas in Python for Data Cleansing and validating the source data.
- Designed and developed ETL pipeline in Azure cloud which gets customer data from API and processes it to Azure SQL DB.
- Orchestrated all Data pipelines using Azure Data Factory and built a custom alerts platform for monitoring.
- Created custom alert queries in Log Analytics and used Webhook actions to automate custom alerts.
- Created Databricks Job workflows which extract data from SQL server and upload the files to sftp using pyspark and python.
- Used Azure Key vault as a central repository for maintaining secrets and referenced the secrets in Azure Data Factory and also in Databricks notebooks.
- Built Teradata ELT frameworks which ingest data from different sources using Teradata Legacy load utilities.
- Built a common sftp download or upload framework using Azure Data Factory and Databricks. Maintain and support the Teradata architectural environment for EDW Applications.
- Involved in the full lifecycle of projects, including requirement gathering, system designing, application development, enhancement, deployment, maintenance, and support
- Involved in logical modeling, physical database design, data sourcing, data transformation, data loading, SQL, and performance tuning.
- Project development estimations to business and upon agreement with business delivered project accordingly Created proper Teradata Primary Indexes (Pl) taking into consideration both planned access of data and even distribution of data across all the available AMPS.
- Considering both the business requirements and factors, created appropriate Teradata NUSI for smooth (fast and easy) access to data.
- Developing Data Extraction, Transformation, and Loading jobs from flat files, Oracle, SAP, and Teradata Sources into Teradata using BTEQ, FastLoad, FastExport, MultiLoad, and stored procedures.
- Design of process-oriented UNIX script and ETL processes for loading data into the data warehouse Developed mappings in Informatica to load the data from various sources into the Data Warehouse, using different transformations like Source Qualifier, Expression, Lookup, Aggregate, Update Strategy, and Joiner
- Worked on Informatica Advanced concepts & also Implementation of Informatica Push down Optimization technology and pipeline partitioning.
- Performed bulk data load from multiple data sources (ORACLE 8i, legacy systems) to TERADATA RDBMS using BTEQ, MultiLoad, and FastLoad.
- Used various transformations like Source qualifiers, Aggregators, lookups, Filters, Sequence generators, Routers, Update Strategy, Expression, sorters, Normalizer, Stored Procedures, Union, etc.
- Used Informatica Power Exchange to handle the change data capture (CDC) data from the source and load into Data Mart by following the slowly changing dimensions (SCD) Type II process.
- Used Power Center Workflow Manager to create workflows, and sessions, and also used various tasks like a command, event wait, event raise, and email.
- Designed, created, and tuned physical database objects (tables, views, indexes, PPI, UPI, NUPI, and USI) to support normalized and dimensional models.
- Created a cleanup process for removing all the Intermediate temp files that were used before the loading process.
- Used volatile tables and derived queries for breaking up complex queries into simpler queries.
- Responsible for performance monitoring, resource and priority management, space management, user management, index management, access control, and executing disaster recovery procedures.
- Used Python and Shell scripts to Automate Teradata ELT and Admin activities.
- Performed Application-level DBA activities creating tables, and indexes, and monitored and tuned Teradata BTEQ scripts using Teradata Visual Explain utility.
- Performance tuning, monitoring, UNIX shell scripting, and physical and logical database design.
- Developed UNIX scripts to automate different tasks involved as part of the loading process.
- Worked on Tableau software for reporting needs.
- Worked on creating a few Tableau dashboard reports, Heat map charts and supported numerous dashboards, pie charts, and heat map charts that were built on the Teradata database.
Environment: Spark-Streaming, Hive, Scala, Hadoop, Kafka, Spark, Sqoop, Docker, Spark SQL, TDD, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper, Azure, Unix/Linux Shell Scripting, Python, PyCharm, Informatica, Informatica PowerCenterLinux, Shell Scripting.
Confidential
Data Analyst
Responsibilities:
- Understand the data visualization requirements of the Business Users.
- Writing SQL queries to extract data from the Sales data marts as per the requirements.
- Developed Tableau data visualization using Scatter Plots, Geographic maps, Pie Charts, Bar Charts, and Density charts.
- Designed and deployed rich Graphic visualizations with Drill Down and Drop-down menu options and Parameterized using Tableau.
- Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
- Explored traffic data from databases connecting them with transaction data, and presenting as well as writing reports for every campaign, providing suggestions for future promotions.
- Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.
- Data Cleaning, merging, and exporting the dataset was done in Tableau Prep.
- Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality to improve the analysis.
Environment: Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.
Confidential
Data Analyst
Responsibilities:
- Processed data received from vendors and loaded them into the database.
- The process was carried out every week and reports were delivered on a bi-weekly basis.
- The extracted data had to be checked for integrity.
- Documented requirements and obtained sign offs.
- Coordinated between the Business users and development team in resolving issues.
- Documented data cleansing and data profiling.
- Wrote SQL scripts to meet business requirements.
- Analyzed views and produced reports.
- Tested cleansed data for integrity and uniqueness.
- Automated the existing system to achieve faster and more accurate data loading.
- Generated weekly, and bi-weekly reports to be sent to the client business team using business objects and documented them too.
- Learned to create Business Process Models.
- Ability to manage multiple projects simultaneously tracking them towards varying timelines effectively through a combination of business and technical skills.
- Good Understanding of clinical practice management, medical and laboratory billing, and insurance claims with processing with process flow diagrams.
- Assisted the QA team in creating test scenarios that cover a day in the life of the patient for Inpatient and Ambulatory workflows.
Environment: SQL, data profiling, data loading, QA team, Tableau, Python, Machine Learning models.