Data Engineer Resume
Los Angeles, CaliforniA
SUMMARY
- 6+ years of combined IT experience as a Big Data Engineer, Python Engineer and Business Analyst, and an expert in designing, development, and implementing data models for big data applications
- Hands On experience in developing Spark applications using Spark tools like RDD transformations, Spark Core, Spark Streaming and Spark SQL
- Have experience in technologies, tools and databases like Big Data, AWS, Azure, SQL, Hadoop, Hive, Pig, Sqoop, HBase, Spark, Cassandra, and Hue
- Excellent Knowledge using Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance
- Proficient in relational databases like Oracle, MySQL and SQL Server
- Extensive experience in development of Bash scripting, T - SQL, and PL/SQL scripts
- Strong experience in working with NO SQL databases and its integration Dynamo DB, Cosmo DB, Mongo DB, Cassandra and HBase
- Extensively worked onAWS services like EC2, S3, EMR, RDS, Athena, Lambda Function, Step Function, Glue Data Catalog, SNS, RDS(Aurora), Redshift, DynamoDB, and Quick Sightand other services of the AWS family
- Extensive knowledge in working with Azure cloud platform (HDInsight, VM, Blob Storage, Data Lake, Data Bricks, ADLS Gen 2, Azure Data Factory, Synapse, and Data Storage Explorer)
- Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames Application programming interface (API)
- Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL
- Expertise inf Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks, Managing the Machine Learning Lifecycle
- Strong experience in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis
- Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required Validations in the data
- Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive User Defined Functions (UDF's) as required
- Capable of understanding and knowledge of job workflow scheduling and locking tools/services like Oozie, Zookeeper, Airflow and Apache NiFi
- Strong experience and knowledge of real time data analytics using Spark Streaming and Flume
- Extensive experience in Configuring Spark Streaming to receive real time data from Apache Kafka and store the stream data to HDFS and expertise in using Spark-SQL with various data sources like JSON, Parquet and Hive
- Proficient in Python Scripting and worked in stats function with NumPy, Pandas for organizing data and visualization using Matplotlib
- Developed and Designed automation framework using Python and Shell Scripting
- Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets
- Experience in working with GIT, Bitbucket Version Control System
- Responsible for using Flume sink to remove the data from Flume Channel and deposit in No-SQL database like MongoDB
- Knowledge in using Integrated Development environments like Eclipse, NetBeans, IntelliJ, STS
- Involved in loading data from UNIX file system and FTP to HDFS
- Hands on Experience in using Visualization tools like Tableau, Power BI
- Capable in working with SDLC, Agile and Waterfall Methodologies
- Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.
TECHNICAL SKILLS
Hadoop/Big Data Ecosystem: Apache Spark, HDFS, MapReduce, HIVE, Kafka, Sqoop
Programming & Scripting: Python, PySpark, SQL, Scala
NoSQL Databases: Mongo DB, Dynamo DB
SQL Databases: MS-SQL Server, MySQL, Oracle, Postgre
Cloud Computing: AWS, Azure
Operating Systems: Ubuntu (Linux), Mac OS-X, Windows 10, 8
Reporting: PowerBI, Tableau
Version Control: Git, GitHub, SVN
Methodologies: Agile/ Scrum, Rational Unified Process and Waterfall
PROFESSIONAL EXPERIENCE
Confidential, Los Angeles, California
Data Engineer
Responsibilities:
- Created end to end data pipeline which includes data ingestion, data curation, data provision using AWS cloud services.
- DevelopedSpark applicationsusingPythonand implemented Apache Spark data processing project to handle data from bothBatchandStreamingsources
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.
- Ingested data into S3 bucket from different sources including MySQL, Oracle, MongoDB, SFTP.
- Proficient in working with AWS services like S3, EC2, EMR, Redshift, Athena, Glue, DynamoDB, RDS, IAM
- Working with Big data services/concepts like Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming
- Created EMR cluster on EC2 instances and developed PySpark applications to perform data transformations on top of it and stored into Redshift
- Configured AWS Identity and Access Management (IAM) Groups and Users for improved login authentication
- Used AWS Lambda using Python scripts to automate the operation to read data types (Parquet, JSON, Avro, CSV) from AWS S3 to AWS Redshift
- Created an event-driven AWS Glue ETL pipeline using Lambda function by reading the data from the S3 bucket and storing it in Redshift on daily basis
- Developed python script using Boto3 library to configure the services AWS glue, EC2, S3, DynamoDB
- Performed tuning of Spark Applications to set batch interval time and the correct level of Parallelism and memory tuning
- UsedSpark Streaming APIsto perform transformations and actions for the data coming fromKafkain real-time and persists it toAWS S3
- Developed Kafka consumer API in python for consuming data from Kafka topics
- Used Apache Kafka to aggregate web log data from multiple servers and make them available indownstream systems for Data analysis and engineering type of roles
- Extracted the data from HDFS using Hive and performed data analysis using PySpark, Redshift for feature selection and created nonparametric models in Spark
- Used SQL commands DDL, DQL, DML For creating, selecting, and modifying tables
- Used AWS Redshift, S3, and Athena services to query large amounts of data stored on S3 to create a Virtual Data Lake without having to go through the ETL process
- Developed Oozie coordinators to schedule Hive scripts to create Data pipelines.
- Written several Map reduce Jobs using PySpark, NumPy and used Jenkins for Continuous integration.
- Involved in writing custom MapReduce programs using Java API for data processing.
- Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources with file formats like ORC/Parquet/Text Files into AWS S3
- Imported and exported data jobs, to perform operations like copying data from HDFS and to HDFS using Sqoop and developed Spark code using SQL for faster testing and processing of data
- Developed Sqoop and Kafka Jobs to load data from RDBMS into HDFS and HIVE
- Worked in the development of applications especially in the LINUX environment and familiar with all its commands and worked on Jenkins continuous integration tool for deployment of the project and deployed the project into Jenkins using GIT version control system.
Environment: Spark, Spark-Streaming, PySpark, Spark SQL, AWS EMR, S3, EC2, Redshift, Athena, Lambda, Glue, DynamoDB, MapReduce, Java, HDFS, Hive, Pig, Apache Kafka, Python, Shell scripting, Linux, MySQL, NoSQL, SOLR, Jenkins,Oracle, Git, Airflow, Tableau, Power BI.
Confidential, Tennessee
Data Engineer
Responsibilities:
- Extensively worked on Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, DWH and Data Storage Explorer)
- Designed and deployed data pipelines using DataLake, DataBricks, and Apache Airflow
- Enabling other teams to work with more complex scenarios and machine learning solutions
- Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse
- Evolved in Spark Scala functions for mining data to provide real time insights and reports
- Configured spark streaming to receive real time data from the Apache Flume and store the stream data using Scala to Azure Table
- DataLake is used to store and do all types of processing and analytics
- Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Spark Scala scripts and UDF's to perform transformations on large datasets
- Utilized Spark Streaming API to stream data from various sources. Optimized existing Scala code and improved the cluster performance
- Involved in using Spark DataFrames to create Various Datasets and applied business transformations and data cleansing operations using DataBricks Notebooks
- Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Airflow, Apache NiFi
- Tasks are distribution on celery workers to manage communication between multiple services
- Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved the query performance
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API)
- Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS
- Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API
- Used Flume sink to write directly to indexers deployed on cluster, allowing indexing during ingestion
- Migrated from Oozie to Apache Airflow. Involved in developing Oozie and Airflow workflows for daily incremental loads, getting data from RDBMS (MongoDB, MS SQL)
- Managed resources and scheduling across the cluster using Azure Kubernetes Service. AKS can be used to create, configure, and manage a cluster of Virtual machines
- Extensively used Kubernetes which is possible to handle all the online and batch workloads required to feed, analytics and machine learning applications
- Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication and Apache Ranger for authorization
- Experience in working with Spark applications like batch interval time, level of parallelism, memory tuning to improve the processing time and efficiency
- Used Scala for amazing concurrency support, and Scala plays the key role in parallelizing processing of the large data sets
- Developed map reduce jobs using Scala for compiling the program code into bytecode for the JVM for data processing
- Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements.
Environment: Azure HDInsight, DataBricks, DataLake, CosmosDB, MySQL, Snowflake, MongoDB, Teradata, Ambari, Flume, VSTS, Tableau, PowerBI, Azure DevOps, Ranger, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, Airflow, Hive, Sqoop, HBase.
Confidential
Big Data Engineer
Responsibilities:
- Experienced working with Big Data, Data Visualization, Python Development, SQL, and UNIX.
- Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights
- Handled high volume of day-to-day Informatica workflow migrations
- Review of Informatica ETL design documents and working closely with development to ensure correct standards are followed
- Designed and implemented complex ETL data process using Informatica power Center and advanced SQL queries (analytical functions)
- Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter, and Update Strategy
- Created complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirement gathering and translating them into clear and concise specifications and queries
- Prepared high-level analysis reports using Excel. Provides feedback on the quality of Data including identification of billing patterns and outliers
- Identify and document limitations in data quality that jeopardize the ability of internal and external data analysts and wrote standard SQL Queries to perform data validation and created excel summary reports (Pivot tables and Charts) as well as gathered analytical data to develop functional requirements using data modeling and ETL tools
- ETL Data Cleansing, Integration &Transformation using Hive and PySpark. Responsible for managing data from disparate sources
- UsedSpark optimizations techniqueslike Cache/Refresh tables, broadcasting variables, Coalesce/Repartitioning, increasing memory overhead limits, handling parallelism, and modifying the spark default configuration variables for performance tuning
- Read data from different sources like CSV file, Excel, HTML page, and SQL and performed data analysis and written to any data source like CSV file, Excel, and database
- Developed and handled business logic through backend Python code.
Environment: Python, UNIX, SQL, ETL, Informatica, Spark, HTML.
Confidential
Business Analyst
Responsibilities:
- Gathered business requirements and designed a roadmap to assess and implement predictive analytics across the team handling Data warehouse projects
- Involved in analyzing and profiling data identifying data quality issues to deliver MIS reports
- Documented Business Requirements Documents (BRD), System Specifications Documents, Data models, Data flow Diagrams, User stories and Test cases to have a common understanding of the business and technical requirements across the multiple stakeholders involved
- Identified most valuable potential customers pre-campaign analysis for promoted products through a series of consumer analyses: market research, customer segmentation and profiling, for increasing response rate, sales and profits
- Analyzed the business requirements to develop and maintain data model and data flow diagrams to deconstruct the business needs to the technology team
- Involved in requirements analysis and legacy system data analysis to design and implement ETL jobs using Microsoft SQL Server Integration Services (SSIS)
- Analyzed the data sources from Oracle, SQL Server for design, development, testing, and production rollover of reporting and analysis projects within x Desktop
- Partnered with Development and Quality Assurance teams to ensure the product quality is always intact
- Resolved Production Support issues using SQL, PL SQL to prevent any interference caused to the business
- Led a team of 3 interns to develop prototypes of predictive models using MS Excel to forecast the data team's future infrastructure costs (Software, Hardware, Development and Operational) for the growing data based on historical data
- Estimated a 5% improvement in operational efficiency with implementation of predictive analytics to estimate future infrastructure costs (Software, Hardware, Development and Operational)
- Analyzed the supply chain management revenue reports to understand the business profits generated from the existing suppliers and sub-suppliers for circuit breaker components.
- Revised and proposed a business solution to change the existing suppliers to improve the marginal profits.
Environment: Python, SQL, ETL, Oracle, SSIS, PL SQL, MS Excel.