Data Engineer Resume Los Angeles, California - Hire IT People

SUMMARY

6+ years of combined IT experience as a Big Data Engineer, Python Engineer and Business Analyst, and an expert in designing, development, and implementing data models for big data applications
Hands On experience in developing Spark applications using Spark tools like RDD transformations, Spark Core, Spark Streaming and Spark SQL
Have experience in technologies, tools and databases like Big Data, AWS, Azure, SQL, Hadoop, Hive, Pig, Sqoop, HBase, Spark, Cassandra, and Hue
Excellent Knowledge using Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance
Proficient in relational databases like Oracle, MySQL and SQL Server
Extensive experience in development of Bash scripting, T - SQL, and PL/SQL scripts
Strong experience in working with NO SQL databases and its integration Dynamo DB, Cosmo DB, Mongo DB, Cassandra and HBase
Extensively worked onAWS services like EC2, S3, EMR, RDS, Athena, Lambda Function, Step Function, Glue Data Catalog, SNS, RDS(Aurora), Redshift, DynamoDB, and Quick Sightand other services of the AWS family
Extensive knowledge in working with Azure cloud platform (HDInsight, VM, Blob Storage, Data Lake, Data Bricks, ADLS Gen 2, Azure Data Factory, Synapse, and Data Storage Explorer)
Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames Application programming interface (API)
Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL
Expertise inf Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks, Managing the Machine Learning Lifecycle
Strong experience in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis
Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required Validations in the data
Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive User Defined Functions (UDF's) as required
Capable of understanding and knowledge of job workflow scheduling and locking tools/services like Oozie, Zookeeper, Airflow and Apache NiFi
Strong experience and knowledge of real time data analytics using Spark Streaming and Flume
Extensive experience in Configuring Spark Streaming to receive real time data from Apache Kafka and store the stream data to HDFS and expertise in using Spark-SQL with various data sources like JSON, Parquet and Hive
Proficient in Python Scripting and worked in stats function with NumPy, Pandas for organizing data and visualization using Matplotlib
Developed and Designed automation framework using Python and Shell Scripting
Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets
Experience in working with GIT, Bitbucket Version Control System
Responsible for using Flume sink to remove the data from Flume Channel and deposit in No-SQL database like MongoDB
Knowledge in using Integrated Development environments like Eclipse, NetBeans, IntelliJ, STS
Involved in loading data from UNIX file system and FTP to HDFS
Hands on Experience in using Visualization tools like Tableau, Power BI
Capable in working with SDLC, Agile and Waterfall Methodologies
Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.

TECHNICAL SKILLS

Hadoop/Big Data Ecosystem: Apache Spark, HDFS, MapReduce, HIVE, Kafka, Sqoop

Programming & Scripting: Python, PySpark, SQL, Scala

NoSQL Databases: Mongo DB, Dynamo DB

SQL Databases: MS-SQL Server, MySQL, Oracle, Postgre

Cloud Computing: AWS, Azure

Operating Systems: Ubuntu (Linux), Mac OS-X, Windows 10, 8

Reporting: PowerBI, Tableau

Version Control: Git, GitHub, SVN

Methodologies: Agile/ Scrum, Rational Unified Process and Waterfall

PROFESSIONAL EXPERIENCE

Confidential, Los Angeles, California

Data Engineer

Responsibilities:

Created end to end data pipeline which includes data ingestion, data curation, data provision using AWS cloud services.
DevelopedSpark applicationsusingPythonand implemented Apache Spark data processing project to handle data from bothBatchandStreamingsources
Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.
Ingested data into S3 bucket from different sources including MySQL, Oracle, MongoDB, SFTP.
Proficient in working with AWS services like S3, EC2, EMR, Redshift, Athena, Glue, DynamoDB, RDS, IAM
Working with Big data services/concepts like Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming
Created EMR cluster on EC2 instances and developed PySpark applications to perform data transformations on top of it and stored into Redshift
Configured AWS Identity and Access Management (IAM) Groups and Users for improved login authentication
Used AWS Lambda using Python scripts to automate the operation to read data types (Parquet, JSON, Avro, CSV) from AWS S3 to AWS Redshift
Created an event-driven AWS Glue ETL pipeline using Lambda function by reading the data from the S3 bucket and storing it in Redshift on daily basis
Developed python script using Boto3 library to configure the services AWS glue, EC2, S3, DynamoDB
Performed tuning of Spark Applications to set batch interval time and the correct level of Parallelism and memory tuning
UsedSpark Streaming APIsto perform transformations and actions for the data coming fromKafkain real-time and persists it toAWS S3
Developed Kafka consumer API in python for consuming data from Kafka topics
Used Apache Kafka to aggregate web log data from multiple servers and make them available indownstream systems for Data analysis and engineering type of roles
Extracted the data from HDFS using Hive and performed data analysis using PySpark, Redshift for feature selection and created nonparametric models in Spark
Used SQL commands DDL, DQL, DML For creating, selecting, and modifying tables
Used AWS Redshift, S3, and Athena services to query large amounts of data stored on S3 to create a Virtual Data Lake without having to go through the ETL process
Developed Oozie coordinators to schedule Hive scripts to create Data pipelines.
Written several Map reduce Jobs using PySpark, NumPy and used Jenkins for Continuous integration.
Involved in writing custom MapReduce programs using Java API for data processing.
Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources with file formats like ORC/Parquet/Text Files into AWS S3
Imported and exported data jobs, to perform operations like copying data from HDFS and to HDFS using Sqoop and developed Spark code using SQL for faster testing and processing of data
Developed Sqoop and Kafka Jobs to load data from RDBMS into HDFS and HIVE
Worked in the development of applications especially in the LINUX environment and familiar with all its commands and worked on Jenkins continuous integration tool for deployment of the project and deployed the project into Jenkins using GIT version control system.

Environment: Spark, Spark-Streaming, PySpark, Spark SQL, AWS EMR, S3, EC2, Redshift, Athena, Lambda, Glue, DynamoDB, MapReduce, Java, HDFS, Hive, Pig, Apache Kafka, Python, Shell scripting, Linux, MySQL, NoSQL, SOLR, Jenkins,Oracle, Git, Airflow, Tableau, Power BI.

Confidential, Tennessee

Data Engineer

Responsibilities:

Extensively worked on Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, DWH and Data Storage Explorer)
Designed and deployed data pipelines using DataLake, DataBricks, and Apache Airflow
Enabling other teams to work with more complex scenarios and machine learning solutions
Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse
Evolved in Spark Scala functions for mining data to provide real time insights and reports
Configured spark streaming to receive real time data from the Apache Flume and store the stream data using Scala to Azure Table
DataLake is used to store and do all types of processing and analytics
Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Spark Scala scripts and UDF's to perform transformations on large datasets
Utilized Spark Streaming API to stream data from various sources. Optimized existing Scala code and improved the cluster performance
Involved in using Spark DataFrames to create Various Datasets and applied business transformations and data cleansing operations using DataBricks Notebooks
Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Airflow, Apache NiFi
Tasks are distribution on celery workers to manage communication between multiple services
Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved the query performance
Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API)
Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS
Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API
Used Flume sink to write directly to indexers deployed on cluster, allowing indexing during ingestion
Migrated from Oozie to Apache Airflow. Involved in developing Oozie and Airflow workflows for daily incremental loads, getting data from RDBMS (MongoDB, MS SQL)
Managed resources and scheduling across the cluster using Azure Kubernetes Service. AKS can be used to create, configure, and manage a cluster of Virtual machines
Extensively used Kubernetes which is possible to handle all the online and batch workloads required to feed, analytics and machine learning applications
Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication and Apache Ranger for authorization
Experience in working with Spark applications like batch interval time, level of parallelism, memory tuning to improve the processing time and efficiency
Used Scala for amazing concurrency support, and Scala plays the key role in parallelizing processing of the large data sets
Developed map reduce jobs using Scala for compiling the program code into bytecode for the JVM for data processing
Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements.

Environment: Azure HDInsight, DataBricks, DataLake, CosmosDB, MySQL, Snowflake, MongoDB, Teradata, Ambari, Flume, VSTS, Tableau, PowerBI, Azure DevOps, Ranger, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, Airflow, Hive, Sqoop, HBase.

Confidential

Big Data Engineer

Responsibilities:

Experienced working with Big Data, Data Visualization, Python Development, SQL, and UNIX.
Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights
Handled high volume of day-to-day Informatica workflow migrations
Review of Informatica ETL design documents and working closely with development to ensure correct standards are followed
Designed and implemented complex ETL data process using Informatica power Center and advanced SQL queries (analytical functions)
Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter, and Update Strategy
Created complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirement gathering and translating them into clear and concise specifications and queries
Prepared high-level analysis reports using Excel. Provides feedback on the quality of Data including identification of billing patterns and outliers
Identify and document limitations in data quality that jeopardize the ability of internal and external data analysts and wrote standard SQL Queries to perform data validation and created excel summary reports (Pivot tables and Charts) as well as gathered analytical data to develop functional requirements using data modeling and ETL tools
ETL Data Cleansing, Integration &Transformation using Hive and PySpark. Responsible for managing data from disparate sources
UsedSpark optimizations techniqueslike Cache/Refresh tables, broadcasting variables, Coalesce/Repartitioning, increasing memory overhead limits, handling parallelism, and modifying the spark default configuration variables for performance tuning
Read data from different sources like CSV file, Excel, HTML page, and SQL and performed data analysis and written to any data source like CSV file, Excel, and database
Developed and handled business logic through backend Python code.

Environment: Python, UNIX, SQL, ETL, Informatica, Spark, HTML.

Confidential

Business Analyst

Responsibilities:

Gathered business requirements and designed a roadmap to assess and implement predictive analytics across the team handling Data warehouse projects
Involved in analyzing and profiling data identifying data quality issues to deliver MIS reports
Documented Business Requirements Documents (BRD), System Specifications Documents, Data models, Data flow Diagrams, User stories and Test cases to have a common understanding of the business and technical requirements across the multiple stakeholders involved
Identified most valuable potential customers pre-campaign analysis for promoted products through a series of consumer analyses: market research, customer segmentation and profiling, for increasing response rate, sales and profits
Analyzed the business requirements to develop and maintain data model and data flow diagrams to deconstruct the business needs to the technology team
Involved in requirements analysis and legacy system data analysis to design and implement ETL jobs using Microsoft SQL Server Integration Services (SSIS)
Analyzed the data sources from Oracle, SQL Server for design, development, testing, and production rollover of reporting and analysis projects within x Desktop
Partnered with Development and Quality Assurance teams to ensure the product quality is always intact
Resolved Production Support issues using SQL, PL SQL to prevent any interference caused to the business
Led a team of 3 interns to develop prototypes of predictive models using MS Excel to forecast the data team's future infrastructure costs (Software, Hardware, Development and Operational) for the growing data based on historical data
Estimated a 5% improvement in operational efficiency with implementation of predictive analytics to estimate future infrastructure costs (Software, Hardware, Development and Operational)
Analyzed the supply chain management revenue reports to understand the business profits generated from the existing suppliers and sub-suppliers for circuit breaker components.
Revised and proposed a business solution to change the existing suppliers to improve the marginal profits.

Environment: Python, SQL, ETL, Oracle, SSIS, PL SQL, MS Excel.

We provide IT Staff Augmentation Services!

Data Engineer Resume

Los Angeles, CaliforniA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship