Sr. Data Engineer Resume
New York, NY
PROFESSIONAL SUMMARY
- Around 8+ years of professional experience in IT industry including designing, developing, Analysis and big data in SPARK, Hadoop, Pig and HDFS environment and experience in Python.
- Highly experienced in importing and exporting data between HDFS and Relational Systems like MySQL and Teradata using Sqoop.
- Knowledge on big - data database HBase and NoSQL databases MongoDB and Cassandra.
- Hands-on experience in scripting skills in Python, Linux, and UNIX Shell.
- Thorough understanding of various bigdata components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node YARN and MapReduce programming paradigm.
- Working with relative ease with different working strategies like Agile, Waterfall and Scrum methodologies. Used Git version control systems.
- Expertise with the tools in Hadoop Ecosystem including Pig, Hive, HDFS, Map Reduce, Sqoop, Spark, Kafka, Yarn, Oozie, and Zookeeper, Hadoop architecture and its components.
- Experience with AWS services like S3, Athena, Redshift Spectrum, Redshift, EMR, Glue, Data pipeline, step functions, cloud watch, SNS and Cloud formation.
- Experience in Agile Methodologies and extensively used Jira for Sprints and issue tracking.
- Experience in creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
- Experience in working with Map Reduce Programs, Pig Scripts and Hive commands to deliver the best results.
- Expertise in writing Apache Spark Streaming API on Big Data distribution in the active cluster environment.
- Developed Spark RDD and Spark Data Frame API for Distributed Data Processing.
- Good understanding and exposure to Python programming.
- Extensive knowledge on Amazon Web Services (AWS) EC2, S3, Elastic Map Reduce (EMR) and on Snowflake, Redshift, Identity and Access Management (IAM), Data Pipeline, EMR, Dynamo, Workspaces, RDS, SNS, SQS).
- Good experience on Azure cloud components like HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, CosmosDB.
- Experience in building highly reliable, scalable big data solutions on Hadoop distributions Cloudera, Horton works, AWS. Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3, EBS, RDS and VPC.
- Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling, dimensionality reduction using Principal Component Analysis, testing and validation using ROC plot, K- fold cross validation and data visualization.
- Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison and validation.
- Experience in using various packages in R and python like scikit-learn ggplot2, caret, dplyr, plyr, pandas, numpy, seaborn, scipy, matplotlib, Beautiful Soup, Rpy2.
- Experience in building data pipelines using Azure Data factory, Azure Databricks and loading data to Azure data Lake, Azure SQL Database, Azure SQL Data warehouse and controlling and granting database access.
- Determined, committed and hardworking individual with strong communication, interpersonal and organizational skills.
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, New York, NY
Responsibilities:
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Working extensively on Hive, SQL, Scala, Spark, and Shell.
- Developed a data pipeline using Kafka to store data into HDFS.
- Built real-time streaming data pipelines with Kafka, Spark streaming and Cassandra.
- ETL Pipelines that brings and transforms the huge volumes of data from different source systems.
- Built data lake in Hadoop cluster from various RDBMS sources (using SQOOP/Hive).
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster and developed Simple to complex Map/reduce streaming jobs using Python language that are implemented using Hive and Pig.
- Responsible for the planning and execution of big data analytics, predictive analytics, and machine learning initiatives.
- Implemented a proof of concept deploying this product in Amazon Web Services AWS.
- Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies using Hadoop, MapReduce, HBase, Hive and Cloud Architecture.
- Designed and implemented distributed data processing pipelines using Apache Spark, Hive, Python, Airflow DAGs and other tools and languages in Hadoop Ecosystem.
- Developed solutions for import/export of data from Teradata, Oracle to HDFS, S3 and S3 to Snowflake
- Worked in a Hadoop ecosystem implementation/administration, installing software patches along with system upgrades and configuration.
- Worked on Agile methodology & used Git for version control & Jira for project management, tracking issues and bugs.
- Developed Spark applications utilizing Pyspark and Spark-SQL for information extraction, change, and accumulation from numerous document designs for analyzing and changing the information to reveal experiences into the client utilization designs.
- Conducted performance tuning of Hadoop clusters while monitoring and managing Hadoop cluster job performance, capacity forecasting, and security.
- Automated data movements using Python scripts. Involved in splitting, validating, and processing of files.
- Build data platforms, pipelines, and storage systems using the Apache Kafka, Apache Storm, and search technologies such as Elastic search.
- Developed Oozie workflow jobs to execute hive, Sqoop and MapReduce actions.
- Used Pig UDF's in Python and used sampling of large data sets.
- Used Jupiter Notebook and Spark-Shell to develop, test and analyze Spark jobs before scheduling the customized Active Batch Jobs.
- Performed data analysis and data quality check using Apache Spark Machine learning libraries in Python.
- Understand the data model document and analyze to identify the source systems and transformations required to ingest the data into Data Lake.
- Integrated applications using Apache tomcat servers on EC2 instances and automated data pipelines into AWS using Jenkins, git, maven and artifactory.
- Developed and run serverless Spark based applications using AWSLambda service and Pyspark to compute metrics for various business requirements.
- Implemented Amazon Redshift, Spectrum and Glue for the migration of the Fact and Dimensions tables to the Production environment.
Environment: Python, Hadoop, Apache Kafka, Pig, Pyspark, Hive, Sqoop, Scala, Spark, Oozie, HBase, AWS, Redshift, Glue, Athena, EC2, EMR, S3, MapReduce, Jupiter Notebook, Git
Data Engineer
Confidential, Charlotte, NC
Responsibilities:
- Extensive knowledge/hands on experience in architecting or designing Data warehouse/Database, Modelling, building SQL objects such as tables, views, user defined/ table valued functions, stored procedures, triggers, and indexes.
- Created HBase tables from Hive and Wrote HiveQL statements to access HBase table's data.
- Developed complex Hive Scripts for processing the data and created dynamic partitions and bucketing in hive to improve the query performance.
- Developed MapReduce applications using Hadoop Map-Reduce programming framework for processing and used compression techniques to optimize MapReduce Jobs.
- Developed Pig UDF's to know the customer behavior and Pig Latin scripts for processing the data in Hadoop.
- Used Struts tag libraries and custom tag libraries extensively while coding JSP pages.
- Scheduled automated tasks with Oozie for loading data into HDFS through Sqoop and pre-processing the data with Pig and Hive.
- Developed the Oozie actions like hive, shell, and java to submit and schedule applications to run in Hadoop cluster.
- Experienced of building Data Warehouse in Azure platform using Azure data bricks and data factory.
- Worked with production support team to provide necessary support for issues with CDH cluster and the data ingestion.
- DevelopedSparkjobs on Databricks to perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
- Worked in Azure environment for development and deployment of Custom Hadoop Applications.
- Demonstrated QlikView data analyst to create custom reports, charts, and bookmarks.
- Worked with Azure cloud platform like HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB, SQL DWH. Architected, Designed and Developed Business applications and Data marts for reporting.
- Designed and implemented scalable Cloud Data and Analytical a solution for various public and private cloud platforms using Azure.
- Involved in migrating Spark Jobs from Qubole to Databricks.
- Designed and implemented distributed data processing pipelines using Apache Spark, Hive, Python, Airflow DAGs and other tools and languages in Hadoop Ecosystem
- Designed, Developed and Deployed data pipelines for moving data across various systems
- Developed solutions for import/export of data from Teradata, Oracle to HDFS, S3 and S3 to snowflake
- Resolve Spark and Yarn resource management issues in Spark including Shuffle issues, Out of Memory issues, heap space errors and schema compatibility.
- Monitor and troubleshoot performance of applications and take corrective actions in-case of failures and evaluate possible enhancements to meet SLAs.
- Import and export of data using Sqoop from or to HDFS and Relational DB Oracle and Netezza.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark Databricks cluster.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames, Spark RDD.
Environment: Hadoop, Sqoop, MapReduce, SQL, Teradata, Snowflake, Hive, Pig, SQL, Azure, data bricks, Kafka, Azure Data Factory, Glue, AWS, HBase, Apache, Informatica.
Data Engineer
Confidential, Irving, TX
Responsibilities
- Design robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.
- Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
- Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
- Good knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
- Used Spark Data Frames API over platforms to perform analytics on Hive data and used Spark Data Frame operations to perform required validations in the data.
- Hands on experience on developing UDF, Data Frames and SQL queries in Spark SQL.
- Creating and modified existing data ingestion pipelines using Kafka and Sqoop to ingest the database tables and streaming data into HDFS for analysis.
- Finalize the naming Standards for Data Elements and ETL jobs and create a Data dictionary for Meta Data Management.
- Worked on developing ETL workflows on the data obtained using Python for processing it in HDFS and HBase using Flume.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop and Zookeeper.
- Developed multiple POC's using Spark, Scala and deployed on the Yarn Cluster, compared the performance of Spark, with Hive and SQL.
- Using GCP Console, monitor dataproc cluster and jobs. Stack Driver to monitor Dashboards and do a performance tuning and optimization of jobs which are memory intensive and provide L3 support for the applications in production environment.
- Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution
- Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
- Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.
- Maintain AWS Data pipeline as web service to process and move data between Amazon S3, Amazon EMR and Amazon RDS resources.
- Data cleaning, pre-processing and modelling using Spark and Python.
- Implemented real-time data driven secured REST API's for data consumption using AWS (Lambda, API Gateway, Route 53, Certificate Manager, CloudWatch, Kinesis), Swagger, Okta and Snowflake
- Develop the automation scripts to transfer the data from on premise clusters to Google Cloud Platform (GCP).
- Load the files data from ADLS Server to the Google Cloud Platform(GCP) Buckets and create the Hive Tables for the end users.
- Involved in performance tuning and optimization of long running spark jobs and queries (Hive/SQL).
- Implemented Real-time streaming of AWS CloudWatch Logs to Splunk using Kinesis Firehose.
- Developed AWS CloudWatch Dashboards for monitoring API Performance.
Environment: Python, Spark, Flume, HDFS, HBase, Hive, Pig, Sqoop, Scala, Zookeeper, snowflake, EC2, EMR, S3, Google Cloud Platform, Cloud watch.
Data Analyst/Engineer
Confidential
Responsibilities:
- Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution.
- Convert raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.
- Worked on building end to end data pipelines on Hadoop Data Platforms.
- Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.
- Designed developed and tested Extract Transform Load (ETL) applications with different types of sources.
- Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
- Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
- Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
- Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.
- Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security and resource automation.
Environment: Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, Web Services, Linux RedHat, Unix.
Data Analyst/Engineer
Confidential
Responsibilities:
- Responsible for data identification, collection, exploration, cleaning for modeling.
- Involved in creating database solutions, evaluating requirements, preparing design reports and also migrating data from legacy systems to new solutions.
- Supported analytical platform, handled data quality, and improved the performance using Scala’s higher order functions, lambda expressions, pattern matching and collections.
- Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
- Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders.
- Worked with various database administrators/operations and analysts to secure easy access to data
- Build and maintain data in HDFS by identifying structural and installation solutions.
- Analyzing structural requirements for new applications which will be sourced.
- Creating new workflows and maintaining existing ETL workflows, data management and data query components.
- Design, develop and orchestrate data pipelines for real time and batch data processing using AWS Redshift
- Performed Exploratory Data Analysis and Data visualizations using Python and Tableau.
- Prepared Test Plan to ensure QA and Development phases are in parallel.
- Used report form maps toimport Datato Configuration Management in ServiceNow.
- Maintaining product CatLog to import the configuration item records in ServiceNow.
- Worked with stakeholders to communicate campaign results, strategy, issues or needs.
Environment: Python, HDFS, MapReduce, Kafka, Hive, HBase, Kafka, ETL, Web Services, Linux RedHat, Unix.
TECHNICAL SKILLS
Hadoop Ecosystem: HDFS, SQL, YARN, PIG Latin, MapReduce, Hive, Sqoop, Spark, Yarn, Zookeeper, Oozie, Kafka, Storm, Flume Programming Languages Python, PySpark, JavaScript, Shell Scripting
Big Data Platforms: Hortonworks, Cloudera
AWS Platform: EC2, S3, EMR, Redshift, DynamoDB, Aurora, VPS, Glue, Kinesis, Boto3
Operating Systems: Linux, Windows, UNIX
Databases: Netezza, MySQL, UDB, HBase, MongoDB, Cassandra, Snowflake
Development Methods: Agile/Scrum, Waterfall
IDE's: PyCharm, IntelliJ, Ambari, Jupiter Notebook
Data Visualization: Tableau, BO Reports, Splunk, Microsoft SQL Server