Big Data Engineer Resume
New York City, NY
SUMMARY
- Overall 7+ years of experience as a Data Engineer in the IT industry with solid understanding of Big Data architecture, Cloud services and Data Modeling.
- Proficient in optimizing data pipelines using Bigdata tools such as Hadoop, Spark and creating ETL pipelines on AWS, Azure to develop various business intelligence applications.
- Experienced on Multi - node highly available Hadoop cluster environments, handling all Hadoop environment builds on Hortonworks Data Platform, Cloudera, and Amazon-EMR.
- In depth understanding of Hadoop Architecture and its components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name node, and MapReduce concepts.
- Performed a group operation for analyzing the given data using Apache PIG on Hadoop cluster and managed all the incoming data from different sources and processed before getting into Hadoop file system in a sorted order.
- Designed and implemented HIVE queries and functions for evaluation, filtering, loading, and storing of data into the targets.
- Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive User Defined Functions (UDF's) as required.
- Designed workflows and coordinators in Oozie to automate and parallelize Hive jobs on Apache Hadoop environment.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database Systems and vice versa.
- Experienced in writing Complex SQL queries, PL/SQL stored procedures, functions, packages, triggers, materialized views, data integration and performance training.
- Designed a Python package that were used within a large ETL to process the data from an existing Oracle database into a new PostgreSQL database.
- Extensive experience in working with No SQL databases and its integration with DynamoDB, Cosmo DB, MongoDB, and Cassandra.
- Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming.
- Expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Designed and developed the POC in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Developed various Spark applications using Scala to perform cleansing, transformation, and enrichment of these click stream data.
- Hands on experience in creating real time data streaming solutions using Apache Spark Core, Spark SQL, and Data Frames.
- Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.
- Extensive knowledge in working with AWS EC2, S3, EMR, RDS, Athena, Glue, Redshift, DynamoDB and other services of the AWS ecosystem.
- Extensive knowledge in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL DB, Virtual Machines, DWH & Data Storage Explorer).
- Good experience in programming languages such as Python and Scala with excellent proficiency in Python Scripting and worked in stats functions with NumPy, Visualization using Matplotlib and Pandas for organizing data.
- Strong knowledge in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and Data analysis.
- Experienced in UNIX Shell Scripting for automating deployments and other routine tasks.
- Developed various Shell and Python scripts to automate Spark jobs and Hive scripts.
- Experience in designing, developing, and implementing Extraction, Transformation, and Load (ETL) techniques on multiple database platforms and operating system environments.
- Worked with Project Management tools such as JIRA and Confluence.
- Capable of working with SDLC, Agile, Kanban and Waterfall Methodologies. ·
- Experience in working with GIT, Bitbucket Version Control System.
- Extensive experience working in a Test-Driven Development and Agile-Scrum Development
- Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt to new environments and technologies.
TECHNICAL SKILLS
Big Data/Hadoop Ecosystem: HDFS, Map Reduce, Hive, Apache Spark, YARN, Sqoop, Oozie, Zookeeper, Kafka, Flume, Pig
Programming Languages: Python, Scala, R, SQL, PL/SQL, Linux Shell Scripts, C
Cloud Computing: AWS, Azure
NoSQL Database: HBase, Cassandra, MongoDB, DynamoDB
Database: MY SQL, MS SQL Server, Oracle, PostgreSQL
Version Control: GIT, Bitbucket
Operating Systems: Linux, Unix, Windows, Mac OS-X
Distributed Platforms: Cloudera, Horton Works, MapR
Other tools: Databricks, MS Visual Studio, IntelliJ
PROFESSIONAL EXPERIENCE
Confidential, New York City, NY
Big Data Engineer
Responsibilities:
- Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data based on the business requirements to uncover insights into the customer usage patterns.
- Worked on several components in Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL DB, DWH and Data Storage Explorer).
- Ingested real time data using Kafka to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL DW) and processed the real time data in Azure Databricks.
- Created a process for batch-based data ingestion into Azure based on the schedule intervals by triggering the events, to specifically target the marketing data on a daily basis.
- Ingested data in mini-batches and performed RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
- Ingested data into Azure Blob storage and processed the data using Databricks and wrote Spark Scala scripts and UDF's to perform transformations on large datasets.
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL.
- Extract Transform and Load data from RDBMS to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
- Used Spark DataFrames to create various Datasets and applied business transformations and data cleansing operations using DataBricks Notebooks.
- Created data pipeline for different events in Azure Blob storage into Hive external tables and utilized various Hive optimization techniques like partitioning, bucketing and Mapjoin.
- Designed and deployed data pipelines using DataLake, DataBricks, and Apache Airflow.
- Worked on Azure Data Factory to integrate data of both on-prem (MySQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
- Created Airflow scheduling scripts in Python to automate the process of Sqooping wide range of data sets.
- Developed Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Apache Airflow.
- Developed Python code for different tasks, dependencies, SLA watcher and time sensor for each job for work-flow management and automation using the Airflow tool.
- Performed migration from Oozie to Apache Airflow for faster and efficient scheduling.
- Verified JSON schema change of source files and verified duplicate files in source location by creating a query parser script in Python.
- Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
- Configured Spark streaming to receive real time data from the Apache Kafka and store the stream data using Scala to ADLS Gen 2.
- Used Apache Kafka to Aggregated web log data from multiple servers and make them available indownstream systems for Data analysis and engineering type of roles.
- Utilized Spark Streaming API to stream data from various sources and optimized existing Scala code and improved the cluster performance for quicker response time.
- Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.
- Created complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and business requirements by gathering and translating them into clear and concise specifications and queries.
- Performed the processing using Scala for amazing concurrency support in which Scala plays the key role in parallelizing processing of the large data sets.
- Developed MapReduce jo bs using Scala for compiling the program code into bytecode for the JVM for data processing.
- Participating with cross-functional teams for Technical Architecture Documents, Project Design, and Implementation through deployment discussions.
- Organized and managed real time data for interactive Power BI dashboards and reporting purposes based on business requirements from the clients.
Environment: Azure Data Factory, Blob Storage, Data Storage Explorer, Synapse, Azure SQL DW, Azure HDInsight, Databricks, ADLS Gen 2, CosmosDB, ADF, MySQL, Snowflake, MongoDB, Cassandra, Python, Scala, Hadoop 2. x (HDFS, MapReduce, Yarn), Spark v2.0.2, RDD, Kafka, PySpark, Airflow, DAG, Hive, Sqoop, HBase, Tableau, PowerBI.
Confidential, Wilmington, MA
Data Engineer
Responsibilities:
- Developed roadmap for migration of enterprise data from multiple data sources likeMySQL Server, Postgres database intoS3which serves as a centralizeddatahubacross the organization.
- Developed Spark applicationsby using Pythonand implemented Apache Spark data processing project to handle data from variousRDBMSandStreamingsources.
- Migrated an existing on-premises RDBMS toAWS and usedAWSservices likeEC2andS3for small data sets processing and storage.
- Loaded and transformed large sets of structured and semi structured data from various downstream systems.
- Used AWS Redshift, Athena services to query large amounts of data stored in S3 to create a Virtual Data Lake without having to go through ETL process.
- Configured the services AWS glue, S3, DynamoDB using python Boto3.
- Extracted the data from HDFS using Hive and performed data analysis using PySpark, Redshift for feature selection and created non-parametric models in Spark.
- Designed and built multiple Elastic Map Reduce (EMR) clusters on AWS cloud for Hadoop map reduce applications for enabling multiple proof of concepts.
- Imported and exported data jobs, to perform operations like copying data from HDFS and to HDFS using Sqoop and developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
- Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE and developed Kafka consumer API in python using Kafka connect and utilized AWS MSK for consuming data from Kafka topics.
- Created Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
- Performed data processing tasks using PySpark such as reading data from external sources, merge data, perform data enrichment and load into target data destinations.
- UsedSpark Streaming APIsto perform transformations and actions on the fly for building common learner data model which gets the data fromKafkain real time and persist it toCassandra.
- Worked on ETL pipeline to source the tables and to deliver the calculated ratio data from AWS to Datamart (SQL Server).
- Created AWS Data pipelines using various resources in AWS including AWS API Gateway to receive response from AWS Lambda and retrieve data from Snowflake using Lambda function and convert the response into JSON format using database as Snowflake, DynamoDB.
- Loaded data into S3 buckets using PySpark, AWS Glue and filtered data that isstored in S3 buckets using Elasticsearch and loaded data into Hive external tables.
- Performed tuning of Spark Applications to set batch interval time and correct level of Parallelism and memory tuning.
- Used AWS Lambda to automate the operation to read a dataset (parquet, avro) from AWS S3 to AWS RDS.
- Configured the Linux logs into AWS CloudWatch by creating IAM role and attached with the instance, then configured the file in the instance to push and monitored the logs.
- Configured AWS Identity and Access Management (IAM) Groups and Users for improved login authentication.
- Setup Hadoop and Spark cluster for various POCs, specifically to load the cookie level data and real-time streaming.
- Configured Snow pipe to pull the data from S3 buckets into Snowflakes table and stored incoming data in the Snowflakes staging area.
- Deployed the project into Jenkins using GIT version control system and worked on development of applications especially in the LINUX environment.
- Generated various kinds ofreportsusingPower BIandTableaubased on the Client specification.
Environment: Spark, Spark SQL, Spark Streaming, AWS EMR, S3, EC2, Redshift, Athena, Lambda, Glue, DynamoDB, MapR, Snowflake, HDFS, Hive, PIG, Kafka,Sqoop, Python, PySpark, Scala, Shell scripting, Linux, NoSQL, Jenkins,Oracle, Git, Docker, Oozie, Tableau, Power BI, Cassandra.
Confidential
Junior Data Engineer
Responsibilities:
- Analyzed, designed, and built modern data solutions using Azure PaaS service to support visualization of data.
- Understand current production state of applications and determine the impact of new implementation on existing business processes.
- Designed and configured Azure Cloud relational servers and databases, analyzing current and future business requirements.
- Worked on migration of data from On-prem SQL server to Cloud databases(Azure Synapse Analytics (DW) & Azure SQL DB).
- Configured Input & Output bindings of Azure Function with Azure CosmosDB collection to read and write data from the container whenever the function executes.
- Developed end to end ETL batch and streaming data integration into Hadoop(MapR), transforming data.
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL.
- Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
- Created Hive tables with dynamic and static partitioning including buckets for efficiency, also created external tables in Hive for staging purpose.
- Loaded Hive tables with data, wrote Hive queries which runs on MapReduce and created customized BI tools for the teams that perform query analytics using HiveQL.
- Created and provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
- Created several Databricks Spark jobs with PySpark to perform several tables to table operations.
- Built the data pipeline using Azure Service like Data Factory to load the data from SQL server to Azure Database using Data Factories, API Gateway Services, SSIS Packages, and Python scripts.
- Created data pipeline for different events in Azure Blob storage into Hive external tables. Used various Hive optimization techniques like partitioning, bucketing and Mapjoin.
- Developed Flume ETL job for handling data from HTTP source and sink as HDFS and configuring Data Pipelining.
- Designed and deployed data pipelines using Data Lake, Databricks, and Apache Nifi.
- Utilized Oozie Workflows to orchestrate all the flows on the cluster.
- Developed automatic job flows and ran through Oozie daily and when needed which runs MapReduce jobs internally.
- Worked on PowerShell scripts to automate the Azure cloud system creation of Resource groups, Web Applications, Azure Storage Blobs & Tables, firewall rules.
- Scored the model outputs for attributes of interest as defined by business stakeholders and applied business rules for thresholding the outputs using Hive and Spark.
- Used Python (NumPy, SciPy, pandas, Scikit-Learn, NLTK, and SciPy) to develop a variety of models and algorithms for analytic purposes.
- Created Application Interface Document for the downstream to create a new interface to transfer and receive the files through Azure Data Share.
- Developed Elastic pool databases and scheduled Elastic jobs to execute T-SQL procedures.
Environment: Azure HDInsight, Databricks, CosmosDB, Azure SQL DB, Snowflake, MongoDB, Cassandra, Teradata, Flume, PowerBI, Azure Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, PySpark, Airflow, Hive, Sqoop, HBase, Oozie.
Confidential
SQL Developer
Responsibilities:
- Established schedule and resource requirements by planning, analyzing and documenting development effort to include timelines, risks, test requirements and performance targets.
- Analyzed, designed, and developed databases using ER diagram, normalization, and relational database concept.
- Transferred the MS Excel Sheet reports into SSRS based reports by migrating the data using SSIS packages and then using views and stored procedures to develop new reports.
- Extensively used Oracle PL/SQL language to develop complex stored packages, functions, triggers, Text queries, Text Indexes etc. to process raw data and prepare for the statistical analysis.
- Involved in data replication and high availability design scenarios with oracle streams. Developed UNIX shell scripts to repetitive database processes.
- Developed SQL Server stored procedures, tuned SQL queries (using indexes and execution plan), user defined functions and created views.
- Rebuilt indexes and tables as a part of performance tuning exercise on a routine basis.
- Developed SQL Server stored procedures, tuned SQL queries (using indexes & execution plan).
- Worked on client requirements and wrote complex SQL queries to generate crystal reports.
- Tuned and optimized SQL queries using execution plan and profiler.
- Designed/developed internal client reporting systems using SSRS and SQL Server including power pivot for providing ad-hoc reporting.
- Handled errors using exception Handling extensively for the ease of debugging and displaying the errors messages in the application.
Environment: MS SQL Server, Oracle PL/SQL, SSRS, SSIS, SSAS, DB2, HTML, XML, JSP, Servlet, JavaScript, EJB, JMS, MS Excel.
Confidential
Junior Data Analyst
Responsibilities:
- Created a SSIS packages for data migration and transformation to support and maintain Data Warehouse environment and tackled problems using SQL profiler.
- Designed ETL/data Integration solutions from source systems and historical data scripts utilizing SQL Queries and/or stored procedures using T-SQL.
- Worked on large sets of structured and unstructured data and also involved in data cleansing mechanism in order to eliminate duplicate and inaccurate data.
- Calculated spend analysis for design solutions that are performed in an Agile environment.
- Reconciled daily financial reports using pivot tables and VLOOKUP.
- Analyzed and processed complex data sets using advanced querying, visualization, and analytics tools.
- Mastered the ability to design and deploy rich graphic visualizations using Tableau and converted existing business objects reports into Tableau dashboards.
Environment: ETL, SQL, T-SQL, MS Excel, Agile, Tableau, SQL server, ad-hoc analysis.