Sr. Data Engineer Resume
Branchburg, NJ
SUMMARY
- Over 9+ years of Experience in Data Engineering, designing algorithms, building models, developing Data Mining, Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine Learning Algorithms, Validation and Visualization, and reporting solutions that scale across a massive volume of structured and unstructured data.
- Expertise and Knowledge in using job scheduling and monitoring tools like Azkaban and AirFlow.
- Experience in developing ETL solutions using Spark SQL in Azure Databricks for data extraction, transformation, and aggregation from multiple file formats and data sources for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Experience working on NiFi to ingest data from various sources, transform, enrich and load data into various destinations (Kafka, databases, etc)
- Worked on visualization dashboards using Power BI, Pivot Tables, Charts, and DAX Commands.
- Designed and developed Business Intelligence applications using Azure SQL, Power BI
- Experience with Data pipelines, end - to-end ETL, and ELT process for data ingestion and transformation in GCP.
- Experience in moving data between GCP and Azure using Azure Data Factory.
- Designed and implemented stored procedures, views, and other application database code objects.
- Maintained SQL script indexes and complex queries for analysis and extraction.
- Expertise in various phases of project life cycles (Design, Analysis, Implementation, and testing).
- Participated in Business Requirements gathering and documentation
- Developed and collaborated with others to develop, database solutions within a distributed team
- Written Azkaban job to orchestrate and automate the data pipeline process
- Extensive experience in loading and analyzing large datasets with the Hadoop framework (MapReduce, HDFS, PIG, HIVE, Flume, Sqoop, SPARK, Impala, Scala), NoSQL databases like MongoDB, HBase, and Cassandra.
- Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Automation of workflows and scheduling jobs using Oozie and UC4 Automata.
- Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instances.
- Hands-on experience working with Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
- Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS, and other services of the AWS family.
- Select appropriate AWS services to design and deploy an application based on given requirements.
- ETL framework design and development for Lucas, Confidential, Star Wars digital data hub (Pentaho data integration, Talend Data Integration).
- Strong experience in the Analysis, design, development, testing, and implementation of Business Intelligence solutions using Data Warehouse/Data Mart Design, ETL, BI, Client/Server applications, and writing ETL scripts using Regular Expressions and custom tools (Informatica, Pentaho, and Sync Sort) to ETL data.
- Skilled in performing Data Parsing, Data Manipulation, and Data Preparation with methods including describing Data contents, computing descriptive statistics of Data, regex, split and combine, remap, merge, subset, reindex, melt and reshape. Good understanding of web design based on HTML5, CSS3, and JavaScript.
- Hands-on experience with Big Data tools like Hadoop, Spark, Hive, Pig, Impala, Pyspark, and SparkSQL. Hands-on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear, and Logistic Regression, SVM, Clustering, neural networks, and Principle Component Analysis.
- Good Knowledge in Proof of Concepts (PoC's), gap analysis, and gathered necessary Data for analysis from different sources, prepared Data for Data exploration using Data munging.
- Deep understanding of MapReduce with Hadoop and Spark. Good knowledge of Big Data ecosystems like Hadoop 2.0 (HDFS, Hive, Pig, Impala), and Spark (SparkSQL, Spark MLLib, Spark Streaming).
- Used Kubernetes to manage containerized applications using its nodes, Config Maps, Selector, Services, and deployed application containers as Pods.
- Managed Kubernetes manifest files & Helm packages to deploy scale, load balance, scale and manage Docker containers with multiple namespace versions.
- Responsible for User Management, Plugin Management, and End-to-End automation of the Build and Deployment process using Jenkins.
- Utilized robust CICD tools such as Jenkins, GitHub, Bitbucket, Bamboo, and other utilities that enable automation around software delivery pipelines.
- Configured and managed source code using Git and resolved code merging conflicts in collaboration with application developers.
- Experience in building and deploying Java/J2EE to Tomcat Application servers in an agile continuous integration process and automating the whole process.
- Experience with troubleshooting builds issues to support Dev in both .NET and Java applications.
- Build scripts using Ant and Maven build tools in Jenkins to move from one environment to other environments.
- Excellent performance in building, and publishing customized interactive reports and dashboards with customized parameters including producing tables, graphs, and listings using various procedures and tools such as Tableau and user-filters using Tableau.
TECHNICAL SKILLS
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure, Google Cloud.
Databases: Oracle 12c/11g, Teradata R15/R14, MySQL, SQL Server
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Operating System: Windows, Unix, Sun Solaris
Big Data Tools: Hadoop Ecosystem Map Reduce, Hadoop Ecosystem Map Reduce, Spark 2.3,Airfolw1.10.8,Nifi 2,Hbase 1.2,Hive 2.3,Sqoop 1.4,kafkaOozie,Hadoop 3.0
PROFESSIONAL EXPERIENCE
Sr. DATA Engineer
Confidential, Branchburg, NJ
Responsibilities:
- Designed and developed an architecture for a data services ecosystem spanning Relational, NoSQL, and Bigdata technologies. Extracted Mega Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.
- Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
- Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing of the data in Azure Databricks
- Develop Spark applications using pyspark and spark SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data to uncover insight into the customer usage patterns
- Hands-on experience in developing SQL Scripts for automation.
- Worked on Building data pipelines in airflow in GCP for ETL-related jobs using different BASH operators.
- Experience with Data pipelines, end-to-end ETL, and ELT process for data ingestion and transformation in GCP.
- Experience in Google Cloud components, Google container builders, and GCP client libraries.
- Responsible for estimating the cluster size, monitoring, and troubleshooting the Spark data bricks cluster
- Build pipelines in Azure data factory to move data from on-prem to Azure SQL Datawarehouse, from Amazon S3 buckets to Azure blob storage.
- Create notebooks to load XML files in Azure SQL data warehouse using Azure data bricks
- Ability to apply the spark DataFrame API to complete Data manipulation within spark session
- Good understanding of Spark Architecture including spark core, spark SQL, DataFrame, Spark streaming, Driver Node, Worker Node, Stages, Executors and Tasks, Deployment modes, the Execution hierarchy, fault tolerance, and collection
- I was a critical role in establishing the data bricks infrastructure for building data pipelines, providing best practices and recommendations for using data bricks, working closely with data scientists, and being the point of contact for all things related to data bricks.
- Developing scalable and re-usable frameworks for ingesting large data in Databricks.
- Designed and developed Security Framework to provide fine-grained access to objects in AWS S3 using AWS Lambda and DynamoDB.
- CreatingApache Airflow, including installing, configuring, and monitoring the Airflowcluster.
- Experience in Automating, Configuring, and Deploying Instances on Azure environments and in Data centers and migrating on-premise to Windows Azure using Azure Site Recovery (ASR) and Azure backups.
- Automated provisioning of Hybrid solutions connecting Azure to on-premises resources via Azure Express Route and Azure Hybrid connections.
- Onboarding to the Cloud - Moved critical instances & components of core infrastructure to the cloud (Azure).
- Experience in setting up CICD pipeline integrating various tools with Jenkins to build and run Terraform jobs to create infrastructure in Azure.
- Knowledge in syncing On-premises Windows Server Active Directory to Azure AD (AAD) using Azure AD Connect.
- Deployed and maintained various .NET applications and web services hosted on various versions of IIS web servers (7.0/7.5/8.0/8.5 ), SQL Server, and Azure.
- Good knowledge of Pivotal Cloud Foundry ( PCF ) and Application infrastructure and architectures. Installed and configured Pivotal Cloud Foundry environments (PCF).
- Ensured successful architecture and deployment of enterprise-grade PaaS solutions using Pivotal Cloud Foundry (PCF) as well as proper operation during initial application migration and set new development.
- Used Terraform as "Infrastructure as a code" and modified Terraform scripts as and when configuration changes are required.
- Wrote PowerShell scripts for administrative tasks and management of the server infrastructure.
- Used Ansible server and workstation to manage and configure nodes and wrote several Ansible playbooks for the automation that was defined through tasks using YAML format and run Ansible Scripts to provision Dev servers.
- Creating new Ansible YAML, Playbooks, Roles, and Bash Shell scripts for application deployments.
- Hands-on experience in creating Docker images using a Docker file, working on Docker container snapshots, removing images, and managing Docker volumes for branching purposes.
- Worked on several Docker components like Docker Engine, Hub, Machine, Docker images, compose, Docker registry, and handling multiple images primarily for middleware installations and domain configurations.
- Experience in deploying Kubernetes Cluster on Azure with master/minion architecture and wrote YAML files to create services like pods, deployments, auto-scaling, load balancers, labels, health checks, and namespaces.
- Managed Kubernetes charts using Helm packages. Created reproducible builds of the Kubernetes applications, managed Kubernetes clusters using pods and nodes, and managed releases of Helm packages.
- Set up Jenkins server and build jobs to provide continuous automated builds based on polling the Git source control system to support development needs using Jenkins, Gradle, Git, and Maven.
- Strong knowledge/experience in creating Jenkins CI pipelines and good experience in automating deployment pipelines.
- Expert in Setting up production-ready architecture for airflow on AWS EKS.
- Experience enables remote logging with AWS S3 in ESK and also Store Sensitive data in AWS Secret Manager.
- Experience in Deploy DAGs from Git, and Share DAGs and storing logs with AWS EFS.
- Performed end-to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, and S3.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline that can be written to Glue Catalog and can be queried from Athena.
- Construct the AWS data pipelines using VPC, EC2, S3, Auto Scaling Groups (ASG), EBS, Snowflake, IAM, CloudFormation, Route 53, CloudWatch, CloudFront, and CloudTrail.
- Design and construct AWS Data pipelines using various resources in AWS including AWS API Gateway to receive a response from AWS lambda and retrieve data from snowflake using lambda function and convert the response into JSON format using Database as Snow Flake, DynamoDB, AWS Lambda function, and AWS S3.
- Building ETL data pipeline on Hadoop/Teradata using Hadoop/Pig/Hive/UDFs
- Developing ETL processes based on the necessity, to load and analyze data from multiple data sources using MapReduce, Hive, and Pig Latin Scripting.
- Integrating MapReduce with HBase to import huge clusters of data using MapReduce programs.
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB
- Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker-related tasks such as publishing data to S3, training ML model, and deploying it for prediction.
- Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.
- Proficient with business intelligence tools such as SSIS, SSRS, TOAD, Teradata SQL Assistant, VBA, Tableau, and Actimize.
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, and a broad variety of machine learning methods including classifications, regressions, dimensionally reduction, etc.
- Used Kafka for live streaming data and perform analytics on it. Worked on Sqoop to transfer the data from relational database and Hadoop.
- Designed AWS Cloud Formation templates to create VPC, subnets, and NAT to ensure successful deployment of Web applications and database templates.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Participated in all data collection, data cleaning, developing models, validation, and visualization. Designed and developed an architecture for a data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Worked with data investigation, discovery, and mapping tools to scan every single data record from many sources.
- Involved with writing scripts in Oracle, SQL Server, and Netezza databases to extract data for reporting and analysis and worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data
- Developed automated data pipelines from various external data sources (web pages, API, etc) to the internal data warehouse (SQL server, AWS), then export to reporting tools.
- Used Informatica power center for (ETL) extraction, transformation, and loading data from heterogeneous source systems and studied and reviewed the application of Kimball data warehouse methodology as well as SDLC across various industries to work successfully with data-handling scenarios, such as data
- Worked on analyzing Hadoop clusters and different big data analytic tools including Pig, HBase database, and Sqoop.
- Worked on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, and cloud SQL.
- Used Spark Data frames, Spark-SQL, and Spark MLLib extensively and developed and designed POCs using Scala, Spark SQL, and MLLib libraries.
Environment: Spark, Scala, Pyspark, AWS EMR, EC2, S3, RDS, Redshift, Hbase, Lambda, Boto3, DynamoDB, Mapreduce, Amazon SageMaker, Airflow, Oozie, HBase, Netezza, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, Informatica Power center, Tableau, GCP, SQL, Tableau, Informatica, AWS Redshift, Snowflake, S3, Postgres, GCP, MS SQL Server, Big query, Salesforce, Python, Postman, Unix Shell Scripting, EMR, GitLab
Data Engineer
Confidential, Mountain view, CA
Responsibilities:
- Experienced in development using Cloudera Distribution System. Performed Data Analytics on Data Lake using PySpark on Databricks Platform.
- Design and develop ETL integration patterns using Python on Spark. Participated in Normalization /De-normalization, Normal Form, and database design methodology. Expertise in using data modeling tools like MS Visio and Erwin Tool for the logical and physical design of databases.
- Optimize the PySpark jobs to run on Secured Clusters for faster data processing.
- Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation.
- Developed spark applications in Python and PySpark on the distributed environment to load a huge number of CSV files with different schema into Hive ORC tables.
- Designing and Developing Apache NiFi jobs to get the files from transaction systems into the data lake raw zone.
- Worked on reading and writing multiple data formats like JSON, ORC, and Parquet on HDFS using PySpark.
- Develop a framework for converting existing PowerCenter mappings to PySpark, Python, and Spark Jobs.
- Used UC4 and Oozie Scheduler to automate the workflows based on time and data availability.
- Create a PySpark frame to bring data from DB2 to Amazon S3.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Developed Talend Bigdata jobs to load a heavy volume of data into an S3 data lake and then into the Snowflake data warehouse.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward.
- Designed Star and Snowflake Data Models for Enterprise Data Warehouse using ERWIN.
- Experience in Google Cloud components, Google container builders, and GCP client libraries.
- Created Spark clusters and configured high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
- Primarily involved in Data Migration process using SQL, Azure SQL, SQL Azure DW, Azure Storage, and Azure Data Factory (ADF) for Azure Subscribers and Customers.
- Responsible for ingesting data from various source systems (RDBMS, Flat files, Big Data) into Azure (Blob Storage) using the framework model.
- Used Hadoop scripts for HDFS (Hadoop Distributed File System) data loading and manipulation.
- Involve in Application Design and Data Architecture using Cloud and Big Data solutions on Microsoft Azure.
- Leading the effort for migration of Legacy-system to Microsoft Azure cloud-based solution. Re-designing the Legacy Application solutions with minimal changes to run on a cloud platform.
- Worked on building the data pipeline using Azure Service like Data Factory to load the data from Legacy SQL server to Azure DataBase using Data Factories, API Gateway Services, SSIS Packages, Talend Jobs, custom .Net and Python codes.
- Built Azure Web Job for Product Management teams to connect to different APIs and sources to extract the data and load it into Azure Data Warehouse using Azure Web Job and Functions.
- Participated in the derivation of logical requirements into physical requirements and in the preparation of high-level design documents for ETL jobs.
- Performed Hive test queries on local sample files and HDFS files.
- Developed Hive queries to analyze data and generate results.
- Used Spark Streaming to divide streaming data into batches as an input to Spark Engine for batch processing.
- Worked on analyzing Hadoop clusters and different Big Data analytic tools including Pig, Hive, HBase, Spark, and Sqoop.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
- Used Scala to write code for all Spark use cases.
- Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.
- Assigned name to each of the columns using the case class option in Scala.
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala, initially done using Python and PySpark.
- Involved in converting the HQLs to spark transformations using spark RDD with the support of Python and Scala.
- Extensively worked on Informatica tools like source analyzer, mapping designer, workflow manager, workflow monitor, Mapplets, Worklets, and repository manager.
- Used Debugger in Informatica Power Center Designer to check the errors in mapping.
- Developed multiple Spark SQL jobs for data cleaning.
- Created Hive tables and worked on them using Hive QL.
- Assisted in loading large sets of data (Structured, Semi-Structured, and Unstructured) to HDFS.
- Developed Spark SQL to load tables into HDFS to run select queries on top.
- Used Visualization tools such as Power view for excel, and Tableau for visualizing and generating reports.
- Implemented Custom Azure Data Factory (ADF) pipeline Activities and SCOPE scripts.
- Primarily responsible for creating new Azure Subscriptions, data factories, Virtual Machines, SQL Azure Instances, SQL Azure DW instances, HD Insight clusters, and installing DMGs on VMs to connect to on-premise servers.
Environment: Hadoop, Hive, Oozie, Spark, Spark SQL, Python, PySpark, Azure Data Factory, Azure SQL, Azure Databricks, Azure DW, BLOB storage Java, Scala, AWS, Linux, Maven, Apache NiFi, Oracle 11g/10g, Zookeeper, MySQL, Snowflake.GCP
Data Engineer
Confidential, Vernon Hills, IL
Responsibilities:
- Successfully created a plan which will align with the technology roadmap which delivers efficiency through Big Data tools.
- Provided my feedback during the design sessions of future services and application integrations onto the CBB environment.
- Completed a PySpark framework with all the runtime critical parameters for execution on the CBB environment.
- Successfully provided the primary operational support for the Data Science users like Jupyter notebook, python, spark, hive, Teradata, Greenplum, Oracle databases connections best practices.
- Completed User onboarding and Governance process for private cloud onboarding. Partnership with Global Technology Infrastructure partners to obtain storage and commute capacities based on the use cases.
- Provided primary integration support documents for onboarding of new clients to the CBB environment.
- Successfully Onboarded Jupyter notebook/Hub on CBB environment.
- Support, Maintain and Troubleshoot Spark memory usage, Yarn queue, and other Hadoop services
- Successfully tested and provided a distribution of various APIs and components that CBB provides and supports.
- Build the infrastructure required for optimal extraction, transformation, and loading (ETL) of data from a wide variety of data sources like Teradata and Oracle using Spark, Python, Hive, Kafka, and other Bigdata technologies.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, and PySpark.
- Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation.
- Optimized the PySpark jobs to run on CBB for faster data processing and developed RESTful APIs for capturing logs.
Environment: Spark,Pyspark,Teradata,Hive,Python,Kafka,RESTFUL APIs,Azure.
Data Analyst
Confidential
Responsibilities:
- Work with large amounts of data in service of concrete conclusions and actionable insights.
- Communicate complicated technical topics and nuanced insights to diverse audiences.
- Analyzed SQL queries that were used by the testing team to validate the data in the back-end Hadoop - hive.
- Experienced in designing and implementing large-scale data loading, manipulation, processing, exploration solutions using Hadoop/NoSQL technologies, etc. Working knowledge of Apache Spark, and PySpark.
- Works with Report tools such as Cognos Reporting.
- Strong understanding, writing, and maintaining SQL query skills required.
- Experience with ETL (Extract, Transform, Load) using pig.
- Write, analyze and review the program, using workflow chart diagrams.
- Experience with the HiveQL, scripting/programming language, including the ability to develop, test, and debug ad-hoc queries, stored procedures, and data migration scripts.
- Strong analysis, coding, testing, documenting, and implementation experience in both a development and maintenance environment.
- Basic knowledge of statistics.
Environment: Oracle 10 g,TOAD,SQL,Tableau,Power BI,SSIS 11.0, Informatica Power center,MS OFFICE
Data Analyst
Confidential
Responsibilities:
- Responsible for the analysis, development, installation, modification, and support of software applications.
- Scheduling the build process for software applications and staging it to QA, UAT, and production environments.
- Collaborating with high-performance teams and individuals throughout the firm to accomplish a common goal.
- Involving in the process of Oracle data modeling and building data structures for efficient querying
- Code, debug, revise and document objects or systems with limited supervision. May design based on functional department assigned.
- Work with Oracle technologies and ETL tools such as Ab Initio.
- Sustain engineering to keep the product current with technology dependencies.
- Engage in the system, integration performance, and regression testing.
- Responsible for creating Hive tables, loading the structured data resulting from MapReduce jobs into the tables, and writing Hive queries to further analyze the logs to identify issues and behavioral patterns.
- Implementing Kafka, Spark, and Spark Stream framework for real-time data processing.
- Using Jira for bug tracking and BitBucket to check in and checkout code changes.
- Performing and supporting code deployment walkthrough and validation of deployments.
- Sustain engineering to keep the products current with technology dependencies.
- Work with application servers including Tomcat and databases such as Oracle and Teradata, IBM WebSphere.
- Creating automation process in UNIX and LINUX and scheduling in Control M.
Environment: Hortonworks, Linux, Python, Spark, Java, ETL - Ab Initio, Cognos, GIT, Hadoop, Hive, Jupyter Notebook, Control M, Teradata, Exadata, Hive, Oracle, SQL Server, Control-M jobs.