We provide IT Staff Augmentation Services!

Sr.data Engineer Resume

0/5 (Submit Your Rating)

New York, NY

SUMMARY

  • Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms.
  • Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
  • Expertise on Talend Data Integration suite and Bigdata Integration Suite for Design and development of ETL/Bigdata code and Mappings for Enterprise DWH EL Talend Projects.
  • Expert in using the Talend Troubleshooting and Datastage to understand the errors in Jobs and used the tap/expression editor to evaluate complex expressions and look at the transformed data to solve mapping issues.
  • Over 7+ years of experience in Data Engineering, Data Pipeline Design, Development, and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
  • In-depth experience on SaaS, PaaS, and IaaS concepts of cloud computing architecture and Implementation using Azure, AWS, and google cloud platform.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy, and Beautiful Soup.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Extensively worked with Teradata utilities Fast export, and Multi Load to exp ort and load data to/from different source systems including flat files.
  • Excellent working in Big Data Horton works, HDFS architecture, R, Python, Jupyter, Pandas, numpy, Scikit, Matplotlib, pyhive, Keras, Hive, NoSQL - HBase, Sqoop, Pig, Map Reduce, Oozie, Spark MLlib.
  • Hands-on experience in Liner, Logistic Regression, K Means Cluster Analysis, Decision Tree, KNN, SVM, Random Forest, Market Basket, NLTK/Naïve Bayes, Sentiment Analysis, Text Mining/Text Analytics, Time Series Forecasting.
  • Worked on Scala codebase related to Apache Spark performing the Actions, Transformations on RDDs, DataFrames & Datasets using SparkSQL and Spark Streaming Contexts.
  • Experienced in data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling, and data mining, machine learning, and advanced data processing.
  • Hands-on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear, and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis.
  • Strong experience and knowledge in Data Visualization with Tableau creating: Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots, etc.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Expertise in creating Pods using Kubernetes and worked with Jenkins pipelines to drive all microservices builds out to the Docker registry and then deployed to the Kubernetes cluster.

TECHNICAL SKILLS

ETL Tools: AWS Glue, Airflow, Spark, Sqoop, Flume, Apache Kafka, Spark Streaming

No SQL Databases: MongoDB, Cassandra, Amazon DynamoDB, HBase

Data Warehouse: AWS RedShift, Snowflake, Teradata.

SQL Databases: Oracle DB, Microsoft SQL Server, IBM DB2, PostgreSQL, Teradata, Amazon RDS

Web Development: HTML, XML, JSON, CSS, JQUERY, JavaScript

Monitoring Tools: Splunk, Chef, Nagios, ELK

SourceCode Management: J Frog Artifactory, Nexus, GitHub, Code Commit

Containerization: Docker & Docker Hub, Kubernetes, OpenShift

Hadoop Distribution: Cloudera, Hortonworks, MapR, AWS EMR

Programming and Scripting: Spark Scala, Python, Java, MySQL, PostgreSQL, Shell Scripting, Pig, HiveQL

AWS: EC2, S3, Glacier, Redshift, RDS, EMR, Lambda, Glue, CloudWatch, Kinesis, CloudFront, Route53, DynamoDB, Code Pipeline, EKS, Athena, Quick Sight

Hadoop Tools: HDFS, HBase, Hive, YARN, MapReduce, Pig, HIVE, Apache Storm, Sqoop, Oozie, Zookeeper, Spark, SOLR, Atlas

Build & Development Tools: Jenkins, Maven, Gradle, Bamboo

Methodologies: Agile/Scrum, Waterfall

PROFESSIONAL EXPERIENCE

Sr.Data Engineer

Confidential, New York, NY

Responsibilities:

  • Worked on designing and developing the Real-Time Tax Computation Engine using Oracle, Stream Sets, Kafka, Spark Structured Streaming and MySQL
  • Worked with Data Mapping Team to understand the source to Confidential mapping rules.
  • Analyzed the requirements and framed the business logic for the ETL process using Talend.
  • Involved in the ETL design and its documentation.
  • Developed Jobs in Talend Enterprise edition from stage to source, intermediate, conversion and Confidential O Worked on Talend EL to load data from various sources to Oracle DB.
  • Used tmap, treplicate, tfilterrow, tsort and various ther features in Talend.
  • Implemented Spark using python and utilizing Data frames and Spark SQL API for faster processing of data.
  • Involved in ingestion, transformation, manipulation, and computation of data using Stream Sets, Kafka, Spark
  • Involved in data ingestion into MySQL using Kafka - MySQL pipeline for the full load and Incremental load on a variety of sources like web server, RDBMS, and Data APIs.
  • Worked extensively on AWS Components such as Elastic Map Reduce (EMR), Elastic Compute Cloud (EC2), Simple Storage Service (S3)
  • Experience in integrating Spark-MySQL connector and JDBC connector to save the data processed in Spark to MySQL.
  • Responsible for creating tables and MySQL pipelines which are automated to load the data into tables from Kafka topics.
  • Used Spark Streaming to receive real-time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Collected data using Spark Streaming from AWSS3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Authoring Python (PySpark) Scripts for custom UDF's for Row/ Column manipulations, merges, aggregations, stacking, data labeling, and for all Cleaning and conforming tasks.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau.
  • Implemented Workload Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running ad-hoc queries
  • Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances concerning specific applications to improve robustness.
  • Strengthening the business and help support clients by using data to describe and model the outcomes of investment and business decisions.
  • Validating their findings using an experimental and iterative approach.
  • Present back findings to the business team by exposing their assumptions and validation work in a way that can be easily understood by the business counterparts.
  • Integrating and preparing large, varied datasets, implementing a specialized database, and computing environments, and communicating results.
  • Improving organizational performance through the application of original thinking to existing and emerging analytic methods, processes, products, and services, and employ sound judgment in determining how innovations will be deployed to produce a return on investment.
  • Work with Data Engineers and determine how to best source data, including identification of potential proxy data sources, and design business analytics solutions, considering current and future needs, infrastructure, and security requirements, and load frequencies.
  • Implemented Chef Recipes for Deployment on build on internal Data Centre Servers.
  • Also re-used and modified the same Chef Recipes to create a Deployment directly into AmazonEC2 instances.
  • Used Splunk APM for Service now ticketing tool, log aggregation and analysis on different application servers and integrating the Splunk with Single Sign-On authentication and service now ticketing tool.

Environment: Spark, Scala, Linux, MySQL, Kafka, Spark SQL, Spark Structured Streaming, AWS EC2, EMR, Tableau, Power BI, AWS S3, MySQL, AWS Redshift.

Data Engineer

Confidential, Cleveland Ohio

Responsibilities:

  • Responsible for managing cloud computing tool AWS and the code in Git (version controlling) and deploying and operating AWS, specifically VPC, EC2, S3, EBS, IAM, ELB, Cloud Formation, and Cloud Watch using the AWS console and AWS CLI.
  • Worked in all areas of Jenkins setting up CI for new branches, build automation, plugin management, and securing Jenkins, and setting up master/slave configurations.
  • Created the projects in OpenShift Console with Quotas for non-prod and prod and Troubleshooting OpenShift EFK stack and ELK with LMA for central logging.
  • Used Ansible playbooks to setup a Continuous Delivery pipeline.
  • This primarily consists of a Jenkins and Sonar server, the infrastructure to run these packages, and various supporting software components such as Maven, etc.
  • Primarily Responsible for converting the Manual Report system to a fully automated CI/CD Data Pipeline that ingests data from different Marketing platforms to AWS S3 data lake.
  • Developed Jobs in Talend Enterprise edition from stage to source, intermediate, conversion and Confidential O Worked on Talend EL to load data from various sources to Oracle DB.
  • Utilized AWS services with a focus on big data analytics, enterprise data warehouse, and business intelligence solutions to ensure optimal architecture, scalability, flexibility
  • Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift, and event processing using lambda function
  • Gathered data from Google AdWords, Apple search ad, Facebook ad, Bing ad, Snapchat ad, Omniture data, and CSG using their API.
  • Developed MapReduce programs to parse the raw data and create intermediate data which would be further used to be loaded into Hive portioned data.
  • Involved in creating Hive tables, loading the data into it, and writing Hive queries to analyze the data.
  • Involved in data ingestion into HDFS using Sqoop for full load and Flume for the incremental load on a variety of sources like web server, RDBMS, and Data API.
  • Experience in custom aggregate functions using Spark SQL and performed interactive querying.
  • Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Created UDFs to calculate the pending payment for the given customer data based on the last day of every month and used in Hive Scripts.
  • Used Elastic Search & MongoDB for storing and querying the offers and non-offers data.
  • Responsible for data modeling in MongoDB to load data which is coming as structured as well as unstructured data
  • Leverage appropriate advanced and sophisticated methods and approaches to synthesize, clean, visualize and investigate data as appropriate to deliver analytical recommendations aligned with the business need.
  • Analyzed large data sets to apply Machine Learning techniques and develop predictive models.
  • Used data visualization techniques to effectively communicate analytical results and demonstrate our model performance to define ongoing progress and achieved our objectives.
  • Demonstrated Key Performance Indicator (KPI) dashboards using Tableau.

Environment: Hadoop, HDFS, Flume, Hive, MapReduce, Sqoop, LINUX, MapR, Big Data, UNIX Shell Scripting, TWS, Python, SQL Server, Tableau, PySpark, Cassandra

Data Engineer

Confidential, Rochester MN

Responsibilities:

  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward.
  • Strong experience in leading multiple Azure Big Data and Data transformation implementations in Banking and Financial Services, High Tech, and Utility industries.
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML, and Power BI.
  • Designed end-to-end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage, and Machine Learning Studio.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity.
  • Created Build and Release for multiple projects (modules) in the production environment using Visual Studio Team Services (VSTS).
  • Designed and Developed Real-time Stream Processing Applications using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Azure Kubernetes Service was used to deploy a managed Kubernetes cluster in Azure and built an Azure portal AKS cluster with Azure CLI, and used template-driven deployment options such as templates for the Resource Manager and Terraform.
  • Used Kubernetes to deploy scale, load balance, scale and manage docker containers with multiple names spaced versions.
  • Designed strategies for optimizing all aspects of the continuous integration, release, and deployment processes using container and virtualization techniques like Docker and Kubernetes. Built Docker containers using microservices project and deploy to Dev.
  • Collected the JSON data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.
  • Responsible for resolving the issues and troubleshooting related to the performance of the Hadoop cluster.
  • Utilized Machine Learning algorithms such as linear regression, multivariate regression, PCA, K-means, &KNN for data analysis.

Environment: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Lambda, Glue, Azure, ADF, Blob, cosmos DB, Python, PySpark, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Oozie, HBase, Oracle, Teradata, Cassandra, MLlib, Tableau, Maven, Git, Jira.

Data Engineer

Confidential - Austin, TX

Responsibilities:

  • Created infrastructure for optimal extraction, transformation, and loading of data from a wide variety of data sources.
  • Designed and created optimal pipeline architecture on the Azure platform.
  • Created pipelines in Azure using ADF to get the data from different source systems and transform the data by using many activities.
  • Created Linked service to land the data from different sources to Azure Data Factory.
  • Designed and developed Python scripts to ingest data from RESP API endpoints.
  • Used pandas to transform the raw data and upload it to FTP sites.
  • Ingested data from hive tables to Azure DataLake Gen 1 & Gen 2.
  • Developed the Teradata Macros, Stored Procedures to load data into Incremental/Staging tables and then move data from staging to Journal then move data from Journal into Base tables
  • Implemented authentication mechanism using Azure Active Directory for data access and ADF.
  • Created different types of triggers to automate the pipeline in ADF.
  • Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
  • Created several Databricks Spark jobs with Pyspark to perform several tables to table operations.
  • Develop Azure SQL Data Warehouse SQL scripts with Polybase support and processing files stored in Azure Storage, Azure Data Lake.
  • Work on SQL Scripts, T-SQL Stored procedures, triggers, queries, packages to load data in SQL Server, and SQL Datawarehouse.
  • Worked with the version control system and strong knowledge of DevOps.

Environment: Azure SQL Server, Azure Data Warehouse, Azure Storage, Teradata, Azure Data Lake, Azure Data Lake Analytics, Azure Data Factory, Logic Apps, Function Apps, Event Hubs, Event Grids, SQL Server, Visual Studio.

Data Engineer

Confidential

Responsibilities:

  • Interacting with the Business Requirements and the design team and preparing the Low-Level Design and high-level design documents.
  • Provide in-depth technical and business knowledge to ensure efficient design, programming, implementation, and on-going support for the application.
  • Involved in identifying possible ways to improve the efficiency of the system.
  • Logical implementation and interaction with HBase.
  • Efficiently put and fetched data to/from HBase by writing MapReduce job.
  • Developed Map Reduce jobs to automate the transfer of data from/to HBase.
  • Assisted with the addition of Hadoop processing to the IT infrastructure.
  • Implemented Map/Reduce job and execute the Map/Reduce job to process the log data from the ad-servers.
  • Prepared multi-cluster test harness to exercise the system for better performance.

Environment: Hadoop, HDFS, MapReduce, HBase, Hive, Cassandra, Hadoop distribution of Hortonworks, Cloudera, SQL* PLUS, and Oracle 10g.

We'd love your feedback!