We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Vernon Hills, IL

SUMMARY

  • 8+ years of IT experience in technologies like ETL tools, Machine Learning, Data Extraction, Data Modelling, Statistical Modeling, Data Mining and Data Visualization.
  • Data Engineer along with the experience of using ETL tools and Machine Learning and who is passionate in implementing and exploring ML techniques.
  • Implemented REST API to build on ETL mapping for data collection from various data feeds.
  • Extensive experience in Data warehousing projects by implementing Talend, ETL, Developed mappings to populate data into dimensions and fact tables.
  • Skilled in designing and implementing ETL Architecture for cost effective and efficient environment. Experience in providing ETL solutions for any type of business model.
  • Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
  • Have good experience designing cloud-based solutions in Azure by creating Azure SQL database, setting up Elastic pool jobs and design tabular models in Azure analysis services.
  • Have extensive experience in creating pipeline jobs, schedule triggers using Azure data factory.
  • Developed Scala applications on Hadoop and Spark SQL for high - volume and real-time data processing.
  • Good understanding of Classic Hadoop and Yarn architecture along with various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, Resource Manager, Node Manager, Application Master and Containers.
  • Develop batch processing solutions by using Data Factory and Azure Data bricks. Implement Azure Data bricks clusters, notebooks, jobs and auto scaling.
  • Knowledge on implementing Data Cleaning, Data Validation, Data Mapping, Data Analysis and Data Profiling, features scaling, features engineering, statistical modeling, testing and validation and data visualization.
  • Have extensive experience in creating pipeline jobs, schedule triggers using Azure data factory.
  • Expertise in transforming business resources and requirements into manageable data formats and analytical models, designing algorithms, building models, developing data mining and reporting solutions which scale across a massive volume of structured and unstructured data.
  • Proficient in Machine Learning algorithm and Predictive Modeling including Regression models, Decision Tree, Random Forests, Sentiment Analysis, Naïve Bayes Classifier, SVM, Ensemble Models.
  • Knowledge on Natural Language Processing (NLP) algorithm and Text Mining.
  • Hands on experience with different programming languages such as Java, Python, R.
  • Good team player with the ability to perform individually, good interpersonal relations, strong communication skills, hardworking and high level of motivation.
  • Strong business sense and abilities to communicate data insights to both technical and nontechnical clients.
  • Proficiency in analyzing problems and transferring business concept to functional requirements.

TECHNICAL SKILLS

Operating Systems: Unix, Linux, Windows

Programming Languages: Java, Python 3, Scala 2.12.8, PySpark, C, C++

Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase

Cluster Management & Monitoring: CDH, Horton Works Ambari

Data Bases: MySQL, SQL Server, Oracle 12c, MS Access

NoSQL Data Bases: MongoDB, Cassandra, HBase, KairosDB

Workflow mgmt tools: Oozie, Apache Airflow

Visualization & ETL tools: Tableau, BananaUI, D3.js, Informatica, Talend

Cloud Technologies: Azure, AWS

IDE’s: Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ

Version Control Systems: Git, SVN

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential, Vernon Hills, IL

Responsibilities:

  • Analyze, design and build modern data solutions using Azure PaaS services to support data visualization.
  • Create pipelines in Azure Data Factory (ADF) using Linked Services, Datasets, and Pipelines to extract, transform, and load data from various sources, such as Azure SQL, Blob Storage, and Azure Synapse Analytics (formerly known as SQL Data Warehouse).
  • Develop highly optimized Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from various file formats to analyze and uncover insights into customer usage patterns.
  • Work with Azure Blob and Data Lake Storage and load data into Azure Synapse Analytics (DW).
  • Design and develop schema data models, perform data cleaning and preparation on XML files, and develop SQL scripts for automation purposes.
  • Develop conceptual solutions and create proof-of-concepts to demonstrate the viability of solutions.
  • Build complex distributed systems involving handling large amounts of data, collecting metrics, building data pipelines, and performing analytics.
  • Understand the current production state of applications and determine the impact of new implementations on existing business processes.
  • Engage with business users to gather requirements, design visualizations, and provide training to use self-service BI tools.
  • Implement and manage ETL solutions and automate operational processes.
  • Work with data pipelines consisting of Spark, Hive, and custom-built Input Adapters to ingest, transform, and analyze operational data.
  • Develop highly optimized Spark applications to perform data cleansing, validation, transformation, and summarization activities.
  • Create correlated and non-correlated sub-queries to resolve complex business queries involving multiple tables from different databases.
  • Provide technical guidance for projects to ensure completion within target timeframes.

Environment: SQL Server Management, Azure Data Factory, Power BI, Azure Data Lake Analytics, Azure Analysis Services, ETL, Databricks.

Data Engineer

Confidential, Fort Louderdale, FL

Responsibilities:

  • Extract, transform, and load data from source systems to Azure data storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL in Azure Data Lake Analytics.
  • Ingest data into one or more Azure services, such as Azure Data Lake, Azure Storage, Azure SQL, and Azure Synapse Analytics (DW), and process the data in Azure Databricks.
  • Develop Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from various file formats to analyze and uncover insights into customer usage patterns.
  • Work in large-scale database environments like Hadoop and MapReduce, with an understanding of the working mechanism of Hadoop clusters, nodes, and the Hadoop Distributed File System (HDFS).
  • Estimate cluster size, monitor, and troubleshoot the Hadoop cluster.
  • Analyze data using Hadoop components, such as Hive and Pig, and run Hadoop streaming jobs to process terabytes of data.
  • Design ETL processes using Informatica to load data from flat files and Excel files to target Oracle data warehouse databases.
  • Improve workflow performance by shifting filters as close as possible to the source and selecting tables with fewer rows as the master during joins.
  • Use connected and unconnected lookups whenever appropriate, along with appropriate caches.
  • Create tasks and workflow managers and monitor sessions in the workflow monitor.
  • Set up permissions for groups and users in all development environments.
  • Move mappings from development environments to test environments.

Environment: Azure Data Factory, HDFS, PySpark, Oracle, Spark SQL, Azure Data Lake, Azure Data Storage, Informatica, Hive.

Data Engineer

Confidential, Tampa, FL

Responsibilities:

  • Developed ETL Pipeline using Spark and Hive for ingesting data from multiple sources.
  • Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
  • Created ETL Pipeline using SSIS/ETL framework from the ground up.
  • Responsible for Designing Logical and Physical data modeling for data sources on Confidential Redshift.
  • Extensively used Erwin for Data modeling, Staging and Target Models for the Enterprise Data Warehouse.
  • Performed logical data modeling, physical Data modeling (including reverse engineering) using the Erwin Data modeling tool.
  • Created dimensional model for the reporting system by identifying required dimensions and facts using Erwin.
  • Involved in Normalization / De normalization techniques for optimum performance in relational and dimensional database environments.
  • Resolved the data type inconsistencies between the source systems and the target system using the Mapping Documents and analyzing the database using SQL queries.
  • Involved in performance tuning, stored procedures, views, triggers, cursors, pivot, unpivot functions, CTE's.
  • Created reports using SQL Reporting Services (SSRS) for customized and ad-hoc Queries.
  • Developed stored procedures in MS SQL to fetch the data from different servers using FTP and processed these files to update the tables.
  • Worked on MS SQL Server, including SSRS, SSIS, and T-SQL.
  • Used SAP - SD Module for handling customers of the client and generating the sales reports.
  • Worked on ETL testing and used SSIS tester automated tool for unit and integration testing.

Environment: Tableau 7, Python 2.6.8, Numpy, Pandas, Matplotlib, Scikit-Learn, MongoDB, Oracle 10g, SQL

Data Analyst

Confidential

Responsibilities:

  • Used Python 3.X (numpy, scipy, pandas, scikit-learn, seaborn) and Spark 2.0(PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
  • Developed and implemented predictive models using machine learning algorithms such as learning regression, classification, multivariate regression, Naïve Bayes, Random Forests, K-means clustering, KNN, PCA and regularization for data analysis.
  • Built regression models include: Lasso, Ridge, SVR, XGboost to predict Customer Life Time Value.
  • Built classification models include: Logistic Regression, SVM, Decision Tree, Random Forest to predict Customer Churn Rate.
  • Performed univariate and multivariate analysis on data to identify any underlying pattern in the data and associations between the variables.
  • Applied clustering algorithms i.e. Hierarchical, K-means using Scikit and Scipy.
  • Used F-Score, AUC/ROC, Confusion Matrix, MAE, RMSE to evaluate different Model performance.
  • Performed data imputation using Scikit-learn package in Python.
  • Implemented NLP techniques to optimize Customer Satisfaction.
  • Worked along with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.

Environment: Python 2.x, NLP, R, Machine Learning (Regressions, KNN, SVM, Decision Tree, Random Forest, XGboost, LightGBM, Collaborative filtering, Ensemble), pandas, numpy.

Python Developer

Confidential

Responsibilities:

  • Worked on the project from gathering requirements to developing the entire application. Worked on Anaconda Python Environment.
  • Created, activated and programmed in Anaconda environment.
  • Wrote programs for performance calculations using NumPy.
  • Developed different Statistical Machine Learning, Data mining solutions to various business problems and generating data visualizations using R, Python and Tableau
  • Design and build a text classification application using different text classification models.
  • Analyzed the code completely and have reduced the code redundancy to the optimal level.
  • Worked on development of SQL and stored procedures on MYSQL.
  • Wrote and executed various MySQL database queries from Python MySQL connector and MySQL Db package.
  • Responsible for designing, developing, testing, deploying and maintaining the web application.
  • Developed python routines to log into the websites and fetch data for selected options.
  • Worked on writing and as well as read data from CSV and excel file formats.

Environment: Python 2.x, Anaconda, Sypder (IDE), Tableau, python libraries such as NumPy, SQL Alchemy, MySQLdb.

We'd love your feedback!