Data Engineer Resume
St Louis, MO
SUMMARY
- Data Engineer with 9+ years of experience executing data driven solutions to increase efficiency, accuracy, and utility of internal Data Processing.
- Extensive experience in Analyzing, Developing, Managing and implementing various stand - alone, client-server enterprise applications using Python, Djangoand mapping the requirements to the systems.
- Well versed with Agile with SCRUM, Waterfall Model and Test-driven Development (TDD) methodologies.
- Experience in developing web applications by using Python, Django, C++, XML, CSS, HTML, JavaScript and jQuery.
- Experience in analyzing data using Python, R, SQL, Microsoft Excel,Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Mining and Machine Learning.
- Experience working on Healthcare data, developing data preprocessing pipelines for data like DICOM and NONDICOM images of XRAYS, CT-SCANS etc.
- Manage metadata alongside the data for visibility of where data came from, its linage to ensure and quickly and efficiently finding data for customer projects using AWS Data lake and its complex functions like AWS Lambda, AWS Glue.
- Strong Experience in implementing Data warehouse solutions in Confidential Redshift; Worked on various projects to migrate data from on premise databases to Confidential Redshift, RDS and S3.
- Sound knowledge in Data Quality & Data Governance practices & processes.
- Experience in developing machine learning models like Classification, Regression, Clustering, Decision Tree.
- Good experience in developing web applications implementing Model View Control (MVC) architecture using Django, Flask, Pyramid and Python web application frameworks.
- Experience in working with number of public and private cloud platforms like Amazon Web Services (AWS), Microsoft Azure.
- Experience on Cloud Databases and Data warehouses ( SQL Azure and Confidential Redshift/RDS).
- Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis)
- Extensive experience in Amazon Web Services (Amazon EC2, Amazon S3, Amazon Simple DB, Amazon RDS, Amazon Elastic Load Balancing, Elastic Search, Amazon MQ, Amazon Lambdas, Amazon SQS, AWS Identity and access management, AWS Cloud Watch, Amazon EBS and Amazon CloudFormation).
- Proficient in SQLite, MySQL and SQL databases with Python.
- Experienced in working with various Python IDE’s using PyCharm, PyScripter, Spyder, PyStudio, PyDev, IDLE, NetBeans and Sublime Text
- Experience with Requests, Report Lab, NumPy, SciPy, Pytables, cv2, imageio, Python-Twitter, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, Data Frame and Pandas python libraries during development lifecycle.
- Hands-on experience in handling database issues and connections with SQL and NoSQL databases like MongoDB, Cassandra, Redis, CouchDB, DynamoDB by installing and configuring various packages in python.
- Strong ability to conduct qualitative and quantitative analysis for effective data-driven decision making.
- Conducted ad-hoc data analysis on large datasets from multiple data sources to provide data insights and actionable advice to support business leaders according to self-service BI goals.
- Experience in data preprocessing, data analysis, machine learning to get insights into structured and unstructured data.
- Experienced in working on Application Servers like WebSphere, WebLogic, Tomcat and Web Servers like Apache server, NGINX.
- Good Knowledge in writing different kinds of tests like Unit test/Pytest and build them.
- Experienced with version control systems like Git, GitHub, CVS, and SVN to keep the versions and configurations of the code organized.
- Experienced with containerization and orchestration services like Docker, Kubernetes.
- Good experience in Big Data analytics using Hadoop,MapReduce, Spark, Sqoop, Oozie, AWS, Nifi, Snowflake,
- Expertise in Build Automation and Continuous Integration tools such as Apache ANT, Maven, Jenkins.
- Strong experience in developing Web Services like SOAP, REST, Restful with Python programming language.
- Experienced in writing SQL Queries, Stored procedures, functions, packages, tables, views, triggers using relational database like Oracle, DB2, MySQL, Sybase, PostgreSQL and MS SQL server.
- Experience in using Docker and Ansible to fully automate the deployment and execution of the benchmark suite on a cluster of machines.
- Good Experience in Linux Bash scripting and following PEP-8 Guidelines in Python.
- Extensive Knowledge on developing Spark SQL jobs by developing Data Frames.
- Executed complex HiveQL queries for required data extraction from Hive tables and written Hive UDF’s
- Experience in building applications in different operating systems like Linux (Ubuntu, CentOS, Debian), Mac OS.
TECHNICAL SKILLS
Operating Systems: Windows 98/2000/XP/7,8, Mac OS and Linux CentOS, Debian, Ubuntu
Programming Languages: Python, R, C, C++
Web Technologies: HTML/HTML5, CSS/CSS3, XML, jQuery, JSON, Bootstrap, Angular JS
Python Libraries/Packages: NumPy, SciPy,Boto, Pickle, PySide, PyTables, Data Frames, Pandas, Matplotlib, SQLAlchemy, HTTPLib2, Urllib2, Beautiful Soup, Py Query
Statistical Analysis Skills: A/B Testing, Time Series Analysis, Marko
IDE: PyCharm, PyScripter, Spyder, PyStudio, PyDev, IDLE, NetBeans, Sublime Text, Visual Code
Machine Learning and Analytical Tools: Supervised Learning (Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, Classification), Unsupervised Learning (Clustering, KNN, Factor Analysis, PCA), Natural Language Processing, Google Analytics Fiddler, Tableau.
Cloud Computing: AWS, Azure, Rackspace, OpenStack, Redshift and AWS Glue.
AWS Services: Amazon EC2, Amazon S3, Amazon Simple DB, Amazon MQ, Amazon ECS, Amazon Lambdas, Amazon Sagemaker, Amazon RDS, Amazon Elastic Load Balancing, Elastic Search, Amazon SQS, AWS Identity and access management, AWS Cloud Watch, Amazon EBS and Amazon CloudFormation
Databases/Servers: MySQL, SQLite3, Cassandra, Redis, PostgreSQL, CouchDB, MongoDB,TerraData, Apache Web Server 2.0, NginX, Tomcat, JBoss, WebLogic
ETL: Informatica 9.6, Data Stage, SSIS
Web Services/ Protocols: TCP/IP, UDP, FTP, HTTP/HTTPS, SOAP, Rest, Restful
Miscellaneous: Git, GitHub, SVN, CVS
Build and CI tools: Docker, Kubernetes, Maven, Gradle, Jenkins, Hudson, Bamboo
SDLC/Testing Methodologies: Agile, Waterfall, Scrum, TDD
PROFESSIONAL EXPERIENCE
Confidential, St. Louis, MO
Data Engineer
Responsibilities:
- Develop a data platform from scratch and took part in requirement gathering and analysis phase of the project in documenting the business requirements.
- Worked in designing tables in Hive, MYSQL using SQOOP and processing data like importing and exporting of databases to the HDFS, involved in processing large datasets of different forms including structured, semi-structured and unstructured data.
- Created external and permanent tables in Snowflake on the AWS data
- Migrated on premise database structure to Confidential Redshift data warehouse
- Create Data pipelines for Kafka cluster and process the data by using sprk streaming and created Glue jobs in AWS and load incremental data to S3 staging area and persistence area.
- Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.
- Worked with Hadoop architecture and the daemons of Hadoop including Name-Node, Data Node, Job Tracker, Task Tracker, and Resource Manager.
- Worked with SQL and DB optimization included MYSQL, Postgres DB, AWS Redshift, Cassandra Cluster, Ember DB and SQL Server.
- Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift
- Used AWS data pipeline for Data Extraction, Transformation and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using Python matplot library
- Developed scripts to load data to hive from HDFS and involved in ingesting data into Data Warehouse using various data loading techniques.
- Data Extraction, aggregations and consolidation of Adobe data within AWS Glue
- Scheduled Jobs using crontab, run deck and control-m.
- Build Cassandra queries for performing various CRUD operations like create, update, read and delete, also used Bootstrap as a mechanism to manage and organize the html page layout
- Developed entire frontend and backend modules using Python on Django Web Framework and created User Interface (UI) using JavaScript, bootstrap, Cassandra with MySQL and HTML5/CSS
- Importing and exporting data jobs, to perform operations like copying data from HDFS and to HDFS using Sqoop and developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Analyzed SQL scripts and designed the solutions to implement using PySpark.
- Used JSON and XML SerDe's for serialization and de-serialization to load JSON and XML data into Hive tables.
- Used SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
- Developing data processing tasks using PySpark such as reading data from external sources, merge data, perform data enrichment and load in to target data destinations.
- Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
- Worked in development of applications especially in LINUX environment and familiar with all its commands and worked on Jenkins continuous integration tool for deployment of project and deployed the project into Jenkins using GIT version control system
- Managed the imported data from different data sources, performed transformation using Hive, Pig and Map- Reduce and loaded data in HDFS.
- Executed Oozie workflow engine to run multiple Hive and Pig jobs, which run independently with time and data availability and developed Oozie workflow to run job onto data availability of transactions.
- To achieve Continuous Delivery goal on high scalable environment, used Docker coupled with load-balancing tool Nginx.
- Used MongoDB to stored data in JSON format and developed and tested many features of dashboard using Python, Bootstrap, CSS, and JavaScript.
Environment: Hadoop, Hive, Sqoop, Pig, java, Django, Flask, Snowflake, XML, MySQL, MS SQL Server, Linux, Shell Scripting, Mongo dB, SQL, Python 3.3, Django, HTML5/CSS, Cassandra, JavaScript, PyCharm, GIT, Linux, Shell Scripting, RESTful, Docker, Jenkins, JIRA, jQuery, MySQL, Bootstrap, HTML5, CSS, AWS, EC2, S3.
Confidential, Farmington Hills, MI
Data Engineer
Responsibilities:
- Involved in Design, Development and Support phases of Software Development Life Cycle (SDLC).
- Performed data ETL by collecting, exporting, merging and massaging data from multiple sources and platforms including SSRS/SSIS (SQL Server Integration Services) in SQL Server.
- Worked with cross-functional teams (including data engineer team) to extract data and rapidly execute from MongoDB through MongoDB connector.
- Used JSON schema to define table and column mapping from S3 data to Redshift
- Performed data cleaning and feature selection using Scikit-learn package in python.
- Partition clustering into 100 by k-means clustering using Scikit-learn package in Python where similar hotels for a search are grouped together.
- Advanced knowledge on Confidential Redshift and MPP database concepts.
- Used Python to perform ANOVA test to analyze the differences among hotel clusters.
- Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model.
- Worked with ARIMAX, Holt Winters VARMAX to predict the sales in the regular and seasonal intervals.
- Worked on Automating the provisioning of AWS cloud using cloud formation for Ticket routing techniques.
- Worked with Amazon Redshift tools like SQL workbench/J, PG Admin, DB Hawk, Squirrel SQL.
- Determined the most accurately prediction model based on the accuracy rate.
- Used text-mining process of reviews to determine customers concentrations.
- Delivered result analysis to support team for hotel and travel recommendations.
- Designed Tableau bar graphs, scattered plots, and geographical maps to create detailed level summary reports and dashboards.
- Developed hybrid model to improve the accuracy rate.
Environment: ETL, SQL Server, MongoDB, Python, AWS cloud, Redshift, Tableau
Confidential, Seattle, WA
Data scientist
Responsibilities:
- Worked on Python Open stack API's and used Python scripts to update content in the database and manipulate files.
- Involved in using AWS for the Tableau server scaling and secured Tableau server on AWS to protect the Tableau environment using Amazon VPC, security group, AWS IAM and AWS Direct Connect.
- Configured EC2 instances and configured IAM users and roles and created S3 data pipe using Boto API to load data from internal data sources.
- Built a mechanism for automatically moving the existing proprietary binary format data files to HDFS using a service called Ingestion service.
- Involved and worked on Python Open stack API's and used several python libraries such as wxPython, NumPy and matplotlib
- Performed Data transformations in HIVE and used partitions, buckets for performance improvements.
- Ingestion of data into Hadoop using Sqoop and apply data transformations and using Pig and HIVE.
- Used Python and Django creating graphics, XML processing, data exchange and business logic implementation
- Used Git, GitHub, and Amazon EC2 and deployment using Heroku and Used extracted data for analysis and carried out various mathematical operations for calculation purpose using python library - NumPy, SciPy.
- Developed server-based web traffic using RESTful API's statistical analysis tool using Flask, Pandas.
- Used Pandas API to put the data as time series and tabular format for east timestamp data manipulation and retrieval.
- Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
- Participate in the design, build and deployment of NoSQL implementations like MongoDB.
- Wrote and executed various MYSQL database queries from Python using Python-MySQL connector and MySQL dB package.
- Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis and developed scripts to migrate data from proprietary database to PostgreSQL.
- Involved in development of Web Services using SOAP for sending and getting data from the external interface in the XML format.
- Performed troubleshooting, fixed and deployed many Python bug fixes of the two main applications that were a main source of data for both customers and internal customer service team.
- Developed and executed complex SQL queries to pull data from data sources like SQL server database, and Oracle.Evaluated Information Management System Database to improve Data Quality issues using DQ Analyzer and other Data preprocessing tools.
- Implemented Data Governance policies & procedures in the Students Information Management Database
- Executed Data Analysis and Data Visualization on survey data using Tableau Desktop as well as Compared respondent’s demographics data with Univariate Analysis using Python (Pandas, NumPy, Seaborn, Sklearn, and Matplotlib).
- Developed a machine learning model to recommend friends to students based on their similarities.
- Used Alteryx for Data Preparation in such way that is it useful for developing reports and visualizations.
- Analyzed university research budget with peer universities budgets in collaboration with the research team, and recommended data standardization and usage to ensure data integrity.
- Reviewed basic SQL queries and edited inner, left, & right joins in Tableau Desktop by connecting live/dynamic and static datasets.
- Conducted statistical analysis to validate data and interpretations using Python and R, as well as presented Research findings, status reports and assisted with collecting user feedback to improve the processes and tools.
- Reported and created dashboards for Global Services & Technical Services using SSRS, Oracle BI, and Excel. Deployed Excel VLOOKUP, PivotTable, and Access Query functionalities to research data issues.
- Cleaned, reformatted and documented user’s satisfaction survey data. Developed data gathering application’s using C#.Net.
Environment: Python, Hive, Oozie, Amazon AWS S3, MySQL, HTML, Python 2.7, Django, HTML5, CSS, XML, MySQL, MS SQL Server, GIT, Jenkins, JIRA, MySQL, Cassandra, Pig, Hadoop, AWS Cloud Watch, AWS Redshift, SQL, SOAP, Rest APIs, AWS EC2, XML, JavaScript, AWS, Linux, Shell Scripting, AJAX, Mongo dB
Confidential, Cincinnati, OH
Data Analyst
Responsibilities:
- Developed applications of Machine Learning, Statistical Analysis and Data Visualizations with challenging data Processing problems in sustainability and biomedical domain.
- Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
- Applied concepts of probability, distribution and statistical inference on given dataset to unearth interesting findings through theuse of comparison, T-test, F-test, R-squared, P-value etc.
- Designed and developed Natural Language Processing models for sentiment analysis.
- Used predictive modeling with tools in SAS, SPSS, R, Python.
- Applied linear regression, multiple regression, ordinary least square method, mean-variance, thetheory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc to data with help of Scikit, Scipy, Numpy and Pandas module of Python.
- Applied clustering algorithms i.e.Hierarchical, K-means with help of Scikit and Scipy.
- Performs complex pattern recognition of financial time series data and forecast of returns through the ARMA and ARIMA models and exponential smoothening for multivariate time series data
- Pipelined (ingest/clean/munge/transform) data for feature extraction toward downstream classification.
- Built and analyzed datasets using R, SAS, MatlabandPython (in decreasing order of usage).
- Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them
- Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them.
- Worked in large-scale database environments like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).
- Interfaced with large-scale database system through an ETL server for data extraction and preparation.
- Identified patterns, data quality issues, and opportunities and leveraged insights by communicating opportunities with business partners.
Environment: Machine learning, AWS, MS Azure, Cassandra, Spark, HDFS, Hive, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), R, SAS, SPSS, MySQL, Eclipse, PL/SQL, SQL connector, Tableau.
Confidential, Boston, MA
Data Analyst
Responsibilities:
- Worked with BTEQ to submit SQL statements, import and export data, and generate reports in Teradata.
- Expertise in writing automation scripts using JAVA.
- Responsible for defining the key identifiers for each mapping/interface.
- Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
- Involved in defining the business/transformation rules applied for sales and service data.
- Worked with internal architects and, assisting in the development of current and target state data architectures.
- Implementation of Metadata Repository, Maintaining Data Quality, Data Cleanup procedures.
- Transformations, Data Standards, Data Governance program, Scripts, Stored Procedures, triggers and execution of test plans
- Performed data quality in Talend Open Studio.
- Document data quality and traceability documents for each source interface.
- Establish standards of procedures.
- Generate weekly and monthly asset inventory reports.