Sr. Data Engineer / Python Spark Developer Resume
Charlotte, NC
SUMMARY:
- Over 7+ years of strong experience in Data Analyst, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Statastical modeling, Data modeling, Data Visualization, Web Crawling, Web Scraping. Adept in statistical programming languages like R and Python, SAS, Apache Spark, Matlab including Big Data technologies like Hadoop, Hive, Pig.
- Excellent knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of Big Data Eco - system.
- Experienced on data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing.
- Deep analytics and understanding of Big Data and algorithms using Hadoop, MapReduce, NoSQL and distributed computing tools.
- Expertise in synthesizing Machine learning, Predictive Analytics and Big data technologies into integrated solutions.
- Experienced in Dimensional Data Modeling experience using Data modeling, Relational Data modeling, ER/ Studio, Erwin, and Sybase Power Designer, Star Join Schema/Snowflake modeling, FACT & Dimensions tables, Conceptual, Physical & logical data modeling.
- Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis.
- Expertise in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models, neural networks, SVM, clustering), dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K- fold cross validation and data visualization.
- Experienced in writing Pig Latin scripts, MapReduce jobs and HiveQL.
- Extensively used SQL, Numpy, Pandas, Scikit-learn, Spark, Hive for Data Analysis and Model building.
- Extensively worked on ERWIN tool with all features like REVERSE Engineering, FORWARD Engineering, SUBJECTAREA, DOMAIN, Naming Standards Document etc.
- Experience in using various packages in Rand python like ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2.
- Excellent and experience and knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of Big Data Eco-system.
- Experienced in importing and exporting the data using Sqoop from HDFS to RelationalDatabase systems/ mainframe and vice-versa.
- Extensively worked on Sqoop, Hadoop, Hive, Spark, Cassandra to build ETL and Data Processing systems having various data sources, data targets and data formats
- Strong experience and knowledge in Data Visualization with Tableau creating: Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots etc.
- Experienced with Integration Services (SSIS), Reporting Service (SSRS) and Analysis Services (SSAS)
- Expertise in Normalization to 3NF/De-normalization techniques for optimum performance in relational and dimensional database environments.
- Extensive experienced on ERModeling, Dimensional Modeling (StarSchema, SnowflakeSchema) and Data warehousing and OLAP tools.
- Expertise in data base programming (SQL, PLSQL) XML, DB2, Informix, Teradata, Data base tuning and Query optimization.
- Experience in designing, developing, scheduling reports/dashboards using Tableau and Cognos.
- Expertise in performing data analysis and data profiling using complex SQL on various sources systems including Oracle and Teradata.
TECHNICAL SKILLS:
Database Design Tools and Data Modeling: Fact & Dimensions tables, physical & logical data modeling, Normalization and De-normalization techniques, Kimball.
Databases: SQL Server 20017, MS-Access, Oracle 11g, Sybase and DB2.
Languages: PL/SQL, SQL, T-SQL, C, C++, XML, HTML, DHTML, HTTP, Matlab, Python.
Tools and Utilities: SQL Server 2016/2017, SQL Server Enterprise Manager, SQL Server Profiler, Import & Export Wizard, Visual Studio v14, .Net, Microsoft Management Console, Visual Source Safe 6.0, DTS, Crystal Reports, Power Pivot, ProClarity, Microsoft Office 2007/10/13, Excel Power Pivot, Excel Data Explorer, Tableau 8/10, JIRA
Operating Systems: Microsoft Windows 8/7/XP, Linux and UNIX
PROFESSIONAL EXPERIENCE:
Confidential, Charlotte, NC
Sr. Data Engineer / Python Spark Developer
Responsibilities:
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
- Identified areas of improvement in existing business by unearthing insights by analyzing vast amount of data using machine learning techniques.
- Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.
- Designed and developed NLP models for sentiment analysis.
- Led discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models. Expert in Business Intelligence and Data Visualization tools: Tableau, Microstrategy.
- Worked on machine learning on large size data using Spark and MapReduce.
- Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
- Stored and retrieved data from data-warehouses using Amazon Redshift.
- Worked on TeradataSQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and FastExport.
- Used Data Warehousing Concepts like Ralph Kimball Methodology, Bill Inmon Methodology, OLAP, OLTP, Star Schema, Snow Flake Schema, Fact Table and Dimension Table.
- Refined time-series data and validated mathematical models using analytical tools like R and SPSS to reduce forecasting errors.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, SQL, Excel, VBA, SAS, Matlab, AWS, SPSS, Cassandra, Oracle, MongoDB, SQL Server 2012, DB2, T-SQL, PL/SQL, XML, Tableau.
Confidential, Dallas, TX
Sr Data Analyst / Python Developer
Responsibilities:
- Collaborates with cross-functional team in support of business case development and identifying modeling method (s) to provide business solutions. Determines the appropriate statistical and analytical methodologies to solve business problems within specific areas of expertise.
- Integrated Teradata with R for BI platform and also implemented corporate business rules
- Participated in Business meetings to understand the business needs & requirements.
- Arrange and chair Data Workshops with SME’s and related stake holders for requirement data catalogue understanding.
- Design Logical Data Model which will fit and adopt the Teradata Financial Logical Data Model (FSLDM11) using Erwin data modeler tool.
- Generating Data Models using Erwin9.6 and developed relational database system and involved in Logical modeling using the Dimensional Modeling techniques such as Star Schema and Snow Flake Schema.
- Guide the full lifecycle of a Hadoop solution, including requirements analysis, platform selection, technical architecture design, application design and development, testing, and deployment
- Consult on broad areas including data science, spatial econometrics, machine learning, information technology and systems and economic policy with R
- Performed Datamapping between source systems to Target systems, logicaldata modeling, created classdiagrams and ERdiagrams and used SQLqueries to filter data
- Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data.
- Used various techniques using R data structures to get the data in right format to be analyzed which is later used by other internal applications to calculate the thresholds.
- Maintaining conceptual, logical and physical data models along with corresponding metadata.
- Done data migration from an RDBMS to a NoSQL database, and gives the whole picture for data deployed in various data systems.
- Developed triggers, stored procedures, functions and packages using cursors and ref cursor concepts associated with the project using PL SQL
- Used Meta data tool for importing metadata from repository, new job categories and creating new data elements.
Environment: R, Oracle 12c, MS-SQL Server, Hive, NoSQL, PL/SQL, MS- Visio, Informatica, T-SQL, SQL, Crystal Reports 2008, Java, SPSS, SAS, Tableau, Excel, HDFS, PIG, SSRS, SSIS, Metadata.
Confidential, Charlotte, NC
Data Scientist/Data Analyst
Responsibilities:
- Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python
- Wrote and optimized complex SQL queries involving multiple joins and advanced analytical functions to perform data extraction and merging from large volumes of historical data stored in Oracle 11g, validating the ETL processed data in target database
- Good understanding of Teradata SQL Assistant, Teradata Administrator and data loadExperience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, Pivot Tables and OLAP reporting.
- Identified the variables that significantly affect the target
- Continuously collected business requirements during the whole project life cycle.
- Conducted model optimization and comparison using stepwise function based on AIC value
- Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
- Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency
- Generated data analysis reports using Matplotlib, Tableau, successfully delivered and presented the results for C-level decision makers
- Generated cost-benefit analysis to quantify the model implementation comparing with the former situation
- Worked on model selection based on confusion matrices, minimized the Type II error
Environment: Tableau 7, Python 2.6.8, Numpy, Pandas, Matplotlib, Scikit-Learn, MongoDB, Oracle 10g, SQL
Confidential, Pittsburgh, PA
Python Developer
Responsibilities:
- Used Celery with Rabbit MQ, MySQL, Django, and Flask to create a distributed worker framework.
- Application was based on service oriented architecture and used Python 2.7, Django1.5, JSF 2, Spring 2, Ajax, HTML, CSS for the frontend.
- Created server monitoring daemon with Psutil, supported by Django app for analytics which I created. Also researched big data solutions with Cassandra database.
- Experience with Data migration from Sqlite3 to Apache Cassandra database. Cassandra data model designing, implementation, maintaining and monitoring using DSE, DevCentre, DatastaxOpscenter.
- Build the Silent Circle Management System ( Confidential ) in Django, Python, Node.JSand Mongo dB while integrating with infrastructure services.
- Developed entire frontend and backend modules using Python on Django Web Framework.
- Designed and developed data management system using MySQL
- Created a Python/Django based web application using Python scripting for data processing, MySQL for the database, and HTML/CSS/JQuery and HighCharts for data visualization of the served pages.
- Strong Socket programming experience in Python. Implemented complex networking operations like raceroute, SMTP mail server and web server.
- Used existing Deal Model in Python to inherit and create object data structure for regulatory reporting.
- Created UI and Implemented the presentation layer with HTML, DHTML, Ajax, CSS and JavaScript.
- Involved in writing stored procedures using MySQL.
- Used standard Python modules e.g. csv, robotparser, itertools, pickle, jinja2, lxml for development.
- Managed, developed, and designed a dashboard control panel for customers and Administrators using Django, HTML, CSS, JavaScript, Bootstrap, JQuery and RESTAPI calls.
- Automated RabbitMQ cluster installations and configuration using Python/Bash.
- Deployed the project into Heroku using GIT version control system.
- Improved the coding standards, code reuse. Increased performance of the extended applications by making effective use of various design patterns (Front Controller, DAO).
- Worked extensively with Bootstrap, Angular.js, Javascriptand JQuery to optimize the user experience.
- Built various graphs for business decision making using Pythonmatplotlib library.
- Used Python library BeautifulSoup for webscrapping to extract data for building graphs.
Environment: Python, Django, Oracle, Linux, REST, PyChecker, PyCharm, Sublime, HTML, jinja2, SASS, Bootstrap, Java script, jQuery, JSON, Shell scripting, GIT.