Software Engineer Resume
SUMMARY:
- Python engineer with background in data science and big data. I have 1+ years of hands - on work experience in building Machine learning models, Big data, and Statistical Analysis.
- Skilled in Python, R, Jupyter Notebook, Big Data (Spark, Hadoop, Map-Reduce), Machine learning, Statistics, SQL, problem solving, and programming.
SKILLS:
Languages: Python, R, SQL
Python: Scikit-learn, Jupyter, Pandas, Numpy, Matplotlib, Seaborn, Python-Idle, Spider
R: RMD, R-Shiny, RStudio, tidyr, and dplyr, lattice graphics, caret, ggplot2, MASS, Base, grid, cluster, e1071, class, ROCR, random Forest, caret, glmnet, leaps, car
Big data: Spark, Hadoop, MapReduce Statistical Topics, Machine-Learning
Data Wrangling, Exploratory Data Analysis: Inferential Statistics, Descriptive Statistics, Statistical Graphics, Plot Data Analysis, Sampling Error Analysis, Hypothesis testing, A/B testing
Regression: K-Nearest Neighbors (KNN), simple linear regression, multiple linear regression, interaction terms
Classification: nai ve Bayes classifier, classification with KNN, logistic regression, Support vector machines (SVM), tree-based methods (“decision trees”), boosting, bagging and random forests
Unsupervised Learning and Clustering: Principal Component Analysis (PCA), K-means clustering and hierarchical clustering
Model Selection and Regularization: ridge regression, lasso, dimensionality reduction (PCA), stepwise/forward selection ANOVA, Nested Models, Prediction Accuracy Test, Cross validation, Bootstrapping
Tools: /Other: Confidential SPSS Statistics tool, Rest APIs, JSON, CSV, MySQL, Git, IntelliJ, Microsoft Office, Microsoft Excel, Unix
WORK EXPERIENCE:
Software Engineer
Confidential
Responsibilities:
- Analyzed the Airline data set using Spark-context, created Resilient Distributed Dataset (RDD) of Airline Dataset
- Data exploration using lambda function
- Compute the Average distance travelled by flight apply Map-Reduce Operations.
- Compute the Average delay using aggregate function.
- Create frequency histogram of delays using countByValue and GraphX,
- Used Jupyter Notebook to prepare the report
Tools: /Technologies: Python, Spark, Lambda function, Map-Reduce, Jupyter Notebook
Software Engineer Intern
Confidential
Responsibilities:
- explore the dataset using numpy, pandas, matplotlip, seaborn, sklearn, scipy libraries visualized the dataset, look for each variable’s distribution using histogram plot a correlation matrix to see any strong relationships between variables using matplotlib and seaborn
- Unsupervised learning: random forest & k-nearest neighbor to define a detection of an outlier(fraud) detection method. Used IsolationForest, LocalOutlierFactor libraries
- Compare and Fit both the models
- Calculate prediction fraud errors and model accuracy and confusion matrix for both the models and concluded that Random forest predicts frauds with better precision compare to k-nearest neighbor
- Used Jupyter Notebook to prepare the report
Tools: /Technologies: Python, Spark, Lambda function, Map-Reduce, Jupyter Notebook, numpy, pandas, matplotlip, seaborn, sklearn, scipy, IsolationForest, LocalOutlierFactor, CSV
Confidential
Python engineerResponsibilities:
- Data exploration using pandas
- Create density plot for each variable using seaborn library
- Developing clustering methods for show and no-show medical appointments:
- Principal component Analysis (PCA) using scikit-learn library
- Random Forest using scikit-learn and seaborn.
- Created confusion matrix and evaluated how well model fitted to the data set
- Used cross-validation method to estimate test errors
Tools: /Technologies: Python, Jupyter Notebook, numpy, pandas, seaborn, scikit-learn
Confidential
Python engineerResponsibilities:
- Data exploration - cleaning the data
- Performed univariate and unsupervised analysis - principal components analysis (PCA)
- Logistic Regression model using multiple predictors as input
- Random Forest model using the categorized income and plotted variable importance.
- Support Vector Machine (SVM) by choosing various parameters. Plotted variable importance.
- K-nearest neighbor - built K-nearest neighbor model
- Summarized statistical models in terms of accuracy/error, sensitivity/specificity
- Bootstrapping - compared performance of logistic regression, random forest, and SVM models
Tools: /Technologies: R, RMD, cluster, e1071, class, ROCR, caret, ggplot2, PCA, Logistic Regression, Random forest, SVM, KNN, Bootstrapping
Confidential
Python engineerResponsibilities:
- Data exploration: loading, cleaning, and summarize the data
- Selected optimal models using exhaustive, forward, and backward selection methods.
- Selected optimal set of variables for developing clustering models. Described differences and similarities between attributes deemed important in each case.
- Used cross-validation method to estimate test errors with different numbers of variables.
- Used Lasso, Ridge regularized approaches. Compared resulting models in terms of number of variables and their effects by regression subset selection and resampling.
- Principal Component Analysis (PCA) - merged red and white wine datasets; plotted (biplot and similar plots) data projection for the first two principal components; built PCA model to determine wine quality
Tools: /Technologies - R, RMD; R Libraries - glmnet, leaps, ggplot2, MASS, corrplot, car; Data exploration- exhaustive, forward and backward selection; Cross-Validation; Regularized approaches - lasso and ridge
Confidential
Python engineerResponsibilities:
- Cleaned up the dataset by imputing missing values using the series mean
- Checked linear regression model’s assumption. Transformed the data using reflected and logarithmic transformation.
- Split the data into training and test sets; developed linear regression model on the training set using predictor and an interaction term
- Cross-validated the linear regression model
Tools: /Technologies - Confidential SPSS Statistics tool, Logarithmic Transformation, Linear Regression, Cross-Validation