Data Analyst Resume
Princeton, NJ
SUMMARY
- Having 7 years of experience in a variety of industries including experience in Big Data Technologies (Apache Hadoop stack and Apache Spark) and experience in Python/Java and web technologies and ETL.
- Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance
- Experienced working with various Hadoop Distributions (Amazon EMR, Cloudera, Hortonworks, MapR) to fully implement and leverage new Hadoop features
- Experience with SQL on Hadoop using different tools like Hive, Impala, Spark - SQL, Sqoop
- Experience in developing Spark Applications using Spark RDD, Spark-SQL and Dataframe APIs
- Worked with real-time data processing and streaming techniques using Spark streaming and Kafka
- Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop
- Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the Confidential queries
- Significant experience writing custom UDF’s in Hive and custom Input Formats in MapReduce
- Replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing
- Validate the Data by using the Py-Spark programs
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
- Strong understanding of real time streaming technologies Spark and Kafka
- Knowledge of job work flow management and coordinating tools like Oozie
- Strong experience building end to end data pipelines on Hadoop platform
- Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase
- Strong understanding of Logical and Physical data base models and entity-relationship modeling
- Experience with Software development tools such as JIRA, Play, GIT, Bitbucket, Bamboo
- Good understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data
- Strong understanding of Java Virtual Machines and multi-threading process
- Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)
TECHNICAL SKILLS
Programming Languages: Py-Spark, Python, SQL, Shell/Bash, Java, Spark SQL, Hive, R, C, C++
Internet Technologies: JavaScript, Chart.js, D3.js, HTML5, CSS3, PHP, Bootstrap, Angular, Rest API’s, Airflow
Databases: Hive (DW), MySQL, MongoDB, Cassandra, PostgreSQL, Redshift (DW)
IDEs/ Development tools: Jupyter Notebook, POSTMAN, IntelliJ, Eclipse- Java EE, GitHub, MongoDB Compass, Tableau
Platform: Linux, Ubuntu, OSX, Windows
PROFESSIONAL EXPERIENCE
Confidential, Paoli, PA
Data Engineer
Responsibilities:
- Working in legal compliance team within financial services, day to day work includes large scale financial data management, strategizing
- And implementing efficient data architecture for financial crime detection teams, performing scalable batch/stream data processing.
- Responsible for migrating, ingesting and transforming large scale raw transactional datasets to standardized scalable data products for FC teams.
Technologies used: Py-Spark, Python, Bash/Shell, SQL, Hadoop, Py-Spark, Spark-SQL, Splunk, Hive, MapReduce, Sqoop, Flume, AWS, EMR, S3, EC2, Hue, Tableau
Confidential, Princeton, NJ
Data Engineer
Responsibilities:
- Creating Data Pipelines, strategizing and implementing Micro-Service based data infrastructure, REST API’s, scraping raw web contents, storing on cloud
- Ensuring efficient data management to reduce cost, writing ML models for client focused project solutions, participating in whole project lifecycle.
- Generated client facing reports, created visualization using Plotly, Tableau, worked on Big Data technologies, Python, Beautiful Soup, REST Web Services, S3, EC2, EMR AWS, MS Azure, Flask, MySQL, etc
- Worked on REST API’s, Scraping and Crawling large web data, Data Cleaning, Data Pre-processing, creating visualizations, performing Machine Learning and implementing data pipeline.
- Worked on Big Data technologies on Hadoop ecosystem.
Technologies used: Python, Beautiful Soup, requests, REST API’s, MySQL, Spark, Hive, Map Reduce, S3, EC2
Confidential
Responsibilities:
- Asynchronous text scraping thousands of websites
- Implemented parallelized data processing operations using Dask framework to clean and filter text data
- Implemented ML algorithms to extract accurate needed informations on scale.
- Performed contact sourcing ML based optimizers to retrieve client focused required results and tagging searches.
Technologies Used: Python, Async.io, Dask, BeautifulSoup, Requests, Json, selenium, scrapy, matplotlib, pandas, AWS, MongoDB, XGBoost, NLP, NER, Py-spark
Confidential
Data Analyst
Responsibilities:
- Creating structured data pipeline with 40+ integrations of various data sources to filter, transform and validate the inflow of raw data.
- Performed Data Cleaning and Preprocessing, transformations and performing predictive modelling.
- Targeted analysis of sales and customer acquisitions.
- Target was to find key insights and opportunities designated to leverage the data intelligently, thus improving customer targeting and over data value to increase sales.
- Performed RFM analysis, customer-churn predictions, recommendation system, association rule mining, data enrichment and quality improvement.
Technologies used: Python, GraphLab, numpy, pandas, scikit-learn, tensorflow, keras, Tableau, Chart.js, D3.js
Confidential
Responsibilities:
- Developed robust machine learning models for cryptocurrency direction movement.
- Instrumental in creating infrastructure for complete pipeline for the project.
- Provided framework for identifying key features for stacked models.
- Identified key features for direction movement useful for day traders.
Technologies used: Generative & Discriminative Models, Python, MongoDB, Neural Network, Bitcoin, Quandl
Confidential
Data Engineer
Responsibilities:
- Participated in all phases of project life cycle including data collection, data mining, and data cleaning, developing models, validation, and creating reports.
- Implemented business intelligence dashboards using Tableau producing different summary results based on requirements and role members.
- Utilized MapReduce and PySpark programs to process data for analysis reports.
- Worked on data cleaning and ensured data quality, consistency, and integrity using Pandas and Numpy.
- Performed data preprocessing on messy data including imputation, normalization, scaling, and feature engineering etc., using Scikit-Learn.
- Conducted exploratory data analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlations between features.
- Built classification models based on Logistic Regression, Decision Trees, Random Forest Support Vector Machine, and Ensemble algorithms to predict the probability of absence of patients.
- Used various metrics such as F-Score, ROC, and AUC to evaluate the performance of each model and K -fold cross-validation to test the models with different batches of data to optimize the models.
- Implemented and tested the model on AWS EC2 and collaborated with development team to get the best algorithm and parameters.
- Performed data visualization, designed dashboards with Tableau, and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
Environment: Microsoft SQL Server, SQL Server Management Studio, T-SQL, MLlib, MapReduce, Python, JIRA, AWS, and Tableau.
Confidential
Jr. Data Scientist
Responsibilities:
- Responsible for performing Machine-learning techniques regression/classification to predict the outcomes.
- Responsible for design and development of advanced R/Python programs to prepare transform and harmonizedatasets in preparation for modelling.
- Designed and automated the process of score cuts that achieve increased close and good rates using advanced R programming
- Utilized standard Python modules such as csv, itertools and pickle for development.
- Analysed large datasets to answer business questions by generating reports and outcome.
- Worked in a team of programmers anddataanalysts to develop insightful deliverables that supportdata-driven marketing strategies.
- Executed SQL queries from R/Python on complex table configurations.
- Retrievingdatafrom database through SQL as per business requirements.
- Create, maintain, modify and optimize SQL Server databases.
- Manipulation ofDatausing python Programming.
- Adhering to best practices for project support and documentation.
- Understanding the business problem, build the hypothesis and validate the same using thedata.
- Managing the Reporting/Dash boarding for the Key metrics of the business.
- Involved indataanalysis with using different analytic techniques and modeling techniques.
Environment: R, Python, SQL, exploratory analysis, feature engineering, Machine Learning, Python (NumPy, SciPy, pandas, scikit-learn, NLTK, NLP), Tableau.
