- Around 5+ years of professional IT experience including 3+ years in Data Science/Machine Learning, and Big Data ecosystem.
- Highly experienced in Statistical Learning, Deep Learning, Machine Learning, Data Science, and Data Mining methods and algorithms. Worked with large number of datasets (200 TBs) and handled structured and unstructured data.
- Designed and developed algorithms on Natural Language Processing (NLP) and worked with Data Validation, Data Visualization and Predictive Modeling by employing Python, R, Hadoop, Apache Spark, and Kafka.
- Highly proficient in managing project life cycle in the data science field that includes Data Augmentation, Acquisition and Cleaning using PCA, and other Machine Learning and Statistical algorithms.
- Worked and expert in Python, R, NumPy, Scikit, Matplotlib, Pandas, Beautiful Soup, NLTK, Tensor Flow, PyTorch, PyCharm, and other packages.
- Working experience on building spark applications using build tools like SBT, Maven and Gradle.
- Good experience in dealing with different file formats like text, Sequence, RCFILE, ORC, Parquet, Avro and JSON and different compression formats like GZip, LZO, BZip2 and snappy.
- Good knowledge on relational databases like MySQL, Oracle and NoSQL databases like HBase, MongoDB.
- Excellent understanding / knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
- Hands on experience in installing and deployment of Hadoop ecosystem components like Hadoop Map Reduce, YARN, HDFS, NoSQL, HBase, Oozie, Hive, Tableau, Sqoop, Zoo Keeper and Flume.
- Good Understanding of Hadoop architecture and Hands - on experience with Hadoop components such as Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
- PySpark API's working knowledge.
- Strong experience on Hadoop distributions like Cloudera and Hortonworks.
- Excellent Hands on Experience in developing Hadoop Architecture within the project in Windows and Linux platforms.
- Good technical Skills in Oracle 11i, SQL Server, ETL Development using Informatica tool.
- Expert in importing and exporting data from Oracle and MySQL databases into HDFS using Sqoop and Flume.
- Experience in ingesting the streaming data to Hadoop clusters using Flume and Kafka.
- Performed data analytics using PIG and Hive for Data Architects and Data Scientists within the team.
- Experience with NoSQL databases like HBase, and Cassandra as well as other ecosystems like Zookeeper, Oozie, Storm etc.
- Experience in Job scheduling using Autosys.
- Developed stored procedures and queries using PL/SQL.
- Expertise in RDBMS like Oracle, MS SQL Server, TERADATA, MySQL and DB2.
- Strong analytical skills with ability to quickly understand client’s business needs. Involved in meetings to gather information and requirements from the clients. Leading the Team and involved in Onsite, Offshore co-ordination.
Analytical Tools: SQL, Jupyter Notebook, Tableau, Zeppelin
Machine Learning: Regression, Decision Tree, PCA, Time Series, Random Forest, Probabilistic Models, Neural Nets(RNN, CNN, ResNet, Reinforcement learning, boosting), Optimization method,, Decision Tree, Ensemble Models, Image recognition, Computer Vision, Tensor flow, Deep learning, NLP, Statistical modeling, Naive Bayes, KNN, K clusters
NoSQL: Cassandra, HBase, MongoDB
Hadoop Distributions: Cloudera, Horton Works, AWS (S3, EC2, EMR, Redshift), MapR.
Programming: Python, R, C++, Scala, Python - Data Manipulation, NumPy, Pandas, Matplotlib, Plotly, SciPy, NLTK, Beautiful Soup
Big Data: Spark, Hive, Sqoop, HBase, Hadoop, HDFS, MapReduce, Flume, Shell Script Spark - Spark Core, Spark SQL, Spark Streaming, PySpark, Scala, AWS, Hue
Databases: Oracle 11g/10g, DB2 8.1, MS-SQL Server, My SQL
Operating Systems: Unix / Linux, MacOS, Windows 2000/NT/XPs
Confidential - Albany, NY
- Defining the requirements for data lakes/pipe lines
- Lead several big data machine learning initiatives involving the design, deployment, and development of advanced machine learning algorithms that impacts core products allowing business to grow and scale over 100 million customers and process over billion transactions annually.
- Designed and developed neural network models to prevent account takeover activity at the time of user authentication.
- Performed feature selection from over 2000 features, and applied best practices in neural network model development.
- Developed Gradient Boosted Trees using GBM package in R
- Implemented new features to improve algorithms performance
- Designed several tools in python to visualize GBT model and develop processes to extract model parameters for deployment
- Developed end to end data pipelines
- Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data.
- Involved in capacity planning, configuration of Cassandra Cluster on DATASTAX.
- Design, development and implementation of performant ETL pipelines using python API (pySpark) of Apache Spark.
- Writing reusable, testable, and efficient code Experience in using Sqoop to import the data on to Cassandra tables from different relational databases.
- Importing and exporting data into HDFS from database and vice versa using Sqoop.
- Worked on the core and Spark SQL modules of Spark extensively using programming languages like Python and Scala.
- Utilizing Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Python/PySpark also Scala and databases such as HBase
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Responsible in exporting analyzed data to relational databases using Sqoop.
- Creating the tables in Hive and integrating data between Hive & Spark.
Environment: Python, Deep learning, Jupyter notebook, Tensorflow, DATASTAX Cassandra, Map-Reduce, Cloudera Manager Hive, Oozie and Sqoop
Confidential, Grumman, CA
Big Data Consultant
- Data analysis using open source tools
- Defining the requirements for data lakes/pipe lines
- Also involved in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineeringfeatures scaling, features engineering, statistical modeling, dimensionality reduction using Principal Component Analysis, testing and validation using K- fold cross validation and data visualization.
- Used statistical learning algorithms such as logistic regression, linear regression, hypothesis testing, ANOVA lifecycle during the entirety of the project.
- Transforming the data using Spark applications for analytics consumption
- Design and implement data ingestion techniques for real time data coming from various source systems
- Creation of regulatory reports and analysis. Defining the data streams
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
- Importing and exporting data into HDFS and Hive using Sqoop
- Written Hive queries for data analysis to meet the Business requirements
- Experience in managing and reviewing Hadoop log files
- Worked in aggressive AGILE environment and participated in daily Stand-ups/Scrum Meetings
Environment: Hadoop, HDFS, MapReduce, Cloudera Manager, Pig, Hive, Sqoop, HBase, Oozie, Flume.
- Creating database objects such as tables, views, stored procedures, Triggers etc.
- Identifying columns for Primary Keys in all the tables at the design time and create them.
- Creating functions to provide custom functionality as per the requirements.
- Identifying of potential blocking, deadlocking and write code to avoid those situations.
- Ensuring that the code is written keeping in mind any security issues such as SQL Injection.
- Developing reports in SQL Server Reporting Services.
- Creating Entity Relationship (ER) Diagrams to the proposed database
Environment: Python, Shell scripting, PL/SQL, Oracle, Quality Center, Windows.