Kafka Administrator/data Scientist Resume
Sanfranciso, CA
SUMMARY
- Around 5 years of professional experience with focus in the areas of HadoopAdministration, Machine Learning, Data Science and related technologies
- Passionate about leveraging ever evolving technology to bridge gap between theoretical math and real - life applications.Well versed in Machine Learning, Data Visualization & Analysis.
- Experience in performing major & minor Kafka/Hadoop upgrades on large environments.
- As an admin involved in Cluster maintenance, trouble shooting, Monitoring and followed proper backup & Recovery strategies.
- Experienced in installation, configuration, supporting and monitoring huge Hadoop clusters using CDH and HDP.
- Experience in HDFS data storage and support for running map-reduce jobs.
- Experience in designing and implementing HDFS access controls, directory and file permissions user authorization that facilitates stable, secure access in a large multi-tenant cluster
- Experience in using Cloudera Manager for Installation and management of Hadoop clusters.
- Experience in working large environments and leading the infrastructure support & operations.
- Installing and configuring Kafka and monitoring the cluster using Nagios and Ganglia.
- Load log data into HDFS using Kafka and performing ETL integrations Experience with ingesting data from RDBMS sources like - Oracle, SQL into HDFS using Sqoop.
- Experience in big data technologies: Hadoop HDFS, Pig, Hive, Sqoop, Zookeeper.
- Experience in benchmarking, performing backup and disaster recovery of Name Node metadata and important sensitive data residing on cluster.
- Monitoring and support through Nagios and Ganglia.Migrating applications from existing systems like MySQL, oracleetc to Hadoop.
- Experience on Commissioning, Decommissioning, Balancing, and Managing Nodes and tuning server for optimal performance of the cluster.
- Expertise in applying Machine learning techniques such as Clustering (K-Mean), Linear Regression and Classification Techniques (Random Forest, KNN), ANN (Artificial Neural Network) and SVM (Support Vector Machine)
- Used natural language processing (NLP) techniques combined with machine learning models to classify customers' complaints
- Proficient in Data Cleansing, preprocessing, Dimensionality reduction, Exploratory Data Analysis using Python and R
- Well versed in handling heterogeneous sources of Data and implemented intelligent solutions that defines and shapes the Business Landscape
- Developed statistical methods to evaluate the extremity of a business phenomenon which indicated potential non-compliance behaviors.Performed numeric data analysis using Pandas, Numpy and Matplotlib.
TECHNICAL SKILLS
Big Data Technologies: Pig, Hive, Sqoop, Flume, HBase, Spark, Oozie, ZooKeeper, Hadoop Distributions (Cloudera)
Programming Languages: Python, Java, C/C++, SQL, Javascript, Node.js
Operating Systems: Windows 10/7, Linux, Unix, Android
Java Technologies: J2SE, J2EE - JSP, Servlets, JDBC, JSTL, EJB, Junit, RMI, JMS
Web Technologies: Ajax, JavaScript, JQuery, HTML 5, CSS 3, XML
Databases: MySQL Server 2018, Oracle 10g, AMAZON Dynamo DB, Postgre SQL, MySQL
Frameworks/IDEs: Struts, JSF, Hibernate; SOAP; Eclipse
Tools: /Version Control Sys: GIT, CVS, SVN; Maven, Ant, Junit
Expertise: Scikit-learn, NumPy, SciPy, OpenCv, Deep learning, NLP, RNN, CNN, Tensor flow, Keras, matplotlib
Machine Learning Algorithms: Linear Regression, Logistic Regression, Decision Trees, Random Forest, K Means Clustering, Support Vector Machines
Data Analysis Skills: Data Cleaning, Data Visualization, Feature Selection, Pandas
Tools: Tableau, Anaconda
Expertise: Scikit-learn, NumPy, SciPy, OpenCv, Deep learning, NLP, RNN, CNN, Tensor flow, Keras, matplotlib
PROFESSIONAL EXPERIENCE
Confidential, SanFranciso, CA
Kafka Administrator/Data Scientist
Responsibilities:
- Working as Hadoop Admin and responsible for taking care of everything related to the clusters total of 90 nodes ranges from POC (Proof-of-Concept) to prod clusters.
- Provided regular user and application support for highly complex issues involving multiple components such as Hive, Spark, Kafka, MapReduce.
- Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
- Created Kafka topics, provide ACLs to users and setting up rest mirror and mirror maker to transfer the data between two Kafka clusters. Has used MapR, HDFS, MapReduce, Pig, Hive and Oozie using Amazon EMR.Leveraged appropriate AWS services
- Day to day responsibilities includes solving developer issues, deployments moving code from one environment to other environment, providing access to new users and providing instant solutions to reduce the impact and documenting the same and preventing future issues.
- Adding/installation of new components and removal of them through Cloudera Manager.
- Collaborated with application teams to install operating system and Hadoop updates, patches.
- Monitored workload, job performance and capacity planning using Cloudera Manager.
- Experience integration of Kafka with Spark for real time data processing.
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Installing and configuring Kafka cluster and monitoring the cluster using Nagios and Ganglia.
- Retrieved data from HDFS into relational databases with Sqoop.
- Involved in extracting the data from various sources into Hadoop HDFS for processing.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, HBase database and Sqoop.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig and Sqoop.
Environment: HDFS, Map Reduce, Hive, Hue, Pig, Flume, Oozie, Sqoop, CDH5, Apache Hadoop, Spark, Knox, Kafka, Cloudera Manager, MySQL and Oracle.
Confidential
Machine Learning Consultant/Hadoop Engineer
Responsibilities:
- Trained a linear regression model to determine the score with RMSE of 0.85
- Performed Data Cleansing, preprocessing, Dimensionality reduction, Exploratory Data Analysis and Feature engineering.
- Evaluated the model performance using metrics like RMSE, MAE, R squared and Adjusted R squared.
- Used pyspark for processing data in HDFS and building regression model.
- Data preprocessing and modeling is done on big data ecosystem.Implemented web-service endpoints to return the score using Flask.
- Containerized the machine learning project using docker framework.
- Coordinated with fellow data scientists, software engineers and dev-ops teams in productionizing the project to serve this model at scale.
- Developed MapReduce program to convert mainframe fixed length data to delimited data.
- Used Pig Latin to apply transaction on systems of record.
- Experience on Hadoop Cluster monitoring tool Cloudera Manager.
- Extensively worked with Cloudera Distribution Hadoop, CDH 5.x, CDH4.x .
- Design and developed custom Avro storage to use in Pig Latin to load and store data.
- Used Ganglia to Monitor and Nagios to send alerts about the cluster around the clock
- Experience in managing and reviewing Hadoop log files.
- Actively involved in design analysis, coding and strategy development.
- Developed Sqoop commands to pull data from Teradata and push to HDFS.
- Developed Hive scripts for implementing dynamic partitions and buckets for retail history data.
Environment: Python, Spark, SQL, Pandas, NumPy, Matplotlib
Confidential
Hadoop Administrator
Responsibilities:
- Installed and configured a Horton Works HDP and Hadoop using Ambari.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, HBase database and Sqoop.
- Responsible for building scalable distributed data solutions using Hadoop.
- Worked on installing cluster, commissioning & decommissioning of DataNode, NameNode recovery, capacity planning, and slots configuration.
- Installed, Configured, Tested Datastax Enterprise Cassandra multi-node cluster which has 4 Datacenters and 5 nodes each.
- Installed and configured Cassandra cluster and CQL on the cluster.
- Loaded log data into HDFS using Flume, Kafka and performing ETL integrations.
- Performed data ingestion using Hadoop distcp, Java and Python.
- Created HBase tables to store variable data formats of PII data coming from different portfolios.
- Managing and reviewing Hadoop log files and debugging failed jobs.
- Implemented Kerberos Security Authentication protocol for production cluster.
- Implemented a script to transmit sysprin information from Oracle to HBase using Sqoop.
- Implemented test scripts to support test driven development and continuous integration.
- Worked on tuning the performancePig queries.
- Responsible for adding new eco system components, like spark, storm, flume, Knox with required custom configurations based on the requirements.
- Managed the design and implementation of data quality assurance and data governance processes.
- Worked with Infrastructure teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Implemented Fair scheduler to allocate fair amount of resources to small jobs.
- Assisted the BI team by Partitioning and querying the data in Hive.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
Environment: Hadoop HDFS, MapReduce, Hortonworks, Hive, Pig Hive, Kafka, Oozie, Solr, Zeppelin, Flume Sqoop, HBase.
Confidential
Hadoop Engineer
Responsibilities:
- Analyze large datasets to provide strategic direction to the company.
- Perform quantitative analysis of product sales trends to recommend pricing decisions.
- Conduct cost and benefit analysis on new ideas.
- Scrutinize and track customer behavior to identify trends and unmet needs.
- Develop statistical models to forecast inventory and procurement cycles.
- Assist in developing internal tools for data analysis.
- Used pyspark for processing data in HDFS.
- Data preprocessing and modeling is done on big data ecosystem.Implemented web-service endpoints to return the score using Flask.
- Containerized the machine learning project using docker framework.
- Coordinated with fellow data scientists, software engineers and dev-ops teams in productionizing the project to serve this model at scale.
- Developed MapReduce program to convert mainframe fixed length data to delimited data.
- Used Pig Latin to apply transaction on systems of record.
- Experience on Hadoop Cluster monitoring tool Cloudera Manager.
Environment: Hadoop HDFS, Hortonworks, Hive, Pig Hive, Kafka, Oozie, HBase.
Confidential
Linux Administrator
Responsibilities:
- System installation and Configuration of AIX 5.3 operating system and Red hat 3.x, 4.x & 5.x servers.
- User Administration, adding and removing user accounts, changing user attributes.
- Working with paging spaces, creating, increasing and decreasing paging spaces as per requirement.
- Configured VG's and LV's and extended LV's for file system growth needs using LVM commands.
- Patch Management. Problem determination in File systems and Logical Volumes.
- Creating and managing the default and User defined Paging Spaces.
- Creating and updating the Crontab files.
- NFS Administration. System Resource Controller Administration.
- Corporate client support for mission critical environments.
- Responsible for 200 Linux Servers: RHEL 3.0, 4.0 & 5.x, Bash scripting for automation of tasks.