Big Data Ecosystem: HDFS, MapReduce, HBase, Pig, Hive, Sqoop
Languages: Python, Scala, Linux Shell Scripts, MySQL, Java
Platforms: Cloudera,JavaFX,Eclipse,VisualStudio,Git,Tomcat,Vagrant,Packer,VMBox,Tableau,PowerBI, Confidential
- Involved in running all Hive scripts through Hive, Impala, Hive on Spark and Spark SQL.
- Created HIVE managed and internal tables using bucketing and partitioning with different file formats like Parquet, RC file, ORC file(Column Oriented), AVRO, Sequence files (Row Oriented) and imported data from HDFS and MySQL using SQOOP. Skilled in writing UDF’s and embed them in HIVE.
- Transformed data using HIVE and SQOOP and optimized the job performance.
- Worked with JSON,XML data and converted them to AVRO file format and used HIVE queries to extract the desired data. Ingested, organized and maintained data in data lake and to perform queries.
- Used tableau for visualizations and performed data quality check to analyze false alarms and analyzed critical application issues.
- Extracted data, processed and transformed data using python scripts for various analytical purposes.
- Implemented optimization techniques for data retrieval, storage and data transfer.
- Developed scripting modules for scheduling jobs and successfully troubleshooted critical issues like missing, inconsistent, garbage data, optimization and query analysis.
- Involved in Big Data flow of application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS
- Used sqoop to get and import data from Oracle & MySQL to HDFS & HBASE. Sqoop: Developed Sqoop jobs to import data into AVRO format from Oracle database and built Hive tables.
- Analyzed data using complex SQL queries and involved in the maintenance phase of the application in AGILE environment.
- Used R and Excel and various machine learning techniques were used to forecast and analyze earth’s temperature
- Designed an application that manages insurance policies and its client.
- Using python and with various data mining techniques with NLP classified yelp reviews into different categories.
- Web scrapped the IMDB movie list from the IMDB website using Python and built a classifier that predicts the IMDB rating and its revenue.
- Used PySpark and Spark - shell (Scala) in a local Spark Cluster to process big data results.
- Created Jobs and Transformation in kettle Confidential to implement a Dimensional Star schema with dimensional table having role playing dimensions, bridge table and the fact table having numeric flags on admission status and implemented slowly changing dimensions (SCD type 1 and type 2) and surrogate key.