- Hadoop/Spark developer 4+ years of experience in Big data application development through frameworks Hadoop, Spark
- Excellent knowledge and understanding of Distributed Computing and Parallel processing frameworks.
- Having Experience on Hadoop Eco System like HDFS (Hadoop File Distribution System), Map Reduce, Hive, Sqoop, oozie, pig.
- Hands on experience in Distribution - Cloudera, Hortonworks.
- Experience in data cleansing using Spark Map and Filter Functions.
- Experience in creating Hive Tables and loading the data from different file formats.
- Implemented Partitioning and Bucketing in HIVE.
- Experience developing and Debugging Hive queries.
- Good Experience in Data importing and exporting to Hive and HDFS with Sqoop.
- Experience with Apache Spark, Spark SQL, Spark Streaming.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Hands on experience in in-memory data processing with Apache Spark
- Experience dealing with the file formats like text files, Sequence files, JSON, Parquet, ORC.
- Experienced at performing read and write operations on HDFS file system.
- Experience working with large data sets and making performance improvements.
- Strong knowledge on UNIX/LINUX commands.
- Adequate Knowledge on Scala Language.
- Adequate knowledge of Scrum, Agile and Waterfall methodologies.
- Highly motivated and committed to the highest levels of professionalism.
- Exhibited strong written and oral communication skills. Rapidly learn and adapt quickly to emerging new technologies and paradigms.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data
- Proficiency in using SQL to manipulate data, query expressions, join statements, subquery etc.
- Good experience in Scala Programming.
- Detailed oriented professional, ensuring highest level of quality in reports & data analysis.
- Experience in processing the data using Hive QL and Pig Latin scripts for data Analytics.
- Experience in using Spark Streaming programming model for Real-time data processing. for real time data streaming with Kafka for faster data processing
- Experience in designing and developing application in Spark using Scala.
- Experience in using Producer and Consumer API’s of Apache Kafka.
- Experienced in Apache Kafka to collect the logs and error messages across the cluster.
- Experience creating and driving large scale ETL pipelines.
- Good with version control systems like GIT.
- Provide batch processing solution to certain unstructured and large volume of data by using Spark.
- Extending Hive Core functionality by writing UDF’s for Data Analysis. experience in Data Modeling, Data warehousing, ETL Processing, Database Programming ETL Testing and DBMS.
- Advanced written and verbal communication skills.
- Detailed understanding of Software Development Life Cycle (SDLC) and sound knowledge of project implementation methodologies including Waterfall and Agile.
- Substantial experience working in a fast paced, agile software development framework and Scrum Principles in an ownership and results oriented culture.
HADOOPDetailed Knowledge of Hadoop Components and MapReduce
Written complex HQL queries using analytic functions such as RANK, DENSE-RANK, Cumulative Distribution etc. Developed complex join queries.
Developed optimized complex scripts using Advanced functions such as Co-group, Nested foreach etc
Import data from RDBMS to HDFS
Designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 2.1 for Data Aggregation
Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
Scheduling different job using oozie script.
Spark with Scala
Hadoop and Spark Developer
Confidential, Framingham, MA
- Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
- Developed SQOOP jobs to import data in Avro file format from RDBMS to HDFS and created Hive tables on top of it.
- Performed Spark jobs such as transformations and actions on RDDs using Scala.
- Implemented SparkSQL to access hive tables into Spark for faster processing of data.
- Worked on transforming the queries written in Hive to Spark Application.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Developed preprocessing job using Spark Data frames to transform JSON documents to flat file.
- Worked with various HDFS file formats like Avro, Sequence File, Parquet and various compression formats.
- Good understanding on NoSQL databases such as HBase and MongoDB.
- Good understanding on Kafka architecture i.e., Topics, Consumers, Producers, Brokers, Partitions and Clusters.
- Provide support data analysts in running Pig and Hive queries.
Environment: Cloudera (5) HDFS, Spark, Hive, Map Reduce, Hue, SQOOP, Flume, Oozie, Putty, SPARK SQL, Scala, Linux, YARN and Agile Methodology.
Confidential, Columbus, OH
- Designed and Developed data migration from legacy systems to Hadoop environment.
- Imported the data from RDBMS to HDFS & Hive and performed incremental imports using Sqoop Job in various formats such as Avro, Text and Parquet formats.
- Performed various Sqoop operations such as eval, import, export, job etc.,
- Created External and Managed tables in Hive, loaded the data into tables and processed hive queries that will run internally in map reduce way.
- Pre-processed log data in Pig-Latin by parsing using regular expressions.
- Involved in processing xml & JSON file formats, created partitioning of the data and implemented bucketing in Hive for performance optimization.
- Used JSON, XML and Avro SerDe’s for serialization and de-serialization packaged with Hive to parse the contents.
- Loaded data with complex data types such as Maps, Arrays and Structs into Hive tables.
- Experience in building Pig Latin scripts to extract, transform and load data onto HDFS.
- Developed a workflow in Oozie to automate the task of loading the data into HDFS using Sqoop and processing it with Hive.
- Exported analyzed data to relational databases using SQOOP for visualization to generate reports for the BI team
- Used Cloudera manager to pull metrics on various cluster features like JVM, Running Map and reduce tasks.
Environment: Hadoop (CDH 5), Hive, Pig, Sqoop, Flume, MapReduce, HDFS, Hue
Confidential, Jersey City, NJ
- Ingested data from Relational Database into HDFS using SQOOP and processed them using Hive jobs.
- Created data pipeline using Hive, Spark, and HBase to ingest, transform and analyze the customer behavioral data.
- Performed partitioning and bucketing in Hive to improve the performance.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.
- Implemented Spark RDD transformations and performed actions to implement business analysis.
- Involved in creating the Data frames.
- Created Spark jobs to write the data to HBase tables.
- Involved in improving performances of Spark jobs at application level.
- Involved in tuning the Hive jobs.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Used Spark for interactive queries, processing of batch data and integration with NoSQL database for huge volume of data.
- Involved in writing the data to HBase tables using Hive and Pig.
- Used Hive scripts to compute aggregations and store them on HBase for low latency applications.
- Collected, aggregated, and moved log data from servers to HDFS using Flume.
- Weekly meetings with technical collaborators and active participation in code review sessions with senior and junior developers using Agile methodology.
- Worked with Network, Database, Application, QA and BI teams to ensure data quality and availability.
Environment: Sqoop, Flume, Hive, HBase, HDFS, YARN, Spark, Cloudera (CDH5), Zookeeper, Shell Scripting, Linux
- Extracted the data from MySQL into HDFS using Sqoop.
- Created and worked Sqoop jobs with incremental load to populate Hive External tables.
- Collected the logs data from web servers and integrated into HDFS using Flume.
- Developed Hive scripts for end user / analyst requirements to perform ad hoc analysis.
- Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation.
- Developed Pig-Latin scripts to extract data from the web server output files to load into HDFS.
- Worked on Hue interface for querying the data in Hive and Pig editors.
- Experience in using Sequence File, Avro, Text File and Parquet formats.
- Developed Oozie workflow for scheduling and orchestrating the ETL process.
- Exported the processed data from HDFS to RDBMS using Sqoop Export.
Environment: Hadoop, Sqoop, Hive, Pig, MapReduce, HDFS