- 8+ years of experience in IT industry, started my career as Business Development Associate and due to huge demand in Data Driven Business, started associating in BIG DATA Technologies like Hadoop and Spark. 4+ years of experience as Big Data Engineer, Hadoop stacks and Spark components with Scala
- Extensive experience in Architecting, Designing, Installation, Configuration and Management of Apache Hadoop ClustersHadoop Eco systems and Spark with Scala and python
- Very Good knowledge on Hadoop distributions components like HDFS, HIVE, HBASE, SQOOP, KAFKA, MAPREDUCE.
- PIG Infrastructure Support Environment and Apache SPARK
- Experienced with NoSQL database like HBase to do batch processing
- Experienced on managing the Hadoop infrastructure with Cloudera Manager and AWS EMR
- Excellent understanding of Hadoop architecture and various components such as Hadoop Files System HDFS, Job Tracker, Task Tracker, Name Node, Data Node (Hadoop1.x), YARN concepts like Resource Manager, Node Manager (Hadoop 2.x) and Hadoop MapReduce programming paradigm.
- Experience in functionalities of every Hadoop Demon, resource utilizations and dynamic tuning to make cluster available and efficient.
- Ability to think creatively to help design innovative solutions to complex analytical questions.
- Very good understanding on Hadoop multiple data processing engines such as interactive SQL, real time streaming (using Flume/Kafka) & batch processing.
- Experience in analyzing Log files for Hadoop and eco system services and finding root cause.
- Can handle files in multiple formats (JSON, Text, XML, Avro, Sequence File and parquet)
- Experience in setting up and managing the batch scheduler on Oozie.
- Experience in extracting the data from RDBMS in to HDFS using Sqoop Injection.
- Experience in collecting the logs from log collector into HDFS using Flume.
- Handling NoSQL database such as HBase and Mango DB.
- Experience in deploying Hadoop cluster on Public and Private Cloud Environment like using Amazon Web Services(AWS) EMR,S3 &RDS
- Involved in balancing loads for optimal performance of the cluster.
- Making Hadoop cluster ready for development team working on POCs.
- Experience in working with Zookeeper for cluster coordination services.
- Strong Experience in working with Databases MySQL, Oracle 10g/9i, SQL Server.
- Familiar with Data Analysis, Cleansing, Validation, Verification, Conversion, Migration and Mining.
- Excellent problem - solving skills, high analytical skills and interpersonal skills.
- Ability to handle multiple tasks and work independently as well as in team.
Big Data /Hadoop Eco system Components: Apache Hadoop, HDFS, Map Reduce, Hive, PigHBase, Impala, Sqoop, Oozie, Flume, ZookeeperAmbari, Kafka, Spark, Cassandra
Operating Systems: Linux (RedHat, CentOs), UNIX, Windows
Programming Languages: Scala,Shell Scripting, core Java, C, XML, SQL, Hive QL and PL/SQL
Databases: SQL Server, DB2, Oracle 9i,10g,11g, Teradata, MySQL
NoSQL: HBase, Cassandra
Development Tools: IntelliJ, Eclipse
Hadoop / Spark with Scala Developer
Confidential, San Jose, CA
- Used Cloudera distribution for Hadoop ecosystem.
- Converted MapReduce jobs into Spark transformations and actions using Spark RDDs and Spark SQL using Spark Data frames, Data sets
- Written Spark jobs in Scala to analyze the data of the customers and sales history.
- Used Kafka to get data from many sources into Hadoop Storage System HDFS.
- Involved in designing the rows, keys in HBase to store Text, JSON, Parquet and Avron format files to create schema for HBase tables, to batch processing of historic data
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Good experience in Hive partitioning, Bucketing and Collections perform different types of joins on Hive tables.
- Written Spark applications using Scala to interact with the MySQL database using Spark SQL Context and accessed Hive tables using Hive Context.
- Created Hive external tables to perform ETL on data that is generated on daily basics.
- Created HBase tables for random lookups as per requirement of business logic. \
- Performed transformations using spark and loaded data into HBase tables
- Performed validation on the data ingested to filter and cleanse the data in Hive.
- Created SQOOP jobs to handle incremental loads from RDBMS into HDFS to apply Spark Transformations and Actions
- Imported data as parquet files for some use cases using SQOOP to improve processing speed for later analytics.
- Collected log data from web servers and pushed to HDFS using Flume from NoSQL DB’s Cassandra
- Good understanding of ASW S3 for storage and for cluster management
Technical Environments: Hadoop, Hive, Flume, Shell Scripting, Java, Eclipse, HBase, Kafka, Spark, Python, Oozie, Zookeeper, HQL/SQL, Oracle 11g.
Hadoop and spark Developer
Confidential, San Ramon, CA
- Implementing advanced procedures like text analytics and processing using the in - memory computing capabilities like
- Apache Spark written in Scala
- Installed Hadoop, Map Reduce, and HDFS and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing.
- Analyzed Hadoop clusters, other analytical tools using Hive, Pig and databases like HBase on AWS RDS (relational data store)
- Worked on analyzing/transforming the data with Hive and Pig.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed Scala scripts, UDFFs using both Data frames/SQL and RDD/MapReduce in Spark 1.5 for Data Aggregation, queries and writing data back into OLTP system directly or through Sqoop.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Import the data from different sources like HDFS/Hbase /Cassandra -NoSQL into Spark RDD Using Spark Steaming Developed a data pipeline using Kafka and Strom to store data into HDFS by using Spark Streaming for batch processing and near live processing.
- Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS.
- Load the data into Spark RDD and do in memory data Computation to generate the Output response.
- Involved in importing the real-time data to Hadoop using Kafka and implemented the Oozie job for daily imports
- Involved in loading data from LINUX file system to HDFS.
- Indeapth of core Java to right HIVE and PIG UDF s
Technical Environment: Cloudera CDH5, Spark, Hive, Pig, Oozie, Spark Streaming, Spark Sql, Sqoop.
Hadoop / Spark Developer
Confidential, San Jose, CA
- Experience in Hadoop 2.x with Spark with Scala
- Developed Managed, External and partition tables as per the requirement.
- Ingested structured data into appropriate schemas and tables to support the rule and analytics.
- Developed custom User Defined Function (UDF's) in Hive to transform the large volumes of data with respect to business requirement.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Responsible for building scalable distributed data solutions using Hadoop.
- Involved in loading data from edge node to HDFS using shell scripting.
- Developed Spark code using Scala and Spark - SQL for faster testing and processing of data.
- Implemented scripts for loading data from UNIX file system to HDFS.
- Implemented a script to transmit sprint information from MySQL to hive and HBase/Cassandra, for batch processing
- Automated work flow using Shell Scripts
- Familiarity with Agile development methodology
- Good experience in Hive partitioning, bucketing and perform different types of joins on Hive tables.
- Developed Pig Scripts, Pig UDFs and Hive Scripts, Hive UDFs to load data files.
- Experience in Hadoop 2.x with spark and Scala.
- Managed Hadoop jobs using Oozie workflow scheduler system for Map Reduce, Hive, Pig and Spark transformation actions.
- Used Spark SQL to process the huge amount of structured data.
- Experience in managing and reviewing Hadoop log files.
- Used Oozie workflow engine to run multiple Hive and pig jobs.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
Technical Environment: Apache Hadoop, HDFS, Hive, Pig, Spark Transformations, HBase, Unix, Shell Scripting, Spark, Scala, Oozie, Zookeeper, Cloudera CDH5x
Confidential, Atlanta, GA.
- Collaborated with different teams for Cluster Planning, Hardware requirement, network equipment s to implement nine node Cloudera Distributed Hadoop.
- Involved in implementation and ongoing administration of Hadoop infrastructure.
- Screening of Hadoop cluster Job performances and Cluster capacity planning.
- Worked on analyzing Hadoop stack and different including MapReduce, Pig, Hive, HBase database, Sqoop, Flume
- Implemented commissioning and decommissioning of data nodes, killing the unresponsive task tracker and dealing with blacklisted task trackers.
- Resolving tickets submitted by users, troubleshoot the documented errors, resolving the errors.
- Involved in creating Hive tables and loading and analyzing data using hive queries.
- Dumped the data from one cluster to another cluster by using DistCp (Distributed copy).
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre - processing with Pig.
- Implemented a script to transmit information from Oracle to HBase and Cassandra using Sqoop.
- Assisted in exporting analyzed data to NoSQL DB s Cassandra and HBase using Sqoop.
- Implemented test script to support test driven development and continuous integration.
- Worked on tuning the performance of Hive and Pig queries.
- Performance tuning of Hadoop clusters and Hadoop Map Reduce routines.
- Manage and review Hadoop log files.
- Involved in HDFS maintenance and loading of structured and unstructured data from Linux machines, wrote MapReduce jobs using Java API and Pig Latin as well.
- Monitor Hadoop cluster connectivity and security.
- Monitored multiple Hadoop clusters environments using Ganglia and Nagios.
- Implemented Fair scheduler on the Job tracker to share the resources of the Cluster for the MapReduce jobs given by the user.
- Worked with application teams to install OS, Hadoop updates, patches, versions upgrade as required.
- This project was on agile methodology
- Aligning with the system engineering team to propose and deploy new hardware and software environments required for Hadoop and to expand existing environments.
- Provided 24x7 on call support on a rotation basis.
- Good with Java to right MapReduce business logics, and UDF s for PIG and HIVE
Technical Environment: Cloudera Distributed Hadoop(CDH4), HDFS, Map Reduce, Hive, Pig, Sqoop, Flume, HBase, Oozie, Impala, Kafka
Confidential, Irvine, CA
- Created SSIS packages for File Transfer from one location to the other using FTP task.
- Scheduling the ETL packages as job in SQL Server Agent.
- Created Database Objects - Tables, Indexes, Views, Stored Procedures and User defined functions according to the requirements of the project.
- Created SSRS reports using Report Parameters, Drop-Down Parameters, Multi-Valued Parameters Debugging Parameter Issues Matrix Reports and Charts.
- Created User define Function and complex stored Procedure to retrieve the search results and to get the well formatted output for the report.
- Maintaining Development, QA and UAT database for restore from production and backup.
- Developed and configured the reports using the data in data marts to be sent at regular intervals to the vendor either through email or in shared folders.
Technical Environment: MS SQL Server 2008 R2, SQL Reporting Services (SSRS), SQL Integration Services (SSIS), XML, MS Visio, SQL Report.