- 7 years of overall experience in IT industry, which includes hands on experience in Big data Eco - system related technologies like Map Reduce, Hive, HBase, Pig, Scoop, Flume, Oozie and HDFS.
- Experienced in developing and implementing MapReduce programs using Hadoop to work with Big Data requirement.
- Hands on Experience in Big Data ingestion tools like Flume and Sqoop.
- Worked Extensively on Sqoop to import and export data from RDBMS and vice-versa.
- Worked with assorted flavors of Hadoop distributions such as Cloudera and Hortonworks.
- Experienced with CDH distribution and Cloudera Manager to manage and monitor to Hadoop clusters.
- Experience in working with different kinds of data files such as XML, JSON, Parquet, Avro and Databases.
- Hands on NoSQL database experience with Apache HBase and MongoDB.
- Knowledge in job workflow scheduling and coordinating tools like Oozie and Zookeeper.
- Good knowledge on Apache Spark and Scala.
- Good knowledge of Hadoop architecture and various components such as HDFS Framework, Job Tracker, Task Tracker, Name Node, Data Node, MRV1 and MRV2(YARN).
- Experience with various scripting languages like Linux/Unix shell scripts, Python 2.7 and Scala.
- Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Radis).
- Involved in configuration, development Hadoop Environment on AWS cloud such as Lambda, S3, EC2, EMR (Electronic MapReduce).
- Good Hands on expertise with AWS storage services such as S3, EFS, Storage Gateways and partial familiarity with Snowball.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Experienced in implementing Spark RDD transformation actions to implement business analysis.
- Used Flume to collect, aggregate and store the web log data onto HDFS.
- Used Zookeeper for various types of centralized configurations.
- Extensive knowledge and experience on real time data streaming techniques like Kafka, Storm and Spark Streaming.
- Experience in analyzing data using HIVEQL and Pig Latin and custom MapReduce programs in Java.
- Experience in data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
- Knowledge on handling Hive quires using spark SQL that integrate with spark environment implemented in Scala.
- Hands on experience with message broker such as Apache Kafka.
- Experience in loading data using Hive and writing scripts for data transformations using Hive and Pig.
- Experience in creating Impala views on hive tables for fast access to data.
- Developed UDF functions and implemented it in Hive queries.
- Developed PIG Latin scripts for handling business transformations.
- Comprehensive knowledge of Software Development Life Cycle (SDLC), having through understanding of various phases like Requirements Analysis, Design, Development and testing.
Big Data Skillset Frameworks & Environments: Cloudera CDHs, Hortonworks HDPs, Hadoop1.0, Hadoop2.0, HDFS, MapReduce, Pig, Hive, Impala, HBase, Data Lake, Cassandra, MongoDB, Mahout, Sqoop, Oozie, Zookeeper, Flume, Splunk, Spark, Storm, Kafka, YARN, Falcon, Avro.
Amazon Web Services (AWS): Elastic Map Reduce, EC2 Instances, Airflow, Amazon S3, Amazon Redshift, Dynamo DB, Elastic Cache, Storage Gateways, DNS Route53, Encryption, Virtual Private Cloud, SQS, SNS, SWF, Athena, Glue, Cloud Watch Logs, IAM Roles, Ganglia, EMRFS, s3cmd(Batches), Ruby EMR Utility(monitoring), Boto, Amazon Cognito, AWS API Gateway, AWS Lambda, Kinesis (streams, Firehouse & Analytics).
Programming: Java, Scala, Python.
IDE Tools: Eclipse, Net Beans, Spring Tool Suite, Hue (Cloudera specific).
Databases & Application Servers: Oracle, MySQL, DB2, Cassandra, HBase, Mango DB, Database Technologies MySQL, Oracle 8i, 9i, 11i & 10g, MS Access, Microsoft SQL-Server 2000 and DB2 8.x/9.x, PostgreSQL.
Other Tools: Putty, WinSCP, FileZilla, Data Lake, Talend, Tableau, GitHub, SVN, CVS.
Big Data developer
Confidential, Cincinnati, OH
- Working in an Agile team to deliver and support required business objectives by using Scala, Spark, Hive and Shell Scripting and other related technologies.
- Worked on Hadoop Ecosystem components like HDFS, Hive, Pig & Impala, worked on Spark and Scala.
- Created Hive External tables and loaded the data into tables and query data using HQL.
- In Depth knowledge of Hadoop architecture components such as HDFS, Job Tracker, Task Tracker, Name Node and Data Node.
- Developed Sqoop jobs for moving data between Relational databases and HDFS and vice versa.
- Good knowledge on Amazon AWS concepts like EMR & EC2 web services which provides fast and efficient processing of data.
- Converted ETL pipelines to Scala code base and performed data accessibility to & from S3.
- Written the AWS Lambda functions in Scala with cross functionality dependencies which would generate custom libraries for deploying the Lambda function in the Cloud.
- Implemented Spark applications using Scala for faster processing of large amounts of data.
- Worked on Spark SQL, created data frames by loading data from hive tables and created prep data and stored in AWS S3.
- Created RDD’s and performed Spark Transformations and Actions for cleansing the data.
- Experience in creating Data Frames.
- Created Data Frames using Spark on R and Performed Data cleansing operations.
- Used different types of Packages and Functions in R.
- Collaborated developing machine learning algorithms using R with Data Science Team.
- Experienced with Spark Context, Spark-SQL, Data Frame and RDD’s.
- Developed a script which takes Hive SQL or any SQL that can be run using Spark Engine, Stores the results in text file or output table in a database, passed as arguments.
- The Script can also be called either from shell Scripts or R etc.
- Used Try/ Catch block for Exception Handling in Scala.
- Used Spark SQL and Spark Data Frame extensively to cleanse and integrate imported data into more meaningful insights.
- Experience in Unix Shell Scripting.
- Performed Data Visualization on the Data Using R and worked on ggplot.
- Worked on different file formats such as Parquet, Text File, Avro, JSON and ORC File for Hive Querying and processing
- Performed Optimization of Hive Queries using Joins, Partitioning and bucketing.
- Built Spark Applications using intellij and Maven.
Environment: Cloudera distribution, HDFS, Hive, Impala, Pig, Sqoop, Spark, Scala, Unix/Linux, SQL.
Role: Hadoop Developer
Confidential, Chicago, IL
- Working in an Agile team to deliver and support required business objectives by using Java, Python and shell scripting and other related technologies to acquire, ingest, transform and publish data both to and from Hadoop Ecosystem.
- Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
- Used Flume to collect, aggregate and store the web log data onto HDFS.
- Used Scala to store streaming data to HDFS and to implement Spark for faster processing of data.
- Integrating user data from Cassandra to HDFS. Integrating Cassandra with Storm for real time user attributes look up.
- Performed Sqoop Incremental imports by using Oozie based on every day.
- Installed and configured Hadoop MapReduce, HDFS, developed MapReduce jobs in Java for data cleaning and pre-processing.
- Parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing.
- Scheduling Batch jobs through Amazon Batch performing Data processing jobs by leveraging Apache Spark APIs through Scala.
- Good familiarity with AWS services like Dynamo DB, Redshift, Simple Storage Service (S3), Amazon Elastic Search Services.
- Created Pig scripts to transform the HDFS data and loaded the data into Hive external table.
- Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Connectors, Spark core, Spark SQL, Sqoop, Pig, Hive and NoSQL databases.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
- Performed Optimizations of Hive Queries using Map side joins, dynamic partitions and Bucketing.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Implemented Spark RDD transformations, actions to implement business analysis.
- Worked on Spark Streaming and Spark SQL to run sophisticated applications on Hadoop.
- Created and maintained Technical documentation for launching Hadoop Clusters and executing pig Script.
Environment: Hadoop, CDH 4, CDH 5, Scala, MapReduce, HDFS, Hive, Pig, Sqoop, HBASE, Flume, Spark SQL, Spark-Streaming, UNIX Shell Scripting and Cassandra.
Confidential, Chicago, IL
- Worked on Spark SQL to handle structured data in Hive.
- Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
- Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau.
- Analyzed substantial data sets by running Hive queries and Pig scripts.
- Written Hive UDFs to sort Structure fields and return complex data type.
- Worked in AWS environment for development and deployment of custom Hadoop applications.
- Creating files and tuned the SQL queries in Hive utilizing HUE.
- Involved in collecting and aggregating large amounts of log data using Storm and staging data in HDFS for further analysis.
- Created the Hive external tables using Accumulo connector.
- Developed Spark scripts by using Scala shell commands.
- Stored the processed results In Data Warehouse, and maintaining data using Hive.
- Worked with Spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
- Experience in Writing PIG User Define Function and Hive UDFS.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as map-reduce Hive, Pig, and Sqoop.
- Used SQOOP to import the data from RDBMS to HDFS to achieve the reliability of data.
- Implemented POC for using APACHE IMPALA for data processing on top of HIVE.
- Responsible for managing and reviewing Hadoop log files. Designed and developed data management system using MySQL.
- Created Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs, which run independently with time and data availability.
Environment: HDFS, MapReduce, Storm, Hive, Pig, Sqoop, MongoDB, Apache Spark, Scala, Oozie Scheduler, AWS, Tableau, Java, UNIX Shell scripts, HUE, GIT, Maven.
Hadoop Developer/SQL developer
- Developed pig scripts using various operators like LOAD, STORE, DUMP, FILTER, GROUP, COGROUP, DISTINCT, FILTER, UNION, LIMIT, SPLIT to extract data from data files to load into HDFS.
- Planning and implementing system upgrades including hardware, operating system and periodical patch upgrades.
- Installed Ubuntu 10.10 & 12.04 versions and installed Hadoop ecosystems on top of it like Pig, Hive, Sqoop, HBase, Flume and Kafka.
- Developed the Pig UDF’S to pre-process the data for analysis
- Develop Hive queries for the analysts
- Configured big data workflows to run on the top of Hadoop using Oozie and these workflows comprises of heterogeneous jobs like Pig, Hive, Sqoop Cluster co-ordination services through Zookeeper
- Applied appropriate support packages/patches to maintain system integrity.
- Prepares so many SQL loader control files to load the data from flat files to Oracle by using so many performances tuning techniques.
- Written procedures to get collection of objects from front end and store it into DB tables. Send collection of objects to front end.
- Stored procedures, views, tables, triggers, functions, query tuning, optimization, Indexing
- Creating and implementing new data model for all the new modules and enhacements.
- Ensure data integrity and successful backup and restoration ability.
- Actively involved in conducting walkthroughs, Demo and presentation to the client on the new functionality.
Environment: Pig, Oozie, Map Reduce, Hive, Zookeeper, SQL, MySQL, Unix.