- 7 years of technical IT experience in all phases of Software Development Life Cycle (SDLC) with skills in data analysis, design, development, testing and deployment of software systems.
- 4+ years of industrial experience in Big Data analytics, Data manipulation, using Hadoop Eco system tools Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop , AWS, Spark integration with Cassandra, Avro, Solr and Zookeeper.
- Hands on experience in Hadoop Ecosystem components such as Spark, SQL, Hive, Pig, Sqoop, Flume, Zookeeper/Kafka and HBase and MapReduce.
- Experience in converting SQL queries into Spark Transformations using Spark RDDs, Scala and Performed map-side joins on RDD's.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS)/ Non-Relational Database Systems and vice-versa.
- Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS.
- Experience in developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Talend Integration Suite.
- Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server.
- Excellent Programming skills at a higher level of abstraction using Scala, Java and Python .
- Experience in Hive partitioning, bucketing and perform joins on Hive tables and implement Hive SerDes.
- Worked on different file formats like delimited files, avro, json and parquet.
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.
- Hands on Experience in designing and developing applications in Spark using Scala and Pyspark to compare the performance of Spark with Hive and SQL/Oracle.
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Solid experience in working with csv, text, sequential, Avro, parquet, orc, Jason formats of data.
- Widely used different features of Teradata such as BTEQ, Fastload, Multiload, SQL Assistant, DDL and DML commands and very good understanding of Teradata UPI and NUPI, secondary indexes and join indexes.
- Experience in writing complex SQL queries, creating reports and dashboards.
- Scheduled the jobs using Airflow scheduler.
- Ability to tune Big Data solutions to improve performance and end-user experience.
- Having working experience with Building RESTful web services, and RESTful API.
- Experience in creating Pyspark scripts and Spark Scala jars using IntelliJ IDE and executing them.
- Managed multiple tasks and worked under tight deadlines and in fast pace environment.
- Worked on multiple stages of Software Development Life Cycle including Development, Component Integration, Performance Testing, Deployment and Support Maintenance.
- Excellent analytical, communication skills which helps to understand the business logics and develop a good relation between stakeholders and team members.
- Strong communication skills, analytic skills, good team player and quick learner, organized and self-motivated.
Big Data Ecosystem: HDFS, MapReduce, HBase, Pig, Hive, Sqoop, KafkaFlume, Cassandra, Impala, Oozie, Zookeeper, MapR, Amazon Web Services (AWS), EMR
Cloud Technologies: AWS, Azure, Google cloud platform (GCP)
IDE’s: IntelliJ, Eclipse, Spyder, Jupyter
Operating Systems: Windows, Linux
Programming languages: Python, Scala, Linux shell scripts, PL/SQL, Java
Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE
Java & J2EE Technologies: Core Java, Servlets, JSP, JDBC, Java Beans
Business Tools: We Intelligence, Crystal Reports, Dashboard Design, Tableau
Confidential, Boise, Idaho
- Analysed large and critical datasets using HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Zookeeper and Spark.
- Loaded and transformed large sets of structured, semi structured, and unstructured data using Hadoop/Big Data concepts.
- Performed Data transformations in HIVE and used partitions, buckets for performance improvements.
- Developing Spark scripts, UDF's using both Spark Confidential and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.
- Designed and developed a Data Lake using Hadoop for processing raw and processed claims via Hive and Informatica.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Ingested data into HDFS using SQOOP and scheduled an incremental load to HDFS.
- Using Hive to analyse data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
- Involved in creating Hive tables from wide range of data formats like csv, text, sequential, avro, parquet, orc, Jason, and custom formats using SerDe .
- Experience in testing Big Data Hadoop (HDFS, Hive, Sqoop and Flume), Master Data Management (MDM) and Tableau Reports.
- Develop framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
- Provide guidance to development team working on PySpark as ETL platform
- Migrated an existing on-premises application to AWS . Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR .
- Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Implemented new projects builds framework using Jenkins & maven as build framework tools.
- Created jobs in continuous integrated build and testing and deployment using Jenkins, Maven.
- Experienced in loading the real-time data to NoSQL database like Cassandra.
- Developing scripts in Pig for transforming data and extensively used event joins, filtered, and done pre- aggregations.
- Performed Data scrubbing and processing with Apache Nifi and for workflow automation and coordination.
- Used Sqoop to import data into HDFS and Hive from Oracle database.
- Used Talend for Big data Integration using Spark and Hadoop.
- Generate metadata, create Talend etl jobs, mappings to load data warehouse, data lake.
- Built Azure Data Warehouse Table Data sets for Power BI Reports.
- Import data from sources like HDFS/HBase into Spark RDD.
- Good experience in developing Hive DDLs to create, alter and drop Hive TABLES.
- Working on BI reporting with At Scale OLAP for Big Data.
- Implemented Kafka for streaming data and filtered, processed the data.
- Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
- Developed data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS.
- Developed Shell scripts for scheduling and automating the job flow.
- Developed a workflow using Nifi to automate the tasks of loading the data into HDFS.
- Load balancing of ETL processes, database performance tuning ETL processing tools.
- Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
Environment: Spark, YARN, HIVE, Pig, Scala, Mahout, NiFi, Python, Hadoop, Azure, Dynamo DB, Kibana, NOSQL, Sqoop, MYSQL.
Confidential, St. Louis, MO
Big Data Developer
- Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
- Build the oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or sql and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
- Performing the forking action whenever there is a scope of parallel process for optimization of data latency
- Scheduling the coordinate for atomization of the etl
- Performing hive tuning techniques like partitioning and bucketing and memory optimization.
- Hands on experience on Sqoop import, export and eval
- Worked on different file formats like parquet, orc, json and text files.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala , initially done using python (PySpark) .
- Performed distcp while loading the historic data in to hive
- Used spark sql to load data and created schema RDD on top of that which loads into hive tables and handled structured using spark sql
- Involved in converting the hql’s in to spark transformations using spark RDD with support of python and Scala
- Performed pig script which picks the data from one hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in oozie script
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
- Build an etl which utilizes spark jar inside which executes the business analytical model
- Hands on experiences on git bash commands like git pull to pull the code from source and developing it as per the requirements, git add to add files, git commit after the code build and git push to the pre prod environment for the code review and later used screwdriver.yaml which actually build the code, generates artifacts which releases in to production
- Performed data validation which does the record wise counts between the source and destination
- Have knowledge on apache Hue
- Good hands on experience with git and GitHub
- Created a feature node on GitHub
- Involved in the data support team as role of bug fixes, schedule change, memory tuning, schema changes loading the historic data.
- Worked on implementation of some check points like hive count check, Sqoop records check, done file create check, done file check and touch file lookup.
- Documented the workflow action process, bug fixes.
- Communicate and collaborate with the team in clearing of the blockers also have good communication with stakeholders
- Worked on both Agile and Kanban methodologies
Environment: Hadoop, Map Reduce, HDFS, Hive, Cassandra, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, AWS, GitHub, Talend Big Data Integration, Solr, Impala.
Confidential, Dania Beach, FL
- Working as a Big Data Engineer on Hortonworks distribution. Responsible for Data Ingestion, Data Cleansing, Data Standardization and Data Transformation.
- Working with Hadoop 2.x version and Spark 2.x (Python and Scala).
- Worked on creating Hive managed and external tables based on the requirement.
- Implemented Partitioning and Bucketing on Hive tables for better performance.
- Used Spark-SQL to process the data and to run on Spark engine.
- Worked on Spark for improving performance and optimization of existing algorithms in Hadoop using Spark-SQL and Scala.
- Worked on various file formats like Parquet, Json and ORC.
- Developed end to end ETL pipeline using Spark-SQL, Scala on Spark engine.
- Worked with external vendors or partners to onboard external data into Target s3 buckets.
- Worked on Oozie to develop workflows to automate ETL data pipeline.
- Developed Spark jobs to clean data obtained from various feeds to make it suitable for ingestion into Hive tables for analysis.
- Imported data from various sources into Spark RDD for analysis.
- Configured Oozie workflow to run multiple Hive jobs which run independently with time and data availability.
- Utilized Hive tables and HQL queries for daily and weekly reports. Worked on complex data types in Hive like Structs and Maps.
- Created Cassandra tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Supported code/design analysis, strategy development and project planning.
- Created reports for the BI team using Sqoop to export data into HDFS and Hive.
- Assisted with data capacity planning and node forecasting.
- Collaborated with the infrastructure, network, database, application, and BI teams to ensure data quality and availability.
- Designing ETL processes using Informatica to load data from Flat Files, Oracle, and Excel files to target Oracle Data Warehouse database.
Environment : Hadoop, Kafka, Spark, Sqoop, Spark SQL, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Zookeeper.
- Designed use case diagrams, class diagrams and sequence diagrams using Microsoft Visio tool.
- Extensively used Spring IOC, Hibernate, Core Java such as Exceptions, Collections, etc.
- Deployed the applications on IBM Web Sphere Application Server.
- Build and deployed war file in WebSphere application server.
- Implemented Patterns such as Singleton, Factory, Facade, Prototype, Decorator, Business Delegate and MVC.
- Involved in frequent meeting with clients to gather business requirement & converting them to technical specification for development team.
- Involved in Bug fixing and Enhancement phase, used find bug tool.
- Version Controlled using SVN.
- Developed application in Eclipse IDE.
- Used struts framework to build MVC architecture and separate presentation from business logic.
- Involved in rewriting middle-tier on WebLogic application server.
- Developed the administrative UI using Angular.js and Ext JS.
- Generated Stored Procedures using PL/SQL language.