- 7 years of experience in IT industry as a software engineer which includes 3 years of experience in design and development using hadoop big data eco system tools.
- Experience in using different hadoop eco system components such as HDFS, YARN, MapReduce, Spark, Pig, Sqoop, Hive, Impala, Hbase, Kafka, and Crontab tools.
- Experience in developing ETL applications on large volumes of data using different tools: MapReduce, PySpark, Spark - Sql, Hive and Pig.
- Well versed with Big data on AWS cloud services i.e. EC2, S3, EMR, Lambda and CloudWatch
- Well-versed in using Map Reduce programming model for analyzing the data stored in HDFS and experience in writing Map Reduce codes in Java as per business requirements.
- Experience in importing and exporting data using Sqoop from RDBMS to HDFS and Hive.
- Expert in creating Spark UDFs using python in order to analyze data sets for complex aggregate requirements.
- Experience in developing spark jobs to run entity resolution on PII data of United States population.
- Responsible for taking care and running the spark jobs along with optimizing the jobs in spark along with data validation and automation.
- Experience in performing extensive data analysis in spark using complex dataframe transformations and RDD transformations and creating external hive tables on top of the aggregated data for down stream consumption.
- Used Shell scripting for automating the triggering of the spark jobs.
- Used Hbase for real time low latency read writes for multiple applications.
- Well versed in developing the complex SQL queries using Hive and Spark Sql.
- Experienced in preparing and executing unit test plan and unit test cases during software development.
- Strong understanding in Object-Oriented Programming concepts and implementation.
- Experience in providing training and guidance to new team members in the Project.
- Experience in detailed system design using use case analysis, functional analysis, modeling program with class & sequence, activity and state diagrams using UML and rational rose.
- Used JIRA tool for creating the tasks and logging the work and tracking the issues.
- Very good experience in customer specification study, requirements gathering, system architectural design and turning the requirements into final product/ service.
- Experience in interacting with customers and working at client locations for real time field testing of products and services.
- Ability to communicate and work effectively with associates at all levels within the organization.
- Strong background in mathematics and have very good analytical and problem solving skills.
Programming: Python, Core Java and SQL.
Big Data Eco System: HDFS, YARN, Map Reduce, Spark Core, SparkSQL, ImpalaHive, Pig, Kafka and Sqoop
AWS Stack: EC2, S3, EMR, Lambda and CloudWatch
ELK Stack: Elastic Search, Logstash and Kibana
Scripting Languages: UNIX Shell scripting and Python scripting.
DBMS / RDBMS: Oracle 11g, SQL Serve
Version Control: Git and Bitbucket
CI/CD Tool: Jenkins
Confidential, New Jersey
AWS Spark Developer
- Develop software to combine data from legacy databases and files into various data marts using Pyspark.
- Worked extensively on AWS Components such as Elastic Map Reduce (EMR), DynamoDB, Lambda, RDS etc.
- Worked with Parquet files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark.
- Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
- Deployed Scalable Hadoop cluster on AWS using S3 as underlying file system for Hadoop.
- Connecting my SQL database through Spark driver.
- Having experienced in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling and Excel data extracts.
- Develop and run serverless Spark based applications using AWS Lambda service and Pyspark to compute metrics for various business requirements.
- Develop python, shell scripting and spark-based applications using Pycharm and Anaconda integrated development environments.
- Use Git as version control tool for maintaining software and Jenkins as continuous integration and continuous development tool for deploying applications in production servers.
- Push AWS EMR and AWS Lambda logs to Elastic search using log stash for log analysis.
- Create different dashboards using documents in elastic search and Kibana as a visualization tool.
- Analyze failed jobs in AWS EMR Spark cluster and identify the cause of failures and improve the job performance and minimize/avoid failures using pyspark software.
- Use JIRA tool for updating my tasks/stories created by agile lead.
- Analyze, store and process data captured from different sources using different AWS cloud services such as AWS S3, Cloud Formation, Lambda and EMR services.
- Automate jobs using shell scripting and schedule those jobs to run at a specific time using crontab.
- Test and validate the developed applications in development and QA environments and deploy them in Production environment.
- Develop unit test cases for the software developed before deploying it in production servers.
- Using version control tool - Git with Jenkins to accumulate all the work done by team members.
- Using agile methodology - SCRUM, along with JIRA for project.
- Responsible for debugging and troubleshooting the running applications in production.
- Participated in writing scripts for test automation.
Environment: Spark Core, AWS S3, EMR, Lambda, CloudFormation, CloudWatch, Python, Hive, Presto, Crontab, Elastic Search and Kibana.
- Extensively used Spark core i.e. RDDs, DataFrames, and Spark Sql as part of developing multiple applications using both Python and Scala.
- Built multiple data pipe lines using Pig scripts for processing data for specific applications.
- Used different file formats such as Parquet, Avro, and ORC for storing and retrieving data in hadoop.
- Used Spark-streaming for consuming event based data from Kafka and joined this data set with existing Hive table data to generate performance indicators for an application
- Developed analytical queries on different tables using Spark sql for finding insights and building data pipelines for data scientists to consume this data for applying ML models.
- Spark performance tuning by applying different techniques: choosing optimum parallelism, Serialization format while shuffling the data, using broadcast variables, joins, aggregations, and memory management.
- Written multiple custom Sqoop import scripts to load data from oracle into HDFS directories and Hive tables.
- Used Nifi for automating and managing data flows between multiple systems.
- Used different compression techniques while storing data into Hive tables for performance improvement: snappy and Gzip
- Have used Impala for faster querying for a time critical application to generate reports.
- Also used Hbase for OLTP purpose for an application requiring high scalability using hadoop.
- Have written sqoop export scripts to write the date from HDFS into Oracle database.
- Used Control M component to simplify and automate different batch workload Applications.
- Worked closely with multiple data science and machine learning teams in building a data eco system to support AI.
- Also developed a Java based application to automate most of the manual work in on boarding a tenant to a multi-tenant environment. This is saving around 4 to 5 hours of manual work per tenant per person every day.
- Applied different job tuning techniques while processing data using Hive and spark frameworks to improve the performance of jobs.
Environment: Spark Core, Spark Streaming, Core Java, Python, Hive, Impala, HBase, Sqoop, Kerberos (security), LDAP, and Control M.
- Analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive, and Sqoop.
- Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA.
- Developed an ETL framework using Spark, Pig and Hive.
- Developed Spark Programs for Batch and Real-Time Processing to process incoming streams of data from Kafka sources and transform it into as Data frames and load those data frames into Hive and HDFS.
- Experience in developing SQL scripts using Spark for handling different data sets and verifying the performance of Map Reduce jobs.
- Developed Spark programs using Spark-SQL library to perform analytics on data in Hive.
- Developed various JAVA UDF functions to use in both Hive and Impala for ease of usage in various requirements.
- Created multiple MapReduce jobs in Pig and Hive for data cleaning and preprocessing.
- Created Hive views/tables for providing SQL like interface.
- Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP.
- Writing Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Using Hive to analyze the partitioned data and compute various metrics for reporting.
- Transformed the Impala queries into hive scripts which can be run using the shell commands directly for higher performance rate.
- Created the shell scripts which can be scheduled using Oozie workflows and even the Oozie Coordinators.
- Developed the Oozie workflows to generate monthly report files automatically.
- Managing and reviewing the Hadoop log files.
- Exporting data from HDFS environment into RDBMS using Sqoop for report generation and visualization purpose.
Environment: Hadoop, MapReduce, Sqoop, HDFS, Hive, Pig, Oozie, Java, Oracle 10g, MySQL, and Impala.