- 7 years of experience in design, development, production, maintenance of software applications using Python and Java technologies with big data environment.
- 4+ years of experience in Spark using Spark Context, Spark - Sql, Spark APIs and Data Frame, Pair RDD’s and developed Spark code using Python.
- Strong experience in using various Big Data tools and technologies such as Spark, Kafka,,Hive, Hadoop, HDFS, HBase, Cassandra, Sqoop, Oozie, Pig, Zookeeper, YARN and Flume.
- Experience in writingSpark Applications usingSpark -Shell and Pyspark. Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Extensively worked on Spark Streaming and Spark Sql, consuming real time data from Kafka, and performed various transformations on data and queried using Spark Sql.
- Extensive knowledge on Spark transformations, actions, Dstream operations, Dataframe functions and actions, pair RDD transformations to work on key-value pair datasets.
- Developed multiple POCs using Spark and deployed on the cluster, compared the performance of Spark, with Hive.
- Developed Big data solutions with Cloudera Distribution, Hortonworks, Amazon web services (AWS) and Google Cloud platform (GCP).
- Experienced in developing custom ETL and data ingestion pipelines in Python using Spark SQL and Pyspark.
- Expertise in importing and exporting data using stream processing platforms like Flume and Kafka.
- Experience in writing MapReduce jobs on Hadoop Ecosystem including Hive and Pig.
- Experience in Data Analysis, Design, development Implementation and Testing of Data Warehousing and using Data Conversions, Data Extraction, Data transformation and Data Loading(ETL).
- Worked with Oozie workflow engine to schedule time-based jobs to perform multiple actions.
- Expertise with performing real time analytics on NoSQL distributed databases like Cassandra, HBase and MongoDB.
- Procedural knowledge in cleansing and analyzing data using HiveQL, Pig, and custom MapReduce programs.
- Experience with different file formats such as Avro, Parquet, ORC and JSON
- Built data visualization dashboards using Tableau and Business Objects.
- Experience of using build tools Sbt, Maven.
- Experience in Test-driven development, Software Development Life Cycle (SDLC) methodologies like Agile and Scrum.
Big Data Technologies: Spark, HDFS, MapReduce, Hive, Sqoop, YARN, Oozie, Kafka, Flume, Zookeeper
Programming Languages: C, Java, Python
Cassandra, MongoDB, Spark: SQL, MS SQL Server, Oracle
IDEs: IntelliJ, Eclipse, Net Beans, PyCharm
Machine Learning Tools: Numpy, Pandas, Tensorflow
Tools: YARN, Mesos
Confidential, New York City, NY
Big Data Developer
- Used Kafka to consume raw data coming from Network devices.
- Performed aggregations and transformations using Structured Streaming to transform the data according to the requirements using Python API.
- Used Spark Streaming to process and load data in Hive.
- Developed Hive QL queries to support a reporting tool based on historic data.
- Used python and PySpark to implement a scoring algorithm developed by Data Scientists and used it to score transaction authenticity.
- Created partitioned tables to store the daily transaction data efficiently.
- Used Autosys to schedule batch jobs.
- Performed Shell scripting to develop jobs that run hive queries on transaction data coming daily.
- Controlled the Streaming jobs to consume data, write it on-time and run on specific intervals to keep feeding data to next jobs cycles.
- Build test cases for production related challenges and long-running jobs.
Environment: Hadoop, Hive, Spark, Kafka, Teradata, Postgres Sql, Shell Scripting, Autosys (Job Scheduling).
Confidential, Boca Raton, FL
Spark Developer/ Hadoop Developer
- Used Spark Streaming to consume topics from distributed messaging source Kafka and periodically pushes batch of data to Spark for real time processing.
- Involved in implementing Spark RDD transformations, actions to implement business analysis.
- Extensively used various Spark Data Frame APIs to process terabytes of Dataset.
- Migrated Hive QL queries on structured into Spark SQL to improve performance.
- Integrated the spark streaming with Kafka.
- Created AWS Data pipelines to move data in and out of S3 buckets.
- Experting in transporting and processing real time event streaming using Kafka.
- Import and export data into HDFS and Hive using Sqoop.
- Writing MapReduce jobs to cleanse Data.
- Developed a data pipeline using Kafka, Spark, and Hive to ingest, transform and analyzing customer behavioral data.
- Used Hive to analyze the partitioned, bucketed data and compute various metrics for reporting and implemented business logic using Hive UDFs to perform ad-hoc queries on structured data.
- Configured Oozie workflows to automate data flow, preprocess and cleaning tasks using Hadoop Actions.
- Converted existing hive table into ORC file format and queried via Vertica ORC File Reader for better performance and concurrency.
- Hands on experience in writing Linux/Unix Shell Scripting.
Environment: Spark, Pyspark, HIVE, HDFS, Kafka, Oozie, Spark Streaming, Spark SQL, Cloudera, Cloudera Manager, Python.
Confidential, Dunwoody, GA
- Involved in Data Warehousing and worked extensively on RDBMS (Oracle, MS SQL), ETL - Informatica and Oracle DRM (Data Relationship Management), and Reporting (BIDS - Visual Studio, and Oracle Reports).
- Provided solutions using ETL such as SSIS.
- Provided knowledge transfer for new resources.
- Developed Oozie workflow for scheduling and orchestrating the ETL process.
- Installed and configured Hadoop Map reduce, HDFS, developed multiple Map Reduce jobs in java/Python for data cleaning and preprocessing.
- Worked on a live 200 in pre-prod and 400 in production nodes Hadoop cluster running Cloudera.
- Worked with semi structured data of TB in size.
- Gave extensive presentations about the Hadoop ecosystem, best practices, data architecture in Hadoop.
- Used AWS S3, Redshift storage for data analysis.
- Experience in running Hadoop streaming jobs to process terabytes of xml format data.
- Supported Map Reduce Programs those are running on the cluster.
- Involved in loading data from UNIX file system to HDFS.
- Executed queries using Hive and developed MapReduce jobs to analyze data.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Developed a custom File System plug in for Hadoop so it can access files on Data Platform. This plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.
- Extracted feeds from social media sites such as Facebook, Twitter using Python scripts
Environment: ETL, Hadoop, Hive, HBase, MapReduce, HDFS, Pig, Cassandra, Strom, Flume, Hbase, MapReduce, IBM DataStage 8.1, Oracle 11g / 10g, PL/SQL, SQL*PLUS, LINUX, UNIX Shell Scripting, Java, Python
- Developed Python scripts to parse XML and JSON reports and load the information in a database.
- Used advanced features like pickle/unpickle in Python to share data across the applications.
- Developed and executed MySQL queries from Python using the Python-MySQL connector Designed and built RESTAPIs to add scalability to application.
- Developed views and templates with Python and Django’s view controller and templating language to create a user-friendly website interface.
- Created Python scripts for data access and analysis to help with system monitoring and reporting.