We provide IT Staff Augmentation Services!

Big Data Developer/spark Developer Resume

Malvern, PA

SUMMARY

  • Having 4+ years of experience in the field of software development creating solutions using Enterprise Applications and Web based Applications using JAVA & J2EE Technologies.
  • Having 4 years of experience as a Big Data Engineer with good understanding of Hadoop framework, Big Data Tools like Map - Reduce, HDFS, Yarn/MRV2, Pig, Hive, Sqoop, kafka, flume, HBase, Apache spark, oozie and Technologies for implementing Data analytics.
  • Hadoopdeveloper: Excellent hands on experience using Hadoop tools like HDFS, Hive, Pig, Apache Spark, Apache Sqoop, Flume, Oozie, Apache Kafka, Apache storm, Yarn, Impala, Zookeeper, Hue. Experience in analyzing data using HiveQL, Pig Latin, and MapReduce Programs.
  • Experienced in ingesting data into HDFS from various Relational databases like MYSQL, Oracle, DB2, Teradata, Postgres using sqoop.
  • Experienced in importing real time streaming logs and aggregating the data to HDFS using Kafka and Flume.
  • Excellent knowledge on creating real-time data streaming solutions using Apachestorm, Apache sparkstreaming and building sparkapplications using scala.
  • Well versed with various Hadoop distributions which includes Cloudera(CDH), Hortonworks(HDP) and knowledge on MAPR distribution.
  • Experienced in creating various tables in Hive which include Managed Tables and External tables and loading data into Hive from HDFS.
  • Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF),User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig.
  • Implemented Pig scripts for analyzing large data sets in the HDFS by performing various transformations.
  • Experience in analyzing data using HiveQL, PigLatin, HBase.
  • Capable of processing large sets of structured, semi-structured and unstructured data and supporting system application architecture.
  • Experience working on NoSQL Databases like HBase, Cassandra and MongoDB.
  • Experience in Python, Scala, shell scripting, Spark R.
  • Experience in Creating various Oozie jobs to manage processing workflowswith actions that run Hadoop MapReduce and Pig jobs.
  • Experience in using AWS Cloud components S3, EC2, EMR, IAM, RDS,Elastic beanstalk and DynamoDB.
  • Having knowledge network authentication protocol Kerberos.
  • Experience in using various file formats including XML, JSON, CSV and other file formats like text, sequence files, avro, ORC and Parquette using various compression techniques like snappy,gzip,LZO.
  • Experience with Testing Map Reduce programs using MRUnit, Junit and EasyMock.
  • Knowledge on Machine Learning algorithms and Predictive Analysis using sparkMLLib,Mahoutand leveraging them using spark R.
  • Experience on ETL methodology for supporting Data Extraction, transformations and loading processing using Hadoop.
  • Worked on data visualization tools like Tableu and also integrated the data using ETL tool Talend.
  • Worked on various Relational Databases likeTeradata,Postgres, MySQL, Oracle 10g, DB2.
  • Hands on development experience with JAVA, Shell Scripting, RDBMS, including writing complex SQL queries, PL/SQL, views, stored procedure, triggers, etc.
  • Diverse experience in utilizing Java tools in business, Web, and client-server environments including Java Platform, J2EE, EJB, JSP, Java Servlets, Junit, Java database Connectivity (JDBC) technologiesand application servers like Web Sphere and Weblogic.
  • Experience on various build tools like ANT, MAVEN, Graddle, SBT.
  • Knowledge on creating dashboards/reports using reporting tools like Tableu, Qlickview.
  • Development experience with IDE’s Eclipse, NetBeans, IntelliJ and repositories SVN, GIT and CVS.
  • Having good experience in different software methodologies like waterfall and agile approach.
  • Knowledge on writing YARN applications.
  • Familiarity in working with popular frameworks likes Struts, Hibernate, SpringMVC and AJAX andWeb Services using XML, HTML and SOAP.
  • Passionate about working on the most cutting-edge Big Data technologies.
  • Ability to adapt to evolving technology, strong sense of responsibility and accomplishment.
  • Willing to update my knowledge and learn new skills according to business requirement.

TECHNICAL SKILLS

Hadoop Technologies: HDFS, MapReduce, Hive, Impala, Pig, Sqoop, Flume, Oozie, Zookeeper, Ambari, Hue, Apache Spark, Strom, Kafka, Yarn, NiFi, Ganglia, TEZ

Operating System: Windows, Unix, Linux

Languages: Java, J2EE, SQL, PL/SQL, Shell Script, Python, scala,R

Testing tools: Junit, MRunit, EasyMock

Front - End: HTML, JSTL, DHTML, JavaScript, CSS, XML, XSL, XSLT

SQL Databases: MySQL, Oracle 11g/10g/9i, SQL Server, TeraData, Postgres

NoSQL Databases: HBase, Cassandra, MongoDB, Neo4j,Redshift

File System: HDFS

Reporting Tools: Tableau, Qlickview

IDE Tools: Eclipse, NetBeans, Spring Tool Suite, IntelliJ

Application Server: IBM WebSphere, Web Logic, JBoss

Version control: SVN, GIT and CVS

Build Tools: Maven, Graddle, ANT,SBT.

ETL Tools: Talend, Datastage, Informatica.

Messaging & Web Services Technology: SOAP, WSDL, REST, UDDI, XML, SOA, JAX-RPCIBM WebSphere MQ v5.3, JMS.

PROFESSIONAL EXPERIENCE

Confidential - Malvern, PA

Big Data Developer/Spark Developer

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop.
  • Involved in importing the data from various data sources into HDFS using Sqoop and applying various transformations using Hive, apache Spark and then loading data into Hive tables or AWS S3 buckets.
  • Involved in moving data from various DB2 tables to AWSS3 buckets using Sqoop process.
  • Configuring splunk alerts in-order to get the log files while execution and storing them to a location in s3 bucket when cluster is running.
  • Involved in Hive/SQL queries performing spark transformations using spark RDDs and python(pyspark).
  • Writing oozie scripts in-order to schedule and automate the jobs in EMR cluster.
  • Used Bitbucket as a repository for storing the code and integrated with bamboo for integration purpose.
  • Experienced in bringing up EMR cluster and deploying code into the cluster in S3 buckets.
  • Migrated the existing on-prem code to AWS EMR cluster.
  • Experienced in using NoMachine and Putty in-order to SSH the EMR cluster and running spark-submit.
  • Developed Apache Spark Applications by using Scala, python and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Experience in developing various Spark Streaming Jobs using python. (pyspark) and scala.
  • Developing spark code using pyspark to applying various transformations and actions for faster data processing.
  • Working knowledge onApache Spark Streaming that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Used Spark Stream processing using Scala to get data into in-memory, implemented RDDtransformations, and performed actions.
  • Involved in using various python libraries with pyspark to inorder to create dataframesand store them to Hive.
  • Sqoop jobs, and Hive queries were created for data ingestion from relational databases to compare with historical data.
  • Experience in working with Elastic MapReduce(EMR) and setting up environments on amazon AWS EC2 instances.
  • Experienced in migrating HiveQL into Impala to minimize query response time.
  • Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment.
  • Executed Hadoop/Sparkjobs on AWS EMR using programs, stored in S3 Buckets.
  • Knowledge on creating the user defined functions (UDF’s) in hive.
  • Worked with different File Formats like textfile, avro, parquet for HIVE querying and processing based on business logic.
  • Knowledge in pulling the data from AWS Amazon S3 bucket to data lake and built Hive tables on top of it and created data frames inSpark to perform further analysis.
  • Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
  • Involved in Test Driven Development writing unit and integration test cases for the code.
  • Implemented HiveUDF's to implement business logic and Responsible for performing extensive data validation using Hive.
  • Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API.
  • Involved in developing code and generated various data frames based on the business requirement and created temporary tables in hive.
  • Utilized AWS CloudWatch to monitor the performance environment instances for operational and performance metrics during load testing.
  • Experience in build scripts using Maven and did continuous system integrations like Bamboo.
  • Used JIRA for creating the user stories and creating branches in the bitbucket repositories based on the story.
  • Knowledge on Sonar in-order to validate the code and to follow coding standards.
  • Involved in story-driven agile development methodology and actively participated in daily scrum meetings.

Environment: Cloudera, Map Reduce, HDFS, Scala, Hive, Sqoop, Spark, Oozie, Linux, Maven, control-M, Splunk, NoMachine, Putty, HBase, Python, AWS EMR Cluster, EC2 instances, S3 Buckets, STS, Bamboo, Bitbucket.

Confidential

Big Data Developer/Spark Developer

Responsibilities:

  • Responsible for building scalable distributed data solutions usingHadoop.
  • Experienced in migrating HiveQL into Impala to minimize query response time.
  • Importing different log files using Apache Kafka into HDFS and performed data analytics using apache spark.
  • Involved in importing the data from various data sources into HDFS using Sqoop and applying various transformations using Hive, apache Spark and then loading data into Hive tables.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in Near real time and persist it to Cassandra.
  • Collected the logs from the physical machines and the OpenStack controller and integrated into HDFS using kafka.
  • Experience in developing Kafkaconsumers and Kafkaproducers by extending low level and high level consumer and producer API’s.
  • Involved in converting Hive/SQL queries into spark transformations using spark RDDs and python(pyspark).
  • Involved in running analytics workloads and long running services on Apache Mesos cluster manager.
  • Developed Apache Spark Applications by using Scala, Java and Implemented ApacheSpark data processing project to handle data from various RDBMS and Streaming sources.
  • Experience in developing various Spark Streaming API’s using python. (pyspark).
  • Developing spark codeusing pyspark to applying various transformations and actions for faster data processing.
  • Working knowledge onApache Spark Streaming API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Used Spark Streamprocessing to get data into in-memory, implemented RDDtransformations, and performed actions.
  • Developed various Kafka Producers and consumers for importing various transaction logs.
  • Used Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
  • Involved in integrating HBase with pyspark to import data into HBaseand also performed some CRUD operations on Hbase.
  • Used various HBase commands and generated different Datasets as per requirements and provided access to the data when required using grant and Revoke
  • Performed performance tuning and troubleshooting of Map Reduce jobs by analyzing and reviewing Hadoop log files.
  • Developed various spark applications usingpyspark and numpy.
  • Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
  • Experience in working with Elastic MapReduce(EMR) and setting up environments on amazon AWS EC2 instances.
  • Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment.
  • Executed Hadoop/Sparkjobs on AWS EMR using programs,data stored in S3 Buckets and AWS Redshift.
  • Loaded and performed some transform data into Hadoop cluster from large set of structured data using TalendBig data studio.
  • Worked with different File Formats like textfile, avro, orc for HIVE querying and processing based on business logic.
  • Knowledge in pulling the data from AWS Amazon S3 bucket to data lake and built Hive tables on top of it and created data frames inSpark to perform further analysis.
  • Involved in writing Custom Talend jobs to ingest, enrich and distribute data in Hadoop ecosystem.
  • Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
  • Developed efficient ETL processes, including workflows and jobs scheduling, to move data from source to target according to requirementsusing Talend.
  • Implemented Hive, Pig UDF's to implement business logic and Responsible for performing extensive data validation using Hive.
  • Implemented Daily Cron jobs that automate parallel tasks of loading the data into HDFS using Oozie coordinator jobs.
  • Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API.
  • Used Pig as ETL tool to do transformations, event joins, filter and some pre-aggregations
  • Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
  • Knowledge on Machine Learning algorithms like clustering, classification and regression.
  • Written multiple Map Reduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV& other compressed file formats.
  • Implemented various machine learning algorithm based on business logic using SparkMLLib.
  • Used Talend Open Studio for data integration and for data migration from various location across the business.
  • Integrated data quality plans as a part of ETL processes using Talend.
  • Experience in build scripts using Maven and did continuous system integrations like Jenkins.
  • Used JIRA for bug tracking and GIT for version control.
  • Involved in story-driven agile development methodology and actively participated in daily scrum meetings.

Environment: Cloudera, Map Reduce, HDFS, Pig, Scala,Hive, Sqoop, Spark, Kafka, Oozie, Java, Linux, Maven, HBase, Zookeeper, Kerberos, Tableau, Python,TalendOpen studio,AWS.

Hire Now