Big Data/spark Developer Resume
WI
PROFESSIONAL SUMMARY:
- Over 8 years of professional IT experience, 4 years Big Data Ecosystem experience in ingestion, querying, processing and analysis of big data.
- Experience in using Hadoop ecosystem components like Spark, Scala, Map Reduce, HDFS, HBase, Zoo Keeper, Oozie, Hive, Sqoop, Pig, Flume, Kafka, storm, Cassandra, Cloudera, Horton Works Distribution
- Well - versed with the business domains like banking, Healthcare, Insurance, Entertainment and Travel sectors in Big-Data implementations.
- Hands on experience in using various Hadoop distributions (Cloudera, Hortonworks, MapR).
- Exposure in analyzing data using Hive QL, Pig Latin, HBase and custom Map Reduce programs in Java.
- In-depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Spark Mlib.
- Experience in implementing Real-Time event processing and analytics using messaging systems like Spark Streaming .
- Experienced with Spark Context, Spark-SQL, Data Frames, Pair RDD's and YARN.
- Knowledge in implementing Spark SQL.
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark RDD in Scala and Python.
- Developed Spark Streaming applications. Getting the data using Nifi, writing the stream data into Kafka and analyzing the data through Spark.
- Worked on Spark Scripts to find the most trending products (day-wise and week-wise) using Scala.
- Good working experience on using Sqoop to import data into HDFS from RDBMS and vice-versa.
- Expertise in job scheduling and monitoring tools like Oozie and Zookeeper.
- Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Hands on performing ad-hoc queries on structured data using Hive QL and used Partition and Bucketing techniques and joins with HIVE for faster data access.
- Extensively worked on Hive, Pig and Sqoop for sourcing and transformations.
- Hands on expertise in working and designing of Row keys & Schema Design with NOSQL databases like Mongo DB, HBase, Cassandra and DynamoDB (AWS).
- Used Spark for interactive queries, processing of batch data and integration with NoSQL database for huge volume of data
- Kafka Deployment and Integration with Oracle databases.
- Good experience in developing multiple Kafka Producers and Consumers as per business requirements.
- Written Strom topology to accept the events from Kafka producers and emit into Cassandra DB.
- Develop quality code adhering to Scala coding standards and best practices.
- Experience working on Solr to develop search engine on unstructured data in HDFS.
- Used Solr to enable indexing for enabling searching on Non-primary key columns from Cassandra key spaces.
- Implemented CRUD operations using CQL on top of Cassandra file system.
- Created User Defined Functions (UDFs), User Defined Aggregated Functions (UDAFs) in PIG and Hive.
- Real time exposure to Amazon Web Services, AWS command line interface, and AWS data pipeline.
- Good Knowledge on Cloudera distributions and in Amazon simple storage service (Amazon S3), AWS and Amazon EC2 , Amazon EMR.
- Used ELK (Elasticsearch, Log stash and Kibana) for name search pattern for a customer.
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
- Experienced in writing Spark Applications in Scala and Python (PySpark)
- Experience in cloud platforms like AWS, Azure.
- Worked on HBase to perform real time analytics and experienced in CQL to extract data from Cassandra tables.
- Well-versed with the web technologies such as HTML, CSS and JavaScript.
- Good understanding of Data Mining and Machine Learning techniques.
- Experienced with performing real time analytics on NoSQL data bases like HBase and Cassandra .
- Hands on experience on Ad-hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB .
- Had competency in using Chef, Puppet and Ansible configuration and automation tools. Configured and administered CI tools like Jenkins, Hudson Bambino for automated builds.
- Strong experience on Data Warehousing ETL concepts using Informatica, and Talend.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions. Real time streaming the data using Spark with Kafka for faster processing.
- Design solution for various system components using Microsoft Azure
- Written MapReduce programs in Java for data extraction, transformation and aggregation from various file formats which includes XML, JSON, CSV, Avro, Parquet, ORC, Sequence, Texts and other formats.
- Strong Knowledge in Informatica ETL Tool, Data warehousing and Business intelligence.
- Good level of experience in Core Java, JEE technologies as JDBC, Servlets, and JSP.
- Expert in developing web applications using Struts, Hibernate and Spring Frameworks.
- Hands on Experience in writing SQL and PL/SQL queries.
- Good understanding and experience with Software Development methodologies like Agile and Waterfall and performed Testing such as Unit, Regression, Agile, White-box, and Black-box.
- Used various Project Management services like JIRA for tracking issues, bugs related to code and GitHub for various code reviews and Worked on various version control tools like CVS, GIT, SVN.
TECHNICAL SKILLS:
Hadoop/Big Data: HDFS, Map Reduce, Hive, Pig, Sqoop, Flume, Oozie, Spark, Impala zookeeper, Kafka, AWS, Solr
NoSQL Databases: HBase, Cassandra, mongo DB, DynamoDB
Languages: C, C++, Java, PL/SQL, Pig Latin, HiveQL, Unix shell scripts, Scala, python
Java/J2EE Technologies: Applets, Swing, JDBC, JSON, JSTL, JMS, Java Script, JSP, Servlets, EJB, JSF, jQuery
Frameworks: MVC, Struts, Spring, And Hibernate.
ETL: IBM Web Sphere/Oracle, Talend, Informatica
Operating Systems: UNIX, Red Hat Linux, Ubuntu Linux, Mac and Windows XP/Vista/7/8
Web Technologies: HTML, DHTML, XML, AJAX, WSDL, SOAP
Web/Application servers: Apache Tomcat, WebLogic, JBoss
Databases: Oracle, DB2, SQL Server, MySQL
Tools and IDE: Eclipse, NetBeans, JDeveloper, DB Visualizer.
Version control: SVN, CVS and Git
Network Protocols: TCP/IP, UDP, HTTP, DNS
EXPERIENCE:
Confidential -WI
Big Data/Spark Developer
Responsibilities:
- Involved in importing and exporting data between Hadoop Data Lake and Relational Systems like Oracle , MySQL, using Sqoop .
- Worked on Creating Kafka topics , partitions, writing custom partitioner classes.
- Experienced in writing Spark Applications in Scala and Python (PySpark).
- Imported Avro files using Apache Kafka and did some analytics using Spark in Scala.
- Extracting real time data using Kafka and Spark streaming by Creating D streams and converting them into RDD , processing it and stored it into Cassandra.
- Experience in building Real-time Data Pipelines with Kafka Connect and Spark Streaming.
- Configured, deployed and maintained multi-node Dev and Test Kafka Clusters.
- Processed and transferred the data from Kafka into HDFS through Spark Streaming APIs.
- Using Spark-Streaming APIs to perform transformations and actions on fly for building the common learner data model which gets the data from Kafka in near real time and persists into Cassandra.
- Used Datastax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping.
- Developed script which will Load the data into Spark Data frames and do in memory data computation to generate the output response.
- Involved in migrating map reduce jobs into RDD (Resilient data distributions) and create Spark jobs for better performance.
- Used Scala sbt to develop Scala coded spark projects and executed using spark-submit.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark frame work.
- Building the Cassandra nodes using AWS & setting up the Cassandra cluster using Ansible automation tools
- Worked and learned a great deal from Amazon Web Services (AWS) cloud services like EC2, S3, EMR, EBS, RDS and VPC.
- Used highly available AWS Environment to launch the applications in different regions and implemented Cloud Front with AWS Lambda to reduce latency.
- Developed Scala scripts, UDF’s using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Involved in executing various Oozie workflows and automating parallel Hadoop MapReduce jobs.
- Developed Oozie Bundles to Schedule Pig, Sqoop and Hive jobs to create data pipelines.
- Experience in using ORC, Avro, Parquet, RCFile and JSON file formats and developed UDFs using Hive and Pig.
- Developed Hive queries to do analysis of the data and to generate the end reports to be used by business users.
- Written extensive Hive queries to do transformations on the data to be used by downstream models.
- Used spark and spark-SQL to read the parquet data and create the tables in hive using the Scala API.
- Involved in loading data from UNIX file system to HDFS and responsible for writing generic scripts in UNIX.
- Involved in using Solr Cloud implementation to provide real time search capabilities on the repository with tera bytes of data.
- Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers/sensors
- Experience in writing and tuning extensive Impala queries and creating views for adhoc and business processing.
- Design solution for various system components using Microsoft Azure
- Configures Azure cloud services for endpoint deployment.
- Written generic extensive data quality check framework to be used by the application using impala.
- Involved in restarting failed Hadoop jobs in production environment.
- Generated various marketing reports using Tableau with Hadoop as a source for data.
- Involved in complete project life cycle starting from design discussion to production deployment.
- Used Hadoop FS actions to move the data from upstream location to local data lake locations.
- Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Involved in ingesting data into Cassandra and consuming the ingested data from Cassandra to Hadoop Data Lake.
- Involved in the process of Cassandra data modelling and building efficient data structures.
- Written strom topology to emit data into Cassandra DB.
- Understanding of Kerberos authentication in Oozie workflow for Hive and Cassandra.
- Developed complex Talend ETL jobs to migrate the data from flat files to database.
- Developed map reduce programs as a part of predictive analytical model development.
- Used Jira for ticket tracking and work flow.
- Extensively used GIT as a code repository and Version One for managing day agile project development process and to keep track of the issues and blockers.
Environment: Hadoop, Hive, Impala, Oracle, Spark, Python, Pig, Sqoop, Oozie, Map Reduce, GIT, HDFS, Cassandra, Apache Kafka, Storm, Linux, Tableau, Solr, Confluence, Jenkins, Jira
Confidential -NY
Big Data/Spark Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Involved in transforming data from legacy tables to HDFS and Hive tables using Sqoop.
- This project will download the data that was generated by sensors from the Patients body activities, the data will be collected in to the HDFS system online aggregators by Flume .
- Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
- Expertized in implementing Spark using and Spark SQL for faster testing and processing of data responsible to manage data from different sources Scala.
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
- Creating end to end Spark-Solr applications using Scala to perform various data cleansing, validation, transformation and summarization activities according to the requirement.
- Used Flume to stream through the log data from various sources.
- Wrote Flume configuration files for importing streaming log data into HBase with Flume.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Experience in AWS to spin up the EMR cluster to process the huge data which is stored in S3 and push it to HDFS.
- Implemented Spark SQL to access hive tables into spark for faster processing of data.
- Involved in Converting Hive / SQL queries into Spark transformations using Spark RDD .
- Enhanced HIVE queries performance using TEZ for Customer Attribution datasets.
- Worked on migrating PIG scripts and MapReduce programs to Spark Data frames API and Spark SQL to improve performance Involved in moving all log files generated from various sources to HDFS for further processing through Flume and process the files by using some piggybank.
- Integrated Hive server 2 with Tableau using Horton Works Hive ODBC driver, for auto generation of Hive queries for non-technical business user.
- Used Hadoop's Pig , Hive and Map Reduce for analyzing the Health insurance data to help by extracting data sets for meaningful information such as medicines, diseases, symptoms, opinions, geographic region detail etc. Used Pig in three distinct workloads like pipelines, iterative processing and research.
- Managed and reviewed Hadoop MongoDB log files
- Developed workflow in Oozie to orchestrate a series of Pig scripts to cleanse data, such as removing personal information or merging many small files into a handful of very large, compressed files using pig pipelines in the data preparation stage.
- Worked on No-SQL database MongoDB in storing images and URIs.
- Performed data analysis with MongoDB using Hive External tables. Exported the analyzed data using Sqoop and to generate reports for the BI team.
- Validating the source file for Data Integrity and Data Quality by reading header and trailer information and column validations.
- Worked on three layers for storing data such as raw layer, intermediate layer and publish layer.
- Using Avro file format compressed with Snappy in intermediate tables for faster processing of data.
- Used parquet file format for published tables and created views on the tables.
- Created sentry policy files to provide access to the required databases and tables to view from impala to the business users in the dev, test and prod environment.
- Implemented test scripts to support test driven development and continuous integration.
- Good understanding of ETL tools and how they can be applied in a Big Data environment.
- Processed the web server logs by developing Multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis.
- Worked on NiFi to automate the data movement between different Hadoop systems.
- Designed and implemented custom Nifi processor that reacted, processed for the data pipeline.
- Responsible for generating actionable insights from complex data to drive real business results for various application teams and worked in Agile Methodology projects extensively.
- Worked with SCRUM team in delivering agreed user stories on time for every Sprint
Environment: Hadoop, Map Reduce, Hortonworks, Spark, Nifi, MongoDB, Impala, AWS, Sqoop, HDFS, Hive, Pig, Oozie, Flume, Oracle, UNIX Shell Scripting.
Confidential - KS
Big data Developer
Responsibilities:
- Developed solutions to process data into HDFS (Hadoop Distributed File System), process within Hadoop and emit the summary results from Hadoop to downstream systems.
- Worked on Sqoop extensively to ingest data from various source systems into HDFS.
- Involved in transforming data from Mainframe tables to HDFS, and HBase tables using Sqoop.
- Developed Simple to complex Map/reduce Jobs using Hive, Pig and Python.
- Analyzed Hadoop cluster and different big data analytic tools including Map Reduce, Pig, Hive.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Responsible for developing data pipeline using Flume, Sqoop and Pig to extract the data from weblogs and store in HDFS.
- Handled different type of Joins in Hive such as Inner Join, Left outer join, Right Outer Join, and Full Outer Join.
- Involved in developing PIG UDFs for the needed functionality such as custom Pigs loader.
- Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts.
- Written Pig scripts for sorting, joining, and grouping data.
- Developed HBase data model on top of HDFS data to perform real time analytics using JavaAPI.
- Used Impala to analyze data ingested into HBase and compute various metrics for reporting on the dashboard.
- Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV etc.
- Implemented counters on HBase data to count total records on different tables.
- Developed different kind of custom filters and handled pre-defined filters on HBase data using API.
- Integrated NoSQL database like HBase with Map Reduce to move bulk amount of data into HBase.
- Used (SQL Server, DB2, TD) for integrating into Hadoop cluster and analyzed data by Hive-HBase integration.
- Created HBase tables to store variable data formats of data coming from different portfolios.
- Using Oozie to schedule the workflow to perform shell action and hive actions.
- Worked on a stand-alone as well as a distributed Hadoop application.
- Experienced with working on Avro Data files using Avro Serialization system.
- Kerberos security was implemented to safeguard the cluster.
Environment: Hadoop, HDFS, Pig, Hive, MapReduce, Sqoop, Oozie, Zookeeper, HBase, Java, Eclipse, SQL Server, Shell Scripting.
Confidential - IL
Hadoop/Java Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Installed and configured Hadoop map Reduce, HDFS, developed multiple map reduce jobs in java for data cleaning and pre-processing.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted the data from MySQL into HDFS using Sqoop.
- Developed simple to complex map/reduce jobs using java programming language that are implemented using hive and pig.
- Implemented business logic by writing UDFs in java and used various UDFs from other sources.
- Involved in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
- Written multiple Map Reduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other codec file formats.
- Developed several new MapReduce programs to analyze and transform the data to uncover insights into the customer usage patterns.
- Created Hive tables, loaded the data and Performed data manipulations using Hive queries in MapReduce Execution Mode.
- Extensively involved in Design phase and delivered Design documents.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
- Managing and Reviewing Hadoop log files, deploy and maintaining Hadoop cluster.
- Supported map reduce program those are running on cluster.
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data.
- Experienced with different scripting language like python and shell scripts.
- Performed unit testing using JUNIT framework to check the code review.
- Worked with GitHub.
- Involving in building the modules in Linux environment with ant script.
Environment: Java, HDFS, Map Reduce, Sqoop, Pig, Hive, Junit, Linux, Oozie, Eclipse, MySQL, SQL Server, Python.
Confidential
Java/J2EE Developer
Responsibilities:
- Involved in Java, J2EE, struts, web services and Hibernate in a fast-paced development environment.
- Followed agile methodology, interacted directly with the client on the features, implemented optimal solutions, and tailor application to customer needs.
- Involved in design and implementation of web tier using Servlets and JSP.
- Used Apache POI for Excel files reading.
- Developed the user interface using JSP and Java Script to view all online trading transactions.
- Designed and developed Data Access Objects (DAO) to access the database.
- Used DAO Factory and value object design patterns to organize and integrate the JAVA Objects
- Coded Java Server Pages for the Dynamic front end content that use Servlets and EJBs.
- Coded HTML pages using CSS for static content generation with JavaScript for validations.
- Used JDBC API to connect to the database and carry out database operations.
- Used JSP and JSTL Tag Libraries for developing User Interface components.
- Performing Code Reviews.
- Performed unit testing, system testing and integration testing.
- Involved in building and deployment of application in Linux environment.
Environment: Java, J2EE, JDBC, Struts, SQL. Hibernate, Eclipse, Apache POI, CSS
Confidential
Java/J2EE Developer
Responsibilities:
- Responsible for understanding the scope of the project and requirement gathering.
- Developed the web tier using JSP, Struts MVC to show account details and summary.
- Created and maintained the configuration of the Spring Application Framework.
- Implemented various design patterns - Singleton, Business Delegate, Value Object and Spring DAO.
- Used Spring JDBC to write some DAO classes which interact with the database to access account information.
- Mapped business objects to database using Hibernate.
- Involved in writing Spring Configuration XML files that contains declarations and other dependent objects declaration.
- Used Tomcat web server for development purpose.
- Involved in creation of Test Cases for Unit Testing.
- Used Oracle as Database and used Toad for queries execution and involved in writing SQL scripts, PL/ SQL code for procedures and functions.
- Used CVS, Perforce as configuration management tool for code versioning and release.
- Developed application using Eclipse and used build and deploy tool as Maven.
- Used Log4J to print the logging, debugging, warning, info on the server console.
Environment: Java, J2EE, JSON, LINUX, XML, XSL, CSS, Java Script, Eclipse, Maven, CVS, Tomcat