Hadoop/spark Developer Resume
Sterling, VA
SUMMARY
- Around 7+ years of IT experience in a variety of industries, which includes hands on experience in Big Data Hadoop and Java development
- Expertise with the tools in Hadoop Ecosystem including Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Kafka, Yarn, Oozie, and Zookeeper.
- Excellent knowledge on Hadoop Ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm
- Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Strong experience in writing applications using python using different libraries like Pandas, NumPy, SciPy, Matpotlib etc.
- Good Knowledge in Machine Learning algorithms using Python and its concepts as data - preprocessing, Regression, classification etc. and appropriate model selection techniques.
- Good exposure with Agile software development process.
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Strong experience on Hadoop distributions like Cloudera, MapR, Microsoft AzureHDINSIGHT and Horton Works.
- Experience in implementing OLAP multi-dimensional cube functionality usingAzure SQL Data Warehouse
- Good understanding of NoSQL databases and hands-on work experience in writing applications on NoSQL databases like HBase, Cassandra and MongoDB.
- Experienced in writing complex MapReduce programs that work with different file formats like Text, Sequence, Xml, parquet, and Avro.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Experience in migrating the data using Sqoop from HDFS to Relational Database System and vice-versa.
- Extensive Experience on importing and exporting data using stream processing platforms like Flume and Kafka.
- Good understanding of Teradata, Zeppelin and SOLR.
- Exceptionally good experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications.
- Excellent Java development skills using J2EE, J2SE, Servlets, JSP, EJB, JDBC, SOAP and RESTful web services.
- Strong Experience of Data Warehousing ETL concepts using Informatica Power Center, OLAP, OLTP and AutoSys.
- Experience in database design using PL/SQL to write Stored Procedures, Functions, Triggers, and strong experience in writing complex queries for Oracle.
- Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
- Strong experience in Object-Oriented Design, Analysis, Development, Testing and Maintenance.
- Excellent implementation knowledge of Enterprise/Web/Client Server using Java, J2EE.
- Experienced in using agile approaches, including Extreme Programming, Test-Driven Development and Agile Scrum.
- Worked in large and small teams for systems requirement, design & development.
- Key participant in all phases of software development life cycle with Analysis, Design, Development, Integration, Implementation, Debugging, and Testing of Software Applications in client server environment, Object Oriented Experience in using various IDEs Eclipse, IntelliJ, and repositories SVN and Git.
- Experience of using build tools Ant, Maven.
TECHNICAL SKILLS
Big Data Ecosystem: HDFS, Map Reduce, HIVE, PIG, HBase, Sqoop, Flume, Oozie, Spark, Storm, Kafka, HCatalog, Impala, Datameer.
Distributed Platforms: Cloudera, Hortonworks, MapR Azure HDINSIGHT and Apache
Languages: C, C++, Java, Scala, SQL, PL/SQL, Linux shell scripts, HL7.
NoSQL Databases: MongoDB, Cassandra, HBase
Java Technologies: Servlets, JavaBeans, JSP, JDBC
XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB
Methodology: Agile/Scrum, Rational Unified Process and Waterfall
Monitoring tools: Ganglia, Nagios.
Hadoop/BigData Technologies: HDFS, Map Reduce, spark sql, Sqoop, Flume, Pig, Hive, Oozie, impala, Zookeeper and Cloudera Manager, MongoDB, NO SQL Database HBase
Version Control: GitHub, Bitbucket, CVS, SVN, Clear Case, Visual Source Safe
Build & Deployment Tools: Maven, ANT, Hudson, Jenkins
Database: Oracle, MS SQL Server 2005, MySQL, Teradata
PROFESSIONAL EXPERIENCE
Confidential, Sterling, VA
Hadoop/Spark Developer
Responsibilities:
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Application master, Name node, Master node, resource manager, data node and Map reduce concepts.
- Managing fully distributed Hadoop cluster is an additional responsibility assigned to me. I was trained to overtake the responsibilities of a Hadoop Administrator, which includes managing the cluster, Upgrades and installation of tools that uses Hadoop ecosystem.
- Worked on Installation and configuring of Zookeeper to co-ordinate and monitor the cluster resources.
- Wrote AZUREPOWERSHELLscripts to copy or move data from local file system to HDFS storage.
- Create data pipelines in cloud using Azure Data Factory.
- Implemented test scripts to support test driven development and continuous integration.
- Worked and learned a great deal of from Amazon web Services (AWS) EC2, S3, RDS, ELK.
- Consumed the data from Kafka using Apache spark.
- Load and transform large sets of structured, semi structured, and unstructured data.
- Involved in loading data from LINUX file system to HDFS
- Importing and exporting data into HDFS and Hive using Sqoop
- Worked in creating HBase tables to load large sets of semi structured data coming from various sources.
- Extending HIVE and PIG core functionality by using custom User Defined Function’s (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig using python.
- Experience in importing data from S3 to HIVE using Sqoop and Kafka.
- Good Experience working with Amazon AWS for accessing Hadoop cluster components.
- Responsible for loading data files from various external sources like ORACLE, MySQL into Datalake.
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
- Actively involved in code review and bug fixing for improving the performance.
- Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.
- Created Linux shell Scripts to automate the daily ingestion of IVR data
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
- Automated the History and Purge Process.
- Monitoring production jobs using Control-M daily.
- Created HBase tables to store various data formats of incoming data from different portfolios.
- Created Pig Latin scripts to sort, group, join and filter the enterprise wise data.
- Developed the verification and control process for daily load.
- Experience in Daily production support to monitor and trouble shoots Hadoop/Hive jobs.
Environment: Hadoop, HDFS, Pig, Apache Hive, Sqoop, Kafka, Apache Spark, Scala, AWS, S3, Shell Scripting, Azure, HBase, Python, Kerberos, Agile, Zookeeper, Maven, Ambari, Horton Works, Control-M
Confidential, Hartford, CT
BigData Hadoop/Data Developer
Responsibilities:
- Developing and maintaining a Data Lake containing regulatory data for federal reporting with big data technologies such as Hadoop Distributed File System (HDFS), Apache Impala, Apache Hive and Cloudera distribution.
- Developing different ETL jobs to extract data from different data sources like Oracle, Microsoft SQL Server, transform the extracted data using Hive Query Language (HQL) and load it into Hadoop Distributed file system (HDFS).
- Involved in importing the data from different sources into HDFS using Sqoop and applying transformations using Hive, spark and then loading data into Hive tables.
- Fixing data related issues within the Data Lake.
- Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins.
- Designed, developed, and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
- Performed in-memory computing capacity of Spark to perform procedures such as text analysis and processing using Scala.
- Primarily responsible for designing, implementing, Testing, and maintaining database solution for Azure.
- Experience working with Spark Streaming and divided data into different branches for batch processing through the Spark engine.
- Implementing new functionality in the Data Lake using big data technologies such as Hadoop Distributed File System (HDFS), Apache Impala and Apache Hive based on the requirements provided by the client.
- Communicating regularly with the business teams along with the project manager to ensure that any gaps between the client’s requirements and project’s technical requirements are resolved.
- Developing Python scripts using Hadoop Distributed File System API’s to generate Curl commands to migrate data and to prepare different environments within the project.
- Monitoring production jobs using Control-M daily.
- Coordinating the Production releases with the change management team using Remedy tool.
- Communicating effectively with team members and conducting code reviews.
Environment: Hadoop, Data Lake, Azure, Python, Spark, Hive, Cassandra, ETL Informatica, Cloudera, Oracle 10g, Microsoft SQL Server, Control-M, Linux
Confidential, Cary, NC
BigData Hadoop/Spark Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Migrated PIG scripts, MR to into Spark Data frames API and Spark SQL to improve performance.
- Used Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Expertise in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Developed DF's, Case Classes for the required input data and performed the data transformations using Spark-Core.
- Developed Scala scripts, UDFFs using both Data frames/SQL and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop; And Developed application using Scala as well
- Expertise in deployment of Hadoop Yarn, Spark and Storm integration with Cassandra, ignite and Kafka etc.
- Strong working experience on Cassandra for retrieving data from Cassandra clusters to run queries.
- Developed POC using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
- Deployed and maintained multi-node Dev and Test Kafka Clusters.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Experience in using Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Developed an equivalent Spark Scala code for existing SAS code to extract summary insights on the hive tables.
- Responsible for importing the data from different sources like MYSQL databases into HDFS to save it in form of AVRO, JSON file formats.
- Experience in importing data from S3 to HIVE using Sqoop and Kafka.
- Good Experience working with Amazon AWS for accessing Hadoop cluster components.
- Involved in creating partitioned Hive tables, and loading and analyzing data using hive queries, Implemented Partitioning and bucketing in Hive.
- Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
- Developed Hive queries to process the data and generate the data cubes for visualizing
- Good experience with Talend open studio for designing ETL Jobs for Processing of data.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Configured Hadoop clusters and coordinated with BigData Admins for cluster maintenance.
Environment: Hadoop, YARN, Spark-Core, Spark-Streaming, Spark-SQL, Scala, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Informatica, Cloudera, Oracle 10g, Linux.
Confidential
Hadoop Developer
Responsibilities:
- Experience in developing customized UDF's in java to extend Hive and Pig Latin functionality.
- Responsible for installing, configuring, supporting, and managing of Hadoop Clusters.
- Importing and exporting data into HDFS from Oracle 10.2 database and vice versa using SQOOP.
- Installed and configured Pig and written Pig Latin scripts.
- Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
- Created HBase tables and column families to store the user event data.
- Written automated HBase test cases for data quality checks using HBase command line tools.
- Developed a data pipeline using HBase and Hive to ingest, transform and analyzing customer behavioral data.
- Experience in collecting the log data from different sources like (webservers and social media) using Flume and storing on HDFS to perform MapReduce jobs.
- Handled importing of data from machine logs using Flume.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud.
- Configured, monitored, and optimized Flume agent to capture web logs from the VPN server to be put into Hadoop Data Lake.
- Responsible for loading data from UNIX file systems to HDFS. Installed and configured Hive and written Pig/Hive UDFs.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala, and Python.
- Exported the analyzed data to the relational databases using Sqoop to further visualize and generate reports for the BI team.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Wrote Java code to format XML documents; upload them to Solr server for indexing.
- Used with NoSQL technology (Amazon Dynodb) to gather and track event-based metric.
- Maintenance of all the services in Hadoop ecosystem using ZOOKEPER.
- Designed and implemented Spark jobs to support distributed data processing.
- Expertise in Extraction, Transformation, loading data from Oracle, DB2, SQL Server, MS Access, Excel, Flat Files and XML using Talend.
- Experienced on loading and transforming of large sets of structured, semi and unstructured data.
- Help design of scalable Big Data clusters and solutions.
- Followed Agile methodology for the entire project.
- Involved in review of functional and non-functional requirements.
- Involved in Hadoop cluster task like Adding and Removing Nodes without any effect to running jobs and data.
- Developed workflows using Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Developed interactive shell scripts for scheduling various data cleansing and data loading process.
- Converting the existing relational database model to Hadoop ecosystem.
Environment: Hadoop, HDFS, pig, Hive, Flume, Sqoop, Oozie, Python, Shell Scripting, SQL Talend, Spark, HBase, Elastic search, Linux- Ubuntu, Kafka.
Confidential
Java/Hadoop Developer
Responsibilities:
- Developed JSP, JSF and Servlets to dynamically generate HTML and display the data to the client side.
- Used Hibernate Framework for persistence onto oracle database.
- Written and debugged the ANT Scripts for building the entire web application.
- Developed web services in Java and Experienced with SOAP, WSDL and used WSDL to publish the services to another application.
- Implemented Java Message Services (JMS) using JMS API.
- Involved in managing and reviewing Hadoop log files.
- Installed and configured Hadoop, YARN, Map Reduce, Flume, HDFS, developed multiple Map Reduce jobs in Java for data cleaning.
- Coded Hadoop Map Reduce jobs for energy generation and PS.
- Coded using Servlets, SOAP Client and Apache CXF RestAPI's for delivering the data from our application to external and internal for communication protocol.
- Worked on Cloudera distribution system for running Hadoop jobs on it.
- Expertise in writing Hadoop Jobs to analyze data using Map Reduce, Hive, Pig and Solr, Splunk.
- Created SOAP Web Service using JAX-WS, to enabled client to consume a SOAP Web Service.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and vice-versa.
- Experienced in designing and developing multi-tier scalable applications using Java and J2EE Design Patterns.
Environment: Java, HTML, Java Script, SQL Server, PL/SQL, JSP, Spring, Hibernate, Web Services, SOAP, SOA, JSF, Java, JMS, Junit, Oracle, Eclipse, SVN, XML, CSS, Log4j, Ant, Apache Tomcat.
Confidential
Java/J2EE developer
Responsibilities:
- Designed and developed Struts like MVC 2 Web framework using the front-controller design pattern which is used successfully in several production systems.
- Spearheaded the Quick Wins project by working very closely with the business and end users to improve the current website s ranking from being 23rd to 6th in just 3 months.
- Normalized Oracle database conforming to design concepts and best practices.
- Resolved product complications at customer sites and funneled the insights to the development and deployment teams to adopt long term product development strategy with minimal roadblocks.
- Convinced business users and analysts with alternative solutions that are more robust and simpler to implement from technical perspective while satisfying the functional requirements from the business perspective.
- Applied design patterns and OO design concepts to improve the existing Java/JEE based code base.
- Identified and fixed transactional issues due to incorrect exception handling and concurrency issues due to unsynchronized block of code.
Environment: Java 1.2/1.3 Swing Applet Servlet JSP custom tags JNDI JDBC XML XSL DTD HTML CSS Java Script Oracle DB2 PL/SQL Weblogic JUnit Log4J and CVS