Hadoop/spark Developer Resume
Columbia, MD
SUMMARY:
- Over 9 years of extensive hands - on experience in Hadoop / Big Data & Java/J2EE technologies and in various IT related technologies.
- 6+ years of experience on designing and implementing complete end-to-end Hadoop infrastructure using MapReduce, Pig, Hive, Sqoop, Oozie, Flume, Spark, HBase, Zookeper.
- Experience in working with BI team and transform big data requirements into Hadoop centric technologies.
- Expertise in setting up Hadoop Stand alone and Multi-node cluster.
- Experience in data analysis using Hive, Pig Latin, HBASE and custom Map Reduce programs in Java.
- Experience in writing custom UDFs in Java and Scala for Hive and Pig to Extend the Functionality.
- Experience with Cloudera and Horton works distributions.
- Core understanding of Hadoop main modules such as Hadoop Common, HDFS, MapReduce, YARN, Job Tracker, Task Tracker, Name Node and Data Node.
- Developed analytical components Using Kafka, Scala, Spark, HBASE and Spark Stream.
- Experience in working with Flume to load the log data from multiple sources directly into HDFS.
- Pretty Good knowledge On the Hortonworks administration and security things such as Apache Ranger, Knox Gateway, High Availability.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and from RDBMS to HDFS.
- Involved in creating HDINSIGHT cluster in Microsoft Azure Portal also created EVENTSHUB and Azure SQL Databases.
- Expertise in implementing enterprise level security using AD/LDAP, Kerberos, Knox, Sentry and Ranger.
- Pretty Good Knowledge on hive Optimization techniques like Vectorization and column-based optimization.
- Written oozie workflow to invoke the Jobs in predefined Interval.
- Expert in scheduling Oozie coordinator based on input data events it starts Oozie workflow when input data is available.
- On Other Hand working on POC with Kafka and NIFI to pull the real-time events into Hadoop Box.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD'S and YARN.
- Performed Hadoop backup Strategy to take the backup of hive, HDFS, HBase, oozie etc.
- Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW (Enterprise Data Warehouse).
- Good knowledge of Hadoop Architecture with working experience on various components such as HDFS, Job tracker, Task tracker, Name node, Data node.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the data.
- Experience on Druid for real time data ingestion.
- Hands-on experience on YARN (MapReduce 2.0) architecture and components such as resource Manager, Node Manager, Container and Application Master and execution of a MapReduce job
- Worked with multiple Databases including RDBMS Technologies (MySql, Oracle) and NoSQL databases(Cassandra, HBase, Neo4J)
- Capable of provisioning, installing, configuring, monitoring and maintaining HDFS, yarn, Hbase, Sqoop, Pig, Hive.
- Experienced in integration of various data sources (DB2-UDB, SQL Server, PL/SQL, Oracle, Teradata, XML and MS-Access) into data staging area.
- Experience in Administering Installation configuration troubleshooting Security Backup Performance Monitoring and Fine-tuning of Linux RedHat.
- Extensive experience working in Oracle DB2 SQL Server and My SQL database. Good hold on scripting including Shell/Perl and Python.
- Experience in designing and coding web applications using Core Java and J2EE technologies -JSP, Servlets and JDBC.
- Excellent knowledge in Java and SQL in application development and deployment.
TECHNICAL SKILLS:
Operating Systems: Linux (Ubuntu, RHEL7/6.x/5.x/4.x, SOLARIS, CentOS (4.x/5.x/6.x/7), UNIX, Windows XP/Vista/ 2003/2007/2010
Big Data Technologies: HDFS, MapReduce, Yarn, HIVE, PIG, Pentaho, HBase, Oozie, Zookeeper, Sqoop, Cassandra, Spark, Scala, Storm, Flume, Kafka and Avro, Parquet, Snappy.
NO SQL Databases: HBase, Cassandra, MongoDB, Neo4j, Redis.
Cloud Services: Amazon AWS, Google Cloud.
Languages: C, C++, Java, Scala, Python, HTML, SQL, PL/SQL, Pig Latin, HiveQL, UNIX, JavaScript, Shell Scripting.
ETL Tools: Informatica, IBM DataStage, Talend.
Application Servers: Web Logic, Web Sphere, JBoss, Tomcat.
Databases: Oracle, MySQL, DB2, Teradata, Microsoft SQL Server.
Operating Systems: UNIX, Windows, iOS, LINUX.
Build Tools: Jenkins, Maven, ANT, Azure
Version Controls: Subversion, Git, Bitbucket, GitHub
Development Tools: Eclipse, IntelliJ, Microsoft SQL Studio, Toad, NetBeans
Methodologies: Agile, Waterfall
WORK EXPERIENCE:
Hadoop/Spark Developer
Confidential, Columbia, MD
Responsibilities:
- Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and handled Json Data.
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
- Loading JSON from upstream systems using Spark streaming and load them to elastic search.
- Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Developed data pipeline using Flume, Sqoop, Pig and Java map reduce to ingest customer behavioral data and financial histories into HDFS for analysis.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Worked on Developing Data Pipeline to Ingest Hive tables and File Feeds and generate Insights into Cassandra DB.
- Upgraded the Hadoop cluster from CDH4.7 to CDH5.2 and worked on installing cluster, commissioning & decommissioning of Data Nodes, NameNode recovery, capacity planning, and slots configuration.
- Developed Spark scripts to import large files from Amazon S3 buckets and imported the data from different sources like HDFS/HBase into SparkRDD.
- Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation and worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
- Worked on Installing Cloudera Manager, CDH and install the JCE Policy File to Create a Kerberos Principal for the Cloudera Manager Server, enabling Kerberos Using the Wizard.
- Worked closely data scientist for building predictive model using Pyspark.
- Worked on Spark Structured Streaming for developing Live Steaming Data Pipeline with Source as Kafka and Output as Insights into CassandraDB. The Data was fed in JSON/XML format and then Stored in Cassandra DB.
- Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
- Monitored cluster for performance and, networking and data integrity issues and responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
- Used Impala to read, write and query the Hadoop data in HDFS from HBase or Cassandra.
- Created 25+ Linux Bash scripts for users, groups, data distribution, capacity planning, and system monitoring.
- Install OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
- Responsible for developing Kafka as per the software requirement specifications.
- Worked in Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Supported MapReduce Programs and distributed applications running on the Hadoop cluster and scripting Hadoop package installation and configuration to support fully automated deployments.
- Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE and setup Hadoop environment in AWS EC2 Instances.
- Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
- Perform maintenance, monitoring, deployments, and upgrades across infrastructure that supports all our Hadoop clusters and worked on Hive for further analysis and for generating transforming files from different analytical formats to text files.
- Exported data from Impala to Tableau reporting tool, created dashboards on live connection.
- Created Hive External tables and loaded the data into tables and query data using HQL and worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
- Monitoring Hadoop cluster using tools like Nagios, Ganglia, and Cloudera Manager and maintaining the Cluster by adding and removing of nodes using tools like Ganglia, Nagios, and Cloudera Manager.
- Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.
- Written PySpark code to calculate aggregate data like mean, Co-Variance, Standard Deviation and etc.
- Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
- Helped in creating the End to End process documentation for couple of projects and make it available for the business.
- Worked in Agile & DevOps environments with several teams like Developers, QA, Build & Release Management, Content Distribution teams etc., to faster application/project deployments in a Continuous Delivery Pipeline.
- Analyzing Hadoop cluster and different Big Data analytic tools including Hive and Sqoop.
- Helped the Team lead and the team in resolving production support issues and making sure that all the jobs ran fine.
Environment: Hadoop, MapReduce, Hive, PIG, Sqoop, Python, Spark, Spark-Streaming, Spark SQL, AWS EMR, AWS S3, AWS Redshift, Python, Scala, Pyspark, MapR, Java, Oozie, Flume, HBase, Nagios, Ganglia, Hue, Cloudera ManagerZookeeper, Cloudera, Oracle, Kerberos and RedHat 6.5.
Hadoop Developer
Confidential, Chicago, IL
Responsibilities:
- Responsible for building scalable distributed data solutions on Cloudera distributed Hadoop.
- Involved in using spark streaming and SPARK jobs for ongoing transactions of customers and Spark SQL to handle structured data in Hive.
- Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau.
- Involved in gathering requirements from client and estimating time line for developing complex queries using HIVE and IMPALA for logistics application.
- Written PySpark code to calculate aggregate data like mean, Co-Variance, Standard Deviation and etc.
- Worked on analyzing Hadoop cluster and different Big Data Components including Pig, Hive,Storm, Spark, HBase, Kafka, Elastic Search, database and SQOOP.
- Integrated Kafka with Flume in sand box Environment using Kafka source and Kafka sink.
- Installed Hadoop, Map Reduce, HDFS, and developed multiple Map-Reduce jobs in PIG and Hive for data cleaning and pre-processing.
- Involved in Developing Insight Store data model for Cassandra which was utilized to store the transformed data
- Responsible for writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Written UDF in Scala and used in sampling of large data sets.
- In data exploration stage used hive and impala to get some insights about the customer data.
- Successfully secured the Kafka cluster with Kerberos Implemented Kafka Security Features using SSL and without Kerberos. Further with more grain-fines Security set up Kerberos to have users and groups this will enable more advanced security features and Integrated Apache Kafka for data ingestion.
- Experienced in working with AWS Athena Serverless Query Services.
- Implementation of highly scalable and robust ETL processes using AWS (EMR, CloudWatch, IAM EC2, S3, Lambda Functions, DynamoDB).
- Used distinctive data formats while stacking the data into HDFS.
- Worked in AWS environment for development and deployment of custom Hadoop applications.
- Worked on NiFi workflows development for data ingestion from multiple sources. Involved in architecture and design discussions with the technical team and interface with other teams to create efficient and consistent Solutions.
- Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.
- Creating files and tuned the SQL queries in Hive utilizing HUE.
- Involved in converting Hive/Sql queries into Spark transformations using Spark RDD's.
- Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
- Written various key queries in elastic search for retrieval of data effectively.
- Expertized in implementing Spark using scala and Spark SQL for faster testing and processing of data responsible to manage data from different sources.
- Worked with NoSQL databases like Hbase in making Hbase tables to load expansive arrangements of semi structured data.
- Created CI/CD Pipelines on Azure DevOps environments by providing their dependencies and tasks.
- Acted for bringing in data under HBase using HBase shell also HBase client API.
- Used Pig as ETL tool to do Transformations with joins and pre-aggregations before storing the data onto HDFS.
- Used Kafka to patch up a customer activity taking after pipeline as a course of action of steady appropriate subscribe supports.
- Written elastic search template for the index patterns.
- Developed workflow in Oozie to automate the jobs.
- Provided design recommendations and thought leadership to sponsors/stakeholders that improved review process and resolved technical problems
- Developed complete end to end Big-Data Processing in Hadoop Ecosystems.
Environment: Hadoop, HDFS, Hive, Sqoop, Oozie,NiFi Spark, Scala, Kafka, Python, Cloudera, Linux, Spark streamingPig.
Hadoop Developer
Confidential, San Jose, CA
Responsibilities:
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data from HBase through Sqoop and placed in HDFS for further processing.
- Worked extensively with Hadoop ecosystem tools like HDFS, MapReduce, Pig, Hive, Hbase, sqoop, sparketc.,
- Generating Scala and java classes from the respective APIs so that they can be incorporated in the overall application
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Administered, installed and managed distributions of Hadoop, Hive and Hbase.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Hands on experience in loading data from UNIX file system and Teradata to HDFS. Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Involved in creating Hive tables, loading data and running hive queries in those data.
- Extensive Working knowledge of partitioned table, UDFs, performance tuning, compression-related properties, thrift server in Hive.
- Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts
- Working knowledge in writing Pig's Load and Store functions.
- Developed Java MapReduce programs on log data to transform into structured way to find user location, age group, spending time.
- Developed optimal strategies for distributing the web log data over the cluster, importing and exporting the stored web log data into HDFS and Hive using Scoop.
- Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis
- Monitored multiple Hadoop clusters environments using Ganglia.
- Developed PIG scripts for the analysis of semi structured data.
- Developed and involved in the industry specific UDF (user defined functions).
- Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed to HDFS.
- Used Sqoop to import the data on to Cassandra tables from different relational databases like Oracle, MySQL and Designed Column families.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most purchased product on website.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
- Monitored workload, job performance and capacity planning using Cloudera Manager.
- Managing and scheduling Jobs on a Hadoop cluster using Oozie.
Environment: Amazon EC2, Apache Hadoop 1.0.1, MapReduce, HDFS, CentOS 6.4, Spark, Impala, HBase, KafkaElastic Search, Hive, Pig, Oozie, Flume, Java (jdk 1.6), Eclipse, Sqoop, Ganglia, LINUX.
Hadoop Developer
Confidential
Responsibilities:
- Collaborate in identifying the current problems, constraints and root causes with data sets to identify the descriptive and predictive solution with support of the Hadoop HDFS, MapReduce, Pig, Hive, and Hbase and further to develop reports in Tableau.
- Architect the Hadoop cluster in Pseudo distributed Mode working with Zookeeper and Apache and storing and loading the data from HDFS to Amazon AWS S3 and backing up and Created tables in AWS cluster with S3 storage.
- Evaluated existing infrastructure, systems, and technologies and provided gap analysis, and documented requirements, evaluation, and recommendations of system, upgrades, technologies and created proposed architecture and specifications along with recommendations
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
- Installed and Configured Sqoop to import and export the data into MapR-FS, HBase and Hive from Relational databases.
- Administering large MapR Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
- Installed and Configured MapR-zookeeper, MapR-cldb, MapR-jobtracker, MapR-tasktracker, MapR resource manager, MapR-node manager, MapR-fileserver, and MapR-webserver.
- Installed and configured Knox gateway to secure HIVE through ODBC, WebHcat and Oozie services.
- Load data from relational databases into MapR-FS filesystem and HBase using Sqoop and setting up MapR metrics with NoSQL database to log metrics data.
- Close monitoring and analysis of the MapReduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance.
- Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
- Integrated HDP clusters with Active Directory and enabled Kerberos for Authentication.
- Worked on commissioning & decommissioning of Data Nodes, NameNode recovery, capacity planning and installed Oozie workflow engine to run multiple Hive and Pig Jobs.
- Worked on creating the Data Model for HBase from the current Oracle Data model.
- Implemented High Availability and automatic failover infrastructure to overcome single point of failure for Name node utilizing zookeeper services.
- Leveraged Chef to manage and maintain builds in various environments and planned for hardware and software installation on production cluster and communicated with multiple teams to get it done.
- Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.
- Worked closely with data analysts to construct creative solutions for their analysis tasks and managed and reviewed Hadoop and HBase log files.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports and worked on importing and exporting data from Oracle into HDFS and HIVE using Sqoop.
- Collaborating with application teams to install operating system and Hadoop updates, patches, version upgrades when required.
Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Kafka, Zookeeper, Oozie, Impala, Cloudera, Oracle, Teradata SQL Server, Python, UNIX Shell Scripting, ETL, Flume, Scala, Spark, Sqoop, Python, AWS, S3, EC2, Kafka, OracleMySQL, Hortonworks, YARN, Python.
Java Developer
Confidential
Responsibilities:
- Involved in the analysis, design, and development and testing phases of Software Development Life Cycle (SDLC)
- Designed and developed framework components involved in designing Confidential pattern using Struts and spring framework.
- Designed and developed Message Flows and Message Sets and other service component to expose Mainframe applications to enterprise J2EE applications.
- Developed the Action Classes, Action Form Classes, created JSPs using Struts tag libraries and configured in Struts-config.xml, Web.xml files.
- Wrote several Action Classes and Action Forms to capture user input and created different web pages using JSTL, JSP, HTML, Custom Tags and Struts Tags.
- Involved in Deploying and Configuring applications in Web Logic Server.
- Used SOAP for exchanging XML based messages.
- Used Microsoft VISIO for developing Use Case Diagrams, Sequence Diagrams and Class Diagrams in the design phase.
- Used standard data access technologies like JDBC and ORM tool like Hibernate
- Developed Custom Tags to simplify the JSP code. Designed UI screens using JSP and HTML.
- Actively involved in designing and implementing Factory method, Singleton, Confidential and Data Access Object design patterns.
- Web services used for sending and getting data from different applications using SOAP messages. Then used DOM XML parser for data retrieval.
- Wrote JUNIT test cases for Controller, Service and DAO layer using MOCKITO, DBUNIT.
- Developed unit test cases using proprietary framework which is like JUNIT.
- Used JUnit framework for unit testing of application and ANT to build and deploy the application on WebLogic Server.
Environment: Java, J2EE, JSP, Servlets, HTML, DHTML, XML, JavaScript, Struts, c/c, Eclipse, WebLogic, PL/SQL, and Oracle.