Hadoop Spark Developer Resume
VA
SUMMARY:
- Around 5 years of strong experience in software development.
- Around 4 years of experience using BigData Ecosystems, Machine Learning and Data Analytics.
- Extensive experience in Apache Spark with Scala, Python
- Solid Mathematics, Probability and Statistics foundation and broad practical statistical and data mining techniques cultivated through various industry work and academic programs
- Involved in the Software Development Life Cycle (SDLC)phaseswhich include Analysis, Design, Implementation, Testing and Maintenance.
- Strong technical, administration, &mentoring knowledge in Linux and BigData/Hadoop technologies.
- Have sound knowledge on In - Memory MEMSQL.
- Hands on experience on major components in Hadoop Ecosystem like Hadoop Map Reduce, HDFS, HIVE, PIG, Pentaho, Hbase, Zookeeper, Sqoop, Oozie, Cassandra, Flume and Avro.
- Experienced the deployment of Hadoop Cluster using Puppet tool
- Work experience with cloud infrastructure like Amazon Web Services (AWS).
- Experience in importing and exporting the data using SQOOP from HDFS to Relational Database systems/mainframe and vice-versa
- Expertise in working with ETL Architects, Data Analysts and data modelers to translate business rules/requirements into conceptual, physical and logical dimensional models and worked with complex normalized and denormalized data models.
- Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Installing, configuring and managing of Hadoop Clusters and Data Science tools.
- Managing the Hadoop distribution with Cloudera Manager, Cloudera Navigator, Hue.
- Setting up the High-Availability for Hadoop Clusters components and Edge nodes.
- Strong experience in writing applications using python using different libraries like Pandas, NumPy, SciPy, Matpotlib etc
- Experience in developing Shell scripts and Python Scripts for system management.
- Well versed in using Software development methodologies like Rapid Application Development (RAD), Agile Methodology and Scrum software development processes.
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using RDDs, and Scala
- Used PIG Latin Scripts, join operations, custom user defined functions (UDF) to perform ETL operations
- Experience in Performance Tuning and Debugging of existing ETL processes.
- Worked with version control systems like Subversion, Perforce, and GITfor providing common platform for all the developers.
- Experience with Agile Environment
- Articulate in written and verbal communication along with strong interpersonal, analytical, and organizational skills.
- Hands on experience with Microsoft Azure Cloud services, Storage Accounts and Virtual Networks.
- Good at manage hosting plans for Azure Infrastructure, implementing and deploying workloads on Azure virtual machines (VMs).
- Highly motivated team player with the ability to work independently and adapt quickly to new and emerging technologies.
- Creatively communicate and present models to business customers and executives, utilizing a variety of formats and visualization methodologies.
TECHNICAL SKILLS:
Languages: C,C++,Python, Scala.
Big Data Skills: Map reduce, Hadoop,Spark,Kafka,Storm
Servers: WebSphere, Tomcat 6.x,MIIS (Microsoft Internet Information Server)
Case Tools and IDE: Eclipse,NetBeans,RAD,IntelliJ,Netezza.
Frameworks in Hadoop: Spark, Kafka, Storm
Databases: DB2, Oracle and MySQL Server
Version Tools: GIT
Web Services: SOAP, REST
PROFESSIONAL EXPERIENCE:
Confidential, VA
Hadoop Spark Developer
Responsibilities:
- Migration of Oracle tables to the HDFS using SQOOP.
- Designed and developed rich front end screens using JSF (Ice faces), JSP, Docker, CSS, HTML, Angular JS and JQuery.
- Developed Managed beans and defined Navigation rules for the application using JSF.
- Developed Angular JS 2.0 code and migrated pre-existing code to updated Angular JS 2.0 framework. Written a custom Sqoop class for the XMLTYPE datatype in the Oracle Databse
- Used Java Messaging Services (JMS) for reliable and asynchronous exchange of important information such as payment status report to MQ Server using MQ Series.
- Designed and implemented Apache Spark job which takes the Sequence File from th HDFS and migrate to the Hbase.
- Motivated and assisted team of six memebers in reaching individual andd team goals for quality,productivity and revenue generation
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data
- Used WebSphere Application Server Developer Tools for Eclipse (WDT) to create Java batch projects based on the Java Batch 1.0 standard (JSR 352) and submit them to a Liberty profile server
- Involved and guide the team for the preparing the technical specification
- Involved in Development,Build and Deployment Application
- Implemented micro services in order to separate the tasks and not to have dependency on other Parallel ongoing tasks of same Application.
- Developed various shellscripts and pythonscripts to automate Spark jobs and hive scripts.
- Used Impala to read, write and query the Hadoop data in HDFS from HBase or Cassandra and configured Kafka to read and write messages from external programs.
- Extending HIVE and PIG core functionality by using custom User Defined Function’s (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig using python.
- Identifying opportunities to improve infrastructure that effectively and efficiently utilizes the Microsoft Azure Windows server 2008/2012/R2, Microsoft SQL Server, Microsoft Visual Studio, Windows PowerShell, Cloud infrastructure.
- Deployed Azure IaaS virtual machines (VMs) and Cloud services (PaaS role instances) into secure VNets and subnets.
- Developed Web service using Restful with Jersey, and implemented JAX-RS and also provided security-using SSL.
- Creation OfHBase Tables and implemented Salting on the Hbase
- Migration of tables from oracle to Hbase on the Tenant basis
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Design and develop ETL code using Informatica Mappings to load data from heterogeneous Source systems like flat files, XML’s, MS Access files, Oracle to target system Oracle under Stage, then to data warehouse and then to Data Mart tables for reporting.
- Developed ETL with SCD’s, caches, complex joins with optimized SQL queries.
- Written Programs in Spark using Scala and Python for Data quality check.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark
- One time migraion of 300 Billions of records using apache Spark BulkLoading
- Write ETL jobs using PIG Latin and Worked on tuning the performance of HIVE queries.
- Involved in Delta Migration using Sqoop Incremental Updates
- Written Apache Spark Jobs using Scala API
- Involved in administrating and configuring the mapr distribution
- Implemented Mapr Streams (KAFKA 0.9 API) to the Spark Streaming using Java API.
Environment: Apache Spark, Spark Streaming, Spark SQL, ETL, Hadoop Security,Mapr Streams, Mapr 5.1,Open-Shift,Scala,Java,Hbase,Eclipse,MVN,Sequence Files.
Confidential, Dallas, TX
Data SCientist
Responsibilities:
- Full stack experience in SDLC which involves data collection, analyzing data, Visualization, automation.
- Experience in loading data from different databases to HDFS.
- Converting the data to desired formats that helps us in data cleaning and data preprocessing.
- Performed EDA (Exploratory data analysis) to know the insights of the data.
- Extensively worked on cleaning, preprocessing the data and dimensionality reduction using techniques like PCA (principal Component Analysis) and t-SNE (t- stochastic neighborhood embedding) to reduce the dimensions of the higher-dimensional data.
- Performed feature engineering, performed NLP by using some techniques like Word2Vec, BOW (Bag of Words), tf-idf, Avg-Word2Vec, if-idf Weighted Word2Vec.
- Used Naïve bays classifier to train the reviews dataset.
- Built frequency distribution for all words and frequency distribution for words within positive and negative labels.
- Developed Web Services for online text polarity classification.
- Deploying and maintaining sentiment analysis web app on AWS.
- Visualized, interpret, found the reports and develop uses of data by python Libraries like Pandas, Numpy, Scikit-learn, MatPlotLib, Seaborn.
Confidential, Southborough,MA
Hadoop Developer-Java
Responsibilities:
- Involved in architecture design, development and implementation of Hadoop deployment, backup and recovery systems.
- Experience in working on multi-Petabyte clusters both administration and development.
- Developed Chefmodules to automate the installation, configuration and deployment of ecosystem tools, OS's and network infrastructure at a cluster level.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Performed cluster co-ordination and assisted with data capacity planning and node forecasting using Zookeeper.
- Developed Fault-tolerant data warehouse cluster by using Amazon S3 and monitoring we nodes and automatic replication was developed by using Amazon Redshift
- Involved in performance Tuning of Hadoop clusters
- Implemented Hadoop framework to capture user navigation across the application to validate the user interface and provide analytic feedback/result to the UI team.
- Executed custom interceptors for Flume to filter data and defined channel selectors to multiplex the data into different sinks.
- Analyzed the SQL scripts and designed the solution to implement using Pyspark
- Extracted data from Oracle SQL server and MySQL databases to HDFS using Sqoop.
- Optimized MapReduce jobs to use HDFS efficiently by using Gzip, LZO, Snappy compression techniques.
- Experience in writing Pig scripts to transform raw data from several data sources into forming baseline data.
- Created Hive tables to store the processed results in a tabular format and written Hive scripts to transform and aggregate the disparate data.
- Improve search results using SOLR and customize lucence/Solr code.
- Involved on Lucence, Solr and lead the index and search related development work.
- Query huge data quickly near real time while feeding log file into solr.
- Worked with the team to improve and ranking of search results using SOLR
- Experience in using Avro, Parquet, RCFile and JSON file formats and developed UDFs using Hive and Pig.
- Responsible for cluster maintenance, rebalancing blocks, commissioning and decommissioning of nodes, monitoring and troubleshooting, manage and review data backups and log files.
- Driving the application from development phase to production phase using Continuous Integration and Continuous Deployment (CICD) model using Chef, Maven and Jenkins.
- Develop Pentaho Kettle Graphs to cleanse and transform the raw data into useful information and load it to a Kafka Queue (further loaded to HDFS) and Neo4j database for UI team to display it using the Web application.
- Automated the process for extraction of data from warehouses and weblogs into HIVE tables by developing workflows and coordinator jobs in Oozie.
- Scheduled snapshots of volumes for backup to find root cause analysis of failures and document bugs and fixes for downtimes and maintenance of cluster.
- Tune/Modify SQL for batch and online processes.
- Commissioning and decommissioning the nodes.
- Manage cluster through performance tuning and enhancement.
Environment: Hortonworks (HDP 2.2), HDFS, Batch-Processing,MapReduce, Apache Cassandra,Apache Solr, YARN, Spark, Scala,Hive, Pig, Flume, Sqoop, Chef,Puppet,Python, Oozie, ZooKeeper, Ambari, Oracle Database, MySQL, HBase, SparkSQL, AWS Redshift, Avro, Parquet, RCFile, JSON, UDF, Java (jdk1.7), Multi-Threading, Performance Tuning, CentOS
Confidential, Malvern, PA
Big Data Engineer
Responsibilities:
- Worked with the business users to gather, define business requirements and analyze the possible technical solutions.
- Hadoop installation, configuration of multiple nodes in Amazon EMR platform.
- Setup and optimize Standalone-System/Pseudo-Distributed/Distributed Clusters.
- Developed Simple to complex Map/reduce streaming jobs
- Analyzing data with Hive, Pig and Hadoop Streaming.
- Build/Tune/Maintain Hive QL and Pig Scripts for reporting purpose.
- Handled importing of data from various data sources, performed transformations using Hive,Map/Reduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
- Processed Data across a Hadoop cluster of virtual servers on the Amazon Elastic Computer Cloud(EC2) by using Amazon EMR
- Analyzed the data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) to study customer behavior.
- Stored the data in an Apache Cassandra Cluster
- Used Impala to query the Hadoop data stored in HDFS.
- Manage and review Hadoop log files.
- Support/Troubleshoot Map/Reduce programs running on the cluster
- Load data from Linux file system into HDFS.
- Install and configure Hive and write Hive UDFs.
- Create tables, load data, and write queries in Hive.
- Develop scripts to automate routine DBA tasks using Linux Shell Scripts, Python
Environment: Hadoop 0.20.2 - PIG, Hive, JAVA,AWS, AWS EMR,Cloudera manager, 30 Node cluster with Linux-Ubuntu