Hadoop Developer Resume
Deerfield, IL
SUMMARY
- Overall 6 years of IT experience in a variety of industries, which includes hands on experience in Hadoop developer.
- Expertise with the tools in Hadoop Ecosystem including Pig, Hive, HDFS, MapReduce, Sqoop, flume, Spark, HBase, Yarn, Oozie, and Zookeeper.
- Hands on experience in machine learning, big data, data visualization, R and Python development, Linux, SQL, GIT/GitHub.
- Excellent knowledge on Hadoop Ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm
- Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Strong experience in writing applications using python, Scala and MySQL
- Hands on experience on configuring a Hadoop cluster in a professional environment and on Amazon Web Services (AWS) using an EC2 instance.
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Strong experience on Hadoop distributions like Cloudera, MapR and Horton Works.
- Good understanding of NoSQL databases and hands on work experience in writing applications on NoSQL databases like HBase.
- Experienced in writing complex MapReduce programs that work with different file formats like Text, Sequence, Xml, parquet and Avro.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy.
- Experience in migrating the data using Sqoop from HDFS to Relational Database System and vice - versa.
- Extensive Experience on importing and exporting data using stream processing platforms like Flume
- Very good experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications.
- Successfully migrated multiple Scala and Pyspark applications from old clusters to new LCM clusters.
- Excellent Java development skills using J2EE, J2SE web services.
- Extensive working experience with Python including Scikit-learn, SciPy, Pandas, and NumPy developing machine learning models, manipulating and handling data.
- Strong experience in Object-Oriented Design, Analysis, Development, Testing and Maintenance.
- Excellent implementation knowledge of Enterprise/Web/Client Server using Java, J2EE.
- Worked in large and small teams for systems requirement, design & development.
- Preparation of Standard Code guidelines, analysis and testing documentations.
- Extracted data from HDFS using Hive, Presto and performed data analysis using Spark with Scala , pySpark and feature selection and created nonparametric models in Spark
- Experience in working with Hadoop in Stand-alone, pseudo and distributed modes.
- Good Knowledge on Cloud Computing with Amazon Web Services like EC2, S3 which provides fast and efficient processing of Big Data.
TECHNICAL SKILLS
Big Data Tools: Hadoop, HDFS, Sqoop, Hbase, Hive, Spark, Kafka, Airflow,pyspark
Cloud Technologies: Snowflake, SnowSQL, AWS, Azure, Databricks
ETL Tools: SSIS, Talend
Modeling and Architect Tools: Erwin, ER Studio, Star-Schema, Snowflake-Schema Modeling, FACT and dimension tables, Pivot Tables
Database: Snowflake Cloud Database, Oracle, MS SQL Server, Teradata, MySQL, DB2
Operating Systems: Microsoft Windows and Unix
Reporting Tools: MS Excel, Tableau, Tableau server, Tableau Reader, Power BI, QlikView
Methodologies: Agile, UML, System Development Life Cycle (SDLC), Ralph Kimball, Waterfall Model
Machine Learning: Regression Models, Classification Models, Clustering, Linear regression, Logistic regression, Decision trees, Random Forest, Gradient Boosting, K nearest neighbor (KNN), K mean, Naïve Bayes, Time Series Analysis, PCA, Avro, MLbase
Python and R Libraries: R-tidyr, tidyverse, dplyr, lubridate, ggplot2, tseries Python - numpy, scipy, matplotlib, seaborn, pandas, scikit-learn
Programming Languages: SQL, R (shiny, R-studio), Python (Jupyter Notebook, PyCharm IDE)
PROFESSIONAL EXPERIENCE
Confidential, Deerfield, IL
Hadoop Developer
Responsibilities:
- Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive and MapReduce.
- Managing fully distributed Hadoop cluster is an additional responsibility assigned to me. I was trained to overtake the responsibilities of a Hadoop Administrator, which includes managing the cluster, Upgrades and installation of tools that uses Hadoop ecosystem.
- Used Pyspark Data forms approach for creating the Cal,Dapply and Payfone reporting tables.
- Worked on Installation and configuring of Zookeeper to co-ordinate and monitor the cluster resources.
- Implemented test scripts to support test driven development and continuous integration.
- Worked on POC’s with Apache Spark using Scala to implement spark in project.
- Consumed the data from Kafka using Apache spark.
- Configured Linux native device mappers (MPIO), EMC power path for RHEL 5.5, 5.6, and 5.7.
- Load and transform large sets of structured, semi structured and unstructured data.
- Involved in loading data from LINUX file system to HDFS
- Importing and exporting data into HDFS and Hive using Sqoop.
- Configured Linux native device mappers (MPIO), EMC power path for RHEL 5.5, 5.6, and 5.7.
- Implemented Partitioning, Dynamic Partitions, Buckets in Hive
- Experience in Daily production support to monitor and trouble shoots Hadoop/Hive jobs.
- Worked in creating HBase tables to load large sets of semi structured data coming from various sources.
- Work with structured/semi-structured data ingestion and processing on AWS using S3, Python. Migrate on-premises big data workloads to AWS Snowflake/ Redshift & Azure Databricks.
- Extending HIVE and PIG core functionality by using custom User Defined Function’s (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig using python.
- Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
- Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Responsible for loading data files from various external sources like MySQL into staging area in MySQL databases.
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
- Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka.
- Successfully migrated multiple Scala and Pyspark applications from old clusters to new LCM clusters.
- Used Pyspark Data forms approach for creating the Cal,Dapply and Payfone reporting tables.
- Actively involved in code review and bug fixing for improving the performance.
- Good experience in handling data manipulation using python Scripts.
- Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.
- Created Linux shell Scripts to automate the daily ingestion of IVR data
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
- Used Python and R scripting by implementing machine algorithms to predict data and forecast data for better results
- Created HBase tables to store various data formats of incoming data from different portfolios.
- AWS Architect/developer certification is mandatory for this role.
- Actively involved on proof of concept for Hadoop cluster in AWS. Used EC2 instances, EBS volumes and S3 for configuring the cluster.
- Involved in migrating the ON PREMISE data to AWS.
- Used Hive and created Hive tables, loaded data from Local file system to HDFS.
- Production experience in large environments using configuration management tools like Chef and Puppet supporting Chef Environment with 250+ servers and involved in developing manifests.
- Created EC2 instances and implemented large multi node Hadoop clusters in AWS cloud from scratch using automated scripts such as terraform.
Environment: Hadoop, HDFS, Pig, Apache Hive, Sqoop, Apache Spark, Shell Scripting, HBase, Python, Zookeeper, MySQL.
Confidential, TX
Hadoop Developer
Responsibilities:
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Configured Sqoop Jobs to import data from RDBMS into HDFS using Oozie workflows.
- Involved in creating Hive Internal and External tables, loading data and writing hive queries, which will run internally in map, reduce way.
- Used Pyspark Data forms approach for creating the Cal,Dapply and Payfone reporting tables.
- Created batch analysis job prototypes using Hadoop, Pig, Oozie and Hive.
- Assisted with data capacity planning and node forecasting.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions.
- Documented the systems processes and procedures for future references.
- Monitored workload, job performance and capacity planning using Cloudera Manager.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Used Python and R scripting by implementing machine algorithms to predict data and forecast data for better results.
- These applications were built using Spark Scala API and Pyspark API.
- Performed CRUD operations in HBase.
- Developed Hive queries to process the data.
- Monitoring, Performance tuning of Hadoop clusters, Screening Hadoop cluster job performances and capacity planning Monitor Hadoop cluster connectivity and security Manage and review Hadoop log files.
- Load and transform large sets of structured, semi structured and unstructured data.
- Understands and develops data access related to storing, retrieving or acting on housed data.
- Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD's.
- Hands on experience in provisioning and managing multi-node Hadoop Clusters on public cloud environment Amazon Web Services (AWS) - EC2 and on private cloud infrastructure.
Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Sqoop, Oozie, Hadoop, HDFS, Map Reduce, Hive, HBase, Linux, Cluster Management