BIG DATA HADOOP DEVELOPER Resume St Louis, Missouri - Hire IT People

SUMMARY

Over 10 years of IT experience as a Developer, Designer & QA Engineer with cross - platform integration experience using Hadoop Ecosystem, Java and functional automation
Hands on experience in installing, configuring, and architecting Hadoop and Hortonworks clusters and services - HDFS, MapReduce, Yarn, Pig, Hive, Oozie, Flume, HBase, Spark, Sqoop, Flume and Oozie
Responsible for writing MapReduce programs
Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase
Experienced in developing Java UDFs for Hive and Pig
Experienced in NoSQL DBs like HBase, MongoDB and Cassandra and wrote advanced query and sub-query
Scheduled all Hadoop/hive/Sqoop/HBase jobs using Oozie
Set up clusters in Amazon EC2 and S3 including the automation of setting & extending the clusters in AWS
Practiced Agile Scrum methodology, contributed to TDD, CI-CD and all aspects of SDLC
Experienced in defining detailed application software test plans, including organization, participant, schedule, test and application coverage scope
Gathered and defined functional and UI requirements for software applications
Experienced in real time analytics with Apache Spark RDD, Data Frames and Streaming API
Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data
Experienced in integrating Hadoop with Kafka, experienced in uploading Clickstream data from to HDFS.
Expert in utilizing Kafka for messaging and publishing subscribe messaging system.
Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
Installed and configured OpenShift platform in managing Docker containers and Kubernetes Clusters.
DevOps Practice for Micro Services using Kubernetes as Orchestrator.
Created templates and wrote Shell scripts (Bash), Ruby, Python and PowerShell for automating tasks.
Good knowledge and hands on Experience in monitoring tools like Splunk, Nagios.
Knowledge of using Routed Protocols as FTP, SSH, HTTP, TCP/IP, HTTPS, DNS, VPN'S and Firewall Groups.
Complete application builds for Web Applications, Web Services, Windows Services, Console Applications, and Client GUI applications.
Experienced in troubleshooting and automated deployment to web and application servers like WebSphere, WebLogic, JBOSS and Tomcat.
Experienced in deploy to Integrate with multiple build systems and to provide an application model handling multiple projects.
Hands on experience with integrating Rest APIs to cloud environment to access resources.
Developed pyspark programs and created the data frames and worked on transformations.
Worked on data processing and transformations and actions in spark by using Python (Pyspark) language.
Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

TECHNICAL SKILLS

Hadoop/Big Data: Hadoop, Map Reduce, HDFS, Zookeeper, Kafka, Hive, Pig, Sqoop, OozieFlume, Yarn, HBase, Spark with Scala

No SQL Databases: HBase,Cassandra, Mongo DB

Languages: Java, Python, Scala, Pyspark, UNIX shell scripts

Java/J2EE Technologies: Applets, Swing, JDBC, JNDI, JSON, JSTL

Frameworks: Spring, Hibernate

Operating Systems: Red Hat Linux, Ubuntu Linux and Windows XP/Vista/7/8

Web/Application servers: Apache Tomcat, WebLogic, JBoss

Databases: SQL Server, MySQL

IDE: Eclipse, IntelliJ

PROFESSIONAL EXPERIENCE

BIG DATA HADOOP DEVELOPER

Confidential, St Louis, Missouri

Responsibilities:

Preparing Design Documents (Request-Response Mapping Documents, Hive Mapping Documents).
Experienced with batch processing of data sources using Apache Spark and Elastic search.
Experienced in implementing Spark RDD transformations, actions to implement business analysis
Migrated Hive QL queries on structured into Spark QL to improve performance
Documented the data flow form Application Kafka Storm HDFS Hive tables
Configured, deployed, and maintained a single node storm cluster in DEV environment
Developing predictive analytic using Apache Spark Scala APIs
Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats (Text file, Avro data files, Sequence files, Xml and JSon files, ORC and Parquet).
Handled importing of data from RDBMS into HDFS using Sqoop.
Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.
Developed Spark scripts to import large files from Amazon S3 buckets.
Developed Spark core and Spark SQL scripts using Scala for faster data processing.
Experienced in data cleansing processing using Pig Latin operations and UDFs.
Experienced in writing Hive Scripts for analyzing data in Hive warehouse using Hive Query Language (HQL).
Involved in creating Hive tables, loading with data and writing hive queries to process the data.
Load the data into Spark RDD and performed in-memory data computation to generate the output response.
Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
Created scripts to automate the process of Data Ingestion.
Experience in using Testing Frameworks of Bigdata world, MRUnit, PIGUnit for testing raw data and executed performance scripts.Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with Data Frames in Spark.
Integrate visualizations into a Spark application using Databricks and popular visualization libraries (gplot, matplotlib).
Implemented discretization and binning, data wrangling, cleaning, transforming, merging, and reshaping data frames using Python.
MR2 Batch job was written to fetch required data from DB and store the same in CSV (static file)
Spark job to process the files from Vision EMS and AMN Cache to identify the violations and sending the same to Smarts as SNMP traps.
Automated workflows using shell scripting to schedule(crontab) Spark jobs.
Developed data pipeline using Flume, Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
Experience in deploying data from various sources into HDFS and building reports using Tableau.
Developed a data pipeline using Kafka and Strom to store data into HDFS.
Developed REST APIs using Scala and Play framework to retrieve processed data from Cassandra database.
Performed real time analysis on the incoming data.
Re-engineered n-tiered architecture involving technologies like EJB, XML and JAVA into distributed applications.
Load the data into Spark RDD and performed in-memory data computation to generate the output response.
Loading data into HBase using Bulk Load and Non-bulk load.
Created HBase column families to store various data types coming from various sources.
Loaded data into the cluster from dynamically generated files
Assisted in upgrading, configuration, and maintenance of various Hadoop infrastructures
Created common audit and error logging processes job monitoring and reporting mechanism
Troubleshooting performance issues with ETL/SQL tuning.
Developed and maintained the continuous integration and deployment systems using Jenkins
Create Py spark frame to bring data from DB2 to Amazon S3.
Optimize the Py spark jobs to run on Kubernetes Cluster for faster data processing
Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
Analyzed the sql scripts and designed it by using PySpark SQL for faster performance.
BuiltS3buckets and managed policies for S3 buckets and usedS3 bucketandGlacierfor storage and backup on AWS.
Work with other teams to help develop thePuppetinfrastructure to conform to various requirements including
Perform troubleshooting and monitoring of the Linux server on AWS using NagiosandNew Relic
Management and Administration of AWS ServicesCLI,EC2,VPC,S3,ELBGlacier,Route 53,Cloudtrail,IAM, and Trusted Advisor services.
Created automated pipelines in AWS Code Pipelineto deployDockercontainers in AWSECSusing serviceslikeCloudFormation,CodeBuild,CodeDeploy,S3andpuppet.
Worked onJIRAfor defect/issues logging & tracking and documented all my work usingCONFLUENCE.
Integrated services likeGitHub, AWS Code Pipeline, Jenkins and AWS Elastic Beanstalk to create a deployment pipeline.
Strong exposure in Automation of maintenance tasks in Bigdata environment through Cloudera Manager API.
Having good knowledge of Oracle9i, 10g, 11g as Database and excellent in writing the SQL queries and scripts.
Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine - grained access to AWS resources to users
Experience in Building S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
Ability to handle a team of developers and co-ordinate smooth delivery of the project.

Environment: HDFS, Apache Spark, Kafka, Cassandra, Hive, Scala, Java, Sqoop’s, Shell scripting, Spark, AWS

BIG DATA HADOOP DEVELOPER

Confidential, Stamford, CT

Responsibilities:

Administered, maintained, provisioned, patched, and maintained Cloudera Hadoop clusters on Linux
Designed and developed the applications on the data lake to transform the data according business users to perform analytics.
Developed shell scripts to perform Data Quality validations like Record count, File name consistency, Duplicate File and for creating Tables and views.
Creating the views by masking PHI Columns for the table, so that data in the view for the PHI columns cannot be seen by unauthorized teams.
Worked on Parquet File format to get a better storage and performance for publish tables.
Developed analytical components using Scala, Spark, Apache Mesos and Spark Stream.
Experience in using the Docker container system with the Kubernetes integration
Developed a Web Application using Java with the Google Web Toolkit API with PostgreSQL
Used R for prototype on a sample data exploration to identify the best algorithmic approach and then wrote Scala scripts using spark machine learning module.
Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, Caffe, TensorFlow, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
Excellent understanding / knowledge of Hadoop architecture and various components such as HDFS, HBase, Hive Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm
Built Kafka-Spark-Cassandra Scala simulator for Met stream, a big data consultancy; Kafka-Spark-Cassandra prototypes.
Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic Map Reduce
Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
It is python and Scala based analytic system with ML Libraries.
Worked with NoSQL Platforms and Extensive understanding on relational databases versus No-SQL platforms.Created and worked on large data frames with a schema of more than 300 columns.
Ingestion of data into Amazon S3 using Sqoop and apply data transformations using Python scripts.
Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions in HIVE.
Deployed and analyzed large chunks of data using HIVE as well as HBase.
Worked on querying data using SparkSQL on top of pyspark engine.
Used Amazon EMR to perform the Pyspark Jobs on the Cloud.
Created Hive tables to store various data formats of PII data coming from the raw hive tables.
Developed Sqoop jobs to import/export data from RDBMS to S3 data store.
Designed and implemented Pyspark UDF's for evaluation, filtering, loading and storing of data.
Fine-tuning pyspark applications/jobs to improve the efficiency and overall processing time for the pipelines.
Knowledge of writing Hive queries and running both scripts in tez mode to improve performance on Hortonworks Data Platform.
Used Microservices architecture, with Spring Boot based services interacting through a combination of REST and Spring Boot.
Built Spring Boot microservices for the delivery of software products across the enterprise
Created the ALB, ELBs and EC2 instances to deploy the applications into cloud environment.
Providing service discovery for all microservices using Spring Cloud Kubernetes project
Developed fully functional responsive modules based on Business Requirements using Scala.
Worked in building servers like DHCP, PXE with kick-start, DNS and NFS and used them in building infrastructure in a Linux Environment. Automated build, testing and integration with Ant, Maven and JUnit.
Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
Built pipelines to move hashed and un-hashed data fromAzure Blob to Data lake.
Utilized Azure HDInsight to monitor and manage the Hadoop Cluster.
Collaborated on insights with Data Scientists, Business Analysts and Partners.
Performed advanced procedure like text analytics and processing, using the in-memory computing capabilities ofSpark using Python.
Created pipelines to move data fromon-premise servers to Azure Data Lake

Environment: Apache Hive, HBase, Spark, Azure, python, Agile, Stream sets, Bitbucket, Cloudera, Shell Scripting, Amazon EMR, Amazon S3, PyCharm, Jenkins, Scala, Java.

BIG DATA HADOOP DEVELOPER/Architect