Hadoop Developer/ Data Engineer Resume
SUMMARY
- Overall 8+ years of IT experience across Java, SQL, ETL, Big Data. Interested and passionate about working in Big Data environment. 5+ years of experience in big data, Hadoop, No SQL technologies in various fields like Insurance, Finance, and Health Care.
- Vast noledge on teh Hadoop Architecture and functioning of various components such as HDFS, Name Node, Data Node, Job Tracker, Task Tracker, Map reduce, Spark.
- Extensive of experience in providing solutions for Big Data using Hadoop 2.x, HDFS, MR2, YARN, Kafka, Pig, Hive, Sqoop, HBase, Cloudera Manager, Horton works, Zookeeper, Oozie, Hue.
- Experience in importing and exporting data using Sqoop from HDFS/Hive/HBase to Relational Database Systems and vice - versa. Skilled in Data migration and data generation in Big data ecosystem.
- Experienced in building highly scalable Big-data solutions using Hadoop and multiple distributions i.e., Cloudera, Horton works and NoSQL platforms (HBase).
- Implementation of Big data batch processes using Hadoop, Map Reduce, YARN, Pig and Hive.
- Experience in importing and exporting data using Sqoop from HDFS/Hive/HBase to Relational Database Systems and vice-versa.
- Hands on experience in in-memory data processing wif Apache Spark using Scala and python codes.
- GCP platform on IAM roles, migration of applications, using Cloud storage, BigQuery, dataflow, dataProc
- Good understanding of developer tools, CICD etc.
- Worked wif Spark on an EMR cluster along wif other Hadoop applications, and it can also leverage teh EMR file system (EMRFS) to directly access data in Amazon S3.
- Responsible for interaction wif teh clients for understanding their business problem related to Big data, Cloud Computing and NoSQL technologies.
- Experienced in using Kafka as a distributed publisher-subscriber messaging system.
- Good experience in writing Pig scripts and Hive Queries for processing and analyzing large volumes of data.
- DevelopedSparkscripts by using Scala shell commands as per teh requirement
- Experience in optimization of Map Reduce algorithm using Combiners and Partitioners to deliver best results.
- Extending Hive and Pig core functionality by writing custom UDFs.
- Experience in building and architecting multiple Data Pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among teh team.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s.
- Good noledge on Amazon AWS concepts like EMR & EC2 web services which provides fast and efficient processing of Big Data.
- Experience in understanding teh security requirements for Hadoop and integrate wif Kerberos authentication and authorization infrastructure.
- Experienced in Strong scripting skills in Python and UNIX shell.
- Experience in managing and reviewing Hadoop log files.
- Hands on experience in application development using RDBMS and Linux shell scripting. Having good working experience in Agile/Scrum methodologies, technical discussion wif client
- Communication using scrum calls daily for project analysis specs and development aspects.
- Ability to work independently as well as in a team and able to TEMPeffectively communicate wif customers, peers and management at all levels in and outside teh organization.
TECHNICAL SKILLS
Big Data/Hadoop ecosystem: HDFS, Map Reduce YARN, Apache NiFi, Hive, Pig, HBase, Impala, Zookeeper, Sqoop, Flume, Oozie, Spark. Apache Phoenix, Zeppelin, EMR.
Programming Languages: C, C++, Java, Scala, SQL, PL/SQL, Python, Linux shell scripts.
Methodologies: Agile, Scrum, Waterfall
NoSQL Database: Hbase, Cassandra, MongoDB
Database: Oracle, DB2, MS Azure.
Tools: Used: Eclipse, IntelliJ, GIT, Putty, WinSCP
Operating systems: Windows, Unix, Linux, Ubuntu
PROFESSIONAL EXPERIENCE
Confidential
Hadoop Developer/ Data Engineer
Responsibilities:
- Implemented data pipeline using Spark and Hive ingest customer behavioral data into Hadoop platform to perform user behavioral analytics.
- Developed Spark applications usingScala for ingestions of data from one environment to another along wif test cases.
- Created Hive tables to load large sets of data after transformation of raw data.
- Enabled and automated data pipelines for moving over 25 Gb of data from Oracle to Hadoop and Google Big Query using GitHub for source control and Jenkins.
- Created a Big Query table by writing a python program dat checks a Linux Directory for incoming XML files and then upload all new files to a Google cloud platform storage location before teh data is parsed and loaded.
- Utilizing Google Big Query SQL and Amazon Athena to build and drive reporting excellence.
- Created Google Dataflow pipelines for uploading large public dataset into Google Big Query
- Implementedend-to-end tests between Dataflow and Big Query
- Optimized teh Hive tables utilizing improvement techniques like partitions and bucketing to give better execution HQL queries.
- Experience in guiding teh classification, plan, implementation, growth, adoption and compliance to enterprise architecture strategies, processes and standards
- Designed and developed highly scalable and available systems
- Worked wif services like EC2, Lamba, SES, SNS, VPC,CloudFront, CloudFormation etc,
- Demonstrated expertise in creating architecture blueprints and detailed documentation. Created bill of materials, including required Cloud Services (such as EC2, S3 etc.) and tools
- Hands-on experience wif EC2, ECS, ELB, EBS, S3, VPC, IAM, SQS, RDS, Lambda, Cloud Watch, Storage Gateway, Cloud formation, Elastic Beanstalk and Autoscaling
- Created Hive External table for Semantic data and loaded teh data into tables and query data using HQL.
- Identify data sources, create source-to-target mapping, storage estimation, provide support for setup and data partitioning.
- Developed workflows in Atomic to cleanse and transform raw data into useful information to load into HDFS.
- Used HDFS File System API to connect to FTP Server and HDFS, S3 AWS SDK for connecting to S3 buckets.
- Implemented installation and configuration of multi-node cluster on teh cloud using Amazon Web Services (AWS) on EC2.
- Responsible for building and configuring distributed data solution using MapR distribution of Hadoop.
Environment: AWS, EC2, HDFS, Data Flow and Big Query.
Confidential, Texas
Hadoop Developer
Responsibilities:
- Implemented solutions utilizing Advanced AWS Components: EMR, EC2, etc integrated wif Big Data/Hadoop Distribution Frameworks: Zookeeper, Yarn, Spark, Scala, NiFi etc.
- Designed and Implemented Spark Jobs to be deployed and run on existing Active clusters.
- Configured Postgres Database on EC2 instances and made sure application dat was created is up and running, troubleshooter issues to meet teh desired application state.
- Worked on creating and configuring secure VPC, Subnets, Security Groups through private and public networks.
- Created alarms, alerts, notifications for Spark Jobs to email and slack group message job status and log in Cloud Watch.
- Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
- Worked on generation large set of test data wif data integrity using java which used in Development and QA Phase.
- Developed and deployed teh outcome of spark and Scala code in Hadoop cluster running on GCP.
- Implemented teh adapter to bring teh data from clinical files such as xml and json and processed those files and stored it in teh AWS S3 for reporting purpose.
- Migrated previously written cron jobs to airflow in GCP.
- Developed a deep understanding of AWS vast data sources and using these data sources provided teh solution to business problems.
- Experience in moving data between GCP and Azure using Azure data factory.
- Worked in Spark Scala, improving teh performance and optimized of teh existing applications running on EMR cluster.
- Worked on a Spark Job to Convert CSV data to Custom HL7/FHIR objects using FHIR API’s.
- Stored teh data in tabular formats using Hive tables and Hive Serdes.
- Executed Hive queries dat halped in analysis of trends by comparing teh new data wif existing data warehouse reference tables and historical data.
- Developed Hive User Defined Functions in java, compiling them into jars and adding them to teh HDFS and executing them wif Hive Queries
- Deployed SNS, SQS, Lambda function, IAM Roles, Custom Policies, EMR wif Spark and Hadoop setup and bootstrap scripts to setup additional software’s needed to perform teh job in QA and Production Environment using Terra form Scripts.
- Worked on Spark Job to perform Change Data Capture (CDC) on Postgres Tables and updated target tables using JDBC properties.
- Worked on Kafka Publisher integrated in spark job to capture errors from Spark Application and push into Postgres table.
- Worked extensively on building Nifi data pipelines in docker container environment in development phase.
- Worked wif Devops team to Clusterize NIFI Pipeline on EC2 nodes integrated wif Spark, Kafka, Postgres running on other instances using SSL handshakes in QA and Production Environments.
Confidential, Raleigh NC
Big Data Developer
Responsibilities:
- Working in agile, successfully completed stories related to ingestion, transformation and publication of data on time.
- Perform validations and consolidations for teh imported data, Data Migration and Data Generation.
- Ingested data sets from different Data Bases and Servers using Sqoop, Talend Import tool and MFT (Managed file transfer) Inbound process wif elastic search.
- Expert in implementing advanced procedures like text analytics and processing using teh in-memory computing capabilities like Apache Spark written in Scala and Java.
- Created teh PySpark programs to load teh data into Hive and MongoDB databases from PySpark Data frames.
- Exploring wif teh Spark and improving teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
- Performance optimization when dealing wif large datasets using partitions, broadcasts in Spark, TEMPeffective and efficient joins, transformations during ingestion process.
- Developed Spark scripts by using Scala shell commands, Java as per teh requirement.
- Using Spark streaming consumes topics from distributed messaging source Talend, Kafka and periodically pushes batch of data to Spark for real time processing in elastic search.
- Involved in teams to analyze teh Anomaly detection and ratings of teh data using ETL tool Talend.
- Developed Complex Hive QL‘s using SerDe JSON.
- Implemented Partitioning, Dynamic Partitions and Bucketing in Hive for efficient data access.
- Developed Hive queries for data sampling and analysis to teh analysts.
- Responsible for writing Hive Queries for analyzing data in Hive warehouse using HQL
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
- Involved in importing teh real time data using elastic search to Hadoop using Kafka, Talend and implemented teh Oozie job for daily imports
- Wrote Pig Latin Scripts and Hive Queries using Avro schemas to transform teh Data sets in HDFS.
- As part of support, responsible for troubleshooting of Map Reduce Jobs, Java Scripts, Pig Jobs, Hive
- Worked on performance tuning of Hive & Pig Jobs.
- Installed Hadoop ecosystems (Hive, Pig, Sqoop, HBase, Oozie) on top of Hadoop cluster
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data in Amazon EMR.
Environment: Hadoop, Cloudera, Map Reduce, Spark, Shark, Hive, Apache NiFi, Pig, Sqoop, Shell Scripting, Storm, Talend, Kafka, Data Meer, Oracle, Teradata, SAS, Arcadia, Java 7.0, Nagios, spring, JIRA, EMR.
Confidential
Hadoop and Spark Developer
Responsibilities:
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in Java and Nifi for data cleaning and preprocessing.
- Imported and exported data into HDFS from Oracle database and vice versa using Sqoop.
- Installed and configured Hadoop Cluster for major Hadoop distributions.
- Used Hive, Pig and Talend as an ETL tools for event joins, filters, transformations and pre-aggregations.
- Created partitions, bucketing across state in Hive to handle structured data using Elastic search.
- Developed workflow in Oozie to orchestrate a series of Pig scripts to cleanse data such as removing personal information or merging many small files into a handful of very large, compressed files using Pig pipelines in teh data preparation stage.
- Involved in moving all log files generated from various sources to HDFS for further processing through Elastic search, Kafka, Flume & Talend and process teh files by using Piggybank.
- Extensively used PIG to communicate wif Hive using HCatalog and HBase using Handlers.
- Used Spark SQL for Scala & amp, Python interface dat automatically converts RDD case classes to schema RDD.
- Used Spark SQL to read and write table which are stored in Hive and Amazon EMR.
- Performed Sqooping for various file transfers through teh HBase tables for processing of data to several NoSQL DBs- Cassandra, Mango DB.
- Created tables, secondary indices, join indices viewed in Teradata development environment for testing.
- Captured data logs from web server and Elastic search into HDFS using Flume for analysis.
- Managed and reviewed Hadoop log files.
Environment: Hive, Pig, Map Reduce, Apache Nifi, Sqoop, Oozie, Flume, Kafka, Talend, EMR, Storm, HBase, Unix, Linux, Python, SQL, Hadoop 1.x, HDFS, GitHub, Talend, Python Scripting.
Confidential
Java developer
Responsibilities:
- Developed webpages usingJSP, HTML, CSS, JavaScript, JQuery and Ajax and also developedSOAPbased Web Services usingJAXB.
- Involved in documenting teh application test results, fixing bugs and enhancements followed by Agile Scrum Methodology.
- Added Maven support to existing projects.
- Involved in Strong fundamentals object oriented and Data structures & Algorithms.
- Wrote design specifications and developed various test cases
- Involve in deploying and running teh application onJBossand fixed issues at teh time of production.
- Deployed teh application onWebSphereserver using EclipseIDE.
- Communicated wif teh developed Webservice’s using REST Client and JSON.
- UsedSpring Frameworkfor dependency injection for Action classes using Application Contextxml file.
- Got very good exposure to Hibernate and Deployed teh applications on Weblogic Application Server.
- Implemented teh logging mechanism usingLog4j framework.
- Designed and developedintegration layerfor calling EJB backend APIS.
- Used IBM ClearQuest for teh bug tracking and ticket management.
- Used GIT as version control system for teh source code and project documents.
Environment: Java, Hibernate, JSP, JavaScript, JAXB, Maven, Jboss, WebSphere, MQ, CSS, Spring, EJB, Log4j, JQuery, SOAP, Eclipse, Ajax, HTML, Agile, WebLogic, GIT, IBM Clearquest, Oracle.