Sr.data Engineer/hadoop Developer Resume
Atlanta, GA
SUMMARY
- Around 8 years of professional IT experience with 6+ Years of Big Data Hadoop Ecosystems experience in ingestion, storage, querying, processing and analysis of big data.
- Hands - on experience architecting and implementing Hadoop clusters on Amazon (AWS), using EMR, S2, S3, Redshift, Cassandra, AnangoDB, CosmosDB, SimpleDB, AmazonRDS, DynamoDB, Postgresql., SQL, MS SQL.
- Experience in Hadoop Administration activities such as installation, configuration, and management of clusters in Cloudera (CDH4, CDH5), &Hortonworks (HDP) Distributions using Cloudera Manager & Ambari.
- Hands on experience in installing, configuring, and using Hadoop ecosystem components like HDFS, MapReduce, Hive, Impala, Sqoop, Pig, Oozie, Zookeeper, Spark, Solr, Hue, Flume, Storm, Kafka and Yarn distributions.
- Very good Knowledge and experience in Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data.
- Experienced in performance tuning of Yarn, Spark, and Hive and experienced in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top ofHadoopperformed advanced analytical application by making use of Spark with Hive and SQL/Oracle.
- Expert in big data ecosystem using Hadoop, Spark, Kafka with column-oriented big data systems on cloud platforms such as Amazon Cloud (AWS), Microsoft Azure and Google Cloud Platform.
- Experienced in importing & exporting data between HDFS and Relational Database Management systems using Sqoop and troubleshooting for any issues.
- Exposure to Data Lake Implementation using Apache Spark and developed Data pipe lines and applied business logics using Spark and used Scala and Python to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Experienced in extending Hive and Pig core functionality by writing custom UDFs and Map Reduce Scripts using Java & Python.
- Good Understanding and experience on NameNode HA architecture and experience in monitoring the health of cluster using Ambari, Nagios, Ganglia and Cronjobs.
- Experienced in Cluster maintenance and Commissioning /Decommissioning of Data Nodes and good understanding/noledge of Hadoop Architecture and various components such as HDFS, Job Tracker, and Task Tracker, NameNode, DataNode and MapReduce concepts.
- Experienced in implementation of security controls using Kerberos principals, ACLs, Data encryptions using DM-Crypt to protect entire Hadoop clusters.
- Well-versed in spark components like Spark SQL, MLib, Spark streaming and GraphX.
- Expertise in installation, administration, patches, upgrade, configuration, performance tuning and troubleshooting of Red hat Linux, SUSE, CentOS, AIX, Solaris.
- Experienced Schedule Recurring Hadoop Jobs with Apache Oozie and experience in Jumpstart, Kickstart, Infrastructure setup and Installation Methods for Linux.
- Good noledge in troubleshooting skills, understanding of system's capacity, bottlenecks, basics of memory, CPU, OS, storage, and network.
- Experience in administration activities of RDBMS data bases, such as MS SQL Server.
- Experienced in Hadoop Distributed File System and Ecosystem (MapReduce, Pig, Hive, Sqoop, YARN, MongoDB and HBase) and noledge of NoSQL databases such as HBase, Cassandra and MongoDB.
- Major strengths are familiarity with multiple software systems, ability to learn quickly new technologies, adapt to new environments, focused adaptive and quick learner with excellent interpersonal, technical and communication skills.
TECHNICAL SKILLS
Big Data Ecosystem: HDFS and Map Reduce, Pig, Hive, Impala, YARN, HUE, Oozie, Zookeeper, ApacheSpark, Apache STORM, Apache Kafka, Sqoop, Flume.
Operating Systems: Windows, Ubuntu, RedHat Linux, Unix
Programming Languages: C, C++, Java, Python, SCALA
Scripting Languages: Shell Scripting, Java Scripting
Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, SQL, PL/SQL, Teradata
NoSQL Databases: HBase, Cassandra, and MongoDB
Hadoop Distributions: Cloudera, Hortonworks
Build Tools: Ant, Maven, sbt
Development IDEs: NetBeans, Eclipse IDE
Web Servers: Web Logic, Web Sphere, Apache Tomcat 6
Cloud: AWS
Version Control Tools: SVN, Git, GitHub
Packages: Microsoft Office, putty, MS Visual Studio
PROFESSIONAL EXPERIENCE
Confidential, Atlanta GA
Sr.Data Engineer/Hadoop Developer
Responsibilities:
- Involved in Analysis, Design, System architectural design, Process interfaces design, design documentation.
- Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
- Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Used Spark SQL to process the huge amount of structured data.
- Developed Spark streaming application to pull data from cloud to Hive table.
- Developed various Big Data workflows using custom MapReduce, Pig, Hive and Sqoop.
- Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement usingPySpark.
- Involved in designing Kafka for multi data center cluster and monitoring it.
- Responsible for importing real time data to pull the data from sources to Kafka clusters.
- Develop predictive analytic using Apache Spark Scala APIs.
- Creating Oozie workflows and coordinator jobs for recurrent triggering of Hadoop jobs such as Java map-reduce, Pig, Hive, Sqoop as well as system specific jobs (such as Java programs and shell scripts) by time (frequency) and data availability.
- Developed Spark Streaming Jobs in Scala to consume data from Kafka Topics, made transformations on data and inserted to HBase.
- Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
- Worked extensively on building Nifi data pipelines in docker container environment in development phase.
- Worked with Devops team to Clusterize NIFI Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres running on other instances using SSL handshakes in QA and Production Environments.
- Worked on Google Cloud Platform Services like Vision API, Instances.
- Worked on various file formats like Json, CSV, Avro, Sequence file, Text files and XML files
- Experience with Azure transformation projects and Azure architecture decision making Architect and implement ETL and data movement solutions using Azure Data Factory(ADF), SSIS
- Implemented new projects builds framework usingJenkins & mavenas build framework tools.
- Created jobs in continuous integrated build and testing and deployment usingJenkins, Maven.
- Used Spark for fast and general processing engine compatible with Hadoop data.
- Used Spark to design and perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
- Created various Parser programs to extract data from Autosys, Tibco Business Objects, XML, Informatica, Java, and database views using Scala
- Performed Tuning on Hive Queries, SQL Queries, informatica sessions
- Migration of the mappings from lower to higher environments in Informatica
- Setup of HADOOP Cluster on AWS, which includes configuring different components of HADOOP.
- Concurrency built on Akka based actors handling queries from in-memory data stores used Play Framework's web services API to handle the SSL certificates. Persistence layer built on Slick 3.0. The dispatcher was built on Akka using actors to concurrently handle the cache reading and dispatching of transactions
- Developed Python script for start a job and end a job smoothly for a UC4 workflow
- Installed configured apache airflow for workflow management and created workflows in python
- Analyzed large data sets by running Hive queries, and Pig scripts.
- Cascade Jobs introduced to make the data Analysis more efficient as per the requirement.
- Expertise in creating mappings in TALEND using tMap, tJoin, tReplicate, tParallelize, tConvertType,, tflowtoIterate, tAggregate, tSortRow, tFlowMeter, tLogCatcher, tRowGenerator, tNormalize, tDenormalize, tSetGlobalVar, tHashInput, tHashOutput, tJava, tJavarow, tAggregateRow, tWarn, tLogCatcher, tMysqlScd, tFilter, tGlobalmap, tDie etc
- Created Talend ETL jobs to receive attachment files from pop e - mail using tPop, tFileList and tFileInputMail and tan loaded data from attachments into database and achieved the files.
- Well versed with Talend Big Data, Hadoop, Hive and used Talend Big data components like tHDFSInput, tHDFSOutput, tPigLoad, tPigFilterRow, tPigFilterColumn, tPigStoreResult, tHiveLoad, tHiveInput, tHbaseInput, tHbaseOutput, tSqoopImport and tSqoopExport.
- Created the PySpark programs to load the data into Hive and MongoDB databases from PySpark Data frames
- Using g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket.
- Write a program to download a SQL Dump from there equipment maintenance site and tan load it in GCS bucket. On the other side load this SQL dump from GCS bucket to MYSQL (hosted in Google cloud SQL) and load the Data from MYSQL to Bigquery using Python, Scala, spark and Dataproc.
- Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
- Used Angular modules likeAngular-animate, Angular-Cookies, Angular-Filter, Angular-Mocks, Angular - Resourse, Angular- Route, Angular-Sanitize, Angular-Touch and Angularmaterial .
- Knowledge of developingSPAweb UI usingAngular,typescript1.8 and JQuery and also expertise in developing, maintaining and debugging Rails framework.
- Developed angular material responsive web application pages usingAngular-Materialservices, controllers and directives for front end UI and consumed RESTful web service API.
- Exploring with the Spark and improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, PySpark, Spark-SQL, Data Frame, and Pair RDD's
- Created the PySpark programs to load the data into Hive and MongoDB databases from PySpark Data frames
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
- Created the PySpark programs to load the data into Hive and MongoDB databases from PySpark Data frames
- Experienced in managing and reviewing Hadoop log files
- Built re-usable Hive UDF libraries which enabled various business analysts to use these UDF’s in Hive querying.
- Developed Python scripts to clean the raw data.
- Developed Simple to complex MapReduce Jobs using Hive and Pig.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Developed Spark scripts by using Scala as per the requirement.
- Load the data into SparkRDD and performed in-memory data computation to generate the output response.
- Using g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket.
- Write a program to download a SQL Dump from there equipment maintenance site and tan load it in GCS bucket. On the other side load this SQL dump from GCS bucket to MYSQL (hosted in Google cloud SQL) and load the Data from MYSQL to Bigquery using Python, Scala, spark and Dataproc.
- Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analysing data.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters and implemented data ingestion and handling clusters in real time processing using Kafka.
- Applied MapReduce frameworkjobs in java for data processing by installing and configuring Hadoop and HDFS.
- Worked on NoSQL databases including HBase, MongoDB, and Cassandra.
- Performed data analysis in Hive by creating tables, loading it with data and writing hive queries which will run internally in a MapReduce way.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, HBase, NoSQL database and Sqoop.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's worked on Amazon Web service (AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake.
- Involved in Migrating Objects from Teradata to Snowflake.
- Design Setup maintain Administraor the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse
- Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic MapReduce
- Maintained Hadoop Cluster on AWS EMR. Used AWS services like EC2 and S3 for small data sets processing and storage
- Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
- Involved in creating Hive tables, and loading and analyzing data using hive queries.
- Used FLUME to export the application server logs into HDFS.
Environment: Hadoop, HDFS, Sqoop, Hive, Pig, MapReduce, Spark, Scala, Kafka, AWS, HBase,Azure, MongoDB, Cassandra, Python, NoSQL, Flume, Oozie.
Confidential, Dallas Tx
Data Engineer
Responsibilities:
- Developed Big Data analytic models for customer fraud transaction pattern detection models using Hive from customer transaction data. It also involved transaction sequence analysis with gaps and no gaps, network analysis between common customers for the top fraud patterns.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Created Spark jobs to see trends in data usage by users.
- Used Spark and Spark-SQL to read the parquet data and create the tables in Hive using the Scala API.
- Worked with Play framework and Akka parallel processing
- Hands on experience in Multithreaded programming using akka actors
- Deploying Azure Resource Manager JSON Templates from PowerShell worked on Azure suite: Azure SQL Database, Azure Data Lake, Azure Data Factory, Azure SQL Data Warehouse, Azure Analysis Service
- Loaded data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
- Developed Kafka pub-sub, Cassandra clients and Spark along with components on HDFS and Hive
- Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools) worked on Azure suite: Azure SQL Database, Azure Data Lake(ADLS), Azure Data Factory(ADF) V2, Azure SQL Data Warehouse, Azure Service Bus, Azure key Vault, Azure Analysis Service(AAS), Azure Blob Storage, Azure Search, Azure App Service,Azure data Platform Services.
- Involved in Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Created Data Lakes and data pipeline for different events of mobile applications, to filter and load consumer response data from urban-airship in AWS S3 bucket into Hive external tables in HDFS location. Good experience on Apache Nifi Ecosystem.
- Populated HDFS and HBase with huge amounts of data using Apache Kafka.
- Developing SET ANALYSIS to provide custom functionality inQlikViewApplications.
- Enhanced existing dashboards with new functionalities and established a secure environment usingQliksense/QlikViewSection Access.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
- Used Prefuse open source java framework for the GUI.
- Designed and Implemented the ETL process using Talend Enterprise Big Data Edition to load the data from Source to Target Database.
- Load and transform data into HDFS from large set of structured data /Oracle/Sql server using Talend Big data studio.
- Design and developed end-to-end ETL process from various source systems to Staging area, from staging to Data Marts.
- Developed the Pig UDF’S to pre-process the data for analysis.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig and HiveQL.
- Created Hive tables to store data and written Hive queries.
- Extracted the data from Teradata into HDFS using Sqoop.
- Exported the patterns analyzed back to Teradata using Sqoop.
- Involved in Installing, ConfiguringHadoop Eco System, and ClouderaManager using CDH4 Distribution.
- Developed Spark code to using Scala and Spark-SQL for faster processing and testing.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
- Experienced data pipelines using Kafka and Akka for handling large terabytes of data.
- Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on theHadoopcluster.
- Developed PIG Latin scripts to extract the data from the web server output files to load into HDFS.
- Developed shell scripts to periodically perform incremental import of data from third party API to Amazon AWS
- Design and implement Map Reduce jobs to support distributed data processing.
- Process large data sets utilizing our Hadoop cluster.
- Designing NoSQL schemas in Hbase.
- Developing Map reduce ETL in Java/Pig.
- Extensive data validation using HIVE.
- Importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
- Involved in weekly walkthroughs and inspection meetings, to verify the status of the testing efforts and the project as a whole.
Environment: Hadoop Map Reduce, Pig Latin, Zookeeper,Azure, Oozie, Sqoop, Java, Hive, Hbase, AWS, Python, UNIX Shell Scripting.
Confidential, Dallas Tx
Hadoop Developer
Responsibilities:
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs for data cleaning and preprocessing.
- Involved in creating Hive tables, writing complex Hive queries to populate Hive tables.
- Load and transform large sets of structured, semi structured and unstructured data.
- Used Hive to analyze the partitioned and bucketeddataand compute various metrics for reporting on the dashboard.
- Developed and ConfiguredKafka brokersto pipeline server logs data into Spark streaming.
- Worked on Spark RDD transformations to map business analysis and apply actions on top of transformations.
- Experienced in working with Spark eco system using Spark SQL and Scala on different formats like Text file, Avro, Parquet files.
- Optimized Hive QL Scripts by using execution engine like Tez.
- Wrote complex Hive queries to extractdatafrom heterogeneous sources (DataLake) and persist thedatainto HDFS.
- Implemented SQOOP for large dataset transfer between Hadoop and RDBMS.
- Used different file formats like Text files, Avro, Parquet and ORC.
- Worked with different File Formats like text file, Parquet for HIVE querying and processing based on business logic.
- Used JIRA for creating the user stories and creating branches in the bitbucket repositories based on the story.
- Knowledge on creating various repositories and version control using GIT.
- Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
Environment: Python, Hadoop, MapReduce, CDH 5.9, Cloudera Manager, Control M Scheduler, Shell Scripting, Agile Methodology, JIRA, Git, Tableau.
Confidential
JAVA/J2EE Developer
Responsibilities:
- Developed JMS API using J2EE package.
- Made use of Java script for client-side validation.
- Used Struts Framework for implementing the MVC Architecture.
- Wrote various Struts action classes to implement the business logic.
- Involved in the design of the project using UML Use Case Diagrams, Sequence Diagrams, Object diagrams, and Class Diagrams.
- Understand concepts related to and written code for advanced topics such as Java IO, serialization and multithreading.
- Used DISPLAY TAGS in the presentation layer for better look and feel of the web pages.
- Developed Packages to validate data from Flat Files and insert into various tables in Oracle Database.
- Provided UNIX scripting to drive automatic generation of static web pages with dynamic news content.
- Participated in requirements analysis to figure out various inputs correlated with their scenarios in Asset Liability Management (ALM).
- Assisted design and development teams in identifying DB objects and their associated fields in creating forms for ALM modules.
- Also involved in developing PL/SQL Procedures, Functions, Triggers and Packages to provide backend security and data consistency.
- Responsible for performing Code Reviewing and Debugging.
Environment: Java, J2EE, UML, Struts, HTML, XML, CSS, Java Script, Oracle 9i, SQL*Plus, PL/SQL, MS Access, UNIX Shell Scripting.