We provide IT Staff Augmentation Services!

Sr. Big Data Engineer/ Hadoop Developer Resume

2.00/5 (Submit Your Rating)

Waterston, MA

SUMMARY

  • Having 7+ years of experience as Big Data Developer wif major focus on Big Data Technologies - Hadoop Ecosystem, HDFS, Map-Reduce, HBase, HIVE, Sqoop, Kafka, Oozie, Spark, Teradata
  • Involved in all teh phases of Software Development Life Cycle (SDLC): Requirements gathering, analysis, design, development, testing, production and post-production support.
  • Experience wif Oozie Workflow Engine in running workflow jobs wif actions dat run Hadoop Map Reduce and Pig jobs.
  • Hands on experience in installing configuring and using Hadoop ecosystem components like Hadoop MapReduce HDFS HBase Hive Sqoop Pig Zookeeper and Flume
  • Comprehensive noledge of system engineering, reliability life-cycle management, and reliability modelling.
  • Played role of site reliability engineer as well. noledge of hardware and software design, functional programming, and system operation and maintenance
  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
  • Developed Python and Pyspark programs for data analysis on MapR, Cloudera, Hortonworks Hadoop clusters.
  • Knowledge on installation and administration of multi-node virtualized clusters using Cloudera Hadoop and Apache Hadoop
  • Hands on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop, MapReduce, HDFS, HBase, Oozie, Hive, Kafka, Oozie, Zookeeper, Spark, Storm, Sqoop and Flume.
  • Strong experience using Apache Spark, Spark SQL and other data processing tools and languages
  • Good understanding of Apache Spark High level architecture and performance tuning patterns.
  • Great noledge about Hive (architecture, Thrift servers), HQLs, Beeline and other 3rdparty JBDC connectivity services to Hive.
  • Easy of operability wif teh UNIX file system through command line interface.
  • Good expertise noledge wif teh UNIX commands like changing teh permissions of teh file-to-file and group permissions.
  • Developed Python code to gather teh data from HBase and designs teh solution to implement using Pyspark
  • Maintaining and optimized AWS infrastructure (EMR EC2, S3, EBS, Redshift, and Elastic Search).
  • Experience in writing Map Reduce programs for analyzing Big Data wif different file formats like structured and unstructured data.
  • Experience in Deep Learning using libraries such as Theano, TensorFlow and Keras.
  • Expertise in Machine Learning models like Linear, Logistics, Decision Trees, Random Forest, SVM, K-Nearest Neighbors, clustering (K-means, Hierarchical), Bayesian
  • Hands on experience in teh entire Data Science project life cycle, including Data Acquisition, Data Cleaning, Data Manipulation, Data Mining, Machine Learning Algorithms, Data Validation, and Data Visualization
  • In-depth Knowledge of Dimensionality Reduction (PCA, LDA), Hyper-parameter tuning, Model Regularizatio(Ridge, Lasso, Elastic net) and Grid Search techniques to optimize model performance
  • Skilled at Python, SQL, R and Object-Oriented Programming (OOP) concepts such as Inheritance Polymorphism, Abstraction, Encapsulation.
  • Experience in developing Supervised Deep Learning algorithms which include Artificial Neural Networks Convolution Neural Networks, Recurrent Neural Networks, LSTM, GRU and Unsupervised Deep Learning Techniques like Self-Organizing Maps (SOM’s) in Keras and TensorFlow.
  • Experience of building machine learning solutions using PySpark for large sets of data on Hadoop ecosystem
  • Hands-on experience in machine learning algorithms such as Linear Regression, Logistic Regression, Decision Tree(CART), Random Forest, SVM, K-Nearest Neighbors, Naïve Bayes, K-means Clustering, TEMPPrincipal Components Analysis
  • Professional noledge of deep learning algorithms such as Artificial Neural network (ANN), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN).
  • Proficient in model validation and optimization using k-fold cross validation, ROC curve, confusion matrix and F1 score.
  • Strong skills in Statistics methodologies such as hypothesis testing, ANOVA, Monte Carlo simulation, principal component analysis and correspondence analysis, ARIMA time series analysis, structural equation model.
  • Excellent experience in Python, wif packages pandas, numpy, matplotlib, seaborn, scikit-learn, scipy, statsmodels, PySpark to apply data cleaning, data manipulation, data mining, machine learning, data validation data visualization.
  • Involved in requirement gathering, analysis, design, estimation and testing of teh assigned tasks in python
  • Hands-on experience wif message broker such as Apache Kafka & NoSQL databases such as Cassandra and MongoDB
  • Developed PIG Latin scripts for handling business transformations
  • Implemented Sqoop for large dataset transfer between Hadoop and RDBMs.
  • Knowledge on installation and administration of multi-node virtualized clusters using Cloudera Hadoop and Apache Hadoop.
  • Worked on Jenkins for continuous integration and for End-to-End automation for a poll teh build and deployments by managing different plugins Maven and Ant.
  • Hands-on experience in SCM tools like Git and SVN for merging and branching.
  • Set up a GCP Firewall rules in order to allow or deny traffic to and from teh VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
  • Experience in providing highly available and fault tolerant applications utilizing orchestration technologies like Kubernetes and Apache Mesos on AWS.
  • Experience setting up instances behind Elastic Load Balancer in AWS for high availability.
  • Good understanding of Open shift platform in managing Docker containers using Docker swarm, Kubernetes Clusters.
  • Excellent communications skills, configuration skills and technical documentation skills.
  • Ability to work closely wif teams to ensure high quality timely delivery of builds & release
  • Excellent relationship management skills & ability to conceive efficient solutions utilizing technology. Industrious individual who thrives on a challenge, working TEMPeffectively wif all levels of management.

TECHNICAL SKILLS

Big Data Ecosystem: MapReduce, HDFS, HBase, Spark, Scala, Zookeeper, Hive, Pig, Sqoop Cassandra, Oozie, MongoDB, Flume.

ETL Tools: Informatica, Talend.

Java Technologies: Core Java, Servlets, JSP, JDBC, Java 6, Java Help API.

Frameworks: MVC, Struts, Hibernate and Spring.

Programming Languages: C, C++, Java, Python(Tensoflow, PyTorch, Keras, Numpy, SCipy, NLTK, Gensim, SpaCy,Pasndas, Matplotlib, Plotly), Linux shell scripts, R(Tidy, ggplot, tidyverse, Shiny).

Methodologies: Agile, waterfall, UML, Design Patterns

Database: Oracle 10g, 11g, MySQL, No-SQL SQL Server 2008 R2, HBase.

Application Server: Apache Tomcat 5.x, 6.0.

Web Tools: HTML, XML, AJAX, JavaScript, DTD, Schemas.

Tools: SQL developer, Toad, Maven, SQL Loader

Operating System: Windows 7, Linux Ubuntu

Testing API: JUNIT

PROFESSIONAL EXPERIENCE

Sr. Big Data Engineer/ Hadoop Developer

Confidential, Waterston, MA

Responsibilities:

  • Working on processing big volumes of data using different big data analytic tools including Spark Hive, SQOOP, Pig, Flume, Apache Kafka, PySpark, OOZIE, HBase, Python, Scala.
  • Implementation and data integration in developing large-scale system software experiencing wif Hadoop ecosystem components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig.
  • Developed Hive UDF's for extended use and wrote HiveQL for sorting, joining, filtering and grouping teh structure data.
  • Developed ETL Applications using Hive, Spark, and Impala & Sqoop for Automation using Oozie. Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing teh data onto HDFS.
  • Hands on experience on Cloudera Hue to import data on teh GUI.
  • Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR and MapR (MapR data platform).
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Performed Data Ingestion from multiple internal clients using Apache Kafka.
  • Worked on integrating Apache Kafka wif Spark Streaming process to consume data from external REST APIs and run custom functions.
  • Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment.
  • Hands on experience in Python Pyspark programming on Cloudera, Harton Works and MapR Hadoop Clusters, Aws EMR clusters, AWS Lambda functions and CFT'S
  • Extracted files from Cassandra and MongoDB through Sqoop and placed in HDFS and processed.
  • AWS experience developing data streaming pipelines
  • Working in querying data using Spark SQL on top of Spark engine.
  • Automated teh process for extraction of data from warehouses and weblogs by developing work-flows and coordinator jobs in Oozie and NiFI.
  • Developed workflow in Oozie & NiFi to automate teh tasks of loading teh data into HDFS.
  • Creating Hive tables, dynamic partitions, buckets for sampling, and working on them using HiveQL.
  • Used Sqoop for importing teh data into HBase and Hive, exporting result set from Hive to MySQL using Sqoop export tool for further processing.
  • Enumerated Hive queries to do analysis of teh data and to generate teh end reports to be used by business users. Experience wif streaming toolsets such as Kafka, flink, spark streaming, and or Nifi/StreamSets
  • Created detailed AWS Security groups which behaved as virtual firewalls dat controlled teh traffic allowed reaching one or more AWS EC2 instances.
  • Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and Apache Storm etc. and ingested streaming data into Hadoop using Spark, Storm Framework and Scala.
  • Analyze and model structured data and implement algorithms to support analysis using advanced statistical and mathematical methods from statistics, machine learning, data mining using python.
  • Experience in transferring Streaming data, data from different data sources into HDFS and NoSQL databases using Apache Flume. Cluster coordination services through Zookeeper.
  • Build cluster on AWS environment using EMR using S3, EC2, and Redshift
  • Involved in complete SDLC of project includes requirements gathering, design documents, development, testing and production environments. Packaged Spark development environment into custom vagrant box.
  • Designed data flow running using SPARK & SPARK SQL.

Environment: Hadoop, HIVE, IMPALA, Amazon S3, Beeline, PySpark, Accelerator, Nifi/StreamSets, Bamboo, Control-M, ICEDQ, UNIX, Cloudera Navigator, SPARK SQL,SharePoint, Confluence, AWS, Bitbucket, JIRA, Flume, Oozie, Zookeeper, Cassandra, MongoDB, Spark, Kafka, MapReduce, S3, EC2, EMR.

Senior Data Engineer

Confidential, Evansville IN

Responsibilities:

  • Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive HBase database and SQOOP.
  • Developed multiple Spark jobs in PySpark for data cleaning and Pre-processing.
  • Installed Hadoop, Map Reduce, HDFS, and Developed multiple map reduce jobs in PIG and Hive for data cleaning and pre-processing.
  • Configured Zookeeper, worked on Hadoop High Availability wif Zookeeper failover controller, add support for scalable, fault-tolerant data solution.
  • Good noledge on Amazon EMR (Elastic Map Reduce).
  • Coordinated wif business customers to gather business requirements. And also interact wif other technical peers to derive Technical requirements and delivered teh BRD and TDD documents.
  • Extensively involved in Design phase and delivered Design documents.
  • Involved in Testing and coordination wif business in User testing.
  • Importing and exporting data into HDFS and Hive using SQOOP.
  • Written Hive jobs to parse teh logs and structure them in tabular format to facilitate TEMPeffective querying on teh log data
  • Extracted BSON files from MongoDB and placed in HDFS and processed.
  • Experience in experience of Big Data / NoSQL frameworks, Python, Spark Streaming Data Frames, Scala, Redshift or HBase or Hive or Cassandra
  • Designed and Implemented Partitioning (static, Dynamic) and Bucketing in HIVE, AWS
  • Involved in creating Hive tables, loading wif data and writing hive queries dat will run internally in map reduce way.
  • Python Pyspark programming on MapR Hadoop Clusters, Aws EMR clusters, AWS Lambda functions and CFT'S
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD'
  • Experienced in defining job flows.
  • Worked on creating pipelines to migrate data from Hadoop to Azure Data Factory.
  • Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting.
  • Experienced in managing and reviewing teh Hadoop log files.
  • Used Pig as ETL tool to do Transformations, even joins and some pre-aggregations before storing teh data onto HDFS.
  • Load and Transform large sets of structured and semi structured data.
  • Responsible to manage data coming from different sources.
  • Involved in creating Hive Tables, loading data and writing Hive queries.
  • Utilized Apache Hadoop environment by Cloudera.
  • Created Data model for Hive tables.
  • Involved in Unit testing and delivered Unit test plans and results documents.
  • Exported data from HDFS environment into RDBMS using Sqoop for report generation and visualization purpose.
  • Worked on Oozie workflow engine for job scheduling.

Environment: Hadoop, PySpark, Cloudera, HDFS, Hive, AWS, Azure Data Factory, Azure Storage, SPARK & SPARK SQL, HBase, Sqoop, Kafka, HP ALM/Quality Center, Agile, SQL, Teradata, XML, UNIX, Shell Scripting, WINSQL Hbase, MySQL, MongoDB, Oozie

Big Data Engineer

Confidential, Bridgewater NJ

Responsibilities:

  • Responsible to develop teh applications on Data Lake as per teh client requirements and exposing dat data to client.
  • Developing teh code to move teh data from one zone to another zone in Data fabric platform.
  • Create teh applications like Claims Sweep WGS for transforming teh data as per teh client requirements using Spark, hive and Python.
  • Developing automation scripts to do validation like Record count, schema Check etc. and load teh data into corresponding partitions.
  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
  • Used PySpark to write teh code for all teh use cases in spark and extensive experience wif Scala for data analytics on Spark cluster and Performed map-side joins on RDD.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and have a good experience in using Spark-Shell and Spark Streaming.
  • Build near real-time pipelines dat operate efficiently to handle huge volumes of incoming business activity
  • Developing teh programs to validate teh data after ingesting teh data into Data Lake using UNIX.
  • Developing teh scripts to generate reconciliation reports using Python
  • Involved in moving data from different source systems like Oracle, SQL and DB2 etc. to Data Lake.
  • Identifying teh layout for COBOL copybooks and clean up teh copybooks and ingest teh data as per teh layout.
  • Loaded teh data from Teradata to HDFS using Teradata Hadoop connectors.
  • Ingest teh data from Oracle Database thru Oracle Gloden Gate to Hadoop Data Lake wif teh halp of Kafka.
  • Responsible to provide teh technical solutions for teh team facing issues.
  • Responsible to guide teh team when they have issues.
  • Responsible to provide design and architecture to teh team to develop applications.
  • Responsible to review teh code and make teh code as per teh Client Standards.
  • Creating data model for teh data to be ingested for each table.
  • Identifying teh appropriate file formats for teh tables to retrieve teh data faster.
  • Deciding teh column data types properly so dat we do not lose teh data or miss teh data.
  • Responsible to identify teh user stories or work items for teh initiatives in PI planning.
  • Responsible to develop Masking Algorithms to mask PHI columns to expose teh data to offshore so dat they cannot see actual data.
  • Setup GCP Firewall rules to allow or deny traffic to and from teh VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
  • Supported Site reliability engineer
  • Created projects, VPC's, Subnetwork's, GKE Clusters for environments QA3, QA9 and prod using Terraform.
  • Worked on Jenkins file wif multiple stages like checkout a branch, building teh application, testing, pushing teh image into GCR, deploying to QA3, deploying to QA9, Acceptance testing and finally Deploying to Production.
  • Identifying teh user stories for teh requirements/EPICS.

Environment: s: Kafka, Restful, Amazon Web Services, Scala, Hive, Jira, Stream sets, HDFS, Control-M, GCP, Spark, Teradata, Hortonworks, Scrum., Pig, Tez, Oozie, Hbase, Scala, Pyspark, Spark SQL, Kafka, Python, LINUX, Putty, Cassandra

Data Engineer

Confidential

Responsibilities:

  • Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive HBase database and SQOOP.
  • Installed Hadoop, Map Reduce, HDFS, and Developed multiple map reduce jobs in PIG and Hive for data cleaning and pre-processing.
  • Coordinated wif business customers to gather business requirements. And also interact wif other technical peers to derive Technical requirements and delivered teh BRD and TDD documents.
  • Extensively involved in Design phase and delivered Design documents.
  • Involved in Testing and coordination wif business in User testing.
  • Importing and exporting data into HDFS and Hive using SQOOP.
  • Written Hive jobs to parse teh logs and structure them in tabular format to facilitate TEMPeffective querying on teh log data.
  • Experience building data and/or ETL pipelines
  • Designed and Implemented Partitioning (static, Dynamic) and Bucketing in HIVE, AWS
  • Involved in creating Hive tables, loading wif data and writing hive queries dat will run internally in map reduce way.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD'
  • Experienced in defining job flows.
  • Worked on creating pipelines to migrate data from Hadoop to Azure Data Factory.
  • Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting.
  • Experienced in managing and reviewing teh Hadoop log files.
  • Used Pig as ETL tool to do Transformations, even joins and some pre-aggregations before storing teh data onto HDFS.
  • Load and Transform large sets of structured and semi structured data.
  • Responsible to manage data coming from different sources.
  • Involved in creating Hive Tables, loading data and writing Hive queries.
  • Utilized Apache Hadoop environment by Cloudera.
  • Created Data model for Hive tables.
  • Involved in Unit testing and delivered Unit test plans and results documents.
  • Exported data from HDFS environment into RDBMS using Sqoop for report generation and visualization purpose.
  • Worked on Oozie workflow engine for job scheduling.

Environment: Hadoop, Cloudera, HDFS, Hive, AWS, Azure Data Factory, Azure Storage, HBase, Sqoop, Kafka, Agile, SQL, Teradata, XML, UNIX, Shell Scripting, WINSQL

Tableau Developer

Confidential

Responsibilities:

  • Worked as a Tableau admin, in charge of teh report development and maintenance
  • Independently worked on owning IT support tasks related to Tableau Reports on Server.
  • Built and maintained many dashboards.
  • Training classes to teh employees on Tableau Software.
  • Creating dataflow diagram and documenting teh whole data movement process.
  • Responsible for writing complete requirement document by interacting wif teh business directly to ascertain teh business rules and logic. Writing high level and detail design documents.
  • Created Backup scripts to take periodic backups for teh content on Tableau server.
  • Worked on teh requirement gathering of teh reports
  • Analyzed teh database tables and created database views based on teh columns needed.
  • Created dashboards displaying sales, variance of sales between planned and actual values.
  • Implemented calculations and parameters wherever needed.
  • Published dashboards to Tableau server.
  • Created Users, Adding Users to a Site, Adding Users to a Group Viewing, Editing Deleting Users and Activating their Licenses, Distributed Environments, Installing Worker Servers, Maintaining a Distributed Environment, High Availability.
  • Developed SQL Server Impersonation for Impersonation Requirements, How Impersonation Works,
  • Impersonating wif a Run as User Account, Impersonating wif Embedded SQL Credentials.
  • Good Knowledge and Worked on TABCMD TABADMIN, Database Maintenance and Troubleshooting.

We'd love your feedback!