We provide IT Staff Augmentation Services!

Hadoop Big Data Architect Resume

Savannah, GA


  • 5+ Hadoop Engineering and 7 yrears total Information Technology /Database
  • Expert infrastructure engineers skilled in design, documentation and implementation of Enterprise Data Services solutions.
  • Hands - on experience architecting and implementing Hadoop clusters on Amazon (AWS), using EMR, S2, S3, Redshift, Cassandra, AnangoDB, CosmosDB, SimpleDB, AmazonRDS, DynamoDB, Postgresql., SQL, MS SQL.
  • Research and present potential solutions for current AWS platform in relation to data integration and visualization and reporting.
  • Able to work with team and cross-functionally to research and design solutions to speed up or enhance delivery within teh current platform.
  • Expertise is dealing with multi-petabytes of data from mobile ads, social media, IoT in various formats, structure, unstructures and semi-structured.
  • Use of Amazon Cloud (AWS) using Elastic MapReduce, Elasticsearch, Cloudera Impala.
  • Able to design and document teh technology infrastructure for all pre-production environment and partner with technology Operations on teh design of production implementations.
  • Provide end-to-end data analytics solutions and support using Hadoop systems and tools on cloud services as well a on premise nodes.
  • Expert in big data ecosystem using Hadoop, Spark, Kafka with column-oriented big data systems on cloud platforms such as Amazon CLoud (AWS), Microsoft Azure and Google Cloud Platform.
  • Ability to conceptualize innovative data models for complex products, and create design patterns.
  • Fluent in architecture and engineering of teh Cloudera Hadoop ecosystem.
  • Experience working with Clouderera Distributions, Hortonworks Distributions and Hadoop.
  • Managed architecture and integration of real-time systems processing and near real-time processing using Apache Spark, Sprk Streaming and Apache Storm.
  • Uses Flume and HiveQL scripts to extract, transform, and load teh data into database.
  • Able to perform cluster and system performance tuning.
  • Works with cluster subscribers to ensure efficient resource usage in teh cluster and alleviate multi-tenancy concerns.
  • Inform and recommend cluster upgrades and improvements to functionality and performance.
  • Interaction with NOC team to work with Hadoop to provide large-scale solutions.
  • Experience with large-scale Hadoop deployments (40+ nodes; 3+ clusters).
  • Experience data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
  • Hands on experience in ETL, Data Integration and Migration and Extensively used ETL methodology for supporting Data Extraction, transformations and loading


Hadoop File System: HDFS, MapReduce

Amazon Cloud: AWS, EMR, EC2, EC3, SQS, S3, DynamoDB, Redshift, Cloud Formation

Hadoop Distributions: Cloudera (CDH), Hortonworks (HDP)

Hadoop Platforms: MapR, Elastic Cloud, Anaconda Cloud

Vizualization Tools: Pentaho, Qlikview, Tableau, Informatica, Power BI

Query Engines: SQL, HiveQL, Impala

Databases: SQL, MySQL, Oracle, DB2, Redshift, Amazon Aurora, MongoDB, DynamoDB, CassandraDB, Amazon RDS, ArangoDB

Frameworks: Hive, Pig, Spark, Spark Streaming, Storm

File Formats: Parquet, Avro, Orc, Json, XML, CSV

ETL Tools/Frameworks: Hive, Spark, Kafka, Sqoop, Flume, Camel, Apatar, Talend, Tez

Data Modeling: Toad Database Management Toolset, Podium, Informatica, Talend

Admin Tools: Zookeeper, Oozie, Cloudera Manager, Ambari

Apache Misc: Apache Ant, Flume, HCatalog, Maven, Oozie, Tez

Databases: Apache Cassandra, Apache Cassandra, Datastax Cassandra, Apache Hbase, Couchbase, DB2, MySQL, PostreSQL, DynamoDB, AuroraDB, ArangoDB, SImpleDB, CosmosDB, AmazonRDS

Scripting: Hive, Pig, MapReduce, Yarn, Python, Scala, Spark, Spark Streaming, Storm


Confidential, Savannah, GA

Hadoop Big Data Architect


  • Worked directly with teh Big Data Architecture Team which created teh foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
  • Design included a variation of teh lambda architectureconsisting of near real-time using Spark SQL; Spark cluster.
  • Developed a task execution framework on EC2 instances using Amazon RedShift and ArangoDB.
  • Extracted real time feed using Kafka and Spark Streaming and convert it to RDD and process data in teh form of Data Frame and save teh data as Parquet format in HDFS.
  • Manipulated and analyzed complex, high volume, and high dimensional data in AWS using various querying tools.
  • Involved in loading data from LINUX file system to AWS S3 and HDFS.
  • Implemented Partitioning, Dynamic Partitions and Buckets in Hive for optimized data retrieval.
  • Used Kafka producer to ingest teh raw data into Kafka topics run teh Spark Streaming app to process clickstream events.
  • IAM user, group, roles & policy management. AWS Access Key management
  • Fetched live stream data from IoT data sources and streamed to Hadoop data lake using Spark Streaming and Apache Kafka
  • VPC, Route 53, Security Groups, manage Route, Firewall policy, Load Balance DNS setup.
  • EC2 Instance creation and Auto Scaling, snapshot backup and managing template.
  • Cloud formation scripting, security and resources automation.
  • Cloud watch MonitoriS3 & Glacier storage management, Access control and policing

Confidential, St. Louis, MO

Hadoop Architect & Engineer


  • Used sed Spark SQL to perform transformations and actions on data residing in Hive.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation.
  • Imported data from disparate sources into Spark RDD for processing.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Implemented Spark using Scala, and utilized DataFrames and Spark SQL API for faster processing of data.
  • Involved in converting HiveQL/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Configured Jupyter for Spark web clients.
  • Integrated Zeppelin daemon with Spark master node.
  • Spark notebooks.
  • Implemented Partitioning, Dynamic Partitions and Buckets in Hive for optimized data retrieval.
  • Using Flume to handle streaming data and loaded teh data into Hadoop cluster.
  • Integrating Kafka with Spark streaming for high speed data processing.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Handled teh real time streaming data from different sources such as Cassandra DB, CosmoDB and SQL, using flume and set destination as HDFS.
  • Exported analyzed data to relational databases using Sqoop for visualization, and to generate reports for teh BI team.
  • Extracted teh data from RDBMS (Oracle, MySQL) to HDFS using Sqoop.
  • Worked with various compression techniques to save data and optimize data transfer over network using Lzo, Snappy, etc.
  • Involved in running Hadoop jobs for processing millions of records and data which was updated daily/weekly.
  • Developed teh build script for teh project using Maven Framework.

Confidential, McLean, VA

Big Data Engineer


  • Created materialized views, partitions, tables, views and indexes.
  • Used IoT with Big Data Analytics architecture on AWS to transform supply chain management.
  • Implementation of DATA LAKE ON S3, AWS and Cloud Service: Batch Processing and Real time Processing.
  • Developed dynamic parameter file and environment variables to run jobs in different environments.
  • Worked on installing clusters, commissioning & decommissioning of data node, configuring slots, and on name node high availability, and capacity planning.
  • Executed tasks for upgrading clusters on teh staging platform before doing it on production cluster.
  • Used different file formats like text files, sequence files, and Avro.
  • Used Kafka producer to ingest teh raw data into Kafka topics run teh Spark Streaming app to process clickstream events.
  • Collected teh real-time data from Kafka using Spark Streaming and perform transformations.
  • Used Kafka producer to ingest teh raw data into Kafka topics run teh Spark Streaming app.
  • Extensively used Impala to read, write, and query Hadoop data in HDFS.
  • Implemented Partitioning, Dynamic Partitions and Buckets in Hive for optimized data retrieval.
  • Used Spark to transfer data with Kafka to and from Redshift and Dynamo DB stored on S3.

Confidential, Eau Claire, WI

Big Data Engineer


  • Designed a cost-TEMPeffective archival platform for storing big data using MapReduce jobs.
  • Designed jobs using Oracle, ODBS, Join, Merge, Lookup, Remove, Duplicate, Copy, Filter, Funnel, Dataset, File Set, Change Data, Capture, Modify, Role Merger, aggregator and Peek, Role Generator Stages.
  • Design of Kibana dashboard over Elasticsearch for log monitoring
  • Analyzed MapReduce programs to moderate complexity.
  • Implemented Partitioning, Dynamic Partitions and Buckets in Hive for optimized data retrieval.
  • Wrote MapReduce code to process and parse data from various sources and store parsed data into HBase and Hive using HBase-Hive Integration.
  • Maintained and ran MapReduce jobs on YARN clusters to produce daily and monthly reports per requirements.
  • Responsible for developing data pipeline using Sqoop, MR and Hive to extract teh data from weblogs and store teh results for downstream consumption.
  • Worked on various file formats AVRO, ORC, Text, CSV, and Parquet using Snappy compression.
  • Using Sqoop to extract teh data back to relational database for business reporting.

Confidential, Pittsburg, PA

Big Data Developer


  • Transformed teh logs data into data model using Pig and written UDF functions to format teh logs data.
  • Experienced on loading and transforming of large sets of structured and semi structured data from HDFS
  • through Sqoop and placed in HDFS for further processing.
  • Wrote Sqoop scripts to inbound and outbound data to HDFS, and validated teh data before loading to check teh duplicated data.
  • Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
  • Involved in transforming data from legacy tables to HDFS, and HBase tables using Sqoop.
  • Connected various data centers and transferred data between them using Sqoop and various ETL tools.
  • Worked with Flume to load teh log data from multiple sources directly into HDFS.
  • Analyzed large sets of structures, semi-structured and unstructured data by running Hive queries and Pig scripts.
  • Analyzed teh data by performing Hive queries (HiveQL), Impala and running Pig Latin scripts to study customer behavior.
  • Involved in writing Pig Scripts for cleansing teh data and implemented Hive tables for teh processed data in tabular format.
  • Involved in creating Hive tables, loading with data and writing Hive Queries, which will internally run a MapReduce job.
  • Implemented Partitioning, Dynamic Partitions and Buckets in Hive for optimized data retrieval.
  • Used teh Hive JDBC to verify teh data stored in teh Hadoop cluster.

Hire Now