Lead Hadoop Big Data Engineer Resume
Atlanta, GA
PROFESSIONAL SUMMARY
- Proven track record of increasing responsibilities, turning around low performing teams & enhancing operational processes. Strong Analytical & problem - solving skills
- Extremely large-scale Big Data, databases and data warehouses. Broad domain knowledge.
- Large scale distributed systems, knowledge of all aspects of database technology from hardware to tuning to modeling.
- Integrated Kafka with Spark Streaming for real time data processing.
- Used Spark to fine-tune query responsiveness for better user experience.
- Experienced in Spark DataFrames. Spark SQL,and Spark Streaming APIs.System Architecture, and Infrastructure Planning
- Implemented Hadoop based data warehouses, integrated Hadoop with Enterprise Data Warehouse systems
- Built real-time Big Data solutions using HBase handling billions of records
- Troubleshooting Spark applications to make them more error tolerant.
- Built scalable, cost-effective solutions using Cloud technologies
- Implemented Big Data analytical solutions that 'close the loop' and provide actionable intelligence
- Developed free text search solution with Hadoop and Solr. Analyzing emails for compliance and eDiscovery.
- Expert knowledge in Hadoop/HDFS, MapReduce, HBase, Pig, Sqoop, Cloud: Amazon Elastic Map Reduce (EMR), Amazon EC2, Rackspace, Google Cloud.
- Distributed systems, large-scale non-relational data stores, RDBMS, NoSQL map-reduce systems, data modeling, database performance, and multi-terabyte data warehouses.
- Hadoop framework, Hadoop Distributed file system and Parallel processing implementation.
- Experience in Hadoop Framework and its ecosystem including but not limited to HDFS Architecture, MapReduce Programming, Hive, Pig, Sqoop, Hbase, Oozie etc.
- Large scale Hadoop environments build and support including design, configuration, installation, performance tuning and monitoring.
TECHNICAL SKILLS
DATABASE: Cassandra, Datastax, Hbase, Phoenix, Redshift, DynamoDB, MongoDB, MS Access, SQL, MySQL, Oracle, PL/SQL, Postgres SQL, RDBMS
DATABASE SKILLS: Database partitioning, database optimization, building communication channels between structured and unstructured databases.
DATA STORES (repositories): Data Lake, Data Warehouse, SQL Database, RDBMS, NoSQL Database
PROGRAMMING AND SCRIPTING: Spark, Python, Scala, Hive, Pig, Kafka, SQL
SEARCH TOOLS: Apache Lucene, Elasticsearch, Elastic Cloud, Kibana, Apache SOLR
DATA PIPELINES/ETL: Apache Camel, Flume, Apache Storm, Apache Spark, Nifi, Apache Kafka, Logstash, Talend
DATA CLEANSING: Cloudera CDH 4/5, Hortonworks HDP 2.3/2.4, Amazon Web Services (AWS)
BATCH & STREAM PROCESSING: Apache Hadoop, Spark, Storm, Tez, Flink
PROFESSIONAL EXPERIENCE
Confidential, Atlanta, GA
Lead Hadoop Big Data Engineer
Responsibilities:
- Provide proof-of-concepts to reduce engineering churn.
- Give extensive presentations about the Hadoop ecosystem, best practices, data architecture in Hadoop.
- Constructed custom data pipelines and manage the ETL and transformation process, data lakes, etc. for Confidential &T.
- Designed and developed ETL workflows using Python and Scala for processing data in HDFS & MongoDB.
- Implemented Spark using Scala and utilized Data Frames and Spark SQL API for faster processing of data.
- Provide mentorship and guidance to other architects to help them become independent.
- Implemented partitioning, dynamic partitions, and buckets in Hive. Involved in converting HiveQL/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Have used AWS components (Amazon Web Services) - Downloading and uploading data files (with ETL) to AWS system using S3 Talend components.
- Integrated Kerberos into Hadoop to make clusters more strong and secure from unauthorized users.
- Using Flume to handle streaming data and loaded the data into Hadoop cluster.
- Responsible for building Hadoop clusters with Hortonworks/Cloudera Distribution and integrate with Pentaho.
- Provide review and feedback for existing physical architecture, data architecture and individual code. Moved transformed data to Spark cluster where the data is set to go live on the application using Kafka.
- Debug and solve issues with Hadoop as on-the-ground subject matter expert. This could include everything from patching components to post-mortem analysis of errors.
- Experience in Importing and exporting data into HDFS and Hive using Sqoop.
- Provided proof of concepts converting Avro data into parquet format to improve query processing by using Hive.
- Involved in installing AWS EMR framework.
- Cassandra data modeling for storing and transformation in Apache Spark using Datastax Connector.
- Involved in installing AWS EMR framework.
- Integrated Kafka with Spark Streaming for real time data processing.
- Migrated MapReduce jobs to Spark, using Spark SQL and Data Frames API to load structured data into Spark clusters.
- Handling schema changes in data stream using Kafka.
- Integrated Kafka with Spark Streaming for real time data processing.
- Experienced in managing and reviewing Hadoop log files.
- Participated in development/implementation of Cloudera Hadoop environment.
- Load and transform large sets of structured, semi structured and unstructured data.
- Experience in working with various kinds of data sources such as MongoDB and Oracle.
- Successfully loaded files to Hive and HDFS from Mongo DB.
- Installed Oozie workflow engine to run multiple map-reduce programs which run independently with time and data.
- Performed Data scrubbing and processing with Oozie and Spark.
- Responsible for managing data coming from different sources.
- Experience in working with Flume to load the log data from multiple sources directly into HDFS.
- Involved in loading data from UNIX file system to HDFS.
- Installed and configured Hive and also written Hive UDFs.
Confidential, Atlanta, GA
Hadoop Data Engineer
Responsibilities:
- Involved in creating Hive tables, loading with data and writing hive queries, which will run internally in map, reduce way.
- Worked in installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and cluster configuration.
- Implemented best income logic using Pig scripts.
- Load and transform large sets of structured, semi structured and unstructured data.
- Exported the analyzed data to the relational databases using Sqoop for ingestion and Tableau for data visualization to generate reports for the BI team.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig and Sqoop.
- Programmed ETL functions between Oracle and Amazon Redshift.
- Built and supported several AWS, multi-server environment's using Amazon EC2, EBS, and Redshift.
- Worked with Hive on Tez, and various configuration options for improving query performance.
- Perform analytics on Time Series Data exists in Cassandra using Cassandra API.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, Hbase database and Sqoop.
- Integrated Kafka with Spark Streaming for real time data processing.
- Create columnar format in Hive like Parquet, ORC for storing and for use with file compression tools such as Gzip and Snappy.
- Responsible for building scalable distributed data solutions using Hadoop.
- Involved in loading data from LINUX file system to HDFS.
- Devised and lead the implementation of the next generation architecture for more efficient data ingestion and processing.
- Proficiency with mentoring and on-boarding new engineers who are not proficient in Hadoop and getting them up to speed quickly.
- Experience with being a technical lead of a team of engineers.
- Proficiency with modern natural language processing and general machine learning techniques and approaches.
- Extensive experience with Hadoop and HBase, including multiple public presentations about these technologies.
- Experience with hands on data analysis and performing under pressure.
Confidential, Menlo Park, CA
Hadoop Big Data Engineer
Responsibilities:
- Worked on tuning the performance Pig queries.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required. Worked with Flume to load the log data from multiple sources directly into HDFS.
- Performance tuning and troubleshooting of MapReduce by reviewing and analyzing log files.
- Deployed the Big Data Hadoop application using Talend on cloud AWS.
- Responsible to manage data coming from different sources.
- Involved in loading data from UNIX file system to HDFS.
- Load and transform large sets of structured, semi structured and unstructured data
- Cluster coordination services through Zookeeper.
- Experience in managing and reviewing Hadoop log files.
- Job management using Fair scheduler. Involved in scheduling Oozie workflow engine to run multiple Hive, Sqoop and Pig jobs.
- Developed Oozie workflow for scheduling and orchestrating the ETL process.
- Integrated Kafka with Spark Streaming for real time data processing.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and Troubleshooting, manage and review data backups, manage and review Hadoop log files.
- Installed Oozie workflow engine to run multiple Hive and pig jobs.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Developed Sqoop jobs to populate Hive external tables using incremental loads
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig and Sqoop.
- Crawled some websites using Python and collected information about users, questions asked and the answers posted.
- Analyzed the SQL scripts and designed the solution to implement using Pyspark.
- Hands- on experience in developing web applications using Python on Linux and UNIX platform.
- Experience in Automation Testing, Software Development Life Cycle (SDLC) using the Waterfall Model and good understanding of Agile Methodology.
Confidential, New York, NY
Big Data Developer
Responsibilities:
- Exported data from DB2 to HDFS using Sqoop.
- Developed Map Reduce jobs using API.
- Installed and configured Pig and also wrote Pig Latin scripts.
- Wrote Map Reduce jobs using Pig Latin.
- Developed workflow using Oozie for running MapReduce jobs and Hive Queries.
- Worked on Cluster coordination services through Zookeeper.
- Worked on loading log data directly into HDFS using Flume.
- Involved in loading data from LINUX file system to HDFS.
- Wrote multiple MapReduce programs in Java for data extraction, transformation, and aggregation from multiple file formats.
- Responsible for managing data from multiple sources.
- Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
- Responsible to manage data coming from different sources.
- Assisted in exporting analyzed data to relational databases using Sqoop.
- Implemented JMS for asynchronous auditing purposes.
- Involved in developing Message Driven and Session beans for claimant information integration with MQ based JMS queues.
- Created and maintained Technical documentation for launching Cloudera Hadoop Clusters and for executing
- Integrated Kafka with Spark Streaming for real time data processing.
- Hive queries and Pig Scripts to make UDFs.
- Worked on developing custom MapReduce programs and User Defined Functions (UDFs) in Hive to transform large volumes of data with respect to business requirements.
- Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity
- planning, and slots configuration.
- Created Hbase tables to store variable data formats of PII data coming from different portfolios.
Confidential, Verona, WI
Database Developer
Responsibilities:
- Technical support for Care Everywhere, the interoperability product to exchange electronic medical records between organizations.
- Collaborated with customers on a frequent basis to define and accomplish long-term objectives for organization’s success
- Assisted with setup and ongoing support issues for product’s end-user facing side, back-end configuration, and Windows Server-side.
- Troubleshot issues regarding Windows Server and networking, working alongside customer and outside healthcare organizations.
- Designed and developed code to enhance and support the product.
- Worked on internal project to prevent downtime across the network.