Big Data Engineer Resume
Dallas, TX
SUMMARY
- I have around 8+ years of professional IT experience in Analysis, Development, Integration and Maintenance of Web based and Client/Server applications using Java and Big Data technologies.
- 4+ years of relevant experience in using Hadoop Ecosystem tools and their architecture (HDFS, MapReduce, YARN, Spark, Pig, Hive, HBase, Sqoop, Flume, Oozie).
- Strong experience working with Spark (Dataframes and Spark Sql) for high performance data processing and data preparation.
- Experience in real time analytics with Apache Spark Streaming and Kafka.
- Ingested real time streaming events to Kafka topics using Kafka producers API.
- Good hands on experiencing working with various hadoop disrtibutions mainly Cloudera (CDH), Hortonworks (HDP) and Amazon EMR.
- Expertise in developing production ready Spark applications utilizing Spark - Core, Dataframes, Spark-SQL, Spark-ML and Spark-Streaming API's.
- Strong experience troubleshooting failures in spark applications and fine-tuning for better performance.
- Experience in using D-Streams in spark streaming, accumulators, Broadcast variables, various levels of caching and optimization techniques in spark.
- Worked extensively on Hive for building complex data analytical applications.
- Strong experience writing complex map-reduce jobs including development of custom Input Formats and custom Record Readers.
- Sound Knowledge in map-side join, reduce-side join, shuffle & sort, distributed cache, compression techniques, multiple hadoop Input & output formats.
- Worked extensively on Sqoop for performing bulk and incremental ingestion of large datasets from Teradata to HDFS.
- Good experience working with AWS Cloud services like S3, EMR, Redshift, Athena etc.,
- Having Experience in monitoring and managing the Hadoop cluster using Cloudera Manager.
- Experience in job workflow designing and scheduling using Oozie.
- Worked with Apache NiFi to develop Custom Processors for processing and distributing data among cloud systems.
- Having good knowledge of Scala programming concepts.
- Expertise in distributed and web environments focused in Core Java technologies like Collections, Multithreading, IO, Exception Handling and Memory Management.
- Expertise in development of Web applications using J2EE technologies like Servlets, JSP, Web Services, Spring, Hibernate, HTML5, JavaScript, jQuery, AJAX etc.,
- Knowledge of standard build and deployment tools such as Eclipse, Scala IDE, Maven, Subversion, SBT.
- Extensive knowledge in Software Development Lifecycle (SDLC) using Waterfall, Agile methodologies.
- Facilitate Sprint planning, daily scrums, retrospectives, stakeholder meetings, and software demonstrations.
- Excellent communication skills with the ability to communicate complex issues to technical and non- technical audiences that includes peers, partners, and Senior IT and Business management.
- Implemented spark application with SCALA Programming Language.
- Worked on GCP Cloud technologies Like Data Proc, Data Flow, BIG QUERY.
TECHNICAL SKILLS
Languages: Java, Scala, SQL, PL/SQL, Pig Latin, Python, Hive QL
Web Technologies: JEE (JDBC, JSP, SERVLET, JSF, JSTL), AJAX, JavaScript
Big Data Systems: Hadoop, HDFS, MapReduce, YARN, Pig, Hive, Sqoop, Flume, Oozie, Impala, Spark, Kafka, Cloudera CDH4, CDH5, HortonWorks, Solr, and Ranger
RDBMS: Oracle 10g/11g, MySQL, SQL Server 2005/2008 R2, PostgreSQL, DB2, Teradata
NoSQL Databases: HBase, MongoDB, Cassandra
App/Web Servers: Apache Tomcat, WebLogic
SOA: Web services, SOAP, REST
Frameworks: Struts 2, Hibernate, Spring 3.x
Version Control: GIT, CVS, SVN
IDEs: Eclipse, Scala IDE, NetBeans, IntelliJ IDEA
Operating Systems: UNIX, Linux, Windows
PROFESSIONAL EXPERIENCE:
Confidential, Dallas, TX
Big Data Engineer
Responsibilities:
- Processed data into HDFS by developing solutions, analyzed the data using MapReduce, Hive and produce summary results from Hadoop to downstream systems.
- Data sources are extracted, transformed, and loaded to generate CSV data files with Python programming and SQL queries.
- Developed Spark applications using python and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Developed the strategy/implementation of Hadoop Impala integration with existing ecosystem of RDBMS using Apache spark.
- Involved in creating Hive tables, and then applied HiveQL on those tables for data validation. Performed optimizations like using distributed cache for small datasets, Partition, Bucketing in hive and Map Side joins.
- Developed Simple to complex MapReduce Jobs in python using Hive and Spark.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Developed multiple MapReduce jobs in python for data processing and cleaning.
- Involved in loading data from LINUX file system to HDFS and Responsible for managing data from multiple sources.
- Wrote Kafka producers to stream the data from external rest APIs to Kafka topics.
- Extracted files from oracle through Sqoop and placed in Spark and processed.
- Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
- Load and transform large sets of structured, semi structured, and unstructured data and Assisted in exporting analyzed data to relational databases using Sqoop
Environment: Hadoop, HDFS, Hive, MapReduce, Sqoop, Spark, Kafka, airflow, LINUX.
Confidential -San Francisco, CA
AWS Big Data Engineer
Responsibilities:
- Deployed Lambda and other dependencies into AWS to automate EMR Spin for Data Lake jobs
- Set up continuous integration/deployment of spark jobs to EMR clusters
- Scheduled spark applications/Steps in AWS EMR cluster.
- Installed and configured Apache Hadoop, Hive and Pig environment on the prototype server.
- Configured database to store Hive metadata.
- Loaded unstructured data into Hadoop File System (HDFS).
- Created ETL jobs to load data and server data into Buckets and transported S3 data into the Data Warehouse.
- Created reports and dashboards using structured and unstructured data.
- Joined various tables using spark and Scala and ran analytics on top of them in EMR
- Applied spark streaming for real time data transforming.
- Created multiple dashboards in tableau for multiple business needs.
- Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.
- Implemented test scripts to support test-driven development and continuous integration
- Developed and executed a migration strategy to move Data Warehouse from SAP to AWS Redshift.
- Designed and built multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day.
- Implemented and Managed ETL solutions and automating operational processes.
- Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
- Published interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
Environment: Data Engineering, Databases, EMR, Redshift, Hadoop, Spark, ETL, Tableau.
Confidential -San Diego, CA
Big Data Engineer
Responsibilities:
- Hands on experience in working on Spark SQL queries, Data frames, and import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Created Data Quality Scripts using SQL and Hive to validate successful data to load and quality of the data. Created various types of data validation using Python in spark.
- Involved in converting Hive/SQL queries into Spark Transformations using Spark RDD’s and Apache PySpark.
- Involved in analyzing system failures, identifying root causes, and recommended course of actions.
- Managed the imported data from different data sources, performed transformation using Hive, Spark and Map- Reduce and loaded data in HDFS.
- Executed Oozie workflow engine to run multiple Hive and spark jobs, which run independently with time and data availability and developed Oozie workflow to run job onto data availability of transactions.
- Developed KafkaScripts to extract the data from the sftp server output files to load into HDFS.
- Implemented custom UDF for Confidential Kudu then Developed Hive UDF’S to pre-process the data for analysis and Develop Spark for the analysts.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Spark.
- Collected the logs data from web servers and integrated in to HDFS using Kafka.
- Implemented Fair schedulers on the Job tracker to share the resources of the Cluster for the Map Reduce jobs in python given by the users.
- Managed and reviewed Hadoop log files and Spark to analyze point-of-sale data and coupon usage.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports
- Worked with highly engaged Informatics, Scientific Information Management and enterprise IT teams.
Environment: Hadoop, HBase, HDFS, Hive, Spark, Spark Sql, Pig, Zookeeper, Oozie, Impala, Kafka.
Confidential
Hadoop Developer/Spark
Responsibilities:
- Created data pipelines to ingest data from various sources including Devices, Apps, Survey data, S3 Buckets & Databases (SAP Hana, MySQL, SQL Server, Amazon Redshift, PostgreSQL).
- Collected metrics about Data pipelines, stored indexes in Elasticsearch & created Kibana dashboards which facilitated us in quickly identifying data loss & anomalies.
- Responsible for maintaining General Data Protection Regulation (GDPR) provided legal team with user’s data upon request in desired format, deleted user’s data from various schemas from Hadoop Distributed File System.
- Worked on custom Pyspark libraries to push columns from various formats of data to a Data Governance tool called Collibra.
- Successfully migrated Data Pipeline jobs from Oozie to Airflow.
- Worked on Anonymization Project to anonymize PII information present in various schemas within Data Lake with a provision to maintain a mapping between de-identified hash values and original values in a separate RDBMS Databases.
- Worked on AWS EMR Migration, successfully moved resource intense ETL’S to run in AWS EMR, created custom libraries.
- To meet growing demand for system resources & to accommodate more efficient data access for end users we migrated our platform architecture leveraging Snowflake, Databricks from Cloudera Distribution. Started migrating data pipelines & data from CDH to the new system.
- Worked on a POC to explore the features of Apache Pulsar.
- Worked on building NIFI flows (Processors relating to Kafka, JoltTransformJSON, Files, S3, HDFS etc.) as per the needs.
Environment: Apache Hadoop, Cloudera, AWS EC2, S3, EMR, Glue, Kafka, Pulsar, Linux, Java, MapReduce, HBase, Hive, Sqoop, Oozie and SQL, Spark, Elasticsearch, Kibana, Snowflake, Databricks.
Confidential
Hadoop Developer
Responsibilities:
- Built a scalable distributed data solution using Hadoop on a 30-node cluster using AWS cloud to run analysis on 25+ Terabytes of data.
- Developed several new MapReduce and Spark programs to analyze and transform the data to uncover insights into the customer usage patterns.
- Used MapReduce to Index the large amount of data to easily access specific records.
- Performed ETL using Pig, Hive and MapReduce to transform transactional data to de-normalized form.
- Configured periodic incremental imports of data from DB2 into HDFS using Sqoop.
- Worked extensively with importing metadata into Hive using Java and migrated existing tables and applications to work on
- Hive and AWS cloud.
- Wrote Pig and Hive UDFs to analyze the complex data to find specific user behavior.
- Used Kafka and Solr workflow engine to schedule multiple recurring and ad-hoc Hive and Pig jobs.
- Responsible for maintaining and implementing code versioning techniques using Cassandra for the entire project.
- Created HBase tables to store various data formats coming from different portfolios.
- Utilized cluster co-ordination services through ZooKeeper.
- Assisted the team responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, managing & reviewing data backups and Hadoop log files.
- Worked with teams in various locations nationwide and internationally to understand and accumulate data from different sources.
- Worked with the testing teams to fix bugs and ensure smooth and error-free code.
Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Sqoop, Oozie, HBase, ZooKeeper, PL/SQL, MySQL, DB2, Teradata.