Sr. Big Data Engineer Resume
Greenwood Village, CO
SUMMARY
- Over 8+ years of experience in IT industry, including Big Data environment, Hadoop ecosystem, Java and Design, Developing, Maintenance of various applications.
- Experience in developing custom UDFs for Pig and Hive to in corporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL).
- Expertise in core Java, JDBC and proficient in using Java API's for application development.
- Expertise in Java Script, JavaScript MVC patterns, Object Oriented JavaScript Design Patterns and AJAX calls.
- Good experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions.
- Leveraged and integrated Google Cloud Storage and Big Query applications, which connected to Tableau for end user web - based dashboards and reports.
- Good working experience in Application and web Servers like JBoss and Apache Tomcat.
- Good Knowledge in Amazon Web Service (AWS) concepts like Athena, EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.
- Expertise in Big Data architecture like Hadoop (Azure, Hortonworks, Cloudera) distributed system, MongoDB, NoSQL and HDFS, parallel processing - MapReduce framework
- Development of Spark-based application to load streaming data with low latency, using Kafka and Pyspark programming.
- Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing and analysis of data.
- Experience in development of Big Data projects using Hadoop, Hive, HDP, Pig, Flume, Storm and MapReduce open-source tools.
- Experience in installation, configuration, supporting and managing Hadoop clusters.
- Experience in working with MapReduce programs using Apache Hadoop for working with Big Data.
- Experience in developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Talend Integration Suite.
- Experience in installation, configuration, supporting and monitoring Hadoop clusters using Apache, Cloudera distributions and AWS.
- Strong hands-on experience with AWS services, including but not limited to EMR, S3, EC2, route53, RDS, ELB, DynamoDB, CloudFormation, etc.
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
- Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Scala, Pig, Impala, Sqoop, Oozie, Flume, Storm, big data technologies.
- Worked on Spark, Spark Streaming and using CoreSparkAPI to explore Spark features to build data pipelines.
- Experienced in working with different scripting technologies like Python, UNIX shell scripts.
- Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services successfully loaded files to HDFS from Oracle, SQLServer, Teradata and Netezza using Sqoop.
- Extensive knowledge in working with IDE Tools such as My Eclipse, RAD, IntelliJ, NetBeans.
- Expert in Amazon EMR, Spark, Kinesis, S3, ECS, Elastic Cache, Dynamo DB, Redshift.
- Experience in installation, configuration, supporting and managing -Cloudera Hadoop platform along with CDH4&CDH5 clusters.
- Responsible for designing and implementing the data pipeline using Big Data tools including Hive, Oozie, Airflow, Spark, Drill, Kylin, Sqoop, Kylo, Nifi, EC2, ELB, S3 and EMR.
- Installed and configured apache airflow for workflow management and created workflows in python.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
- Proficiency in multiple databases like NoSQL databases (MongoDB, Cassandra), MySQL, ORACLE, and MS SQL Server.
- Experience in database design, entity relationships and database analysis, programming SQL, stored procedures PL/SQL, packages and triggers in Oracle.
- Experience in working with different data sources like Flat files, XML files and Databases.
- Ability to tune Big Data solutions to improve performance and end-user experience.
- Having working experience with Building RESTful web services, and RESTful API.
- Managed multiple tasks and worked under tight deadlines and in fast pace environment.
- Excellent analytical, communication skills which helps to understand the business logics and develop a good relation between stakeholders and team members.
- Strong communication skills, analytic skills, good team player and quick learner, organized and self-motivated.
TECHNICAL SKILLS
Languages: C, C++, Java, J2SE 1.6, 1.7, 1.8, JEE, Scala, JUnit & Shell Scripting
Big Data Technologies: Apache Hadoop, Apache Spark, Apache Kafka, Apache Sqoop, Apache Crunch, Apache Hive, Map Reduce, Oozie, Apache NiFi and Apache Pig
Frameworks: Spring, Spring Boot.
Web Services: RESTFUL
Data Formats: JSON, AVRO, ORC, CSV, XML and Proto Buffer.
Data Indexing Technology: Apache SOLR
Deployments: Pivotal Cloudy Foundry, Chef.
Integration Tools: Jenkins, Team City.
Operating Systems: Mac OS, Windows XP/ Visa/ 7
Packages & Tools: MS Office Suite (Word, Excel, PowerPoint, SharePoint, Outlook, Project)
Development Tools: Eclipse Juno, IntelliJair
Database: JDBC, MySQL, SQL Server, Oracle 10g
NoSQL Database: HBase and MongoDB
UML Modeling Tools: Visual Paradigm for UML 10.1, Visio
BI Tools: SAP Business Objects 4.1, Information Design Tool and Web Intelligence
Cloud services: AWS, Azure
PROFESSIONAL EXPERIENCE
Confidential, Greenwood Village, CO
Sr. Big Data Engineer
Responsibilities:
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Responsible for Big data initiatives and engagement including analysis, brainstorming, POC, and architecture.
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Installed and Configured Apache Hadoop clusters for application development and Hadoop tools.
- Installed and configured Hive and written Hive UDFs and used repository of UDF's for Pig Latin.
- Developed data pipeline using Pig, Sqoop to ingest cargo data and customer histories into HDFS for analysis.
- Migrated the existing on-prem code to AWS EMR cluster.
- Installed and configured Hadoop Ecosystem components and Cloudera manager using CDH distribution.
- Coordinated with Hortonworks support team through support portal to sort out the critical issues during upgrades.
- Involved in all phases of SDLC using Agile and participated in daily scrum meetings with cross teams.
- Worked on modeling of Dialog process, Business Processes and coding Business Objects, Query Mapper and JUnit files.
- Created automated pipelines in AWS Code Pipeline to deploy Docker containers in AWS ECS using S3.
- Used HBase NoSQL Database for real time and read/write access to huge volumes of data in the use case.
- Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into HBase.
- Developed AWS Lambda to invoke glue job as soon as a new file is available in Inbound S3 bucket.
- Created spark jobs to apply data cleansing/data validation rules on new source files in inbound bucket and reject records to reject-data S3 bucket.
- Developed AWS cloud formation templates and setting up Auto scaling for EC2 instances and involved in the automated provisioning of AWS cloud environment using Jenkins.
- Created HBase tables to load large sets of semi-structured data coming from various sources.
- Responsible for loading the customer's data and event logs from Kafka into HBase using REST API.
- Created tables along with sort and distribution keys in AWS Redshift.
- Created shell scripts and python scripts to automate our daily tasks (includes our production tasks as well)
- Created, altered and deleted topics using Kafka Queues when required with varying.
- Used cloud computing on the multi-node cluster and deployed Hadoop application on cloud S3 and used Elastic Map Reduce (EMR) to run a MapReduce.
- Developed analytics enablement layer using ingested data that facilitates faster reporting and dashboards.
- Worked with production support team to provide necessary support for issues with CDH cluster and the data ingestion platform.
- Created Hive External tables to stage data and then move the data from Staging to main tables
- Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
- Developed applications using Angular6 and lambda expressions in Java to store and process the data.
- Implemented Angular 6 Router to enable navigation from one view to next as agent performs application tasks.
- Pulling the data from Hadoop data lake ecosystem and massaging the data with various RDD transformations.
- Used PySpark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
- Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python.
- Developed and maintained batch data flow using HiveQL and Unix scripting.
- Experienced in writing and deploying UNIX Korn Shell Scripts as part of the standard ETL processes and for job automation purposes.
- Designed and Developed Real time processing Application using Spark, Kafka, Scala and Hive to perform streaming ETL and apply Machine Learning.
- Developed and execute data pipeline testing processes and validate business rules and policies.
- Expert in using Pyspark for reading data from various data sources using SQL or HiveQL.
- Built code for real time data ingestion using MapR-Streams.
- Experienced in AWS Glue and explored Apache Airflow for ETL Processes.
- Implemented Spark using Python and Spark SQL for faster processing of data.
- Automation of unit testing using Python. Different testing methodologies like unit testing, Integration testing.
- Used HIVE join queries to join multiple tables of a source system and load them into Elastic Search Tables.
- Implemented different data formatter capabilities and publishing to multiple Kafka Topics.
- Extensively worked on Jenkins to implement Continuous Integration (CI) and Continuous Deployment (CD) processes.
- Developed AWS cloud formation templates and setting up Auto scaling for EC2 instances and involved in the automated provisioning of AWS cloud environment using Jenkins.
- Written automated HBase test cases for data quality checks using HBase command line tools.
- Involved in development of Hadoop System and improving multi-node Hadoop Cluster performance.
- Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.
Environment: Hadoop 3.0, MapReduce, Hive 3.0, Agile, HBase 1.2, NoSQL, AWS, EC2, Kafka, Pig 0.17, HDFS, Java 8, Hortonworks, Spark, PL/SQL, Python, Jenkins.
Confidential, Nashville, TN
Sr. Data Engineer
Responsibilities:
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Created the automated build and deployment process for application, re-engineering setup for better user experience, and leading up to building a continuous integration system.
- Implemented MapReduce programs to retrieve results from unstructured data set.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Worked on and designed Big Data analytics platform for processing customer interface preferences and comments using Hadoop, Hive and Pig, Cloudera.
- Importing and exporting data into HDFS and Hive using Sqoop from Oracle, MongoDB and vice versa.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Worked on reading multiple data formats on HDFS using Scala.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Installed and configured Pig and also written Pig Latin scripts.
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Build data platforms, pipelines, and storage systems using the Apache Kafka, Apache Storm and search technologies such as Elastic search.
- Worked on querying data using Spark SQL on top of PySpark engine.
- Experienced in implementing POC's to migrate iterative MapReduce programs into Spark transformations using Scala.
- Developed Spark scripts by using Python and Scala shell commands as per the requirement.
- Experienced with batch processing of data sources using Apache Spark, Elastic search.
- Designed dimensional data models using Star and Snowflake Schemas.
- Experienced in AWS cloud environment and on S3 storage and EC2 instances
- Developed Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka.
- Designed and implemented SOLR indexes for the metadata that enabled internal applications to reference Scopus content.
- Used Spark for Parallel data processing and better performances using Scala.
- Extensively used Pig for data cleansing and extract the data from the web server output files to load into HDFS.
- Implemented a fully operational production grade large scale data solution on Snowflake Data Warehouse.
- Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
- Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
- Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it using MapReduce programs.
- Developed simple to complex MapReduce streaming jobs using Python.
Environment: Pig 0.17, Hive 2.3, HBase 1.2, Sqoop 1.4, Flume 1.8, zookeeper, Aws, MapReduce, HDFS, Cloudera, Scala, Spark 2.3, SQL, Apache Kafka 1.0.1, Apache Storm, Python, Unix.
Confidential, New York, NY
Big Data Developer
Responsibilities:
- Contributing to the development of key data integration and advanced analytics solutions leveraging Apache Hadoop and other big data technologies for leading organizations using major Hadoop Distributions like Hortonworks.
- Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
- Performed Data transformations in HIVE and used partitions, buckets for performance improvements.
- Created Hive external tables on the MapReduce output before partitioning; bucketing is applied on top of it.
- Developed business specific Custom UDF's in Hive, Pig.
- Developed end to end architecture design on bigdata solution based on variety of business use cases.
- Worked as a Spark Expert and performance Optimizer.
- Member of Spark COE (Center of Excellence) in Data Simplification project at Cisco.
- Experienced with Spark Context, Spark-SQL, Data Frame, Pair RDD's, PySpark.
- Handled Data Skewness in Spark-SQL.
- Implemented Spark using Scala, Java and utilizing Data frames and Spark SQL API for faster processing of data.
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Developed a data pipeline using Kafka, HBase, Spark and Hive to ingest, transform and analyzing customer behavioral data also developed Spark jobs and Hive Jobs to summarize and transform data.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and PySpark.
- Implemented Sqooping from Oracle and MongoDB to Hadoop and load back in parquet format.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data; Worked under Map Distribution and familiar with HDFS.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HDFS.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Designed and maintained Test workflows to manage the flow of jobs in the cluster.
- Worked with the testing teams to fix bugs and ensured smooth and error-free code.
- Designed dimensional data models using Star and Snowflake Schemas.
- Preparation of docs like Functional Specification document and Deployment Instruction documents.
- Experience in making the Devops pipelines using Openshift and Kubernetes for the Microservices Architecture.
- Fixed defects during the QA phase, support QA testing, troubleshoot defects and identify the source of defects.
- Involved in installing Hadoop Ecosystem components (Hadoop, MapReduce, Spark, Pig, Hive, Sqoop, Flume, Zookeeper and HBase).
- Leveraged and integrated Google Cloud Storage and Big Query applications, which connected to Tableau for end user web-based dashboards and reports.
- Worked collaboratively with all levels of business stakeholders to architect, implement and test Big Data based analytical solution from disparate sources.
Environment: AWSS3, RDS, EC2, Redshift, Hadoop 3.0, Hive 2.3, Pig, Sqoop 1.4.6, Oozie, HBase 1.2, Flume 1.8, Hortonworks, MapReduce, Kafka, HDFS, Oracle 12c, Microsoft, Java, GIS, Spark 2.2, Zookeeper, PySpark, Snowflake.
Confidential, Costa Mesa, CA
Hadoop Developer
Responsibilities:
- Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Bigdata technologies such as Hadoop, Map Reduce Frameworks, HBase, and Hive.
- Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.
- Developer full SDLC of AWS Hadoop cluster based on client's business need
- Involved in loading and transforming large sets of structured, semi structured and unstructured data from relational databases into HDFS using Sqoop imports.
- Implement enterprise grade platform (mark logic) for ETL from mainframe to NOSQL (Cassandra)
- Responsible for importing log files from various sources into HDFS using Flume
- Analyzed data using HiveQL to generate payer by reports for transmission to payer's form payment summaries.
- Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
- Used Data Frame API in Scala for converting the distributed collection of data organized into named columns.
- Performed data profiling and transformation on the raw data using Pig, Python, and Java.
- Developed predictive analytic using ApacheSparkScalaAPIs.
- Involved in working of big data analysis using Pig and User defined functions (UDF).
- Created Hive External tables and loaded the data into tables and query data using HQL.
- Implemented Spark Graph application to analyze guest behavior for data science segments.
- Enhancements to traditional data warehouse based on STAR schema, update data models, perform Data Analytics and Reporting using Tableau.
- Involved in migration of data from existing RDBMS (oracle and SQL server) to Hadoop using Sqoop for processing data.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
- Developed prototype for Big Data analysis using Spark, RDD, Data Frames and Hadoop eco system with CSV, JSON, parquet and HDFS files.
- Developed Hive SQL scripts for performing transformation logic and loading the data from staging zone to landing zone and Semantic zone.
- Involved in creating Oozie workflow and Coordinator jobs for Hive jobs to kick off the jobs on time for data availability.
- Worked on Oozie scheduler to automate the pipeline workflow and orchestrate the Sqoop, hive and pig jobs that extract the data on a timely manner.
- Exported the generated results to Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.
- Managed and lead the development effort with the help of a diverse internal and overseas group.
Environment: Big Data, Spark, YARN, HIVE, Pig, JavaScript, JSP, HTML, Ajax, Scala, Python, Hadoop, AWS, Dynamo DB, Kibana, Cloudera, EMR, JDBC, Redshift, NOSQL, Sqoop, MYSQL.
Confidential
Hadoop Developer/Admin
Responsibilities:
- Involved in start to end process of Hadoop cluster setup where in installation, configuration and monitoring the Hadoop Cluster.
- Automated Setup Hadoop Cluster, Implemented Kerberos security for various Hadoop services using Horton Works.
- Responsible for Cluster maintenance, commissioning and decommissioning Data nodes, Cluster Monitoring, Troubleshooting, Manage and review data backups, Manage & review Hadoop log files.
- Monitoring systems and services, architecture design and implementation of Hadoop deployment, configuration management, backup, and disaster recovery systems and procedures.
- Installation of various Hadoop Ecosystems and Hadoop Daemons.
- Responsible for Installation and configuration of Hive, Pig, HBase and Sqoop on the Hadoop cluster.
- Configured various property files like core-site.xml, hdfs-site.xml, mapred-site.xml based upon the job requirement
- Involved in loading data from UNIX file system to HDFS, Importing and exporting data into HDFS using Sqoop, experienced in managing and reviewing Hadoop log files.
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake ecosystem by creating ETL pipelines using Pig, and Hive.
- Managed and reviewed Hadoop Log files as a part of administration for troubleshooting purposes. Communicate and escalate issues appropriately.
- Extracted meaningful data from dealer csv files, text files, and mainframe files and generated Python panda's reports for data analysis.
- Developed python code using version control tools like GIT hub and SVN on vagrant machines.
- Performed data analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions. Documented the systems processes and procedures for future references.
- Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters. Involved in Installing and configuring Kerberos for the authentication of users and Hadoop daemons.
Environment: Horton Work, Hadoop, HDFS, Pig, Hive, Sqoop, Flume, Kafka, Storm, UNIX, Cloudera Manager, Zookeeper and HBase, Python, Spark, Apache, SQL, ETL
Confidential
Big Data Developer
Responsibilities:
- Involved in complete SDLC life cycle of big data project that includes requirement analysis, design, coding, testing and production.
- Extensively Used Sqoop to import/export data between RDBMS and hive tables, incremental imports and created Sqoop jobs for last saved value.
- Established custom Map Reduces programs to analyze data and used Pig Latin to clean unwanted data.
- Installed and configured Hive and wrote Hive UDF to successfully implement business requirements.
- Involved in creating hive tables, loading data into tables, and writing hive queries those are running in MapReduce way.
- Experienced with using different kind of compression techniques to save data and optimize data transfer over network using Lzo, Snappy, etc. in Hive tables.
- Implemented custom interceptors for flume to filter data and defined channel selectors to multiplex the data into different sinks.
- Experience in working with Spark SQL for processing data in the Hive tables.
- Developing Scripts and Tidal Jobs to schedule a bundle (group of coordinators), which consists of various Hadoop Programs using Oozie.
- Involved in writing test cases, implementing unit test cases.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
- Hands on experience with Accessing and perform CURD operations against HBase data using Java API.
- Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
- Implemented POC to migrate map reduce jobs into Spark RDD transformations using Scala.
- Developed spark applications using Scala for easy Hadoop transitions.
- Extensively used Hive queries to query data according to the business requirement.
- Used Pig for analysis of large data sets and brought data back to HBase by Pig
Environment: Hadoop, HDFS, Map Reduce, Hive, Flume, Sqoop, PIG, MySQL and Ubuntu, Zookeeper, CDH3/4 Distribution, Java Eclipse, Oracle, Shell Scripting.