Big Data Engineer/ Hadoop Developer Resume
ChicagO
SUMMARY:
- Big Data Developer/Hadoop Developer with eight plus years of experience in Hadoop ecosystem (HIVE, PIG, YARN, MapReduce, IMPALA, SQOOP, Spark, Oozie, Zookeeper, HBASE, Hue, Ambari, Kafka and Flume)providing and implementing solutions for Big Data Applications with excellent knowledge of Hadoop architecture (HDFS, Name Node and Data Node).
- Good experience in Hortonworks using puttyand Cloudera for Apache Hadoop distributions.
- Experience in bi - directional data pipelines from HDFS to Relational Database with Sqoop. Created Sqoop jobs, import, export, evaluate and also evaluate.
- Hands on experience in writing MapReduce programs, Pig & Hive scripts.
- Designed and created Hive external tables using shared meta-store with Static & Dynamic partitioning, bucketing and indexing.
- Expertise in analyzing data using HiveQL, Pig Latin, and custom Map Reduce programs in python and Java.
- Good Knowledge of Pig for load data, transformations, event joins, filter, group and other aggregation functions.
- Exploring with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
- Good exposure on Yarn environment with Spark with python and Scala.
- Contributed towards building Apache Spark applications using Python, Scala.
- Writing complex SQL queries using joins, group by, nested queries.
- Experience in HBase to load data using connectors and write queries using NOSQL.
- Performance of the Hive, Pig queries were increased by running through Apache Tez.
- Parse and gather data from popular file formats such as .csv, .json, .xml and .html.
- Experience in writing machine learning algorithms using Spark MLlib library.
- Running and scheduling workflows using Oozie and Zookeeper, identifying failures and integrating jobs.
- Hands on experience on Kafka and Flume to load the log data from multiple sources directly into HDFS.
- Experience with solid capabilities in exploratory data analysis, statistical analysis, and visualization usingR, Python, SQL and Tableau.
- Integrated Hadoop with Tableau to generate visualizations like Tableau Dashboards.
TECHNICAL SKILLS:
Operating Systems: Linux, Mac OS & Windows
Hadoop eco system: HDFS, MapReduce, Hive, Yarn, Pig, Impala, Spark SQL, HBase, Kafka, Sqoop, Flume, Spark Streaming, Oozie, Zookeeper, Hue, Ambari.
Hadoop Distribution: Hortonworks-2.6.1, Cloudera-5.10.
Programming Languages: R, python, Linux shell scripts, Java and Scala
Databases: MySQL, Mongo DB, Cassandra, Teradata and HBase
ETL: Talend
Cloud: AWS, Microsoft Azure, Google Cloud
Build Tools: Ant, Maven
Processing: Apache Spark, Apache Storm
Visualization: Tableau, R, python
PROFESSIONAL EXPERIENCE:
Big Data Engineer/ Hadoop Developer
Confidential, Chicago
Responsibilities:
- Data Ingestion from relational databasesinto Hdfs using Sqoop import/export and also created Sqoop Job, Evaluate, and incrementaljobs.
- Created Partitions, Bucketing and Indexing concepts for optimization as part of hive data modelling.
- Involved in developing Hive DDLs to create, alter and drop Hive tables.
- Built re-usable Hive UDF libraries for business requirements which enabled users to use these UDF’s in Hive querying.
- Responsible for analyzing and cleansing raw data by performing Hive queries and running Pig Scripts on data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and experience in using Spark-Shell and Spark Streaming.
- Used Spark using Python, Scalaand utilizing Data frames andSparkSQL API for faster processing of data.
- Built recommendation system using Association rule mining algorithm in spark using MLlib to find frequent buying patterns in customers and recommend products accordingly, also implemented an idea for pruning obvious items.
- Written multiple MapReduce program in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
- Developed Hive and Impala for end user/ analyst requirements to perform hoc analysis.
- Strong knowledge in NOSQL column oriented databases like HBase, Cassandra and their integration with Hadoop cluster using connectors.
- Experienced with both HUE UI for accessing HDFS files and data.
- Streamed real time data by integrating Spark with Kafka for dynamic price surging.
- Developed a data pipeline using Kafka and Spark to store data into HDFS.
- Designed workflow by scheduling Hive processes for Log file data which is streamed into HDFS using Flume.
- Designed workflows and coordinators, managed in Oozie and Zookeeper to automate and parallelize Hive, Sqoop and Pig jobs in Cloudera Hadoop using XML.
- Involved in building the runnable jars for the module framework through Maven clean, Maven dependencies.
- Tested ApacheTez, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
- Developed SQL scripts to compare all the records for every field and table at each phase of the data movement process from the original source system to the final target.
- Pre-processed large sets of structured and semi-structured data, with different formats like Text Files, Avro, Parquet, ORC, Sequence Files, and JSON Record.
- Used Talend to build data integration workflows that clean, transform and integrate data on Hadoop on top of YARN.
- Responsible for continuous monitoring and managing the Hadoop Cluster using Cloudera Manager.
- Creating customized Tableau Dashboards, integrating Custom SQL from Hadoop and performing data blending in reports.
Environment: Cloudera CDH,Linux, HDFS, MapReduce, Shell Scripting, Java, Talend, Hive, Pig, Spark,Storm, Impala, Sqoop, Flume, Oozie, Kafka, Eclipse, ApacheTez, Talend, Yarn, Distcp, Maven, Tableau.
Big Data Developer
Confidential, Washington D.C
Responsibilities:
- Working in an Agile team to deliver and support required business objectives by using Python, Shell Scripting and other related technologies to acquire, ingest, transform and publish data both to and from Hadoop Ecosystem.
- Extracted the data from MySQL into HDFS using Sqoop export/import and also handled importing of data from various data sources, performed transformations using Pig and loaded data into HDFS.
- Assisted application teams in installing Hadoop updates, operating system, patches and version upgrades when required.
- Imported data from RDBMS to Hadoop using Sqoop import.
- Hands on experience in loading data from UNIX file system to HDFS and vice versa
- Performed transformations using Python and Scala to analyze and gather the data in required format of customer
- Used a combination of Flume and Kafka to get log data from and web and mobile app servers.
- Worked on migrating custom workflows built using tools in Hadoop to a third-party tool.
- Created Hive External and Internal tables on top of data in HDFS using various SerDe.
- Created hive tables using ORC for faster access and compression in data modelling.
- Ran Analytic queries and gathered stats for tables in hive using Impala.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how it translates to MapReduce jobs.
- Proficient in designing Row keys and Schema Design for NOSQL DatabaseHBaseand knowledge of other NOSQL database Cassandra.
- Worked on creating and automating reports in Excel using data imported from Hive via ODBC.
- Writing the RDD and dataframes to process the data in spark before it can be ingested for different uses like reporting.
- Worked on Python scripts to help our team internally with data management.
- Migrated complex Map reduce programs into Spark RDD transformations and actions.
- Developed RDD's using Python and coded Python applications for business requirements.
- Created workflows to automate the batch jobs using third party tools.
- Used R and R Studio for statistical models, machine learning algorithms and creating executive reports.
- Used statistical inference, linear regression and maximization techniques.
- Configuration of cluster files and Direct connect and VPN with AWS VPC
- Created cloud formation template to build a repeatable process to stand up various application deployment environments in AWS.
- Worked in production support team to ensure data availability, data quality and data integrity for the enterprise.
- Indulged in regular stand-ups meetings, status calls, Business owner meetings with stake holders, Risk management teams in an agile environment.
- Supported code/design analysis, strategy development and project planning.
- Followed Scrum implementation of scaled agile methodology for entire project.
Skills: Cloudera CDH 5.1, Hadoop, Spark, Kafka, Flume, Sqoop, Hive, Impala, HBase, R, R Studio, Scala and python, AWS, Zookeeper, Shell Scripting, Oozie, SQL and Tableau
Big Data Developer
Confidential, Charlotte
Responsibilities:
- Responsible for loading the customer’s data and event logs from Kafka into HBase using REST API.
- Worked on debugging, performance tuning and Analyzing data using Hadoop components Hive Pig.
- Imported streaming data using Apache Storm and Apache Kafka into HBase and designed hive tables on top.
- Created Hive tables from JSON data using data serialization framework like AVRO.
- Developed multiple POCs using Pyspark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Deployed Hadoop cluster using Cloudera Hadoop 4 (CDH4) with Pig, Hive, HBase and Spark.
- Developed restful webservice using Spring Boot and deployed to pivotal web services.
- Used build and deployment tools like Maven.
- Involved in Test Driven Development (TDD).
- Developed Kafka producer and consumers, HBase clients, Spark and Hadoop MapReduce jobs along with components on HDFS, Hive.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Responsible for processing unstructured data using Pig and Hive.
- Managed and reviewed Hadoop log files. Used Scala for integration Spark into Hadoop.
- Implemented Spark using Scala and SparkSQL for faster testing and processing of data
- Extensively used Pig for data cleansing and HIVE queries for the analysts.
- Created PIG script jobs in maintaining minimal query optimization.
- Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Worked on various Business Object Reporting functionalities such as Slice and Dice, Master/detail, User Response function and different Formulas.
Skills: Linux, Hadoop, HDFS, Pig, Hive, HBase, MapReduce, Sqoop, Oozie, Spark, Hue, LINUX, Teradata, Java APIs, Java collection, SQL Business Objects XI R2, Apache Storm, Pyspark, SQL, Teradata, Spring Boot, Maven, Kafka, Scala, SparkSQL.
Data Analyst
Confidential
Responsibilities:
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data from RDBMS through Sqoop and placed in HDFS for further processing.
- Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Built and maintained scalable data pipelines using the Hadoop ecosystem and other opensource components like Hive.
- Managing and scheduling of Jobs on a Hadoop cluster using Oozie.
- Created tables using Hive and queries are performed using HiveQL which will invoke and run
- Involved in creating Hive tables, loading data and running hive queries in those data.
- Extensive working knowledge of partitioned table, UDFs, performance tuning, compression-related properties, thrift server in Hive.
- Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts.
- Working knowledge in writing Pig's Load and Store functions.
- Developed SQL queries to join tables in MySQL and prepare data for statistical models
- Prepared reports using Excel Pivot Tables and Pivot Charts
- Assimilated and stitched unstructured customer data to MySQL database ensuring consistency
- Created MySQL database to capture online enquiries placed on website
Skills: Hortonworks,Hadoop, Ambari, HDFS, Sqoop, Hive, HBase, Pig, Oozie, MySQL, Flume, SQL and Tableau
Software Analyst
Confidential
Responsibilities:
- Created extract files for improving the performance. Used different Mark types and Mark properties in views to provide better insights into large data sets.
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets
- Responsible for gathering and analyzing the business requirements and then translate them to technical report specifications.
- Designed Tableau Reports, graphs and dashboards as per requirements.
- Created Tableau scorecards, dashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts using show me functionality.
- Delivering reports to Business team on timely manner.
- Worked on functional requirements sessions with business and technology stakeholders on data modeling, integration, and configuration for data warehouse with automated and manual field data collection systems
- Created SQL Queries for testing the tableau Dashboards.
- Migrated Workbooks and Tableau upgrades/migration works. Implemented new features in Tableau to the existing Workbooks and Dashboards.
- Created Dashboards with interactive views, trends and drill downs. Published Workbooks and Dashboards to the Tableau server.
- Combined visualizations into Interactive Dashboards and publish them to the web.
- Involved in installation and configuration of Tableau Server.
- Publishing dashboards and extracts to Tableau server.
- Defined best practices for Tableau report development.
- Developed training plan to cross train new team members and facilitated knowledge base content management.
Skills: Tableau, Dashboards, Tableau Desktop, Tableau Server, SQL, MS-Excel and MS-Office.