Big Data Engineer/hadoop Developer Resume
Chicago, IL
SUMMARY
- Big Data Developer/Hadoop Developer with 8 plus years of experience inHadoopecosystem (HIVE, PIG, YARN, MapReduce, IMPALA, SQOOP, Spark, Oozie, Zookeeper, HBASE, Hue, Ambari, Kafka and Flume)providing and implementing solutions for BigDataApplications with excellent knowledge ofHadooparchitecture (HDFS, Name Node and Data Node).
- Good knowledge on distributed computing, Spark Core API and Spark SQL.
- Used various file formats like Avro, Parquet, Sequence, Json, ORC and text for loading data, parsing, gatheringand performing transformations.
- Good experience in Hortonworks and Cloudera for ApacheHadoopdistributions.
- Experience in bi - directionaldatapipelines from HDFS to Relational Database with Sqoop.
- Designed and created Hive external tables using shared meta-store with Static & Dynamic partitioning, bucketing and indexing.
- Expertise in analyzingdatausing HiveQL, Pig Latin, and custom Map Reduce programs in python and Java.
- Good Knowledge of Pig for load data, transformations, event joins, filter, group and other aggregation functions.
- Exploring with Spark improving the performance and optimization of the existing algorithms inHadoopusing Spark context, Spark-SQL,DataFrame, pair RDD's.
- Familiarity with libraries like Pyspark, NumPy, Pandas, Starbase, Matplotlib in python.
- Contributed towards building Apache Spark applications using Python, Scala.
- Writing complex SQL queries using joins, group by, nested queries.
- Performance of the Hive, Pig queries were increased by running through Apache Tez.
- Experience with solid capabilities in exploratorydata analysis, statistical analysis, and visualization using R, Python, SQL and Tableau.
- Running and scheduling workflows using Oozie and Zookeeper, identifying failures and integrating, coordinating and scheduling jobs.
- Hands on experience on Kafka and Flume to load the logdatafrom multiple sources directly into HDFS.
- Integrated Hadoop with Tableau to generate visualizations like Tableau Dashboards.
TECHNICAL SKILLS
Operating Systems: Linux, Mac OS & Windows
Hadoop eco system: HDFS, MapReduce, Hive, Yarn, Pig, Impala, Spark SQL, HBase, Kafka, Sqoop, Flume, Spark Streaming, Oozie, Zookeeper, Hue, Ambari.
Hadoop Distribution: Hortonworks-2.6.1, Cloudera-5.10.
Programming Languages: R, python, Linux shell scripts, Java and Scala
Databases: MySQL, Mongo DB, Cassandra, Teradata and HBase
Cloud: AWS, Microsoft Azure, Google Cloud
Build Tools: Ant, Maven
Streaming/Real Time:
Processing: Apache Spark, Apache Storm
Visualization: Tableau, R, python
PROFESSIONAL EXPERIENCE
Confidential, Chicago, IL
Big Data Engineer/Hadoop Developer
Responsibilities:
- Data Ingestion from relational databasesinto Hdfs using Sqoop import/export and also created Sqoop Job, Evaluate, and incrementaljobs.
- Created Partitions, Bucketing and Indexing concepts for optimization as part of hive data modelling.
- Responsible for installation and configuration of Hive, Pig, Sqoop, Flume and Oozie on the Hadoop Cluster.
- Involved in developing Hive DDLs to create, alter and drop Hive tables.
- Built re-usable Hive UDF libraries for business requirements which enabled users to use these UDF’s in Hive querying.
- Responsible for analyzing and cleansing raw data by performing Hive queries and running Pig Scripts on data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and experience in using Spark-Shell and Spark Streaming.
- Used Spark using Python, Scalaand utilizing Data frames, Data sets andSparkSQL API for faster processing of data.
- Built recommendation system using Association rule mining algorithm in spark using MLlib, to find frequent buying patterns in customers and recommend products accordingly, also implemented an idea for pruning obvious items.
- Streamed real time data by integrating Kafka with Spark for dynamic price surging using machine learning algorithm.
- Written multiple MapReduce program in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
- Developed Hive and Impala for end user/ analyst requirements to perform hoc analysis.
- NOSQL column oriented databases like HBase and their integration with Hadoop cluster using connectors.
- Troubleshooting, debugging and altering the Talend issues, working on maintenance and performance of the ETL tools.
- Experienced with both HUE UI for accessing HDFS files and data.
- Developed a data pipeline using Kafka and Spark to store data into HDFS.
- Designed workflow by scheduling Hive processes for Log file data which is streamed into HDFS using Flume.
- Involve in loading data from UNIX file system to HDFS.
- Extracted, modified and loaded data from files, Oracle and other input sources to load data into HDFS
- Designed workflows and coordinators, managed in Oozie and Zookeeper to automate and parallelize Hive, Sqoop and Pig jobs in Cloudera Hadoop using XML.
- Experienced in performing in memory batch processing using Spark Streaming(Spark and Spark-SQL and Spark-shell).
- Imported and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop
- HBase to load data using connectors and write queries using NOSQL.
- Involved in building the runnable jars for the module framework through Maven clean, Maven dependencies.
- Integrate BI SSIS with Hadoop and performed ETL operations.
- Tested ApacheTez, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
- Developed SQL scripts to compare all the records for every field and table at each phase of the data movement process from the original source system to the final target.
- Pre-processed large sets of structured and semi-structured data, with different formats like Text Files, Avro, Parquet, ORC, Sequence Files, and JSON Record.
- Responsible for continuous monitoring and managing the Hadoop Cluster using Cloudera Manager.
- Creating customizedTableauDashboards, integrating Custom SQL from Hadoop and performingdatablending in reports.
Environment: Cloudera CDH 5.8, Linux, HDFS, MapReduce, Shell Scripting, Java, Talend, Hive, Pig, Spark, Storm, Impala, Sqoop, Flume, Oozie, Kafka, Eclipse, ApacheTez, Talend, Yarn, ETL, Maven, Tableau.
Confidential, Nashville, TN
Big Data Developer
Responsibilities:
- Working in an Agile team to deliver and support required business objectives by using Python, Shell Scripting and other related technologies to acquire, ingest, transform and publishdataboth to and fromHadoopEcosystem.
- Extracted thedatafrom MySQL into HDFS using Sqoop export/import and also handled importing ofdatafrom variousdatasources, performed transformations using Pig and loadeddatainto HDFS.
- Assisted application teams in installingHadoopupdates, operating system, patches and version upgrades when required.
- Importeddatafrom RDBMS toHadoopusing Sqoop import.
- Hands on experience in loadingdatafrom UNIX file system to HDFS and vice versa
- Performed transformations using Python and Scala to analyze and gather thedatain required format of customer
- Developed ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
- Used a combination of Flume and Kafka to get logdatafrom and web and mobile app servers.
- Worked on migrating custom workflows built using tools inHadoopto a third-party tool.
- Used Pig as ETL tool to do transformations, event joins, filtering and some pre-aggregations before storing the data into HDFS.
- Wrote SQL queries, PL/SQL stored procedures/functions for relational databases like Oracle, SQL Server,Graph Databases.
- Wrote Machine Learning algorithms using Spark MLlib library.
- Created Hive External and Internal tables on top ofdatain HDFS using various SerDe.
- Created hive tables using ORC for faster access and compression in data modelling.
- Ran Analytic queries and gathered stats for tables in hive using Impala.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how it translates to MapReduce jobs.
- Proficient in designing Row keys and Schema Design for NOSQL DatabaseHBaseand knowledge of other NOSQL database Cassandra.
- Worked on creating and automating reports in Excel usingdataimported from Hive via ODBC.
- Writing the RDD and dataframes to process thedatain spark before it can be ingested for different uses like reporting.
- Worked on Python scripts to help our team internally withdatamanagement.
- Migrated complex Map reduce programs into Spark RDD transformations and actions.
- Developed RDD's using Python and coded Python applications for business requirements.
- Created workflows to automate the batch jobs using third party tools.
- Created cloud formation template to build a repeatable process to stand up various application deployment environments in AWS like EC2 and EMR.
- Provision, monitor and maintain AWS EC2 instances, watching the security and manage theAWSS3 bucket storage on AWS cloud environment.
- Wrote Unix/Linux Shell Scripting for scheduling jobs and for writing pig scripts and hive QL.
- Experience on EMR cluster for running spark algorithms through Putty.
- Worked in production support team to ensuredataavailability,dataquality anddataintegrity for the enterprise.
- Indulged in regular stand-ups meetings, status calls, Business owner meetings with stake holders, Risk management teams in an agile environment.
- Supported code/design analysis, strategy development and project planning.
- Followed Scrum implementation of scaled agile methodology for entire project.
Environment: Cloudera CDH4 and CDH5, Elastic search, AWS EC2, Hadoop, Spark, Kafka, Flume, Sqoop, Hive, Impala, HBase, R, R Studio, Scala and python, AWS, Zookeeper, Shell Scripting, Oozie, ETL,SQL and Tableau
Confidential, Charlotte, NC
Hadoop Developer
Responsibilities:
- Responsible for loading the customer’s data and event logs from Kafka into HBase using REST API.
- Worked on debugging, performance tuning and Analyzing data using Hadoop components Hive Pig.
- Imported streaming data using Apache Storm and Apache Kafka into HBase and designed hive tables on top.
- Created Hive tables from JSON data using data serialization framework like AVRO.
- Developed multiple POCs using Pyspark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Deployed Hadoop cluster using Hortonworks with Pig, Hive, HBase and Spark.
- Developed restful webservice using Spring Boot and deployed to pivotalweb services.
- Used build and deployment tools like Maven.
- Involved in Test Driven Development (TDD).
- Developed Kafka producer and consumers, HBase clients, Spark and Hadoop MapReduce jobs along with components on HDFS, Hive.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Responsible for processing unstructured data using Pig and Hive.
- Managed and reviewed Hadoop log files. Used Scala for integration Sparkinto Hadoop.
- Implemented Sparkusing Scala and SparkSQL for faster testing and processing of data
- Extensively used Pig for data cleansing and HIVE queries for the analysts.
- Created PIG script jobs in maintaining minimal query optimization.
- Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Worked on various Business Object Reporting functionalities such as Slice and Dice, Master/detail, User Response function and different Formulas.
Environment: Hortonworks HDP, Linux, Hadoop, HDFS, Pig, Hive, HBase, MapReduce, Sqoop, Oozie, Spark, Hue, LINUX, Teradata, Java APIs, Java collection, SQL Business Objects XI R2, Apache Storm, Pyspark, SQL, Teradata, Spring Boot, Maven, Kafka, Scala, SparkSQL.
Confidential
Data Analyst
Responsibilities:
- Experienced on loading and transforming of large sets of structured, semi structured and unstructureddatafrom RDBMS through Sqoop and placed in HDFS for further processing.
- Installed and configured Flume, Hive, Pig, Sqoop and Oozie on theHadoopcluster.
- Built and maintained scalabledatapipelines using theHadoopecosystem and other opensource components like Hive.
- Managing and scheduling of Jobs on aHadoopcluster using Oozie.
- Created tables using Hive and queries are performed using HiveQL which will invoke and run
- Involved in creating Hive tables, loadingdataand running hive queries in thosedata.
- Extensive working knowledge of partitioned table, UDFs, performance tuning, compression-related properties, thrift server in Hive.
- Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts.
- Working knowledge in writing Pig's Load and Store functions.
- Developed SQL queries to join tables in MySQL and preparedatafor statistical models
- Prepared reports using Excel Pivot Tables and Pivot Charts
- Assimilated and stitched unstructured customerdatato MySQL database ensuring consistency
- Created MySQL database to capture online enquiries placed on website
Environment: Hortonworks,Hadoop, Ambari, HDFS, Sqoop, Hive, HBase, Pig, Oozie, MySQL, Flume, SQL and Tableau.
Confidential
Software Analyst
Responsibilities:
- Created extract files for improving the performance. Used different Mark types and Mark properties in views to provide better insights into largedatasets.
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets
- Responsible for gathering and analyzing the business requirements and then translate them to technical report specifications.
- DesignedTableauReports, graphs and dashboards as per requirements.
- CreatedTableauscorecards, dashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts using show me functionality.
- Delivering reports to Business team on timely manner.
- Worked on functional requirements sessions with business and technology stakeholders ondata modeling, integration, and configuration fordatawarehouse with automated and manual fielddata collection systems
- Created SQL Queries for testing thetableauDashboards.
- Migrated Workbooks andTableauupgrades/migration works. Implemented new features inTableau to the existing Workbooks and Dashboards.
- Created Dashboards with interactive views, trends and drill downs. Published Workbooks and Dashboards to theTableauserver.
- Combined visualizations into Interactive Dashboards and publish them to the web.
- Involved in installation and configuration ofTableauServer.
- Publishing dashboards and extracts toTableauserver.
- Defined best practices forTableaureport development.
- Developed training plan to cross train new team members and facilitated knowledge base content management.
Environment: Tableau, Dashboards, Tableau Desktop, Tableau Server, SQL, MS-Excel and MS-Office.
