Hadoop/big Data Developer Resume
Charlotte North, CarolinA
SUMMARY
- Data Warehouse (ETL, BI) and Big Data Developer passionate about technologies, enjoys working with people and enjoys exploring the data architecture landscape.
- 7+ years of experience with various technologies like Big Data, Pentaho, Amazon Redshift, S3 cloud, Ec2, Tableau, and Business Objects with different databases like Oracle, DB2, vertical, MySQL, Redshift, etc.…
- Extensive experience in installing, configuring, and architecting Hadoop and Hortonworks clusters and services - HDFS, MapReduce, Yarn, Pig, Hive, HBase, Spark, Sqoop, Flume, and Oozie.
- Prior experience in loading data into Hive partitions and creating buckets in Hive, as well as developing MapReduce jobs to automate the transfer of data from HBase.
- Scheduled all Hadoop/Hive/Sqoop/HBase jobs using Oozie, and set up clusters in Amazon ECS2 and S3, including automating the process of setting up and extending clusters in AWS.
- Extensive experience with Apache Spark RDD, Data Frames, Streaming APIs, and Spark Data Frames API over the Cloudera platform for Hive data analytics.
- Expert in integrating Hadoop with Kafka, adept at uploading Clickstream data to HDFS, and proficient at utilizing Kafka for messaging and publishing a subscription-based messaging system.
- Hands-on experience with routing protocols such as FTP, SSH, HTTP, TCP/IP, HTTPS, DNS, VPN, and Firewall Groups, as well as monitoring tools like Splunk and Nagios.
- Familiar with different components in Pentaho including database lookup & join, calculate, row normalizer & demolisher, JavaScript, add constants, and add a sequence.
- Worked on Continuous Integration framework using Jenkins server, and developed interactive reports using Tableau and Pentaho BA tool for clients to help with monthly statistical analysis and decision making, etc.
- Good knowledge of Talend Big Data, Hadoop, and Hive, as well as the use of Talend big data components like tHdfsOutput, tHdfsInput and tHiveLoad and the creation of complex mappings using Joblets, tMap, tJoin, tReplicate, tParallelize, tJava, tJavaFlex, tAggregateRow, XML.
- Implemented change data capture techniques with the slowly growing target, simple pass-through mapping, slowly changing dimensions (SCD) type 1 and type 2 and created mappings to populate data into dimensions and fact tables.
- Professional in Data Modeling Techniques using Star Schema, Snowflake Schema, Fact and Dimension tables, RDBMS, and Physical and Logical Modeling for Data Warehouse and Data Mart.
- Experiencewith all phases of the Data Warehouse life cycle including Requirement Analysis, Design, Coding, Testing, and Deployment.
- Expertise in Data Visualization with Tableau, including Line and scatter plots, bar charts, histograms, pie graphs, dot plots, boxes, time series, error bars, multiple chart types, multiple axes, subplots, etc.
- Experience with object-oriented programming (OOP) using Python, Scala, and Java, as well as statistical software, such as SAS, SPSS, MATLAB, and R.
- Experience with SQL and NoSQL databases like MongoDB, Cassandra, Redis, CouchDB, and DynamoDB by installing and configuring various Python packages.
- Experience managing a variety of file formats such as text files, sequence files, and ORC data files using different Serdes 's in Hive.
- Hands-on experience in J2EE based technologies (JSP, Servlets, JDBC, JMS, and EJB) and have installed Jira (various versions from 5. X) on Ubuntu, Amazon EC2, and have created Jira Agile boards, assigned workflows, customized screens, and created custom fields on Jira.
- Extensive experience working with application servers, such as WebLogic, Tomcat, and Apache Server, as well as web servers, such as NGINX.
- Exceptional interpersonal and communication skills, effective time management and organization skills, and ability to work well in a team environment.
- Ability to combine technical expertise with strong Conceptual, Business, and Analytical skills to deliver quality solutions to complex problems and lead by example.
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop Map Reduce, Impala, HDFS, Hive, Pig, HBase, Flume, Storm, Sqoop, Oozie, Kafka, Spark and Zookeeper
Hadoop Distributions: Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Amazon EMR (EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, DynamoDB, Redshift, ECS, Quick sight) Azure HDInsight (Data Bricks, Data Lake, Blob Storage, Data Factory ADF, SQL DB, SQL DWH, Cosmos DB, Azure AD)
Programming Languages: Python, R, Scala, SAS, Java, SQL, PL/SQL, UNIX shell Scripting, Pig Latin
Machine Learning: Regression (Linear, Logistic, Ridge, Lasso, Polynomial, Bayesian), Classification (Decision Trees, Random Forest, SVM, KNN, Stochastic Gradient Descent, Adaboost, Naïve Bayes) and Clustering (K-means, Hierarchical, Density-Based).
Deep Learning: Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), LSTM, GRUs, GANs.
Databases: Snowflake, MySQL, Teradata, Oracle, MS SQL SERVER, PostgreSQL, DB2
NoSQL Databases: HBase, Cassandra, Mongo DB, DynamoDB and Cosmos DB
Version Control: Git, SVN, Bitbucket
Web Development: JavaScript, Node.js, HTML, CSS, Spring, J2EE, JDBC, Okta, Postman, Angular, JFrog, Mokito, Flask, Hibernate, Maven, Tomcat, WebSphere.
ETL/BI: Informatica, SSIS, SSRS, SSAS, Tableau, Power BI, QlikView, Arcadia.
Operating System: Mac OS, Windows 7/8/10, Unix, Linux, Ubuntu
Methodologies: RAD, JAD, UML, System Development Life Cycle (SDLC), Jira, Confluence, Agile, Waterfall Model
PROFESSIONAL EXPERIENCE
Confidential, Charlotte, North Carolina
Hadoop/Big Data Developer
Responsibilities:
- Engaged in installing and configuring Cloudera Distribution Hadoop platform extensively.Configured Map Reduce, HDFS, Hive, Pig, Sqoop, Flume, and Oozie on the Hadoop cluster.
- Implemented pipelines in Azure Data Factory using Linked Services/Datasets/Pipeline/ to extract, transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, and write-back tool.
- InstalledHorton WorksHadoop cluster on ConfidentialAzure cloudto satisfy customer's requirements for data locality.
- Imported data from various sources, performed transformations using Hive and MapReduce, and loaded data into HDFS.
- Converted unstructured data to structured data through Map Reduce and inserted data into HBase from HDFS.
- Familiarity with Database Design, Entity Relationships, Database Analysis, PL/SQL stored procedures, packages, and triggers on Windows and Linux.
- Responsible for creating new Azure Subscriptions, data factories, Virtual Machines, SQL Azure Instances, SQL Azure DW instances, HD Insight clusters, and installing DMGs on VMs to connect to on premise servers.
- Analyzed thedata flowfrom multiple sources to the target to determine the appropriate design architecture for Azure.
- Created Hive tables, loaded data, and wrote hive queries that run internally in a map-reduce manner.
- Involved in the Development of Pig Latin scripts and Pig command line transformations for data joins and customized Map Reduce outputs.
- Developed MapReduce jobs for generating reports for the number of activities created on a day during an import from multiple sources and wrote the results back to HDFS.
- Have used Sqoop to migrate data from HDFS and My SQL and/or Oracle and have integrated Hive and HBase for OLAP operations on HBase data.
- Extracting Oracle data and dumping it onto HDFS as an Avro file and converting it to parquet format again to reduce performance issues. To resolve the issue, we are loading the data into HIVE/IMPAL in parquet files
- Implemented gzip compression to free up some space in the cluster and Snappy compression on tables to reclaim space.
- Developed Partitions and Bucketing concepts in Hive and optimized performance for both Managed and External tables.
- Developedapplications for data extraction, transformation, and aggregation using PySpark and Spark-SQL, analyzing & transforming the data to gain insight into customer usage patterns.
- Wrote code using PySpark for all use cases in Spark and used Scala extensively to perform data analytics on Spark clusters and performed map-side joins with RDD.
- Assisting with the setup of QA environments and updating configurations for implementing Sqoop and Pig scripts.
- Transformed, cleaned, and filtered imported data using Spark Data Frame API, Hive, MapReduce, and loaded final data into Hive.
- Understanding of data warehousing techniques, Star/Snowflake schemas, ETL, Fact/Dimension tables, OLAP and Report delivery methods.
- Worked with end-users to design and implement analytical solutions using R based upon recommendations as per project proposals.
- Experience with coordinating clusters and scheduling workflows using Zookeeper and OOZIE Operational Services.
- IntegratedOozie with the rest of the Hadoop stack, including Java MapReduce Pig Hive Sqoop as well as system-specific jobs, such as Java programs and shell scripts.
- Familiarity with Azure web services and experience working on projects.Understanding of the software development life cycle, agile methodologies, and test-driven development.
- Meetings with technical collaborators weekly and active participation in code reviews with senior and junior developers.
Environment: s: Azure cloud, Azure Data Lake, Azure Data Factory, Azure SQL database, Hadoop, HDFS, MapReduce, Pig, Spark, Sqoop, HBase, Oozie, MySql, Putty, Zookeeper, UNIX and Shell scripting.
Confidential, Bethesda, Maryland
Data Engineer
Responsibilities:
- Construction of AWS Data pipelines with various AWS resources including AWS API Gateway to receive response from AWS lambda and retrieve data from snowflake using lambda function and converting the response into Json format using Snowflake, DynamoDB, AWS Lambda function and AWS S3 storage.
- Implemented and developed a Scala program that can be executed using Sqoop, Hive, and Pig to optimize MR jobs to use HDFS efficiently by using multiple compression mechanisms with the help of Oozie workflows.
- Developed Spark workflows using Scala for data pull from AWS S3 bucket and applied transformations in Snowflake.
- Migrated an existing on-premises application toAWS. Utilized AWS services like EC2 and S3 for small data sets processing and storage; experienced in maintaining Hadoop clusters on AWS EMR.
- Analyzed large datasets using Cloudera, HDFS, MapReduce, Hive, Hive UDF, Pig, Sqoop, and Spark.Managed the source code using Git version control, integrated Git with Jenkins to support build automation, and monitored commits in Jira.
- Implemented Terraform scripts to automate AWS services including ELB, CloudFront distribution, RDS, EC2, database security groups, Route 53, VPC, subnets, security groups, and S3 buckets. Converted existing AWS infrastructure to AWS Lambda via Terraform and AWS CloudFormation.
- Worked on setting upData Lake/Data catalogon AWS Glu. Creating a data lake on AWS Athena from S3 for visualization in AWS Quick Sight by constructing models and building actual data lakes.
- Working knowledge of Hadoop, Hive, PIG, Sqoop, Kafka, AWS EMR, AWS S3, AWS Redshift, Oozie, Flume, HBase, Hue, HDP, IBM Mainframes, HP Nonstop, and RedHat 5.6.
- Worked on Snowflake Schemas and Data Warehousing andprocessedbatch and streaming data load pipelines using Snow Pipe and Matillion from data lake Confidential AWS S3 buckets.
- Responsible for designing, developing, and executing complex T-SQL queries (DDL / DML), stored procedures, views, and functions that are part of transactional and analytical databases.
- Workingwith SQOOP to migrate RDBMS tables to Hive tables and later using Tableau to generate data visualizations.
- Implemented ETL processes with Data engineers and operations team and used Snowflake models to write and optimize SQL queries for data extraction to meet client needs.
- Participated in developing detailed Test strategy, Test plan, Test cases, and Test procedures for Functional and Regression Testing using Quality Center.
- Working knowledge of Hadoop, Hive, Pig, Sqoop, Kafka, AWS EMR, AWS S3, AWS Redshift, Oozie, Flume, HBase, Hue, HDP, IBM Mainframes, HP Nonstop, and RedHat 5.6.
- Interacting with business customers, gathering requirements, and creating data sets that will be used to visualize business data.
- Expertise in migrating Enterprise Data (Trust Data) and staggered procedures from Microsoft SQL to AWS Redshift using AWS Glue and S3.
- Proven expertise in data modeling, ETL development, and data warehousing according to project requirements.
- Usedthe AWS data pipeline for extracting, transforming, and loading data from heterogeneous and homogeneous data sources and built graphs for business decisions using the Python Mat plot library.
Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon SageMaker, Apache Spark, HBase, Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau.
Confidential, Hartford, Connecticut
Sr. Big Data Engineer
Responsibilities:
- Professional Services designing big data Solutions for clients with Cloudera Hadoop and Hadoop eco-system technologies, including Hive, HBase, Storm, and Spark.
- Supported Map Reduce Programs those are running on the cluster. Involved in loading data from UNIX file system to HDFS.
- Installed Kafka on Hadoop cluster and configured producer and consumer in java to establish a connection from the source to HDFS with popular hashtags.
- To improve performance, scalability, and memory usage of processing large volume of data, adopt Spark and Spark SQL to build Confidential Benchmarks cubes and populate cube annual and quarter metrics using Scala.
- Exploring the Spark by improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, SparkSQL, Data Frame, Pair RDD & Spark YARN.Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.
- Created S3 buckets configure and generated policies for different environments (Dev, QA, Staging, and Production) in AWS Databricks platform.
- Designed and developed an entire module called CDC (change data capture) in python and deployed in AWS Glue using PySpark library and python.
- Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
- In the preprocessing phase of data extraction, Used Spark to remove all the missing data for transforming data to create new features.
- Reduced access time by refactoring data models, query optimization and implemented Redis cache to support Snowflake. Created ETL jobs and pushed the data to Snowflake.
- Developed complex parallel jobs. Performance tuning of ETL mappings on a large scale in order to ensure minimum job run time.
- Involved in converting Hive/SQL queries into Spark transformations using APIs like Spark SQL, Data Frames, and Python.
- Building complex SQL Queries in the Teradata data warehouse and analyzing profitability data at product, customer segment and channel. Providing the findings to senior managers for decision support, revenue, and expense forecasting.
- Analyzing complex data, troubleshooting, and discovering root causes of the problems utilizing SQL queries, Hive, SAS, Alteryx and Excel.
- Developed and implemented R and Shiny application which showcases machine learning for business forecasting.
- Performed transformations, cleaning, and filtering on imported data using Spark Data Frame API, Hive, MapReduce, and loaded final data into Hive.
- Participate in migrate the code from Development to Test, Production and provided various migration documentations like Environment preparation, deployment components, Batch execution instructions and DFD’s for jobs.
- As part of QA and Production support, provided the DDL’s, DML’s, and validation scripts, and provided L3 support in non-dev environments to resolve the issues.
- Primary contributor in designing, coding, testing, debugging, documenting and supporting all types of applications consistent with established specifications and business requirements to deliver business value.
Environment: Hadoop 2.9, AWS, Spark 3.0, PySpark, S3, AWS Glue, Kafka, HDFS, Spark SQL, Hive, Agile, Python 3.7, R, MapReduce, Databricks, PySpark, Snowflake, ADF, Unix, DataStage, Teradata.
Confidential
Data Analyst
Responsibilities:
- Worked withOLAPtools such asETL,Datawarehousing,andModeling.Data were extracted, transformed & loaded from SQL Server to Oracle databases using Informatica / SSIS.
- Applied probability, distribution, and statistical inference to a given dataset to discover some interesting findings through these comparisons, e.g., T-test, F-test, R-squared, P-value, and so on.
- Used Apache Spark with Python to develop and execute Big Data Analytics and Machine Learning applications, as well as machine learning use cases under Spark ML and MLlib.
- Created R scripts to ensure that appropriate data access, manipulation, and reporting functions are performed.
- Used R Studio packages such as dplyr, tidyr & ggplot2 for data visualization and generated scatter plots and high low graphs to identify relationships between variables.
- Studied Data Mining and validated data to ensure accuracy between warehouse and source systems.
- Consolidate and reconcile data from various sources, including Oracle databases, MS Excel, and flat files, and perform data attestation.
- Participated in all phases of the SSIS life cycle, including the creation of SSIS packages, building, deploying, and executing them.
- Created reports for users in different departments using SQL Server Reporting Services (SSRS).
- WroteT-SQLstatements to retrieve data and participated in the performance tuning of T-SQL queries and stored procedures.
- Developed various machine learning algorithms using pandas, NumPy, seaborne, SciPy, Matplotlib, Scikit-Learn, and NLTK in Python.
- Applied statistical techniques such as Bayesian HMM and machine learning models such as Decision Trees, XG Boost, SVM, and Random Forest for building models.
- Loaded data from HDFS/Hive into PySpark Data Frame, developed aggregated reports using PySpark.
- Developed PySpark reports and automated them using AutoSys and migrated Hive scripts into Spark to save time.
- Data Quality Scripts were developed using SQL and Hive to validate the successful data load and quality of the data.Created various types of data visualizations usingPythonandTableau.
- Developed Tableau views with complex calculations and hierarchies for the analysis of large data sets.
- Troubleshooted, fixed, and deployed many Python bug fixes for the two main applications that provided the main source of data for both customers and internal departments.
- Worked with Data Analysts to gather requirements, conduct business analysis, and coordinate projects.
- Knowledge of risk analysis, root cause analysis, cluster analysis, correlation, and optimization, as well as the K-means algorithm for clustering data into groups.
- Developed pivot tables, exported data from external SQL databases, produced reports, and updated spreadsheet information using Excel.
- Managed, updated, and manipulated report orientation and structures using advanced Excel functions such as Pivot Tables and V-Lookups.
Environment: SQL, OLAP, Python, OLTP, Informatica, SAS, SSIS, SSRS, T-SQL, Tableau, Advanced Excel.
Confidential
Java developer
Responsibilities:
- Participated in all phases of the Software Development Life Cycle (SDLC), which includes problem-solving, analysis, design, coding, and testing skills, including unit and acceptance testing.
- Implemented various business functionalities (session beans) using Entity Java Beans (EJB) classes.
- Contributed to the development of web pages using JavaScript/HTML, JavaScript, JSP, and Struts.
- Tested the web services using the Restful tool in the application, which supported text, JSON, and XML formats.
- Worked with different types of controllers, including Simple Form Controllers, Abstract Controllers, and Controller Interfaces.
- Assist with the planning, design, and implementation of Enterprise Architectures, including requirements analysis, process modeling, and integration of IBM Rational Rose.
- Designed and developed the backend Oracle database schema and Entity-Relationship diagrams for the application.
- Contributed to the development of CSV files using the Data Load.
- Implemented procedures, packages, triggers, and different Joins to retrieve the database usingPL/SQL, andSQLscripts. CreatedDDL, and DMLscripts to create tables and privileges on respective tables in the database.
- Involved in acceptance testing with test cases and code reviews.
- Developed code for handling the exceptions using exceptional handing.
- Consumed Web Service for transferring data between different applications using RESTful APIs along with Jersey API and JAX-RS.
- Responsible for fixing bugs based on the test results.
- Responsible for Hibernate Configuration and integrated Hibernate framework.
- Designed and Developed Stored Procedures, Triggers in Oracle to cater the needs for the entire application.
- Extensively worked with the retrieval and manipulation of data from the Oracle database by writing queries using SQL.
- Creating and administrating database objects like table, views, and indexes.
- Built the application using TDD (Test Driven Development) approach and involved in different phases of testing like Unit Testing.
- Involved in configuring JMS and JNDI in rational application developer (RAD).
- Implemented Log4j to maintain system log.
- Used Spring Repository to load data from oracle database to implement DAO layer and Jenkins to deploy the application in testing environment.
- Coordinating with other programmers in the team to ensure that all the modules complement each other well.
Environment: JDK1.5, JSON, XML, JSP, Struts, Html, CSS, JavaScript, AngularJS, JQuery, REST, JAX-RS, Jersey API, JUnit, Log4J, Jenkins, SharePoint, RAD, JMS API.