Big Data Analyst Resume
New York, NY
EXPERIENCE SUMMARY:
- Around 8 years of professional IT experience with around 5 years of experience in Big data Environment, Hadoop Ecosystem and good experience in Spark, NoSQL, Java Development.
- Hands on experience across Hadoop Eco System that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Spark, Sqoop, Hive, Pig, Impala, Oozie, Oozie Coordinator, Zookeeper and Apache Cassandra, HBase.
- Experience in using various tools like Sqoop, Flume, Kafka, NiFi , Pig to ingest structured, semi - structured and unstructured data into the cluster.
- D esigning both time driven and data driven automated workflows using Oozie and used Zookeeper for cluster co-ordination .
- Experience in Hadoop cluster using cloudera's CDH, Horton works HDP.
- Experience in working with structured data using HiveQL, join operations, Hive UDFs, partitions, bucketing and internal/external tables.
- Expertise in writing Map-Reduce Jobs in Java, Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
- Experienced in using Pig scripts to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
- Worked on importing data into HBase using HBase Shell and HBase Client API. Experience in designing and developing tables in HBase and storing aggregated data from Hive Table.
- Experience working with Python, UNIX and shell scripting.
- Experience with different data formats like Json, Avro, parquet, ORC formats and compressions like snappy & bzip.
- Experience in Extraction, Transformation and Loading ( ETL ) of data from multiple sources like Flat files and Databases.
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing and ETL Tools like IBM DataStage, Informatica and Talend.
- Good knowledge of cloud integration with AWS using Elastic Map Reduce (EMR), Simple Storage Service (S3), EC2, Redshift and Microsoft Azure.
- Good knowledge on Google Cloud Dataproc creation, Cluster creation and in-memory Emulator testers.
- Experience in using IDEs and Tools like Eclipse, IntelliJ, NetBeans, GitHub, Maven, SBT, CBT.
- Strong in core Java, data structure, algorithms design, Object-Oriented Design (OOD) and Java components like Collections Framework, Exception handling, I/O system, and Multithreading.
- Hands on experience in MVC architecture and Java EE frameworks like Struts2, Spring MVC, and Hibernate.
- Experience with complete Software Development Life Cycle (SDLC) process which includes Requirement Gathering, Analysis, Designing, Developing, Testing, Implementing and Documenting.
- Worked with waterfall and Agile methodologies.
- Good team player with excellent communication skills with strong attitude towards learning new technologies.
- Hands on Experience in Spark architecture and its integrations like Spark SQL , DataFrames and Datasets APIs .
- Worked on Spark for enhancing the executions of current processing in Hadoop utilizing Spark Context , Spark SQL, Data Frames and RDD’s .
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Python.
- Hands on experience Using Hive Tables by Spark , performing transformations and Creating Data Frames on Hive tables using Spark.
- Used Spark-Structured-Streaming to perform necessary transformations.
- Expertise in converting Map Reduce programs into Spark transformations using Spark RDD's.
TECHNICAL SKILLS:
HADOOP: HDFS, MapReduce, Hive, beeline, Sqoop, Flume, Oozie, Impala, pig, Kafka, Zookeeper, NiFi, Cloudera Manager, Hortonworks
Spark Components: Spark Core, Spark SQL (Data Frames and Dataset), Scala, Python.
Programming Languages: Core Java, Scala, Shell, Hive QL, Python
Web Technologies: HTML, JQuery, Ajax, CSS, JSON, JavaScript.
Operating Systems: Linux, Ubuntu, Windows 10/8/7
Databases: Oracle, MySQL, SQL Server
NoSQL Databases: Hbase, Cassandra, MongoDB
Cloud: AWS Cloud Formation, Azure
Version controls and Tools: GIT, Maven, SBT, CBT
Methodologies: Agile, Waterfall
IDES & Command Line Tools: Eclipse, Net Beans, IntelliJ
PROFESSIONAL EXPERIENCE:
Big Data Analyst
Confidential, New York, NY
Responsibilities:
- Planning, designing and creating data-wise rules for stream and streaming to Data-lake. Validating the data using Databricks. Converted structured data to Avro schema for streaming purpose.
- Hands on Experience in Spark architecture and its integrations like Spark SQL, Data Frames and Datasets APIs.
- Worked on Spark for enhancing the executions of current processing in Databricks utilizing Data Frames
- Worked with Terraform replication tasks, for intermediate storing of data from PostgreSQL to S3 buckets. And capturing the CDC events and updating the buckets.
- Writing a make file to automate the terraform. Testing S3 modules using Terra-test
- Created SNS topics for AWS DMS task to capture all events. Created an AWS lambda function to capture the error logs and notify to SNS and automating the process using Terraform.
- Registered the meta data, transformation rules and data sources.
- Creating warehouses and Databases in snowflake and designed a complex query and checking the cluster performance.
- Created Okera Tables, understanding the sensitive data and masking the sensitive data and creating views.
- Working on the True source data elements and retrieving them from original source, masking private data and loading it to AWS S3 storage buckets.
- Working on Customer Experience, adding scheduled payments using IRIS and IVR data and using EASE application.
- Working on Historical payments Data for creating new Data models.
- Worked on designing fraud data tools and testing the tool with existing data.
- Worked in creation of a datastore from the scratch. Developed efficient data pipelines and products using Python.
- Used AWS Lambda to automate the operation to read a dataset (parquet, dat, avro,) from AWS S3 to AWS RDS.
- Written Programs in Spark using Scala and Python for Data quality check.
- Involved in converting Hive/SQL queries into Spark using Spark RDDs, Python and Scala.
- Created terraform scripts to spin up an emr and ec2 clusters.
- Worked on in AWS Identity and Access Management (IAM) components, including user, groups, roles, policies and password policies and provide the same to end user.
- Created highly scalable, resilient, and performance architecture using Amazon AWS cloud technologies such as S3, EMR, EC2 and Lambda.
- Developed efficient data pipelines and products using tools and languages like python, spark
- Performed application programming, testing and validation, deployment and documentation
- Move data between production systems and across multiple platforms
- Using tools like Jenkins, GitHub for version controlling and continuous integration
- Working with file formats like Avro, parquet, csv.
Environment: Hadoop, Spark, Scala, Kafka, Python, AWS, Databricks.
Hadoop Developer
Confidential, Charlotte, NC
Responsibilities:
- Developed an EDW solution, which is a cloud based EDW and Data Lake that supports Data asset management, Data Integration, and continuous data analytic discovery workloads.
- Developed and implemented real-time data pipelines with Spark Streaming, Kafka, and Cassandra to replace existing lambda architecture without losing the fault-tolerant capabilities of the existing architecture.
- Created a Spark Streaming application to consume real-time data from Kafka sources and applied real-time data analysis models that we can update on new data in the stream as it arrives.
- Worked on importing, transforming large sets of structured, semi-structured and unstructured data.
- Used Spark-Structured-Streaming to perform necessary transformations and data model which gets the data from Kafka in real time and Persists into HDFS.
- Implemented the workflows using the Apache Oozie framework to automate tasks. Used Zookeeper to co-ordinate cluster services.
- Created various hive external tables, staging tables and joined the tables as per the requirement.
- Designed ETL flows to get data from various sources, transform for further processing and load in Hadoop/HDFS for easy access and analysis by various tools.
- Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table. Created Map side Join, Parallel Execution for optimizing the Hive queries.
- Developed and implemented hive and spark custom UDFs involving date Transformations such as date formatting and age calculations as per business requirements.
- Written Programs in Spark using Scala and Python for Data quality check.
- Written transformations and actions on Data Frames, used Spark SQL on data frames to access hive tables into spark for faster processing of data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Understanding of Teradata Logical Data models. Developed shell scripts for Daily and weekly Loads, transferring files refreshing data between environments.
- • Exported the patterns analyzed back into Teradata using Sqoop. Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- ETL Data Cleansing, Integration &Transformation using Hive and PySpark. Responsible of managing data from disparate sources.
- Involved in loading data from Teradata database into HDFS using Sqoop queries.
- Preparing and using test data/cases to verify accuracy and completeness of ETL process.
- Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
- Experience in using Google Cloud in creating Cloud Dataproc to create one or more Compute Engine Instances that will connect to a cloud Bigtable Instances and created Hadoop Clusters to run Hadoop jobs and used in-memory emulators for testing using filters.
- Used Spark optimizations techniques like Cache/Refresh tables, broadcasting variables, Coalesce/Repartitioning, increasing memory overhead limits, handling parallelism and modifying the spark default configuration variables for performance tuning.
- Performed various benchmarking steps to optimize the performance of Spark jobs and thus improve the overall processing.
- Worked in Agile environment in delivering the agreed user stories within the sprint time.
Environment: Hadoop, HDFS, Hive, Sqoop, Oozie, Spark, Scala, Kafka, Python, Cloudera.
Hadoop Developer
Confidential, Miami, FL
Responsibilities:
- Worked with product owners, Designers, QA and other engineers in Agile development environment to deliver timely solutions to as per customer requirements.
- Transferring data from different data sources into HDFS systems using Kafka producers, consumers and Kafka brokers.
- Used Oozie for automating the end-to-end data pipelines and Oozie coordinators for scheduling the workflows.
- Involved in creating Hive tables, loading data and writing hive queries, views and worked on them using Hive QL.
- Performed Optimizations of Hive Queries using Map side joins, dynamic partitions and Bucketing.
- Applied Hive queries to perform data analysis on HBase using the serde tables in meeting the data requirements for the downstream applications.
- Responsible for executing hive queries using Hive Command Line, Web GUI HUE and Impala to read, write and query the data into HBase.
- Implemented MapReduce secondary sorting to get better performance for sorting results in MapReduce programs.
- Designed end to end ETL flow for one of the feeds having millions of records inflows daily. Used apache tools/frameworks Hive, Pig, Sqoop and HBase for the entire ETL workflow. Load and transform large sets of structured, semi structured that includes Avro, sequence files.
- Worked on migration of all existed jobs to Spark, to get performance and decrease time of execution.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Using Hive join queries to join multiple tables of a source system and load them to Elastic search tables.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in Amazon EMR.
- Used Microsoft Azure for building the applications and for building, testing, deploying the applications.
- Experience with ELK Stack in building quick search and visualization capability for data.
- Good Knowledge in using NiFi to automate the data movement between different Hadoop systems.
- Experience with different data formats like Json, Avro, parquet, ORC formats and compressions like snappy & bzip.
- Coordinated with the testing team for bug fixes and created documentation for recorded data, agent usage and release cycle notes.
Environment: Hadoop, Big Data, HDFS, Scala, Python, Oozie, Hive, HBase, NiFi, Impala, Spark, Linux AWS, Azure, Hortonworks.
Java Developer
Confidential
Responsibilities:
- Involved in various phases of Software Development Life Cycle like requirement gathering, design, analysis and code development
- Developed Use Cases, Class Diagrams, Activity Diagrams and Sequence Diagrams.
- Developed Java Server Pages (JSP) for the front end and Servlets for handling Http requests. Worked with Tomcat Server for deployment.
- Developed Graphical User Interfaces using XML and used JSP's for user interaction
- Used JSP custom tags and Stored Procedures in the web tier to dynamically generate web pages.
- Used SVN as version control and Ant to build the J2EE application.
- Worked on Oracle to perform DML and DDL operations.
- Involved in Unit Integration, Pre-Production testing, Client Acceptance Tests and Approvals.
Environment: Java, J2EE, Eclipse IDE, JavaScript, JSON, MySQL, PL/SQL, Web service
Jr. Java Developer
Confidential
Responsibilities:
- Involved in different SDLC phases involving Requirement Gathering, Design and Analysis, Development and Customization of the application.
- Designed new pages using HTML, CSS, jQuery, and JavaScript.
- Wrote database queries using SQL and PL/SQL for accessing, manipulating and updating Oracle database.
- Created database design for new tables and forms with the help of Technical Architect.
- Worked with managers to identify user needs and troubleshoot issues as they arise.
- Performing Unit testing, once the basic implementation has done.
Environment: Java, J2EE, Eclipse IDE, JavaScript, JSON, MySQL, PL/SQL, Web service