Sr. Big Data Engineer Resume

SUMMARY

Excellent understanding of Hadoop architecture and various components such as HDFS, YARN, High Availability, and MapReduce programming paradigm.
Experience in Amazon, Horton works, and Cloudera Hadoop distributions.
Hands on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop 2.x, MapReduce 2.x, HDFS, Oozie, Hive, PIG Kafka, Oozie, Zookeeper, Storm, Spark, Sqoop, Flume, HBase.
Experienced in installing, configuring and monitoring the Data stax Cassandra Cluster, DevCenter and OpsCenter
knowledge on read and write processes, including SSTables, MemTables and Commit log
Excellent understanding of Cassandra Architecture and management tool like OpsCenter
Good Exposure on Map Reduce (JAVA), HiveQL, Pig scripting, Spark SQL (Scala/python).
Experience in building Data pipelines using Kafka and Spark.
Experience in managing and reviewing Hadoop log files.
Hands on experience in Import/Export of data using Hadoop Data Management tool Sqoop.
Development experience in RDBMS like Oracle, MS SQL Server, Teradata and MYSQL.
Extensively worked on Spark using Scala on cluster for computational (analytics), On top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL.
Hands on Experience in Spark architecture and its integrations like Spark SQL, Data Frames and Datasets APIs.
Experience in Troubleshooting and Tuning Spark applications and Hive scripts to achieve optimal performance.
Worked with real - time data processing and streaming techniques using Spark streaming and Kafka.
Experience developing Kafka producers, Kafka Consumers and KStreams for streaming millions of events per second.
Experience in Oozie and Airflow to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
Hands-on knowledge in Core Java concepts like Exceptions, Collections, Data-structures, I/O, Multithreading, Serialization, and deserialization of streaming

PROFESSIONAL EXPERIENCE

Confidential

Sr. Big Data Engineer

Responsibilities:

Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
Implemented big data framework: Hadoop, HDFS, Apache Spark, Hive, Map/Reduce and Sqoop
Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
Managed and reviewed Hadoop log files to identify issues when job fails and used HUE for UI based pig script execution, Oozie scheduling.
Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
Developed Python code to gather the data from HBase and designs the solution to implement using PySpark
Developed PySpark code to mimic the transformations performed in the on-premises environment and analyzed the SQL scripts and designed solutions to implement using PySpark.
Automated workflows using shell scripts pull data from various databases into Hadoop and developed scripts to automate the process and generate reports.
Created detailed AWS Security groups which behaved as virtual firewalls that controlled the traffic allowed reaching one or more AWS EC2 instances.
Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster.
Deploy and configured cloud AWS EC2 for client websites moving from self-hosted services for scalability purposes and work with multiple teams to provision AWS infrastructure for development and production environments.
Designed number of partitions and replication factor for Kafka topics based on business requirements and worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark). Used various Spark Transformations and Actions for cleansing the input data and involved in using the Spark application master to monitor the Spark jobs and capture the logs for the spark jobs.
Worked with Amazon EMR to process data directly in S3 when we want to copy data from S3 to the Hadoop Distributed File System (HDFS) on Amazon EMR cluster by setting up the Spark Core for analysis work.
Implemented Spark using Scala and utilizing Data frames and Spark SQLAPI for faster processing of data and worked on extensible framework for building high performance batch and interactive data processing application on hive. Involved on configuration, development of Hadoop environment with AWS cloud such as EC2, EMR, Redshift, Cloud watch, and Route.
Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into Cassandra.
Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.

Confidential

Big Data Engineer

Responsibilities:

Processing real-time data for handling the request for transformations that are sent in from the user end and displaying the necessary results within a period by performing the appropriate transformations.
Processing different kinds of streaming data that vary in the formats like JSON, CSV, XML, XLXS, Html, etc.
Developing Spark jobs to establish a connection with the Oracle database, create datasets and data frames for each individual table from each set of data (product, client, and reference data).
Installed application on AWS EC2 instances and configured the storage on S3 buckets.
Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python, and Scala
Developed python code for different tasks, dependencies, SLA watcher, and time sensor for each job for workflow management and automation using the Airflow tool.
Experienced in writing Real-time Processing and core jobs using Spark Streaming with Kafka system.
Worked on Informatica Power Center tool Source Analyzer, Data Warehouse designer, Mapping & Mapplet and Transformation Designer. Developed Informatica mappings and in tuning of mappings for better performance. Worked with Memory cache for static and dynamic cache for the better throughput of sessions containing Rank, Lookup, Joiner, Sorter and Aggregator transformations
Written SQL Scripts and PL/SQL Scripts to extract data from Database and for Testing Purposes.
Worked on analyzing Hadoop clusters using different big data analytic tools including Pig, Hive, Oozie, Zookeeper, Sqoop, Spark, Kafka, and Impala with Cloudera distribution.
Designed, implement efficient data pipelines to integrate data from a variety of sources into Data Lake.
Gathering requirements with respect to models designed by data architects for master data (product, client, and reference data) to the fullest to grasp the hierarchies, relationships, and complexities.
Performing extraction of JSON schemas from the models, re-factoring and removing the unnecessary fields, and re-structuring them according to the requirements
Deployed the Big Data Hadoop application using Talendon cloud AWS and on Microsoft Azure.
Used coming up with data warehousing solutions while working with a variety of database technologies.
Designing Apache Spark programs for reading millions of transactions of data from Oracle Database to implement Structured Streaming and performing the necessary transformations using Spark SQL.
Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD, Spark YARN.
Used the Spark Data Cassandra Connector to load data to and from Cassandra.
Experienced in Creating data-models for Client' transaction logs, analyzed the data from Casandra tables for quick searching,

Confidential

Big Data Engineer

Responsibilities:

ETL architecture design to load raw data from different sources in different format and perform preprocessing like filtering, deduplication and transformation and store in Hadoop cluster.
Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
Architected and designed the data flow for the collapse of 4 legacy data warehouses into an AWS Data Lake
Using Pyspark developed framework to implement ETL architecture to input raw data and stores structured data in Hadoop cluster.
Used Pyspark functions and Spark SQL Data frames to increase performance by writing user defined functions (UDF's).
Stored and retrieved data from data-warehouses using Amazon Redshift.
Mapping column-based data for transformation of data based on business requirement and stored in parquet format.
Unit test case development in python programming, covering all possible scenarios to avoid any errors from end-to-end pipeline.
Implemented data loading and aggregation frameworks and jobs that will be able to handle hundreds of GBs of json files, using Spark, Airflow.
Automated the process of running queries using Hive and Spark from the data stored in HDFS after executing the ETL process.
Increased heap size at the node level when executor memory exceeds maximum level, thus solving memory issues.
Worked on Amazon Redshift and AWS a solution to load data, create data models and run BI on it.
Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
Storing and loading the data from HDFS to Amazon S3 and backing up the Namespace data into NFS.
Worked on real-time data processing and streaming techniques using Spark streaming and Kafka to move data in and out of HDFS and Relational DB.
Used Jira for bug tracking, GIT and Bitbucket for check-in and check-out of code changes.Environment: Hadoop, HDFS, PIG, GIT, Hive, AWS, S3, Sqoop, Cloudera, Zookeeper, Oracle, Shell Scripting, Airflow 1.10.11, Unix,Linux.

Confidential

Data Analyst

Responsibilities:

Performed as a Data Analysis, Data Modeling, Data Migration, and data profiling using complex SQL on various sources systems including Oracle and Teradata.
Experienced in building applications based on large data sets in MarkLogic.
Translated business requirements into working logical and physical data models for Data warehouse, Data marts and OLA Papplications.
Analyzed data line age processes to identify nerable data points, control gaps, data quality issues, and over all lack of data governance.
Worked on data cleansing and standardization using the cleanse functions in Informatica MDM.
Designed Star and Snowflake Data Models for Enterprise Data Warehouse using ERWIN.
Validated and updated the appropriate LDM 'stop process mappings, screen designs, use cases, business object model, and system object model as they evolve and change.
Created business requirement documents and integrated there quirements and underlying platform functionality.
Maintained data model and synchronized it with the changes to the database.
Designed and developed use cases, activity diagrams, and sequence diagrams using UML.
Extensively involved in the modeling and development of Reporting Data Warehousing System.
Designed the data base tables created table and column level constraints using the suggested naming conventions for constraint keys.
Implemented enterprise grade platform (Marklogic) for ETL from main frame to Cassandra.
Used ETL tool BODS to extract, transform and load data into data warehouses from various sources like relational databases, application systems, timetables, flat files etc.
Developed stored procedures and triggers.• Wrote packages, procedures, functions, exceptions using PL/SQL.
Reviewed the data base programming for triggers, exceptions, functions, packages, procedures.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship