Senior Associate Resume

SUMMARY

Data Engineer with 7+ years of experience, specialized in Big data Ecosystem including Data Aggregation, Querying, Storage, Analysis, Developing and Implementation of data models.
Solid experience in Hadoop distributed file system (HDFS), Sqoop, Hive, HBase, Spark, MapReduce, Ambari, Kafka, Yarn, Airflow, Flume, Oozie, Zookeeper and pig.
Experience working with Cloudera, Hortonworks Hadoop Distributions and Amazon EMR, Azure HDInsight cloud - based Hadoop distributions.
Experience working with AWS bigdata components (EMR, EC2, S3, RDS, DynamoDB, Redshift, Atana, Lambda).
Experience working with Azure bigdata components (HDInsight, DataBricks, DataLake, Blob storage, Data Factory, CosmosDB, Storage Explorer).
Excellent Knowledge of Hadoop Architecture and daemons of Hadoop clusters which include Distributed file system, Parallel Processing, Fault tolerance, YARN, HDFS Resource Manager, Node Manager, Name Node, Data Node.
Excel in analyzing data using Spark SQL, HiveQL, Pig Latin, Spark/Scala, PySpark and custom MapReduce programs in java.
Strong working experience in developing Spark applications using Spark components (Spark Core, DataFrames, Spark SQL, Spark-ML and spark streaming API’s).
Experience in creating real time data streaming solutions using Spark Streaming, Kafka, Storm.
Strong experience on implementing PySpark and Spark/Scala applications for interactive analysis, batch and stream processing.
Extensive experience in importing and exporting data using Sqoop and RDBMS - MySQL, Oracle, MS SQL server, DB2, Teradata and PostgreSQL databases.
Strong working experience on NoSQL databases like HBase, Cassandra, MongoDB, DynamoDB and CosmosDB.
Extensively worked with Hive for creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing by HiveQL queries.
Strong experience in implementation of bigdata pipelines for batch processing and real time processing using Spark, Sqoop, Kafka and Flume.
Worked on different file formats like Parquet, Avro, ORC and Flat files.
Strong experience in writing Spark Scripts in Python and Scala for development and analysis.
Extensive experience in developing Shell scripting, Stored Procedures, Functions and Triggers, Complex SQL queries using Oracle PL/SQL.
Involved in Software Development Life Cycle (SDLC) involving Application Development, Data Modeling, Data Analysis andETL/OLAP Processes.
Worked in both Agile and Waterfall methodologies.
Worked alongside Data science team, project manager, engineering team on regular basis for gaining insights from data and decision making.
Experience in designing dashboards, reports, performing ad-hoc analysis and visualizations using Tableau, Power BI.
Excellent in interacting with people, quickly learning new required tools/technologies, rational in thinking and hard worker to deliver quality results.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, Yarn, Spark, Kafka, Airflow, Hive, Impala, Sqoop, Flume, Pig, Ambari, Oozie, Zookeeper, Nifi, Mesos.

Hadoop technologies: Apache Hadoop, Apache Spark, Cloudera, Hortonworks.

Cloud services: AWS (EMR, EC2, S3, Redshift, RDS, Atana, Lambda, DynamoDb), Azure (HDInsight, DataBricks, DataLake, Blob Storage, CosmosDB, Data Factory, Storage Explorer).

Programming Languages: Python, Scala, Java, C, C++, Shell Scripting, Pig Latin, HiveQL.

Database: MySQL, Teradata, Oracle, DB2, MS SQL SERVER, PostgreSQL, Snowflake.

NoSQL Database: HBase, Cassandra, MongoDB.

Visualization: Tableau, PowerBI.

Version control: Git, SVN.

Operating systems: Windows (XP/7/8/10), Linux (Ubuntu, Centos).

Web Development: JavaScript, HTML, CSS, JSON, XML, J2EE.

PROFESSIONAL EXPERIENCE

Confidential

Senior Associate

Responsibilities:

Involved in developing data pipelines using AWS (EMR, EC2, S3, RDS, Atana, Lambda) and moving large scale pipeline applications from on premise clusters to AWS cloud.
Worked along Data Science team to support running Machine Learning models on Apache Spark clusters on AWS EC2 instances and assisted them working on Spark machine learning library (MLlib). Worked on migrating data from on premise databases to S3, RDS, Redshift.
Involved in creating event driven ETL pipelines by using AWS Glue whenever new data is appeared on AWS S3 and understand data assets.
Developed an ETL pipeline to extract archived logs from disparate sources and stored in AWS S3 data lake for further processing using PySpark.
Worked with AWS Atana to perform analytics on the data stored in S3 buckets.
Created a pipeline to gather data using Kafka and Spark Streaming and store it in Cassandra and apply transformations using PySpark.
Developed multiple PySpark scripts to perform cleaning, validation and transformations of Data.
Performed ELT operations using PySpark, SparkSQL, Hive and python on Large datasets.
Worked on Spark SQL for reading/writing data from JSON file, text file, Parquet file. worked in development of Spark code using Scala for less processing and testing.
Involved in handling large datasets using Partitions, In memory capabilities of Spark, Efficient Joins and others during the process of transforming data.
Developed efficient PySpark scripts for reading and writing data from NoSQL table and tan run SQL queries on the data.
Uploaded and processed large amounts of various structured and unstructured data into HDFS on AWS cloud using Sqoop.
Worked extensively with importing data into Hive and accessing it using Python and migrated existing tables and applications to work on the AWS cloud(S3).
Worked on creating data pipelines with Airflow to schedule PySpark jobs for performing incremental loads and used Flume for weblog server data.
Worked on datamodeling and normalization techniques to load the raw data from multiple sources in various storage formats like Parquet, Avro, CSV into Data Lakes.
Involved in developing data integrity tasks to resolve data related issues.
Involved in migrating Oozie workflows to Airflow to automate data pipelines to extract data and weblogs from DynamoDB and Oracle.
Involved in writing shell scripts to load datainto tables in Redshift from S3 buckets.
Worked on analyzing data from multiple sources to create integrated views that can be used for decision making.

Environment: AWS (EMR, EC2, S3, RDS, Atana, Lambda, Glue, DynamoDB, Redshift), Hadoop 2.x (HDFS, MapReduce, Yarn, Sqoop, Hive), Spark, PySpark, SparkSQL, MLlib, Kafka, Zookeeper, Airflow, Flume, python, HBase, Oracle, Cassandra, Git.

Confidential

Big Data Engineer.

Responsibilities:

Worked with Azure cloud platform (HDInsight, DataBricks, DataLake, Blob storage, Data Factory and Data Storage Explorer).
Experience analyzing data from Azure data storages using Databricks for deriving insights using Spark cluster capabilities.
Developed ETL pipelines in Spark using python and workflows using Airflow.
Involved in development of data ingestion, aggregation, integration, and advanced analytics using Snowflake and Azure Data Factory.
Worked on developing several Spark/Scala scripts for data extraction from different sources and providing data insights and reports as per need.
Wrote Spark SQL scripts to run over imported data and existing RDDs and implemented Spark practices partitioning, caching and check pointing.
Developed code from scratch using Scala according to the technical requirements.
Loaded all data into Hive from source CSV files using Spark.
Worked on extracting Real time data using Spark streaming and Kafka and converted it to RDD and process it into DataFrame and load the data into HBase.
Used Spark SQL to read the Parquet data and loaded tables in hive to Spark using Scala.
Developed Spark jobs using Scala and Spark SQL for faster testing and processing of data.
Worked on deriving Structured Data from Unstructured data received using Spark.
Used Spark-Cassandra Connector to load data to and from Cassandra.
Worked in the process of data collection, data processing and exploration project in Scala.
Automated the process for extraction of data from warehouses and weblogs by developing workflows and coordinated interdependent Hadoop jobs using Airflow.
Worked on ingesting data (MySQL, MSSQL, MongoDB) into HDFS for analysis using Spark, Hive and Sqoop. Experience Working with CosmosDB (Mongo API).
Involved in PL/SQL query optimization to increase speed of runtime of stored procedures.
Worked extensively with Sqoop for importing metadata from Oracle.
Involved in converting the JSON data into pandas DataFrame and stored into hive tables.
Developed data pipeline using Flume to extract the data from weblogs and store in HDFS and used Postgres to copy files from existing file format.
Used Azure DevOps for CI/CD, Apache Mesos for managing resources and scheduling across the clusters and used Ambari Web UI for monitoring Spark clusters.
Actively monitored and resolved issues evolved in pipelines and designed dashboards using Power BI on requirement of team.

Environment: Azure HDInsight, DataBricks, DataLake, Blob storage, Data Factory and Data Storage Explorer, Scala, Python, SQL, Hadoop (HDFS, Yarn, MapReduce, Hive, Sqoop), Spark, Kafka, Zookeeper, Flume, Mesos, Ambari, Airflow, HBase, Oracle, MySQL, Postgres, Snowflake, Cassandra, MongoDB, CosmosDB, PowerBI, Azure DevOps, Git

Confidential

Data Engineer

Responsibilities:

Automated the installation and configuration of Scala, Python, and Hadoop, and necessary dependencies that configured the necessary files accordingly.
Fully configured Hadoop cluster for faster and more efficient processing and analysis of the data.
Working noledge of Spark RDD, Data Frame API, Data Set API, Spark SQL, and Spark Streaming.
Extensively used Sqoop to ingest batch data present in Oracle and MS SQL into HDFS at scheduled intervals.
Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD’s.
Involved in ingesting real-time data into HDFS using Kafka and implemented job for daily imports.
Responsible for handling streaming data from web server console logs. Worked on creating Kafka Producer, Consumer, Topic, Brokers, and partitions.
Configured Zookeeper to manage Kafka cluster nodes, coordinate the brokers/clusters topology.
Transformed and stored the ingested data into Data Frames using Spark SQL.
Created Hive tables to load transformed data. Implemented dynamic partitioning and buckets in Hive.
Worked on Performance and Tuning optimization of Hive.
Improved the performance and optimization of existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, Pair RDD’s, and Spark YARN.
Experienced in Performance tuning of Spark application for setting right batch interval time, level of parallelism, and memory tuning for optimal efficiency.
Involved in exporting Spark SQL Data Frames into Hive tables as Parquet files. Performed analysis on Hive tables based on the business logic. Continuously tuned Hive UDF’s for faster queries by employing partitioning and bucketing.
Created a data pipeline using Oozie workflows which runs jobs daily.
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for datasets processing and storage. Experience in maintaining the Hadoop cluster on AWS EMR.
Installed application on AWS EC2 instances, configured the storage on S3 buckets, and worked closely with AWS EC2 infrastructure teams to troubleshoot complex issues.
Design and develop ETL processes on AWS Glue to migrate campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.Involved in loading data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elastic Search and loaded data into Hive External tables.
Configured Snow pipe to pull data from S3 buckets into Snowflake table.
Worked on Amazon RedShift for shifting all Data warehouses into a single warehouse.
Involved in utilizing Azure Data Factory services to ingest data from legacy disparate data stores like SAP, SFTP servers, and HDFS to Azure Data Lake Storage.
Model complex ETL jobs that transform data visually with a data flow or by using compute services like Azure Databricks and Azure SQL Database.
Involved in daily stand-up meetings and Sprint showcase and Sprint retrospective.
Generated scheduled reports for Kibana dashboards and visualizations. Worked in Agile development environment having Kanban methodology.
Implemented UNIX scripts to define the use case workflow and to process data files and automate the jobs.
Responsible for using GIT for version control to commit the code developed which further used for deployment using the build and release tool, Jenkins. Developed CI/CD system with Jenkins on Kubernetes container environment, utilizing Kubernetes and Docker for the CI/CD system to build, test and deploy.

Environment: Hadoop, HDFS, MapReduce, Hive, Sqoop, Zookeeper, Apache Kafka, Oozie, Yarn, Spark, Python, Scala, AWS (S3, Glue, EC2, Redshift, EMR, SQS), Azure Databricks, Azure SQL Database, Snowflake, Unix, Kanban, Kibana, GIT, Jenkins, Kubernetes, Docker

Confidential

Big Data Developer

Responsibilities:

Worked on configuration and monitoring Hadoop cluster using Cloudera distribution.
Involved in migrating data from on prem Cloudera cluster to AWS EC2 instances deployed on EMR cluster and developed ETL pipeline to extract logs and store in AWS S3 data lake and further processed it using PySpark.
Moved files between HDFS and AWS S3 and worked with S3 bucket in AWS on regular basis.
Responsible for developing data pipeline using Flume, Sqoop and Pig to extract the data from weblogs and store in HDFS.
Migrated data between various data sources like Teradata, Oracle and MySQL to HDFS by using Sqoop. Used HCatalog to access Hive table metadata from MapReduce and Pig code.
Developed a data pipeline using Kafka and Storm for streaming data and to store it into HDFS.
Used Informatica PowerCenter for cleaning, managing and integrating data from different sources for ETL and loaded into a single warehouse repository.
Used Impala to read, write and query the Hadoopdata in HDFS from HBase and constructed Impala scripts to reduce query response time.
Analyzed data stored in S3 buckets using SQL, PySpark and stored the processes data in Redshift and validated data sets by implementing Spark components.
Performed ETL operations using Python, SQL on many data sets to obtain metrics.
Prepared data according to analyst requirements on the extracted data using Pandas and NumPy modules in Python.
Involved in designing and developing automation test scripts using Python.
Involved in writing multiple python scripts to extract data from different API’s.
Created HBase tables using Shell to load large sets of data from different databases.
Involved in scheduling Time based Oozie workflow engine to run multiple Hive and Pig jobs.
Developed flow XML files using Apache NiFi to process and ingest data into HDFS.
Worked on performance tuning of Apache NiFi workflow to optimize the data ingestion speeds.
Responsible for collecting and aggregating large amount of log data using Flume and staging it into HDFS for further analysis.
Worked on integration of Apache Storm with Kafka to perform web analytics and upload streaming data from Kafka to HBase and Hive.
Responsible for developing data pipelines using Apache Kafka by implementing Kafka producers and consumers.
Used Hive optimization techniques like partitioning and bucketing to provide better performance with HiveQL queries. Loaded large amounts of data to HBase using MapReduce jobs.
Worked on developing UDFs to work with Hive and wrote our tests in Scala.
Used zookeeper to maintain configurations across clusters and for better synchronization, grouping and reliable distributed coordination.
Worked with Kerberos and Apache sentry for security and authorization on Hadoop.
Used Git for version control.

Environment: Cloudera CDH4, Hadoop 2.x (HDFS, MapReduce, Yarn, Sqoop, Hive), AWS, EC2, S3, Redshift, Impala, Spark, Pig, SQL, HBase, Kafka, Zookeeper, Flume, Oozie, HCatalog, NiFi, Storm, Informatica, Python, MySQL, Scala, Teradata, Oracle, Git.

Confidential

Hadoop Developer

Responsibilities:

Worked with Hortonworks distribution and Bigdata components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop, Pig, Ambari.
Primary responsibilities include building scalable distributed data solutions using Hadoop ecosystem.
Developed multiple Map Reduce jobs in Java for data extraction, transformation and aggregation from multiple file formats like XML, JSON, CSV and other file formats.
Imported data from Oracle, MySQL, DB2 using Sqoop, performed transformations using Hive, MapReduce and loaded data into HDFS.
Involved in development of ETL pipelines using Python and SQL for analytics and reviewed use cases before loading into HDFS.
Updated Bash scripts to bring the log files from FTP server and parse the logs into relational format by Hive jobs.
Developed HBase tables for data storage. Implemented various Hive queries for analytics.
Created Hive partitions and Bucketing tables for faster data access and analyzed the data for computing various metrics for reporting.
Developed Data Cleaning techniques, UDFs using Pig scripts, HiveQL, MapReduce.
Involved in automating and scheduling the Sqoop jobs in a timely manner using Unix Shell Scripts.
Analyzed the data by performing Hive queries (HiveQL) and running Pig Latin scripts to study customer behavior and performed Ad-hoc analysis for single time use.
Updated and optimized existing modules of Python scripts according to the need.
Involved in scheduling and executing workflows in Oozie to run pig jobs.
Used Ambari for monitoring and Zookeeper for distributed coordination over clusters.
Assisted on creating ETL jobs in Talend and push data to a data warehouse.
Generated dashboards using Tableau and analytical solutions by collecting, scrubbing, and extracting data from various sources.
Involved in process of inspecting, cleansing, transformations, and modeling data with the goal of discovering useful information, conclusions and supporting decision-making.
Used SVN for version control.

Confidential

Java/J2EE Developer

Responsibilities:

Involved in architectural and design discussions.
Handled server-side validations using Servlets and customer side validations using JSP, Bootstrap, jQuery and taken care of javaMultithreading in common java classes/library.
Part of developing web-services using WSDL, SOAP and XML.
Involved in writing Spring Configuration XML files that contains declarations.
Used Oracle as Database and used Toad for queries execution and also involved in writing SQL scripts, PL/ SQL code for procedures and functions.
Used Tomcat web server for deployment purpose.
Developed configuration files corresponding to beans mapped and backend database tables.
Worked on front end change requests by making changes to existing code usingJavaScript,HTML and CSS.
Used Log4J to print the logging, debugging, warning, info on the server console.
Developed application using Eclipse and used build tool as Maven.
Involved in CI/CD process using Jenkins and used SVN for version control and management.

Environment: Java/J2EE, Servlets, JSP, Bootstrap, jQuery, WSDL, SOAP, XML, Oracle, PL/SQL, Tomcat, JavaScript, HTML, CSS, Log4J, Maven, Jenkins, SVN.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship