- Six years of experience in the IT industry, including four - plus years as a Big Data Developer/Data Engineer.
- Expertise in configuring Hadoop ecosystem (1.x and 2.x) components such as HDFS, MapReduce, Yarn, Pig, Hive, Impala, HBase, Oozie, Zookeeper, Ambari, Sqoop, Kafka and Flume.
- Good understanding of Hadoop architecture, HDFS daemons such as Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node and YARN daemons like Resource Manager, Node Manager, Application Master, and Containers.
- Knowledge of working and modifying with Hadoop configuration files for cluster setup and Strong experience on Hadoop distributions like Cloudera , MapR and Horton Works .
- Hands on experience in writing complex Map Reduce jobs in Java, Python. Implementing Hive scripts and creating custom UDFs using Python, Java and Maven dependencies.
- Good understanding of Spark architecture, Spark components such as Spark Core API, Driver, Cluster Manager, Executors, Staging, Partitions, Spark Jobs, Tasks, DAG Scheduler, Task Scheduler, RDDs, Spark-SQL, Spark Streaming.
- Experienced in working with distributed data tools: HBase, Hive, Spark-SQL, MySQL.
- Experience in data Ingestion, importing and exporting data from different databases like MySQL, MongoDB, Oracle, and Teradata into HDFS and vice-versa using Sqoop.
- Experience in designing and developing applications in Spark using Scala, PySpark to compare the performance of Spark with Hive and SQL/Oracle.
- Hands-on work experience with Oozie Workflow Engine for running jobs in the Hadoop ecosystem.
- Experienced in creating schedulers, workflows and Data Pipelines using OOZIE, Spark in Hadoop Ecosystem.
- Working knowledge of creating real-time data streaming solutions using Apache Spark/ Spark Streaming, Kafka and Flume.
- Experience in AWS EC2, configuring the servers for Auto scaling and routing traffic using Elastic load balancing.
- Configuring AWS EC2 instances in VPC network & managing security through IAM, security groups, NACLs and Monitoring server’s health through Cloud Watch.
- Experience of handling different file formats like JSON, AVRO, Parquet, CSV and Sequence File.
- In-depth knowledge of machine learning concepts like clustering and classifiers and implementing them using Spark MLLib, Python Scikit-Learn Library, Pandas, NumPy.
- Good knowledge in developing, testing, validating and creating reports for Machine learning, Computer Vision, Deep learning, Natural Language Processing projects using Python, TensorFlow, Keras, PySpark.
- Working knowledge of Amazon’s Elastic Cloud Compute (EC2) infrastructure for computational tasks and services like S3, EBS, EFS, RDS, DynamoDB as the Storage mechanism.
- Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL , Data Frame, Pair RDD's, YARN.
- Hands-on experience with AWS VPC, VPN, Putty and WinSCP and version control tool like Git.
- Knowledge of Agile and Scrum methodology for project management.
- Experience in writing bootstrap/shell scripts to install patch updates for applications running on Linux systems.
- Hands-on experience in the Linux file system, terminal commands.
- A great team player with the ability to effectively communicate with all levels of the organization such as technical, management and customers.
Big Data Technologies: Apache Hadoop, HDFS, YARN, MapReduce, Apache Pig, Hive, Flume, Sqoop, Zookeeper, Oozie, Impala and Kafka, HBase, Ambari, Oozie, YARN, Apache Spark.
Hadoop Distribution: Cloudera, Hortonworks.
SQL and NoSQL Databases: Oracle, MySQL, Teradata, Amazon RDS, Dynamo DB, HBase.
Programming Languages/Coding: C, C++, Java, Scala, SQL, Python (Scikit-learn, NumPy, SciPy, Pandas, Matplotlib, Seaborn, Plotly, Keras, TensorFlow, PySpark), Linux Shell Scripts, bootstrap, R(intermediate), SQL
Cloud: : AWS (S3, DynamoDB, EMR, Kinesis, EC2, IAM, ECS, RDS, VPC, Lambda, EBS, ELB), Docker.
Tools: Eclipse IDE, IntelliJ IDEA, PyCharm, SBT, Maven and ANT, Tableau, MATLAB, Cygwin, R Studio.
Operating Systems & others: Linux (CentOS, Ubuntu), Unix, Windows, Putty, WinSCP, VMWare, Oracle VirtualBox, AWS and Microsoft Office Suite.
Machine Learning: Classification, Clustering, Regression, Feature Engineering, Data Mining (CRISP-DM), Data Scraping, Data Manipulation, and Visualization.
Statistics/Mathematics: Inferential and descriptive statistics, Probability, Calculus, Linear Modelling, Non-Parametric testing, PCA, ANOVA.
Data Visualizations: Tableau, Amazon Quick sight
Confidential, Northville, MI
Big Data Engineer
- “Launchpad for Intelligent Analytics ( Confidential )” - a Coordination of Benefits solution to perform analytics for Medicare and Medicaid claims.
- Collecting the data of Protected Health Information ( PHI ) which are in the form of flat files, in different EDI file formats, etc., on client databases, and loading them into the MySQL database for storage.
- Importing and exporting terabytes of data using Sqoop on incremental append with splits on boundary conditions from MySQL to HDFS in formats ( parquet, Avro, text ), with compression codecs ( gzip, snappy )
- Using HIVE as cleaning engine performed Data cleaning, transformations, aggregations, Bucketing, partitioning on extracted data and stored the HIVE metastore in MySQL for future use.
- Created maven projects on parallel Spark sessions, with required dependencies and plugins for the running environments.
- Created Rule-based engine (business rules) using Drools for Medicaid/Medicare insurance programs.
- Processed Data is collected as Spark RDDs for validation and performed transformations to filter recoverable and non-recoverable Medicaid claims and saved them as text files for further analysis.
- Defined Spark Jobs using maven , sbt jar files and input files and specific rules for the jobs are passed as arguments during spark-submit or specified them in properties file depending on user requirement.
- Handling performance tuning of Spark Partitions, Memory capabilities, Broadcast variables, for Effective & efficient Joins, Transformations during data ingestion process itself.
- Implemented Oozie workflows for data ingestion, transformation, processing, and streaming Spark and Hive jobs.
- Responsible in analyzing, finding patterns and insights within extracted structured data of non-recoverable claims using Neural Networks , ensemble learning, AutoML algorithms using H2o package in Python .
- Launching Elastic Map Reduce (EMR) clusters with Hive, Presto, Pig, Zookeeper, Sqoop, HBase, Spark for Designing, building and operationalizing large-scale enterprise data solutions and applications in combination with other AWS services S3, DynamoDB, RedShift, Kinesis, Lambda.
- Created AWS CloudWatch dashboards, events, alarms to monitor EMR for instance outages, resources and cost optimization with autoscaling groups enabled EC2 to launch an instance when its health check fails.
- Creating EMR steps for running custom maven, sbt jars, HIVE scripts and storing output files to S3 buckets with IAM roles, IAM users, security groups to provide access only to authorized systems and users.
- Created Tableau dashboards, visualizations, and performing advanced analytics to analyze data.
- Responsible in building Groups, hierarchies, set to create detail level summary reports and Dashboard using KPI's.
Environment: Hadoop, HDFS, Spark, AWS Cloud Services, Drools, MapReduce, Python, Hive, Sqoop, Oozie, Scala, Python, SQL Scripting, Linux Shell, Zookeeper, Tableau, MySQL.
Confidential, Tempe, AZ.
Big Data Engineer
- Performing data cleaning and pre-processing of the highly granular data using Hive scripts.
- Creating Hive tables, loading data using Incremental imports and writing queries to parse the data.
- Experience in importing and exporting terabytes of data using Sqoop from HDFS to RDBMS (MySQL) and real-time data streams from servers to HDFS using Apache Flume for IoT applications.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Created Elastic Map Reduce (EMR) clusters and Configured the Data pipeline with EMR clusters for scheduling the task runner and provisioning of EC2 Instances with AWS EC2, Key Pairs, Security Groups, Auto Scaling, ELB, on both Windows and Linux for fault tolerance.
- Configured CloudWatch, SQS, and SNS using AWS API for monitoring AWS services, and to send notifications.
- Developed a data pipeline for applications to collect real-time data using Amazon Kinesis from different producers and processed it by writing Spark Scala, PySpark streaming scripts and stored results in required formats in HDFS(EMR), S3, DynamoDB.
- Developed Spark Map Reduce Scripts using IntelliJ IDE and build SBT dependencies.
- Developed Spark Applications using Scala , PySpark and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources, to ease iterative and interactive querying by creating Spark data frames, datasets with effective partitioning, bucketing.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Deploying Oozie Workflow Engine and writing shell scripts to manage and automate interdependent Sqoop, Hive, and Spark jobs .
- Utilizing Spark MLLib and NumPy for developing machine learning algorithms like clustering, adaptive prediction, and classifiers to analyze the data.
- Used Apache Arrow (PyArrow) to transfer spark Datasets/Data frames between JVM process to Python.
- Collaborating with the BI team to understand the project requirements and creating custom Hive UDFs using Python to compute various metrics for reporting and analysis.
Environment: Hive, HDFS, Spark, Spark MLLib, MapReduce, Pig, Hive, Sqoop, Java, Linux Shell, Zookeeper, MySQL, SQL/PL, IntelliJ IDEA.
Confidential, Atlanta, GA
Junior Bigdata Developer
- Exporting raw structured, multi-structured (Avro, XML, CSV) large volumes of financial transaction data in batches (yearly and quarterly) into HDFS environment to create Data Lakes using Sqoop.
- Performed data cleaning and pre-processing of the structured data using hive scripts .
- Wrote pig UDFs using java and Maven dependencies for filtering, joining/ aggregating and structuring the data for efficient querying.
- Implemented Hive internal and external tables and utilized partitioning, bucketing and joining concepts to prepare data for fast querying.
- Created custom Hive UDFs using Python for querying and extracting data for reporting and analysis.
- Used SparkAPI over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Implemented Spark scripts by using Scala, PySpark with appropriate commands for data transformation and processing as per the project requirement.
- Used SQL querying on MySQL database for storing and updating the table, creating views and extracting required data for processing and analysis.
- Installed and configured Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, HDFS, and YARN in Linux (CentOS). Monitored and managed clusters using Apache Ambari.
- Created presentations using Microsoft PowerPoint and Excel for reporting results and collaborated with the client’s BI team to provide data for analysis.
- Designing and giving technical presentations and presenting results on Hadoop architecture, Hadoop ecosystem, and Big Data applications and projects.
Environment: Oracle VM VirtualBox, Linux, Windows, IntelliJ IDE, Hadoop, Map Reduce, Hive.
- Collecting the data of employee information which are in the form of different file formats etc., on client databases, and loading them into the multiple databases for back-up storage.
- Used Sqoop to import structured data batches from data sources (MySQL, SQL SERVER) in formats like Excel, CSV, mapping sheets, etc.) into HDFS for storage and processing which provides integrated views that can be used to drive analysis and summarized crisply for executive decision making .
- Complete knowledge in Hadoop ecosystem components, HDFS commands configuration files.
- Handled the cleaning, sorting and aggregating data from large datasets of employee details for analysis.
- Developed multiple Map Reduce jobs using HIVE, PySpark in IntelliJ IDEA for data cleaning and pre-processing different categories of leaves for employees, their role-based privileges and organizational policies.
- Created and utilized containers and partitioners for Map Reduce applications to improve performance and reduce communication and computation time.
- Developed Python and bootstrap shell scripts to automatically review configuration files of applications.
- Analyzed datasets using Pig, Hive, MapReduce, and Sqoop to recommend business improvements.
- Created Spark RDDs in Spark-shell using Scala (functional programming) for performing data cleaning, transformations, actions, and querying.
- Monitored DAG scheduler for Spark application’s stages, tasks, executors, environment, accumulators in Spark web UI for its workflow, and to understand efficient job execution.
- Led the design of an integrated analytics solution that replaced several disjoined manual reports.
- Worked with application teams to install an operating system, Hadoop updates, patches, version upgrades as required in Cloudera Hadoop deployment using Oracle Virtual Box.
- Monitored workload, job performance and capacity planning using Cloudera Manager.
Environment: Sqoop, Map Reduce, Hadoop, HDFS, Linux, IntelliJ IDE, Spark, Scala, NumPy, Python, Cloudera VM.
Software Programmer Analyst
- Involved in the phases of SDLC (Software Development Life Cycle) including Requirement collection, Design, and analysis of Customer specification, Development, and Customization of the application.
- Involve in object-oriented analysis and design using UML Unified Modeling Language such as use case, activity, sequence, class and component diagrams using Visio.
- Creating Queries and joins on multiple tables, Functions, and Triggers using LINQ in MySQL for inserting/updating/ deleting the data into the relational tables.
- Coding and Tuning the queries, Stored procedures, and functions, Triggers using PL/SQL
- Created, published, and managed reports using SSRS , then designed an automated scheduled delivery of the reports to the manager for error checking, updating, and finally deploying them to the end users.
- Worked as an active team member in developing Client/Server Applications on various architectural design patterns including MVC 3.0/4.0, Two-Tier, Three-Tier Architecture.
- Coordinate with Project managers, Development and QA teams during the project.
- Developed application using Eclipse and used to build and deploy a tool like Maven, SBT.
- Implemented the mechanism of logging and debugging with Log4j.
- Involved in creation of Test Cases for JUnit Testing, performed code reviews to ensure consistency to style standards and code quality and to fix bugs (if any).
Environment: MySQL, SQL Server Reporting Services 2010, log4j, java, ASP.Net MVC