We provide IT Staff Augmentation Services!

Big Data Developer Resume

Bloomington, IL


  • 7 years of total IT experience in Big Data Analysis and development, 5+ years of experience in Data Science, Information Availability, Information Governance for various domains
  • Experience in design and development of applications using Hadoop and its ecosystem components like Hadoop, Hive, Spark, Scala, Sqoop, Kafka, HBase and YARN
  • Excellent knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node
  • Hands on experience with Scala language features - language fundamentals, Classes, Objects, Traits, Collections, Case Classes, Higher Order Functions, Pattern Matching, Extractors, etc.
  • Experience on Hadoop Distributions HDP 2.6.x and CDH 5.x
  • Experience in developing Spark streaming applications using Scala to consume real - time transactions via Kafka Topics
  • Experience on building the applications using Spark Core, Spark SQL, Data Frames, Spark Streaming
  • Expertise on usage of SQL queries to extract data from RDBMS databases - MySQL, DB2, Oracle and Postgres SQL
  • Experience on importing the data from RDBMS databases MySQL, Oracle and DB2 into Hadoop data lake using Sqoop
  • Experience on data ingestion tool NiFi, used to extract data from various data sources into Hadoop data lake
  • Experience on job scheduling tools - Control-M and Oozie
  • Experience on distributed SQL engines such as Presto to enable low latency data extractions from Hadoop for analytical purposes
  • Experienced in AWS - S3, EC2, RDS and EMR
  • Experience in developing Spark applications using DataFrame and Datasets. Transformed data using PySpark, Spark SQL, performance tuning techniques using Catalyst and Tungsten
  • Hands on experience to migrate existing data from traditional warehouse locations to Hadoop cluster and create common data lake and consumption Data Mart to enable regulatory and MI reporting
  • Experience on NoSQL databases HBase, MongoDB and Cassandra
  • Experience in real - time messaging systems such as Kafka to ingest real time streaming data into Hadoop
  • Worked with different Bug Tracking Tools like Remedy, and Jira
  • Experience on developing Spark batch applications to ingest data into common data lake using Scala
  • Experience in analyzing data using HiveQL
  • Experience in importing and exporting data using Sqoop from RDBMS to HDFS and vice-versa
  • Experience in architecting, designing, implementing and deploying the Data Protection Software suite and Digital Investigation software suite for diverse environments
  • Experience in building Data pipelines, Data Engineering, Data Mining & programming Machine Learning Algorithms (supervised and unsupervised) to gather insights off the data
  • Proficient in Machine Learning techniques (Decision Trees, Linear/ Logistic Regression, Random Forest, K - Nearest Neighbors) and Statistical Modeling in Forecasting/ Predictive Analytics, Hypotheses Testing, Factor Analysis/ PCA
  • Experience in analyzing, manipulating and developing machine learning models with Python using Scikit - Learn, NumPy, SciPy and Pandas
  • Experience analyzing, manipulating and developing machine learning models with data with R using libraries ggplot2, evir, Ecdat, car, caret, Cubist, mlbench, AppliedPredictiveModelling, Cubist, plyr and pROC
  • Experience in quantitative research methods and analysis (ANOVA, ARIMA, ARMA, factor analysis, regression analysis, SVM, Naïve Bayes, Anomaly detection)
  • Experience in visualizing infographics to deliver meaningful insights of data using RShiny & Tableau
  • Excellent networking and communication with all levels of stakeholders as appropriate, including executives, application developers, business users, and customers
  • Experience working with Agile and Waterfall methodologies


Confidential, Bloomington, IL

Big Data Developer

  • Worked with Project Manager, Business Leaders and Technical teams to finalize requirements and create solution design & architecture
  • Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, Hive, Spark, Scala Sqoop
  • Design and Develop Spark code using Scala, PySpark & Spark SQL for high speed data processing to meet critical business requirement
  • Analyzed the SQL scripts and designed the solution to implement using PySpark
  • Implement RDD/Datasets/DataFrame transformations in Scala through SparkContext and HiveContext
  • Developed algorithms & scripts in Hadoop to import data from source system and persist in HDFS (Hadoop Distributed File System) for staging purposes
  • Developed Shell scripts to perform Hadoop ETL functions like Sqoop, create external/internal Hive tables, initiate HQL scripts
  • Developed scripts in Hive to perform transformations on the data and load to target systems for reporting
  • Worked on all four stages - data ingest, data transform, data tabulate and data export
  • Maintained fully automated CI/CD pipelines for code deployment (Gitlab/ Jenkins/ IBM UC Deploy)
  • Built code using Java, Spring boot, Maven, and Jenkins for building and automating our data workflow
  • Performed Junit Tests and Functional tests for validating our code
  • Actively managed, improved, and monitored cloud infrastructure on AWS - EC2, S3, and EMR
  • Wrote Puppet manifests and modules to deploy, configure, and manage servers for internal DevOps process

Environment: Cloudera Hadoop, HDFS, Yarn, Java, Spring Boot, Maven, Jenkins, Gitlab, Git, Hive, PySpark, Spark SQL, Sqoop, MS SQL Server, Oracle, SQL/ NoSQL, Linux, Puppet, Tableau

Confidential, Iselin, NJ

Big Data Consultant

  • Worked closely with customers to understand their current technical environment, key business drivers and future technology requirements
  • Developed project proposals and Statements of Work based on the gathered requirements and the proposed solution
  • Loaded data from different relational data sources into HDFS using Sqoop and exported them to partitioned Hive tables
  • Designed both Managed and External tables in Hive to optimize performance
  • Worked with various file formats such as Parquet, Avro, ORC, CSV, flat files and JSON
  • Exported and Imported data into HDFS and Hive using Sqoop
  • Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka through persistence of data into HBase
  • Implemented Kafka Security Features using SSL and without Kerberos. Further with more grain-fines Security I set up Kerberos to have users and groups this will enable more advanced security features
  • Installed Kerberos secured Kafka cluster with no encryption on Dev and Prod
  • Installed Ranger in all environments for Second Level of security in Kafka Broker
  • Designed and Implemented Kafka Producer Application to produce real time data using Apache Kafka Connect; Used Change Data Capture (CDC) software and Oracle Golden Gate real time data replication tool
  • Implemented different data formatter capabilities and publishing to multiple Kafka Topics
  • Implemented Kafka High level consumers to get data from Kafka partitions and move into HDFS
  • Used Kafka HDFS connector to export data from Kafka topics to HDFS files in a variety of formats and integrate with Apache Hive and then into HBase
  • Integrated Apache Kafka with Elasticsearch using Kafka Elasticsearch Connector to stream all messages from different partitions and topics into Elasticsearch for search and analysis
  • Worked on Kafka and REST API to collect and load the data on HBase and Hive
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka and Persists into HBase database
  • Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala
  • Automated build and deployment using Jenkins to reduce human error and speed up production processes
  • Maintain build profiles in Team Foundation Server and Jenkins for CI/CD pipeline
  • Built statistical models on AWS EMR by uploading data in S3 and creating instance on EC2
  • Performed real-time streaming using different payment system EDMi and published on Kafka Topics
  • Used PySpark, Spark MLlib to perform Classification, Regression and Clustering on data
  • Used Spark Streaming to aid with real-time analytics on data coming in through Kafka pipelines
  • Actively developed predictive models and strategies for effective fraud detection for credit and customer banking activities using k-Means clustering using Python (PySpark)
  • Developed a linear regression model to predict a continuous measurement for improving the observation on credit data; developed using spark with Python API (PySpark).
  • Assisted senior data scientist in performing text mining on customer review data using topic modeling and sentiment classification
  • Performed k-Means clustering in order to understand customer backgrounds and segment the customers based on transaction behavior information for customized product offering, to improve existing profitable relationships and to avoid customer churn using R
  • Built interactive dashboards for business using Tableau

Environment: Cloudera Hadoop, HDFS, Yarn, MapReduce, Scala, Hive, Spark, PySpark, Spark SQL, HBase, Sqoop, Kafka, MS SQL Server, Oracle, SQL/ NoSQL, Linux, Python, R, NumPy, SciPy, Pandas, Scikit- Learn, Tableau

Confidential, Denver, CO

Big Data Analyst

  • Used R and Python programming to perform exploratory data analysis and visualization components
  • Developed audience extension models relying using machine learning algorithms - decision trees, random forest, logistic regression, and other categorical data (Hadoop - Python - R)
  • Performed web scraping using BeautifulSoup library to extract data for building graphs and visualizations
  • Developed ARIMA and EWMA forecasting model to perform predictive analytics
  • Developed prediction model applying Classification using Decision Tree (J48) classifier
  • Developed strategic and analytical dashboards using Tableau
  • Generated KPI’s for customer satisfaction survey results - Developed Tableau workbooks from multiple data sources using Data Blending - Developed Pareto charts, stacked bar graphs, Histograms and Scatter plot
  • Worked with team of developers to design, develop and implement a BI solution for Sales, Product and Customer KPIs - Pareto Analysis

Environment: Hortonworks Hadoop, HDFS, Yarn, Hive, Python, R, MS SQL Server, Oracle 11g R2, MongoDB

Confidential, Fairfax, VA

Geospatial Data Analyst - Research

  • Provide guidance and organize data access based on database privileges
  • Provided solutions to the customer to streamline data to work across multiple software platforms
  • Complete ad hoc research requests and surveys by interpreting data questions
  • Categorized multiple sources of data, including real-time or dynamic, and imagery
  • Collect data from internal and external sources and conduct analysis using inferential statistical techniques
  • Used R statistical software for effective analysis by hypothesis testing to validate data and interpretations
  • Collected data using SQL and R - cleaned with R and visualized using Tableau 9
  • Trained and supervised undergraduate students
  • Produced static maps and provided web-based mapping support
  • Participated in public involvement meetings as a representative of the company/client to present project information, address concerns and provide feedback to impacted residents
  • Created dynamic data visualizations for reports and presentations to regulators, clients, and the community

Environment: Python, R, Weka, MS SQL Server, Machine Learning, SQL/ NoSQL, Linux, Tableau, NumPy, SciPy, Pandas, Scikit- Learn, Seaborn, BeautifulSoup


Data Analyst

  • Nature of the work involves mainly cleaning and analyzing of geospatial data from GIS domains and ingest into Google Maps API as per country specific security policies using Techmate
  • Supported the collection, analysis, harmonization, and loading of metadata into a metadata repository
  • Transformed third party raw mapping data utilizing SQL database query tool and curated data was distributed to different business units to meet strict deadlines
  • Rendered satellite imaginary, and user edits to develop integrated geographical maps for GPS feeds
  • Performed Data Profiling utilizing statistics such as minimum, maximum, mean, median, mode, percentile, standard deviation and variations such as count and sum
  • Reduced marketing cost per AdWords lead by $100
  • Performed keyword research and built PPC campaigns from ground up - product lifecycle analysis
  • Tracked sales metrics - ROI, revenue from natural/paid search, CTR, CPC, conversions - for managed search terms - keywords using Google Analytics
  • Developed organizational strategy and content for web and email marketing campaigns

Environment: Google Analytics, Linux, Shell Scripting, Google AdWords, MS SQL Server, Microsoft Excel


Data Analyst

  • Extracted data from Oracle and MS SQL Server using Informatica to load it into a single data warehouse repository
  • Synthesized data reported ad-hoc utilizing Excel, & Crystal Reports
  • Designed and developed the ETL process from different source system to transform the data as per the business requirements to be used by the reporting teams
  • Created dimensional model based on star schemas and designed them using ERwin
  • Participated in client discussions to gather scope information and perform analysis of scope information to provide inputs for project scoping documents
  • Designed and developed Marketing ad hoc reports using Power BI
  • Developed Power BI model used for financial reporting of P & L
  • Wrote calculated columns, measures query’s in Power BI desktop
  • Worked with end user to convert old reports into OBIEE reports
  • Supported process innovation for the Retail business unit by developing Strategic Capacity analysis
  • Created Business Requirement documents (BRD), Functional & Technical Requirement documents
  • Analyzed & collected data to assist customers in planning, forecasting, and in managing their business

Environment: MS SQL Server, Oracle, Erwin, MS Visio, Power BI, Microsoft Excel


Reporting & Analysis: MS Excel, Tableau, Google Analytics, MSBI, SSIS, SSRS

Languages: UNIX, SQL, Java, Python, Scala

Databases: MS SQL Server 2008, MySQL, MS-Access, Oracle 11g R2, MongoDB

Operating Systems: Windows, OS X

Statistical/ Data Mining: Python, R

Python Packages: NumPy, SciPy, Pandas, Scikit- Learn, TensorFlow, Matplotlib, Seaborn, OpenCV, PySpark

Big Data Technologies: Hadoop, Spark, Kafka, Sqoop, Hive, MapReduce, Yarn

Data Operations: GIS, Operational Research, SEO, A/B Testing, Pattern Recognition, Predictive analysis, Visualization, Machine Learning (Supervised & Unsupervised)

Cloud Computing: AWS - S3, EC2

Other Tools: Git, IntelliJ IDE, PyCharm, Anaconda, Spring Bot, Maven, Jenkins

Hire Now