We provide IT Staff Augmentation Services!

Data Engineer / Python Spark Developer Resume

Hartford, CT


  • Over 6+ years of strong experience in Big DATA / HADOOP Data Analyst, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Statistical modeling, Data modeling, Data Visualization.
  • Adept in statistical programming languages like R and Python, Apache Spark including Big Data technologies like Hadoop, Hive, Netezza, Yarn, MapReduce, Pig.
  • Experienced on data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining and advanced data processing.
  • Deep analytics and understanding of Big Data and algorithms using Hadoop, MapReduce, NoSQL and distributed computing tools.
  • Experienced in writing Pig Latin scripts, MapReduce jobs and HiveQL, Netezza.
  • Extensively used SQL, Numpy, Pandas, Seaborn, Scipy, Matplotlib, Scikit - learn, Spark, Netezza, Hive for Data Analysis.
  • Deep understanding & exposure of Big Data Eco-system.
  • Experienced in importing and exporting the data using Sqoop from HDFS to Relational Database systems/ mainframe and vice-versa.
  • Extensively worked on Sqoop, Hadoop, Hive, Spark to build ETL and Data Processing systems having various data sources, data targets and data formats.
  • Strong experience and knowledge in Data Visualization with Tableau creating: Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots etc.
  • Expertise in performing data analysis and data profiling.
  • Experience on hosting the Git repository service.
  • Experience in Agile project. Especially Atlassian products Jira and Confluence
  • Strong problem-solving skills, good communication and good team player.
  • Practiced in clarifying business requirements, performing gap analysis between goals and existing procedures/skill
  • Research-oriented, motivated, proactive, self-starter with strong technical, analytical and interpersonal skills.


Scripting language: Python 3.x/2.7/2.4, Jupyter Notebook, Pandas, Matplotlib, NumPy, data visualization using Tableau, Sentimental Analysis, Text Classification using NLP, Unix shell scripting, C

Database & tool: My SQL, SQL Server, Selenium.

Big data: Hadoop, Hive, Netezza, Map reduce, Yarn, Impala, Sqoop.

Operating system: Windows, Unix, Linux, Ubuntu, Putty, Secure CRT, Secure Fx.

Network protocol: TCP/IP, HTTP, HTTPS

Version control system: GitHub


Confidential, Hartford, CT

Data Engineer / Python Spark Developer


  • Utilized Apache Spark with Python to develop and execute Big Data Analytics.
  • Hands on coding - Write and test the code for the Ingest automation process - Full and Incremental Loads. Design the solution and develop the program for data ingestion using - Sqoop, map reduce, Shell script & python
  • Developed various automated scripts for DI (Data Ingestion) and DL (Data Loading) using python map reduce.
  • Developed export framework using python, sqoop, Hive and Netezza using Aginity Work Bench.
  • Developed fully customized framework using python, shell script, Sqoop & hive.
  • Extensively worked on HDFS, HIVE, Netezza
  • Dealt huge volumes of data.
  • Identified areas of improvement in existing business by unearthing insights by analyzing vast amount of data
  • Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools and statistics.
  • Led discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models. Expert in Business Intelligence and Data Visualization tools: Tableau, MicroStrategy.
  • Worked on large size data using Spark and MapReduce.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Done data migration from an RDBMS to a NoSQL database, and gives the whole picture for data deployed in various data systems.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Stored and retrieved data from data-warehouses Hadoop Yarn.
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • Used Meta data tool for importing metadata from repository, new job categories and creating new data elements.
  • Created Data Quality Scripts using SQL and Hive, Netezza to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
  • Using GITHUB repository to clone the code and commit the changes and push it to the Develop Branch from Feature branch.

Environment: Hadoop, Big Data, Map Reduce, Yarn, Spark, Python, Scoop, SQL, Excel, Hive, Netezza, Aginity Workbench for Hadoop, Aginity Workbench for Pure Data System for Analytics, SQL Server 2012, Tableau.

Confidential, Richmond, VA

Data Analyst / Python Developer


  • To analyze the risk level for those clients with little to no credit account based on data points. we take up a data which specifies a person who takes credit by a bank. Each individual is classified as a good or bad credit risk depending on the set of attributes.
  • Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python
  • Wrote and optimized complex SQL queries involving multiple joins and advanced analytical functions to perform data extraction and merging from large volumes of historical data stored in Snowflakes DB, validating the ETL processed data in target databaseDeveloped Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication
  • , accuracy, and consistency.
  • Generated data analysis reports using Matplotlib, Tableau, successfully delivered and presented the results for C-level decision makers
  • Integrated development environment (IDE), preferably Jupyter Notebook, ANACONDA Package.
  • Manage large datasets using Panda data frames and MySQL.
  • Coding in Python (Windows, MySQL) environment.
  • Working in an Agile-based development environment using Kanban method.

Environment: Python 3.5, Pandas, Numpy, Matplotlib, MY-SQL, MS-EXCEL, Windows.

Confidential, North Quincy, MA

Data Analyst / Python Developer


  • Extracting data from Sail point on premises Database.
  • Good Understanding of Identity and Access Management (Roles, Entitlements, SODs, Profiled Access)
  • SailPoint's identity provide complete visibility into who is doing what, what kind of risk that represents, and allows you to take action. It links people, applications, data and devices to create an identity-enabled enterprise.
  • Provides role mining, role consolidation, and role Identify unknown privileged access, ~35% across the enterprise using HPA (high privileged access) discovery using data science
  • Prevent and detect segregation of duties or toxic combinations of access and their usage
  • Accurately measure and report user, account, entitlement, application, departmental, and organization risk posture
  • Data Mining like extracting sufficient quantities of data that's required to solve the problem.
  • Data Mungling is cleaning the raw data through data mining and convert into a format.
  • Data parsing through Data Profiling automated generic scripts which we created to process different applications.
  • Performing Sentimental Analysis, Text Classification using NLP on the data.
  • Filter data and finding actionable insights to find solutions for Role Mining Questions.
  • Involved with problem definition, data exploration, data acquisition and visualization, evaluating and comparing metrics on EXCEL and CSV file.
  • Discovers, risk ranks and monitors accounts with privileged access for outlier access and anomalous behavior. Backdoor access and its misuse will be a thing of the past.
  • Plotting graphs, generating reports in the excel format and do presentation for the client.
  • Developed Tableau data visualizations and dashboards using Tableau Desktop.
  • Data storage to My-SQL if needed.
  • Performs data collection, data profiling, analysis, validation and reporting.
  • Extracts and analyzes data from various sources, including SQL and Non -SQL data stores.
  • Data Cleaning, wrangling using tools like pandas, NumPy.
  • Data Visualization and reporting using Matplotlib library.
  • Generating reports from the data provided.
  • Experience with an integrated development environment (IDE), preferably PyCharm, Spyder
  • Managed large datasets using Panda data frames and MySQL.
  • Coding in Python (Windows, MySQL) environment.
  • Exposure working in an Agile-based development environment and knowledge of different methods like Scrum and Kanban with sprint cycle
  • Participate in the team Standup meeting.
  • Interacting with the clients directly along with the manager and scrum master to know their further business problems

Environment: Python 3.5, Pandas, Numpy, Matplotlib,MY-SQL, MS-EXCEL, Windows.

Python Developer



  • Involved in business approach preparation for solving the problem of travel ticket cancellations/extensions due to flight delays or flight cancellations
  • Data preparation, Feature selection (selecting features relevant to the problem) and outliers treatment
  • Built automated SQL scripts that generates flight delay and cancellation predictions on a daily basis.
  • Used Python and several python packages like pandas, numpy, sklearn, matplotlib for building machine learning statistical models and data manipulations
  • Used flight delay predictions to identify the customers who might get affected
  • Managed large datasets using Pandas and da - gcp package and MySQL
  • Used Web Services to get travel destination data and rates
  • Designed the data schema and the project relevant tables using MYSQL on Google Bigquery (a MYSQL query editor on google cloud platform)
  • Coding in Python (Linux, MySQL) environment.

Environment: Python 3.x, MySQL, Microsoft Excel, Windows and Google cloud platform Django, SQL, Windows and Linux

Hire Now