We provide IT Staff Augmentation Services!

Lead Data Engineer Resume

4.00/5 (Submit Your Rating)

CA

SUMMARY:

  • Over 8 years of IT experience in software analysis, design, development and implementation of AWS, Big Data, Hadoop and Data Science Technologies
  • Hands on experience with NoSQL Databases like HBase, Cassandra and relational databases like Oracle and MySQL.
  • Experience working with MPP Databases like GreenPlum (Pivotal)
  • Create reports for the BI team using Sqoop to export data into HDFS and Hive .
  • Hands - on programming experience in various technologies like Python, R, SCALA
  • Hands on experience in developing Spark applications using Spark tools like RDD transformations, Spark core, Spark streaming and SparkSQL.
  • Strong understanding and strongknowledge in NoSQL databases like HBase, MongoDB & Cassandra.
  • Expert knowledge of data structures and algorithms .
  • Extensive experience working on Hadoop stack including data ingestion tools like Kafka and Storm .
  • Developed end to end Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities according to the requirements.
  • Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume .
  • Working Experience of Amazon's Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as Storage mechanism.
  • Hands on experience in application development using PYTHON, RDBMS, and Linuxshell scripting.
  • Strong experience in writing MapReduce scripts using Scala, Java with Java API, Apache Hadoop API, Python API, PySpark API and Spark API for analyzing the data.
  • Experience with all stages of the SDLC and Agile Development model right from the requirement gathering to Deployment and production support.
  • Also have experience in understanding of existing systems, maintenance and production support, on technologies such as Java, J2EE.
  • Extensive experience in working with semi/unstructured data by implementing complex MapReduce programs using design patterns.
  • Java development J2EE Frameworks like Struts, EJBs and Web services .
  • Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL and Spark Streaming.
  • Deployed MapR 15-node cluster on AWS and integrated Cassandra with it.
  • Designed, configured and deployed Amazon Web Services (AWS) for a multitude of applications utilizing the AWS stack (Including EC2, Glue, Lambda, SNS, S3, RDS, Cloud Watch, SQS, IAM), focusing on high-availability, fault tolerance, and auto-scaling.
  • Migrated Cassandra, Hadoop cluster on AWS and defined different read/write strategies for geographies.
  • Experience in writing MapReduce, YARN, PIG Scripts, Hive Queries, apache Kafka, Storm for analyzing Data
  • Hands on experience on NoSQL databases such as HBase, Cassandra- bit knowledge on MongoDB
  • Developing parser and loader map reduce application to retrieve data from HDFS and store to HBase and Hive.

TECHNICAL SKILLS:

Hadoop/Big Data Technologies: Hadoop 3.0, HDFS, MapReduce, HBase 1.4, Apache Pig 0.17, Hive 2.3, Sqoop 1.4, Apache Nifi Apache Impala 3.0, Oozie 4.3, Yarn, Apache Flume 1.8,Angular2.0, Kafka 1.1, Zookeeper 3.4

Hadoop Distributions: Cloudera, Hortonworks, MapR

Cloud: AWS,GCP, Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, HDInsight, Azure Data Lake and Data Factory.

Programming Language: Scala 2.12, Python 3.6, SQL, PL/SQL, Shell Scripting, Storm 1.0

Web/Application Server: Apache Tomcat 9.0.7, JBoss, Web Logic, Web Sphere

SDLC Methodologies: Agile, Waterfall

Databases: Oracle 12c/11g, SQL

Database Tools: TOAD, SQL PLUS, SQL

Operating Systems: Linux, Unix, Windows 10/8/7

NoSQL Databases: HBase 1.4, Cassandra 3.11, MongoDB

Version Control: GIT, SVN, CVS

PROFESSIONAL EXPERIENCE:

Confidential, CA

Lead Data Engineer

Responsibilities:

  • Python Script reviewing for the data collection, reports development scripts, altercation. Made many different changes in the functionality. Created data frames for data type testing, did normalization analysis on data supporting data science team.
  • Worked in Azure environment for development & deployment of Custom Hadoop Applications.
  • Wrote pre-processing queries in python for internal spark jobs
  • Worked on different kinds of manual rule writing which detected fraud through the internal web application using PQL.
  • Loaded data from many different sources to the PostgreSQL database (Greenplum) which was parallel processed & integrated data science repository with the environment creating CI/CD pipelines.
  • Wrote Table Schemas for Greenplum Database
  • Supported architectural designing with the architects technically and by designing in Visio.
  • Performed real time integration and loading data from the Azure data box& mounting it onto fuse for bulk loads
  • Wrote key spaces scripts in Cassandra and monitored cluster to create replication onto the database which follows gossip protocol.
  • Created complex SSIS packages building pipelines and ETL for bulk & daily data injection.
  • Worked on Hadoop HDInsight to inject Data Science Scoring models to Greenplum (Data Lake)
  • Created Sqoop jobs to inject TB’s of data from RDBMS multiple to Hadoop

Environment: Hadoop, Cassandra, GreenPlum, SSIS, Sqoop, Zookeeper 3.4,MS Azure, Hadoop 3.0, Spark, Cloudera, Scala 2.12,DTCC, Python,Kafka, HDFS, Flume, UNIX, NoSQL, Kafka, Visio

Confidential, DC

Data Engineer (Big Data/Hadoop)

Responsibilities:

  • Worked with PySpark, improving the performance and optimized of the existing applications running on EMR cluster to AWS Glue.
  • Performed Transformation and Loading using AWS Glue.
  • Configured Hadoop Framework on EC2 instances to make sure application that was created is up and running, troubleshoot issues to meet the desired application state.
  • Configured Glue Dev Endpoints to point Glue Job to specify EMR cluster or EC2 instance.
  • Worked on PySpark SQL where the task is to fetch the NOTNULL data from two different tables and loads.
  • Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using Cloudwatch.
  • Designed and Implement test environment on AWS.
  • Worked on architecting and configuring secure VPC, Subnets, Security Groups through private and public networks.

Environment: Hadoop, Hive, Yarn, Spark, Scala, PySpark, Dynamodb, Hbase, Zookeeper, S3, SNS, Athena, Glacier, EMR, EC2, Lambda, Glue, Glue Catalog, IAM, Python, Java, Eclipse, Unix, Linux, Sql Assistant, AWS CLI, Terraform scripts, Ambari, Hue, MySQL, Teradata, Oracle, DB2

Confidential, IL

Data Engineer

Responsibilities:

  • As a Data Engineer worked on Hadoop eco-systems including Hive, MongoDB, Zookeeper, Spark Streaming with MapR distribution.
  • Involved in Agile methodologies, daily scrum meetings, sprint planning.
  • Used Different Spark Modules like Spark core, Spark SQL, Spark Streaming, Spark Data sets and Data frames.
  • Worked on MongoDB by using CRUD (Create, Read, Update and Delete), Indexing, Replication and Shading features.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Scripts were written for distribution of query for performance test jobs in Amazon Data Lake.
  • Created Hive Tables, loaded transactional data from Teradata using Sqoop and worked with highly unstructured and semi structured data of 2 Petabytes in size.
  • Implemented Apache Nifi flow topologies to perform cleansing operations before moving data into HDFS.
  • Designed novel machine learning algorithms for predictive modeling and produce data insights (Data Science)
  • Implemented multiple MapReduce Jobs in java for data cleansing and pre-processing.
  • Build Hadoop solutions for bigdata problems using MR1 and MR2 in Yarn.
  • Handled importing of data from various data sources, performed transformations using Hive, Pig, and loaded data into HDFS.
  • Involved in identifying job dependencies to design workflow for Oozie & Yarn resource management.
  • Configured Oozieworkflow to run multiple Hive and Pigjobs which run independently with time and data availability.
  • Imported and exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Support CloudStrategy team to integrate analytical capabilities into an overall cloud architecture and business case development.

Environment: Hadoop 3.0, Hive 2.3, Zookeeper 3.4, MapR, Agile, TeraData, MapReduce, Yarn, HDFS, Sqoop 1.4, Kafka 2.0, Apache Nifi, MongoDB,DTCC, Pig 0.17, HBase, Python, AWS, Scala, Spark, Oozie, Yarn, Cassandra

Confidential, Omaha, NE

Data Scientist

Responsibilities:

  • Build time series predictive models for solar and load forecast using geospatial and energy data.
  • Applied Moran’s Index to spatial autocorrelation for a multi variate or multi-dimensional GIS data.
  • Applied Hidden Markov Model in reinforcement learning and spatial pattern recognition.
  • Test for stationarity of the data set by visualizing rolling statistics and applying Dickey-Fuller test using python statistical package stats models.
  • Develop forecast models using statistical methods like Auto Regressive Integrated Moving Averages (ARIMA) and Auto-correlation function (ACF).
  • Evaluated time series forecast models using statistical evaluation metrics like Mean absolute percentage error (MAPE) and Root Mean squared error (RMSE).
  • Applied stochastic process to model observed time series data.
  • Worked on many statistical methods like correlation (Pearson, Spearman), data distributions and descriptive statistics.
  • Build linear regression models for predictions and used linear methods for statistical significance tests and correlations in R.
  • Build machine learning model for market segmentation using k-means clustering analysis in python with a 95.5% accuracy.
  • Build a predictive model to improve sales hit ratio using decision trees (ID3) using Weka.
  • Predicted target variable on test data and created confusion matrices, AUROC curves.
  • Worked on relational, graph and document stories.
  • In-depth knowledge in SQL databases, tables, stored procedures, triggers, views, user defined data types and functions.
  • Experience with querying relationship using graph database neo4j.
  • Design and visualize interactive results using tableau to publish dashboards.

Environment: Python, MySQL, univariate, multi variate, ordinal, discreet, random variable, time series data sets, Defining Problem statement, Data Collection, Data Preprocessing, Data exploration using statistical methods and visualizations, Model building, Model evaluation, model deployment, NumPy, SciPy, Pandas, matplotlib, scikit-learn, TensorFlow, Seaborn, stats models, Decision trees, Linear regression, ARMA, ARIMA, data mining methods and cluster analysis.

Confidential, SD

Data Science/Data Engineering Intern

Responsibilities:

  • ML Implementation: Part of a team, responsible on developing full ML algorithm cycle, including, data extraction and preprocessing, design, implementation, visualization and results documentation.
  • Data Analysis: Data Manipulation and Aggregation from a different source using Nexus, Toad, Business Objects, PowerBI and SmartView
  • Implemented Agile Methodology for building an internal application.
  • Focus on integration overlap and Informatica newer commitment to MDM with the acquisition of Identity Systems.
  • Classification : Used multiple supervised learning algorithm, such as Logistic Regression, Decision trees, KNN, Naïve Bayes for classification problems.
  • Fraud Detection: Applied various ML algorithm on data sets to predict credit risk and fraud detection.
  • AWS Cloud Search: Used Python to develop data extraction processes from AWS.
  • Big Data: Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
  • Data Preprocessing: Performed data cleaning and imputation of missing values using

Environment: Python, Anaconda, PYCHARM, Hadoop 2.3, Sqoop, Pig 0.15, Hive 1.9, HBase, MySql, HDFS, AWS.

We'd love your feedback!