Data Engineer Resume
Palm Beach Gardens, FL
SUMMARY
- Data Engineer/Machine Learning Engineer with 5+ years of progressive experience and emphasis on Data munging, Data cleaning, Data Analytics, Data Visualization, Big Data ecosystems using Hadoop, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Yarn, Oozie, and Zookeeper.
- Experience in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling.
- Experience in Data Science/Machine Learning in the different domains such as Data Analytics, Machine Learning (ML), Predictive Modelling, Natural Language Processing (NLP) and Deep Learning algorithms.
- Proficient at wide varieties of Data Science programming languages Python, R, SQL, PySpark, Sci - kit Learn, NumPy, SciPy and Pandas, NLTK, TextBlob, Genism, SpaCy, Keras and TensorFlow.
- Experienced in facilitating the entire lifecycle of a data science project: Data Cleaning, Data Extraction, Data Pre-Processing, Dimensionality Reduction, Algorithm implementation, Back Testing and Validation.
- Expert in Machine Learning algorithms such as Ensemble Methods (Random forests), Linear, Polynomial, Logistic Regression, Regularized Linear Regression, Support Vector Machines (SVM), Deep Neural Networks, Extreme Gradient Boosting, Decision Trees, K-Means, K-NN, Gaussian Mixture Models, Naive Bayes.
- Experience in processing large datasets with Spark using Python.
- Solid understanding of big data technologies like Hadoop, Spark, HDFS, MapReduce, and Hive.
- Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
- Experience in developing ETL applications on large volumes of data using different tools:MapReduce, Spark, PySpark, Spark-Sql, and Pig.
- Experience in usingSQOOPfor importing and exporting data fromRDBMS to HDFS and Hive.
- Experience in various databases such as MySQL, SQL, DB2, Oracle, NoSQL- MongoDB, Cassandra, HBase.
- Experience in Continuous Integration and Deployments (CI/CD) using build tools like Jenkins, MAVEN, and ANT.
- Exposure to Data Lake Implementation and developed Data pipelines and applied business logic utilizing Apache Spark.
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.
- Experience in Realtime data streaming using NiFi and KAFKA.
- Proficient in various AWS services such as VPC, EC2, S3, ELB, Auto Scaling Group (ASG), Elastic Search, Cloud Formation, Glue, Athena, Lambda, Step Functions, Kinesis, Route 53, CloudWatch, CloudFront, CloudTrail, SQS,SNS,SES, EKS, EMR, AWS Systems Manager etc.
- Experience with version control tools such as GIT, GitHub and SVN.
- Experience in using Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment.
TECHNICAL SKILLS
Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, KafkaZookeeper, Yarn, Airflow, Apache Spark.
Databases: Oracle, MySQL, SQL Server, MongoDB, PostgreSQL, Teradata.
Programming: Python, R, Java, Shell script, SQL
Machine Learning: RNN, CNN, Regression (Linear and Logistic), Decision trees, Random Forest, SVM, KNN, PCA.
ML Frameworks: Pandas, Keras, NumPy, TensorFlow, Scikit-Learn, NLTKCaffe.
Cloud Technologies: AWS, GCP
AWS Tools: EC2, S3, VPC, Cloud Watch, EMR, EKS, ELB, Kinesis, Elastic Search, Auto ScalingGlue, Athena.
Versioning tools: SVN, Git, GitHub
Operating Systems: Windows, Ubuntu Linux, MacOS
PROFESSIONAL EXPERIENCE
Confidential, Palm Beach Gardens, FL
Data Engineer
Responsibilities:
- Worked on Tensorflow, Keras, NumPy, Scikit-Learn, tf.Data API, Jupyter Notebook, in Python at various stages for developing, maintaining and optimizing machine learning model.
- Extracted Fingerprint image Data stored on local network to Conduct Exploratory Data analysis(EDA), Cleaning and organize. Ran NFIQ algorithm to ensure data quality by collecting the high score images. Finally Created histograms to compare distributions of different datasets.
- Transformed the image dataset to protocol buffers, serialized and finally stored inside TFrecord data format.
- Loaded the data in GPU and achieved Half Precision FP16 training on Nvidia Titan RTX and Titan V GPU for TensorFlow 1.14.
- Optimized TFRecord data ingestion pipeline using tf.Data API and made them scalable by streaming over network, thus enabling training of models with Datasets which were bigger than CPU memory.
- Automated training and optimization of model hyper parameters to quickly conduct and test 50 different variations of the model .Finally storing the results and generating automated reports.
- Maintaining Models created by other data scientists, retrained them with different variations of datasets.
- Created tooling for other data scientists to help them become more effective at exploring the data and other tasks.
- Productized existing TensorFlow models by converting them to tflite format which allowed integration with existing C++ and android applications.
- Conducted Transfer Learning using ResNet50 pretrained model by freezing the bottom layer and retraining top layers with fingerprint images.
- Successfully visualized what the internal layers of CNN are “seeing” by making Class Activation Maps (CAM).
- Used Validation and Testing sets to avoid the overfitting of the model to make sure the predictions are accurate and measured the performance using Confusion matrix and ROC Curves.
- Responsible for the execution ofbig data analytics, predictive analytics and machine learning initiatives.
- Implemented a proof of concept deploying this product inAWS S3 bucketandSnowflake.
- Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
- DevelopedScalascripts,UDF's using bothdata frames/SQL and RDDinSparkfor data aggregation, queries and writing back into S3 bucket.
- Developed Map Reduce/ Spark modules for machine learning & predictive analytics in Hadoop on AWS.
- Involved indata cleansing and data mining.
- Used Python to analyze the data, plot the visualizations and implemented ML algorithms for large dataset analysis, matplotlib & seaborn.
- Selecting features, building and optimizing classifiers using Machine learning techniques.
- Analyzed text data using NLP libraries in python.
- Applied various Classification models such as Naïve Bayes, Logistic Regression, Random Forests, Support Vector Classifiers, from scikit-learn library and improved performance of the model by using various Ensemble learning like Random Forests, XGBoost and Gradient Boosting using Scikit-learn.
- Evaluated the accuracy and precision of the algorithm using a variety of validation techniques.
- Implemented Machine Learning models in Spark using the PySpark.
- Wrote, compiled, and executed programs as necessary using Apache Spark in Scala toperform ETL jobswith ingested data.
- UsedSpark Streamingto divide streaming data into batches as an input to Spark engine for batch processing.
- Used Pyspark for extract, filtering and transforming the Data in data pipelines.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine,Spark SQLfordata analysisand provided to the data scientists for further analysis.
- Designed and Developed Spark workflows using Python for data pull from AWS S3 bucket and Snowflake applying transformations on it.
- ImplementedSpark RDD transformationstoMap business analysis and apply actions on top of transformations.
- Automated resulting scripts and workflow usingApache Airflowandshell scriptingto ensure daily execution in production.
- Migrated data from AWS S3 bucket to Snowflake by writing custom read/writesnowflake utilityfunction using Python.
- Worked on Snowflake Schemas and Data Warehousing andprocessedbatch and streaming data load pipeline usingSnow Pipeand Matillion from data lake Confidential AWS S3 bucket.
- Profile structured, unstructured, and semi-structured data across various sources to identifypatterns in data and Implement data quality metricsusing necessary queries orpythonscripts based on source.
- Involved in building a data pipeline and performed analytics usingAWS stack(EMR, EC2, S3, RDS, Lambda, Kinesis, Athena, ELB, Glue).
- Creating pipelines using data from the s3 buckets.
- Using pipelines and tableau to visualize the data and interpret reports to stakeholders.
- Writing complex queries in SPARK SQL and Python to interact with the data.
- Install and configureApache Airflowfor S3 bucket and Snowflake data warehouse and createddagsto run the Airflow.
- Use Lambda functions and Step Functions to trigger Glue Jobs and orchestrate the data pipeline
- Use PyCharm IDE for Python/PySpark development and Git for version control and repository management.
- Created DAG to use theEmail Operator, Bash Operator and spark Livy operatorto execute and inEC2instance.
- Deploy the code toEMRviaCI/CD using Jenkins
Environment: Hadoop, Agile, MapReduce, Snowflake, Spark, Hive, Kafka, Python, R, Airflow, JSON, AWS, EC2, S3, Athena, Glue, AutoScaling, EKS, ELB, Tensorflow, Keras
Confidential, Philadelphia
Data Engineer
Responsibilities:
- Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing, and reporting of voluminous, rapidly changing data.
- Worked on developingPysparkscript to encrypting the raw data by using hashing algorithms concepts on client specified columns.
- Involved in convertingHive/SQLqueries intoSparktransformations usingSpark RDDs and Python.
- Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations.
- Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
- Used the Supervised and unsupervised techniques such as Logistic Regression Classifier, Random Forest Classifier, Auto encoder neural networks, Isolation Forest, Local Outlier Factor, Elliptic Envelope, One-Class Support Vector Machines for anomaly detection on wifi rdkb markers and connected clients.
- Analyzed data using SQL, R, Python, and presented analytical reports to management and technical teams.
- Performed data cleaning and feature selection using Machine Learning package in PySpark and working with deep learning frameworks such as TensorFlow and Keras etc.
- Natural Language Processing (NLP) such as sentiment analysis, entity recognition, Topic Modeling and Text summarization was done using advanced python library such as NLTK, TextBlob, Spacy and Gensim.
- Segmented the customers based on demographics, geographic, behavioral and psychographic data using K-means Clustering. Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using Python and Tableau.
- Pandas Data frame, NumPy, Jupyter Notebook, SciPy, scikit-learn, TensorFlow, Keras was used as a tool for Machine Learning and Deep Learning.
- Wrote complex SQL statements to interact with the RDBMS database to filter the data and data analytics.
- Apache Spark is used for bigdata processing, streaming, SQL, Machine Learning (ML)
- Developed the Pysprk code for AWS Glue jobs and for EMR.
- Used Data Build Tool for transformations in ETL process, AWS lambda
- Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG’s and dependencies between the tasks
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon Athena.
- Used Spark SQL for Python interface that automatically converts RDD case classes to schema RDD.
- Wrote various data normalization jobs for new data ingested into Amazon s3 and Amazon Athena.
- Developing ETL pipelines in and out of data warehouse using combination of Python and pyspark.
- Managed the entire product site on Tableau and Quick while dealing with products relating to various different clients.
- Created a pipeline to hit the athena database real time to get information on the size and number of columns involved to reduce the size and optimize the pipelines.
- Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
- Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
- Involved in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.
- Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
- Conducted Data blending, Data preparation using SQL for Tableau consumption and publishing data sources to Tableau server.
- Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
- Create Athena data sources on S3 buckets for adhoc querying and business dashboarding using Quicksight and Tableau reporting tools.
- Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.
Environment: AWS EMR, EC2, S3, RDS, Athena, Glue, Auto Scaling, Elastic Search, Lambda, Amazon SageMaker, Apache Spark, HIVE, Map Reduce, Snowflake, Python, Tableau, Agile.
Confidential, San Francisco, CA
Data Engineer
Responsibilities:
- Understanding business needs, analyzing functional specifications and map those to develop and designingMapReduceprograms and algorithms
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Setup storage and data analysis tools in Amazon Web Services (AWS) cloud computing infrastructure.
- Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R, Mahout, Hadoop and MongoDB.
- Worked as Data Architects and IT Architects to understand the movement of data and its storage and ER Studio.
- Worked with several R packages including knitr, dplyr, SparkR, Causal Infer, Space-Time.
- Used Pandas, Numpy, Seaborn, Scipy, Matplotlib, Sci-kit-learn, and NLTK in Python for developing various machine learning algorithms.
- Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, Big Data environments.
- Machine Learning algorithms such as decision trees and random forest were used in this process to predict the urgency of the problem statement received by the company, this was done by calculating the weighted totals of the polarity and subjectivity of the problem statements and classifying each statement accordingly.
- Installed and used Caffe Deep Learning Framework
- Utilized Spark, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Spark SQL and MLlib libraries.
- Used Data Quality Validation techniques to validate Critical Data Elements (CDE) and identified various anomalies.
- Participated in all phases of Datamining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization and Performed Gap Analysis.
- Worked on Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
- Programmed a utility in Python that used multiple packages (Scipy, Numpy, Pandas)
- Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.
- Worked on batch processing of data sources usingApache Spark, Elastic search.
- Extracted the needed data from the server intoHDFSand Bulk Loaded the cleaned data intoHBase.
- Used different file formats like Text files, Sequence Files, Avro, Record Columnar CRC, ORC.
- Worked with NoSQL databases likeHbasein creating Hbase tables to load large sets of semi structured data coming from various sources.
- UsedAmazon web services (AWS)like EC2 and S3 for small data sets.
- Designed both 3NF data models for ODS, OLTP systems and Dimensional Data Models using Star and Snowflake Schemas.
- Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
- Created SQL tables with referential integrity and developed queries using SQL, SQL PLUS and PL/SQL.
- Designed and developed Use Case, Activity Diagrams, Sequence Diagrams, OOD (Object oriented Design) using UML and Visio.
Environment: AWS, R, Informatica, Machine learning-Algorithms, Anaconda, Market Basket Analysis, Sentiment Analysis, Polarity, Predictive Analytics, Deep Learning- Algorithms, CNN, HCNN, Python, Data Mining, Data Collection, Data Cleaning, Validation, HDFS, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes MS Vision, Map-Reduce, Rational Rose, SQL, and MongoDB.
Confidential, Los Angeles, California
Data Engineer
Responsibilities:
- Implementation of machine learning methods, optimization, and visualization. Mathematical methods of statistics such as Regression Models, Decision Tree, Naïve Bayes, Ensemble Classifier, Hierarchical Clustering and Semi-Supervised Learning on different datasets using Python.
- Researched and implemented various Machine Learning Algorithms using the R language.
- Devised a machine learning algorithm using Python for facial recognition.
- Used R for a prototype on a sample data exploration to identify the best algorithmic approach and then wrote Scala scripts using spark machine learning module.
- Used Scala scripts for spark machine learning libraries API execution for decision trees, ALS, logistic and linear regressions algorithms.
- Worked on Migrating an On-premises virtual machine to Azure Resource Manager Subscription with Azure Site Recovery.
- Provide consulting and cloud architecture for premier customers and internal projects running on MS Azure platform for high availability of services, low operational costs.
- Develop structured, efficient and error-free codes for Big Data requirements using my knowledge in Hadoop and its Eco-system.
- Development of web service using Windows Communication Foundation and.Net to receive and process XML files and deploy on Cloud Service on Microsoft Azure.
- Worked on various methods including data fusion and machine learning and improved the accuracy of distinguished right rules from potential rules.
- Worked onHBasefor support enterprise production and loading data intoHBASEusingSQOOP.
- Developed Merge jobs in Python to extract and load data into a MySQL database.
- Used Test driven approach for developing the application and Implemented the unit tests using Python Unit test framework.
- Tested with various Machine Learning algorithms like Support Vector Machine (SVM), Random Forest, Trees with XGBoost concluded Decision Trees as a champion model.
- Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XGBoost, SVM, and Random Forest.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
Environment: Machine Learning, R Language, Hadoop, Big Data, Azure, Python, Spark, Scala, Hbase, MySQL, MongoDB, Agile.