Data Engineer Resume
NC
SUMMARY
- Overall 3+ years of experience as Data Engineer including designing, Deep learning Specialization, Machine Learning and TensorFlow Developer.
- Expertise in writing end to end Data processing Jobs to analyse data using MapReduce, Spark and Hive.
- Experience with Apache Spark ecosystem using Spark - Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
- Experience with integration of Jira with third-party systems such as Service Now.
- Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.
- Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
- Experienced in using Pig scripts to do transformations, event joins filters and pre-aggregations before storing the data into HDFS.
- Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.
- Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured semi-structured and unstructured data sets and stores them in HDFS.
- Good understanding of data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, Snowflake Schema Modelling, Fact and Dimension tables.
- Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
- Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
- Strong experience in working with Windows, Linux and Mac environments, writing shell scripts.
- Experienced in working in SDLC, Agile and Waterfall Methodologies.
- Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.
TECHNICAL SKILLS
Programming Language: Python, R, MATLAB
Packages: SciPy, NumPy, Pandas, scikit-learn, matplotlib, NLTK, spacy, Keras, PySpark, Stanford NLP Stanza
NLP: Named Entity Recognition, POS tagging, Parsing, Vectorization, Tagging, Sentiment Analysis, Text Classification, Clustering, etc.
Database: MySQL, SQL
Frame Work: Machine Learning & Deep Learning (Keras, TensorFlow, PyTorch), WEKA, CNN, BERT, Transformers, Spacy
Project Management Tools: Jira, Git
Operating System: Window, Linux, Mac
Methodologies: Agile, Scrum, Waterfall
Cloud Technologies: Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure
Container tools: Docker
PROFESSIONAL EXPERIENCE
Confidential, NC
Data Engineer
Responsibilities:
- Responsible for the execution ofbig data analytics, predictive analytics and machine learning initiatives.
- Implemented a proof of concept deploying this product inAWS S3 bucket
- Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
- Worked on PySpark APIsfor data transformations.
- Extending Hive and Pigcore functionality by writing custom UDF's for Data Analysis.
- Upgraded currentL in uxversion to RHEL version 5.6
- Expertise in hardening,Linux Server and Compiling, Building and installing Apache Server from sources with minimum modules
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Worked on the tuning of SQLQueries to bring down run time by working on Indexes and Execution Plan.
- Created a validation set using Keras2DML in order to test whether the trained model was working as intended or not.
- Defined multiple helper functions that are used while running the neural network in session. Also defined placeholders and number of neurons in each layer.
- Created neural networks computational graph after defining weights and biases.
- Created aTensorFlowsession which is used to run the neural network as well as validate the accuracy of the model on the validation set.
- After executing the program and achieving acceptable validation accuracy a submission was created that is stored in the submission directory.
- Executed multipleSpark SQLqueries after forming the Database to gather specific data corresponding to an image.
Confidential
Data Engineer, Co-founder
Responsibilities:
- Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Performed data analysis by using Hive to retrieve the data from SQL to retrieve data from Red Shift.
- Explored and analysed the customer specific features by using Spark SQL.
- Performed data imputation using Scikit-learn package in Python.
- Responsible for ETL development with successful design, development, and integration of components within the Talend ETL Platform and Java Technology.
- Participated in features engineering such as feature intersection generating, feature normalize and label encoding with Scikit-learn pre-processing.
- Creating complex JIRA workflows including project workflows, custom fields, notification schemes, reports and dashboards in JIRA.
- Migrate and Upgrade Jira from Oracle to PostgreSQL environments.
- Worked in creating complex stored procedures,SSIS packages, triggers, cursors, tables, and viewsand other SQL joins and statements for applications.
- Designed and implemented recommender systems which utilized Collaborative filtering techniques to recommend course for different customers and deployed to AWS EMR cluster.
- Utilized natural language processing (NLP) techniques to Optimized Customer Satisfaction.
- Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.
- Developing containment scripts for data reconciliation using SQL and Python.
- Performed data analysis and data profiling using complex SQL on various sources systems including MySQL and Teradata.