We provide IT Staff Augmentation Services!

Data Engineer Resume

3.00/5 (Submit Your Rating)

Santa Clara, CA

SUMMARY

  • Over 6+ years of experience in data analysis and data engineering covering the whole data lifecycle, from data ingestion, wrangling and modeling, to data visualization and insight discovery.
  • Data - driven mindset, passionate about diving deep into data and communicating data-based findings and insights with co-workers and business stakeholders.
  • Strong programming skills in Python (NumPy, Pandas, scikit-learn, Seaborn), SQL, Java, R, Scala, LinuxShell Scripting.
  • Hands-on experience of RDBMS such as MySql, Postgres as well as NoSqlbatabase such as HBase, Cassandra, MangoDB.
  • Good experience in handling big data framework including Hadoop, Map-Reduce, HDFS, Yarn, HBase, Hive.
  • Experience in real-time data streaming including Spark, Kafka, Flume.
  • Versed in performing ETL tasks, designing data warehouse, and building data pipelines for data ingestion, aggregation, transformation, grouping, join, etc.
  • Hands-on experience in cloud computing such as AWS and GCP, including AWS experience in EC2, S3, RDS, Elastic Beanstalk, Glue, CloudWatch, as well as GCP experience with deployment of Docker and k8s.
  • Knowledge in statistics including descriptive statistics, inferential statistics, probability theory, probability distributions, and Bayesian statistics.
  • Modeling experience leveraging machine learning techniques such as regression, classification, dimension reduction, as well as deep learning techniques such as convolution neural network and recurrent neural network with Tensor Flow and Keras.
  • Good understanding of business requirements and good product sense, familiar with purchase funnel analysis, fractional attribution analysis, A/B testing design, strategies such as SEM and SEO.
  • Proficient in BI tools such as Tableau, PowerBI, Google Analytics and matplotlib with good experience of creating interactive data-oriented reports and dashboards.
  • Effective communication skills and presentation skills as evidenced by working with people from both Engineering and Marketing. Meet and collaborate with managers, development teams, stakeholders.

TECHNICAL SKILLS

Programming Languages: Python, SQL, Java, R, Scala, Shell Scripting

Data Wrangling & Visualization: NumPy, Pandas,Tableau, Matplotlib, ggplot, Seaborn

Machine Learning& Deep Learning: scikit-learn, Logistic Regression, Random Forest, K-means Clustering, Keras, Tensor Flow

Big Data: Hadoop, Hive, Spark

Cloud: AWS, GCP

Deployment& Version Control: Docker, Kubernetes, Heroku, Git

Web design: Flask, HTML, CSS

PROFESSIONAL EXPERIENCE

Confidential, Santa Clara, CA

Data Engineer

Responsibilities:

  • Design, create and implement RDBMS as well as NoSQL database, build views, indexes, stored procedures.
  • Data modeling of the product information, customer features, build data warehouse solution to support BI activities.
  • SQL queries on RDBMS such as MySql/Postgres and HiveQL on Hive tables for data extraction and preliminary data analysis.
  • Build data pipelines including data ingestion, data transformation such as aggregation, filtering, cleaning, and data storage.
  • Data ingestion from SQL and NoSQL database and multiple data formats such as XML, JSON, CSV.
  • Data ingestion of real-time customer behavioral data into HDFS using Flume, Sqoop, Kafka, and data transformation using SparkStreaming.
  • Perform ETL operations using ScalaSpark andPySparkunder IntelliJ with Java and PyCharm with Python respectively.
  • Implement and execute the parallel processing of Map-Reduce job utilizing Java for the log data from the servers.
  • Monitor and health check of the data warehouse by providing failover solutions and disaster recovery solutions in a cost-effective manner.
  • Leverage Yarn for large-scale distributed data as well as troubleshoot and resolve Hadoop cluster performance issues.
  • Perform data management and data query using Spark and deal with streaming data using Kafka to make sure data transfers and processes in a fast and reliable manner.
  • Leverage AWS S3 as storage solution for HDFS, AWS Glue as the ETL solution and AWS kinesis as the data streaming solution to deploy the data pipeline on cloud.
  • Migrate data warehouse from RDBMS to AWS Redshift and analyze log data using AWS Athena on S3. Maintain Hadoop cluster using AWS EMR.
  • Data cleansing, data manipulation, data wrangling using Python to eliminate invalid datasets and reduce prediction error.
  • Conducted A/B test on metrics such as customer retention, acquisition, sales revenue, and volume growth to assess the performance of products.
  • Leveraged Pandas, Numpy and Seaborn for exploratory data analysis.
  • Extend Hive functionality by using User Defined Functions including UDF, UDTF, and UDAF.
  • Developed predictive modeling using Python packages such as SciPy and scikit-learn as well as Mixed-effect models and time series models in R based on business requirements.
  • Feature selection, feature extraction using Spark Machine Learning libraries including algorithms such as multivariate regression, K-means clustering, KNN.
  • Carried out Dimension Reduction with PCA and Feature Engineering with Random Forest to capture key features for predicting annual sales and best purchased product using Python and R.
  • Created Hive integrated Tableau dashboards and reports to visualize the time series of purchase value to keep track of the business metrics as well as deliver business insights to stakeholders.
  • Work with Git for version control, Maven for Java project build, test and deploy.

Confidential, San Jose, CA

Data Engineer

Responsibilities:

  • Collect 7 million pairs of ‘raw - punctuated’ text data of CSV files for text cleaning stage of a speech recognition app.
  • Perform data ingestion, transformation, and cleaning utilizing Python NumPy and Pandas.
  • Implement and evaluate punctuated text data as a post-processor of speech recognition RNN using Keras and Tensor Flow.
  • Integrate the result of cleaned text data with REST API with as well as deploy the API to the AWSElastic Beanstalk.
  • Leverage S3 for data lake solution, DynamoDB and RDS for database solution.
  • Leverage CloudWatch to monitor the performance of the product. Implement auto-scaling structures to deal with failovers.
  • Design and develop data augmentation for synthetic text and voice data.
  • Pre-process of raw data and conduct data wrangling such as grouping, aggregation, filtering, replacing missing values using Python.
  • Perform tree-based ensemble algorithm such as XGBoost and AdaBoost for feature extraction and feature selection.
  • Work with ML teams dealing with acoustics and leverage toolkit such as NLTK.
  • Build analysis and prediction algorithms for correlations of features and conduct Hypothesis Testing to determine the significance level.
  • Developed innovative solutions to big data and cloud issues such as deploying the Docker containers and k8s pods on GCP.

Confidential, Irvine, CA

Data Engineering Researcher 

Responsibilities:

  • Design the data architecture in MySQL in for storage of device information, save 30% manpower on legacy database.
  • Create REST APIs for model testers to upload and download their test results on our server that is integrated with the database.
  • Automate ETL procedures and build data warehouse to keep the product and model testers information updated.
  • Combine new device models into the data warehouse, track product versions, interpret the reason of test failure using data analytics tools.
  • Data transformation and exploratory data analysis using Python, R and data visualization using Matplotlib.
  • Database migration utilizing SSIS from DB2 to SQL Server and design data warehouse using FTDW sizing tools.
  • Enforce data quality in data warehouse by data cleansing using SSIS data flow services.
  • Data analytics using SSAS and produce formatted reportsusing SSRS.
  • Create report and dashboard using Tableau and PowerBI to deliver business insights to managers and stakeholders.
  • Process XML, JSON, Delta tables and build ETL data pipeline with dashboard.
  • Implement visualization tools to generate daily, weekly, and monthly dashboards from massive databases to monitor key features of data and deal with loggings of events.
  • Support for application such as reviewing and tuning production related queries and deal with long running batch jobs.
  • Isolate and de-bug infrastructure problems and perform problem resolution.

We'd love your feedback!