We provide IT Staff Augmentation Services!

Associate Data Engineer Resume

5.00/5 (Submit Your Rating)

Irving Tx Associate Data Engineer Irving, TX

SUMMARY

  • 8+ years of experience in implementing various Big Data/ Cloud Engineering, Snowflake, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, Data virtualization and Data Science Solutions.
  • Experience in Data transformation, Data mapping from source to target database schema, Data Cleansing procedures
  • Adept in programming languages like R and Python including Big Data technologies like Hadoop, Hive
  • Excellent Knowledge of Relational Database Design, Data Warehouse/OLAP concepts, and methodologies
  • Experience in designing star schema & Snowflake schema for Data Warehouse, ODS architecture
  • Expertise in OLTP/OLAP System Study, Analysis and E - R modeling, developing Database Schemas like Star schema and Snowflake schema used in relational, dimensional and multidimensional modeling
  • Experience in designing, building and implementing complete Hadoop ecosystem comprising of MapReduce, HDFS, Hive, Impala, Sqoop, Oozie, HBase, MongoDB, and Spark and Kafka
  • Experienced in Data Architecture and data modeling using Erwin, ER-Studio and MS Visio
  • Experience in coding SQL for developing Procedures, Triggers, and Packages
  • Experience in creating separate virtual data warehouses with difference size classes in AWS Snowflake
  • Hands-on experience in bulk loading & unloading data into Snowflake tables using COPY command
  • Experience with data transformations utilizing SnowSQL in Snowflake
  • Experience writing spark streaming and spark batch jobs, using spark MLlib for analytics
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS)- Oracle, DB2 and SQL Server and from RDBMS to HDFS
  • Well experienced in Normalization, De-Normalization and Standardization techniques for optimal performance in relational and dimensional database environments
  • Solid understanding of AWS, Redshift, S3, EC2 and Apache Spark, Scala process, and concepts
  • Hands on experience in machine learning, big data, data visualization, R and Python development, Linux, SQL, GIT/GitHub
  • Experience with data visualization using tools like ggplot, Matplotlib, Seaborn, Tableau and using Tableau software to publish and presenting dashboards, storyline on web and desktop platforms
  • Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations
  • Hands on experience with RStudio for doing data pre-processing and building machine learning algorithms on different datasets
  • Experienced working on NoSQL databases like MongoDB and HBase
  • Extensive working experience with Python including Scikit-learn, SciPy, Pandas, and NumPy developing machine learning models, manipulating and handling data
  • Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors, Random Forest, SVM, Bayesian, XG Boost, K-means Clustering, K-Nearest Neighbors) and Statistical Modeling in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression-based models, Hypothesis testing, Factor analysis/ PCA, Ensembles
  • Implemented machine learning algorithms on large datasets to understand hidden patterns and capture insights
  • Experienced in building and optimizing big data pipelines, architectures, and data sets using TensorFlow. Data API, Spark, and Hive.

TECHNICAL SKILLS

Big Data: Hadoop, HDFS, Sqoop, Hbase, Hive, MapReduce, Spark, Cassandra, Kafka

Languages: Python (Jupiter Notebook, PyCharm IDE), R, Java, C

Cloud Computing Tools: Snowflake, SnowSQL, AWS, Databricks, GCP, Azure data lake services

ETL tools: TensorFlow. Data API, PySpark

Modelling and Architect Tools: Erwin, ER Studio, Star-Schema, Snowflake-Schema Modelling, FACT and dimension tables, Pivot Tables

Databases: Snowflake Cloud Database, Oracle, MS SQL Server, Teradata, MySQL, DB2

Database Tools: SQL Server Data Tools, Visual Studio, Spotlight, SQL Server Management Studio, Query Analyzer, Enterprise Manager, JIRA, Profiler

Reporting Tools: MS Excel, Tableau, Tableau server, Tableau Reader, Power BI, QlikView

Machine Learning Algorithm's: Logistic Regression, Linear Regression, Support Vector Machines, Decision TreesK-Nearest Neighbors, Random Forests, Gradient Boost Decision Trees, Stacking ClassifiersCascading Models, Naive Bayes, K-Means Clustering, Hierarchical Clustering and Density

PROFESSIONAL EXPERIENCE

Confidential, Irving, TX

Associate Data Engineer

Responsibilities:

  • Migrated the existing data from Teradata/SQL Server to Hadoop and perform ETL operations on it.
  • Responsible for loading structured, unstructured, and semi-structured data into Hadoop by creating static and dynamic partitions.
  • Worked on different data formats such as JSON and performed machine learning algorithms in Python.
  • Created a task scheduling application to run in an EC2 environment on multiple servers.
  • Strong knowledge of various Data warehousing methodologies and Data modeling concepts.
  • Created Hive partitioned tables using Parquet Avro format to improve query performance and efficient space utilization.
  • Responsibilities include Database Design and Creation of User Database.
  • Moving ETL pipelines from SQL server to Hadoop Environment.
  • Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Implemented a CI/CD pipeline using Jenkins, Airflow for Containers from Docker and Kubernetes.
  • Used SSIS, NIFI, Python scripts, Spark Applications for ETL Operations to create data flow pipelines and involved in transforming data from legacy tables to Hive, HBase tables, and S3 buckets for handoff to business and Data scientists to create analytics over the data.
  • Support current and new services that leverage AWS cloud computing architecture including EC2, S3, and other managed service offerings.
  • Used advanced SQL methods to code, test, debug, and document complex database queries.
  • Design relational database models for small and large applications.
  • Designed and developed Scala workflows for data pull from cloud-based systems and applying transformations on it.
  • The ability to develop reliable, maintainable, efficient code in most of SQL, Linux shell, and Python.
  • Implemented Apache-spark code to read multiple tables from the real-time records and filter the data based on the requirement.
  • Stored final computation result to Cassandra tables and used Spark-SQL, spark-dataset to perform data computation.
  • Used Spark for data analysis and store final computation result to HBase tables.
  • Troubleshoot and resolve complex production issues while providing data analysis and data validation.

Environment: Teradata, SQL Server, Hadoop, ETL operations, Data Warehousing, Data Modelling, Cassandra, AWS Cloud computing architecture, EC2, S3, Advanced SQL methods, NiFi,Python, Linux, Apache Spark, Scala, Spark-SQL, HBase

Confidential, San Francisco, CA

Data Engineer

Responsibilities:

  • Gathered, analyzed, and translated business requirements to technical requirements, communicated with other departments to collect client business requirements and access available data
  • Acquiring, cleaning and structuring data from multiple sources and maintain databases/data systems. Identifying, analyzing, and interpreting trends or patterns in complex data sets
  • Develop, prototype and test predictive algorithms. Filtering and cleaning data, review reports and performance indicators
  • Developing and implementing data collection systems and other strategies that optimize statistical efficiency and data quality
  • Create and statistically analyze large data sets of internal and external data
  • Working closely with marketing team to deliver actionable insights from huge volume of data, coming from different marketing campaigns and customer interaction matrices such as web portal usage, email campaign responses, public site interaction, and other customer specific parameters
  • Performed incremental loads as well as full loads to transfer data from OLTP to Data Warehouse of snowflake schema using different data flow and control flow tasks and provide maintenance for existing jobs
  • Design and implement secure data pipelines into a Snowflake data warehouse from on-premises and cloud data sources
  • Kafka was used as message broker to collect large data and to analyze the collected data in the distributed system.
  • Designed ETL process using Talend Tool to load from Sources to Snowflake through data Transformations
  • Have knowledge on partition of Kafka messages and setting up the replication factors in Kafka Cluster.
  • Developed Snowpipes for continuous injection of data using event handler from AWS (S3 bucket)
  • Design and developed end-to-end ETL process from various source systems to Staging area, from staging to Data Marts and data load
  • Loading data into Snowflake tables from internal stage using SnowSQL
  • Prepared data warehouse using Star/Snowflake schema concepts in Snowflake using SnowSQL
  • Responsible for Data Cleaning, features scaling, features engineering by using NumPy and Pandas in Python
  • Conducted Exploratory Data Analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features
  • Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems
  • Used information value, principal components analysis, and Chi square feature selection techniques
  • Used Python and R scripting by implementing machine algorithms to predict data and forecast data for better results
  • Experience in developing packages in R studio with a shiny interface
  • Improve efficiency and accuracy by evaluating model in Python and R
  • Used Python and R script for improvement of model
  • Experimented with multiple classification algorithms, such as Random Forest and Gradient boosting using Python Scikit-Learn and evaluated performance on customer discount optimization on millions of customers
  • Built models using Python and Pyspark to predict probability of attendance for various campaigns and events
  • Performed data visualization and Designed dashboards with Tableau, and generated complex reports, including charts, summaries, and graphs to interpret findings to team and stakeholders

Environment: Snowflake, SnowSQL, AWS S3, Hadoop, Hive, HBase, Spark, R/R studio, Python- Pandas, Numpy, Scikit-Learn, SciPy, Seaborn, Matplotlib, SQL, Powershell, Machine Learning, Kafka.

We'd love your feedback!