We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

Santa Clara, CA

SUMMARY

  • Data Engineer with over 8 years of experience in Data warehousing, Data engineering, Feature engineering, big data, ETL/ELT, and Business Intelligence. As a big data architect and engineer, specialize in AWS and Azure frameworks, Cloudera, Hadoop Ecosystem, Spark/Py Spark/Scala, Data bricks, Hive, Redshift, Snowflake, relational databases, tools like Tableau, Airflow, DBT, Presto/Athena, and Data DevOps Frameworks/Pipelines with strong Programming/Scripting skills in Python.
  • Over 8+ years of experience in Data Engineer, including profound expertise and experience on statistical data analysis such as transforming business requirements into analytical models, designing algorithms, and strategic solutions that scales across massive volumes of data.
  • Expert experience over various database platforms: Oracle, Mongo DB, Redshift, SQL Server, Postgres.
  • Strong experience in full System Development Life Cycle (Analysis, Design, Development, Testing, Deployment and Support) in waterfall and Agile methodologies.
  • Effective team player with strong communication and interpersonal skills, possess a strong ability to adapt and learn new technologies and new business lines rapidly.
  • Experienced in writing complex SQL Quires like Stored Procedures, triggers, joints, and Sub quires.
  • Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra.
  • Experience writing Machine Learning algorithms (Regression Models, Decision Trees, Naive Bayes, Neural Networks, Random Forest, Gradient Boosting, SVM, KNN, Clustering.
  • Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming and expertise in building PySpark and Spark - Scala applications for interactive analysis, batch processing and stream processing.
  • Data Streaming from various sources like cloud (AWS, Azure) and on - premises by using the tools Spark and Flume.
  • Experience in Amazon AWS, Google Cloud Platform and Microsoft Azure cloud services.
  • Azure cloud experience using Azure Data Lake, Azure Data Factory, Azure Machine Learning, Azure Databricks
  • AWS cloud experience using EC2, S3, EMR, RDS, Redshift, AWS Sagemaker, Glue.
  • Experience of building machine learning solutions using PySpark for large sets of data on Hadoop ecosystem.
  • Adept in statistical programming languages like Python and R including Big-Data technologies like Hadoop, HDFS, Spark and Hive.
  • Having good knowledge of tools like Snowflake, SSIS, SSAS, SSRS to design warehousing applications.
  • Experience in data mining, including predictive behavior analysis, Optimization and Customer Segmentation analysis using SAS and SQL.
  • Experience in Applied Statistics, Exploratory Data Analysis and Visualization using matplotlib, Tableau, Power BI, Google Analytics.
  • Extensive experience on various version controls like Git, SVN, Bitbucket.

TECHNICAL SKILLS

Hadoop Distributions: Cloudera, AWS EMR and Azure Data Factory.

Languages: Scala, Python, SQL, Hive QL, KSQL.

IDE Tools: Eclipse, IntelliJ, pycharm.

Cloud platform: AWS, Azure

AWS Services: VPC, IAM, S3, Elastic Beanstalk, CloudFront, Redshift, Lambda, Kinesis, DynamoDB, Direct Connect, Storage Gateway, EKS, DMS, SMS, SNS, and SWF

Reporting and ETL Tools: Tableau, Power BI, Talend, AWS GLUE.

Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (Hbase, Cassandra, Mongo DB)

Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks, Kafka, Cloudera

Machine Learning And Statistics: Regression, Random Forest, Clustering, Time-Series Forecasting, HypothesisExplanatory Data Analysis

Containerization: Docker, Kubernetes

CI/CD Tools: Jenkins, Bamboo, GitLab CI, uDeploy, Travis CI, Octopus

Operating Systems: UNIX, LINUX, Ubuntu, CentOS.

Other Software: Control M, Eclipse, PyCharm, Jupyter, Apache, Jira, Putty, Advanced Excel

Frameworks: Django, Flask, WebApp2

PROFESSIONAL EXPERIENCE

Sr. Data Engineer

Confidential, Santa Clara CA

Responsibilities:

  • Worked in Creating data pipeline of gathering, cleaning and optimizing data using Hive, Spark.
  • Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka and Persists into HDFS.
  • Designed, developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis.
  • Imported and Exported Data from Different Relational Data Sources like SQL Server to HDFS using Sqoop.
  • Developed Spark SQL scripts using Python for faster data processing.
  • Implemented a 'server less' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets
  • Worked huge datasets stored in AWS S3 buckets, used spark data frames to perform preprocessing in Glue.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
  • Developed spark workflows using Scala to pull the data from AWS and apply transformations to the data.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Worked on Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
  • Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
  • Worked on Migrating MapReduce programs into Spark transformations using Spark and Scala.
  • Extracted the data from HDFS using Hive and performed data analysis using Spark with Scala, PySpark, Redshift for feature selection and created nonparametric models in Spark.
  • Worked on RDS databases like MySQL server and NOSQL databases like MongoDB, HBase.
  • Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using R, Tableau.
  • Developed Tableau visualizations and dashboards using Tableau Desktop.
  • Handling the day-to-day issues and fine tuning the applications for enhanced performance.
  • Collaborate with team members and stakeholders in design and development of data environment

Environment: Python, R, SQL, Hive, Spark, AWS, Hadoop, NoSQL, Cassandra, SQL Server AWS, HDFS, PySpark, Tableau, Mongo DB, Postgres SQL, Redshift, Hbase, Sqoop, Airflow, Oozie.

Sr. Data Engineer

Confidential, New York City NY

Responsibilities:

  • Developed cloud architecture and strategy for hosting complicated app workloads on Microsoft Azure.
  • Using the Azure cloud platform, I designed and implemented data pipelines (HDInsight, DataLake,DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
  • Using PySpark and Shell Scripting, created a customized ETL solution, batch processing, and real- time data intake pipeline to transport data into and out of the Hadoop cluster.
  • Using Azure Data Factory, we combined on-premises data (MySQL, Cassandra) with cloud data (Blob Storage, Azure SQL DB) and implemented transformations to load back into Azure Synapse.
  • Built and published Docker container images using Azure Container Registry and deployed them into Azure Kubernetes Service (AKS).
  • Metadata was imported into Hive, and existing tables and applications were transferred to Hive and Azure.
  • Combining ADF and Scala, I created complicated data conversions and manipulations.
  • To meet business functional needs, configured Azure Data Factory (ADF) to ingest data from various sources such as relational and non-relational databases.
  • Built DAGs in Apache Airflow to schedule ETL processes and added other Apache Airflow components like Pool, Executors, and multi-node capability to optimize workflows.
  • Improved Airflow performance by identifying and applying the best setups.
  • Configured Spark streaming to receive real-time data from Apache Flume and store the stream data to an Azure Table using Scala, and DataLake is used to store and process all forms of data. Using Spark Dataframes, I created data frames.
  • Applied various aggregations given by the Spark framework to the transformation layer utilizing Apache Spark RDD, Data frame APIs, and Spark SQL.
  • By mining data with Spark Scala functions, I was able to provide real-time insights and reports.Existing Scala code was optimized, and cluster performance was improved.
  • Utilized Spark Context, SparkSQL, and Spark Streaming to process large datasets.
  • Continuous monitoring of the Spark cluster using Log Analytics and the Ambari WEB UI improved the cluster's stability.
  • Transitioned log storage from Cassandra to Azure SQL Datawarehouse, which improved query performance.
  • Using Spark, Hive, and Sqoop, created custom input adapters to ingest data for analytics from multiple sources (Snowflake, MS SQL, MongoDB) into HDFS.
  • GC: Sqoop, Flume, and Spark Streaming API were used to import data from web servers and Teradata.
  • Developed Spark tasks in a test environment using Scala and used Spark SQL for querying to ensure faster data processing.
  • Indexing for data ingestion was implemented utilizing a Flume sink to write directly to indexers on a cluster.
  • By leveraging Azure Synapse to manage workloads, data was delivered for analytics and business intelligence purposes.
  • Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory, and Apache Ranger for authentication improved security.
  • Azure Kubernetes Service was used to manage resources and schedule across the cluster.

Environment: Spark, Hadoop, AWS (Lambda, Glue, EMR), NoSQL, Python, HDFS, Amazon Elastic Compute Cloud, Amazon Simple Storage Service(S3), CloudWatch Triggers (SQS, Event Bridge, SNS), REST, ETL,DynamoDB, Redshift, JSON, Tableau, JIRA, Jenkins, git, Maven, Bash

Data Engineer

Confidential, Columbia SC

Responsibilities:

  • Prepared ETL design document which consists of the database structure, change data capture, Error handling, restart and refresh strategies.
  • Worked on batch processing of data sources using Apache Spark, Elastic search.
  • Enabling other teams to work with more complex scenarios and machine learning solutions.
  • Involved in using Spark Data Frames to create Various Datasets and applied business transformations and data cleansing operations using Databricks Notebooks.
  • Experience in working with Spark applications like batch interval time, level of parallelism, memory tuning to improve the processing time and efficiency.
  • Used PySpark for amazing concurrency support, and PySpark plays the key role in parallelizing processing of the large data sets.
  • Developed map reduce jobs using PySpark for compiling the program code into bytecode for the JVM for data processing.
  • Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Spark Scala scripts and User defined Functions to perform transformations on large datasets.
  • Used some existing notebooks in migrating the data from existing applications to Azure DW using Databricks.
  • Design and implement end-to-end data solutions (storage, integration, processing, visualization) in Azure.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL.
  • Ingest the data to Azure Data Lake and process the data in In Azure Databricks.
  • Extracting the data from Azure Data Lake into HDInsight cluster and applying spark transformations and loading into HDFS.
  • Worked in building ETL workflows in Azure platform using Azure Databricks and data factory.
  • Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark Databricks cluster.
  • Extracted the data from HDFS using Hive, Presto and performed data analysis using Spark with Scala, PySpark, Redshift, and feature selection and created nonparametric models in Spark.
  • Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
  • Worked with Git for managing source code in the repositories for branching and merging.

Environment: Azure, Python, SQL, HDFS, Hive, Map reduce, Elastic search, SQL Server, Spark, PySpark, Scala, Databricks, JVM, Git.

Data Engineer/Python Developer

Confidential

Responsibilities:

  • Involved in Analysis, Design and Implementation/translation of Business User requirements.
  • Worked on collection of large sets using Python scripting.
  • Worked on Python Open stack API and used Python scripts to update content in the database and manipulate files
  • Developed subject segmentation algorithm using R.
  • Worked on large sets of Structured and Unstructured data.
  • Involved in data cleansing mechanism in order to eliminate duplicate and inaccurate data
  • Performed data analysis and data profiling using complex SQL queries on various sources systems on SQL server.
  • Experience in creating Hive Tables, Partitioning and Bucketing.
  • Perform Maintenance, including managing Space, Remove Bad Files, Remove Cache Files and monitoring services.
  • Improving workflow performance by shifting filters as close as possible to the source and selecting tables with fewer rows as the master during joins.
  • Mastered the ability to design and deploy rich Graphic visualizations using Tableau and Converted existing Business objects reports into tableau dashboards.
  • Created and executed SQL queries to perform Data Integrity testing on a Teradata Database to validate and test data using TOAD.
  • Involved in Unit Testing and Resolution of various Bottlenecks came across.
  • Analyzed and processed complex data sets using advanced querying, visualization, and analytics tools.

Environment: Python, R, Machine Learning, SQL, SQL server, Tableau, Hive, Teradata, Unit, AWS.

Data Analyst

Confidential

Responsibilities:

  • Assess and document data requirements and client-specific requirements to develop user-friendly BI solutions - reports, dashboards, and decision aids.
  • Design, develop, and maintain T-SQL - stored procedures, joins, complex sub-queries for ad-hoc data retrieval and management.
  • Develops logical and physical data flow models for ETL applications.
  • Build, test and maintain automated ETL processes to ensure data accuracy and integrity
  • Adapt and optimize ETL processes to accommodate changes in source systems and new business user requirements.
  • Build, test, and manage BI standard reporting templates and dashboards for internal and external use.
  • Build packages recording and updating historical attributes of employees using slowly changing dimension.
  • Configure percentage sampling transformation to identify sample respondents for survey research.
  • Applied lookup and cache transformation and use reference table to populate missing columns into Data Warehouse.
  • Troubleshoot data integration issues and debug reasons for ETL failure.
  • Undertakes regular data mapping, parsing and ETL scanning.
  • Create custom reports for requested projects and modify existing queries and reports in Power BI desktop and SSRS.
  • Managing snapshot and caching for SSRS performance.
  • Implemented security policies for SSRS reports.
  • Documentation of data environment and solutions to support end-users and analysts.
  • Worked with visualization tools like Tableau.

Environment: Python, SSIS, SSAS, SSRS, SQL Server, Hadoop, Hive, Tableau, Junit.

Hire Now