We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

0/5 (Submit Your Rating)

TX

SUMMARY

  • 8+ years of IT experience working on Big Data technology using Cloudera and Hortonworks distributions in a range of businesses. Hadoop, Spark, MapReduce, Kafka, Hive, Apache Ambari, Sqoop, HBase, and Impala are all part of the Hadoop working environment.
  • Scala, Java, Python, SQL, T - SQL, and R programming experience.
  • Hands-on expertise with major Hadoop ecosystem components such as MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, and Kafka in creating and implementing enterprise-based applications.
  • Configured and installed Hadoop/Spark Ecosystem Components with ease.
  • For processing and converting complicated data using in-memory computing capabilities developed in Scala, proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX, and Spark Streaming. Using Spark Context, Spark SQL, and Spark MLlib, I worked with Spark to improve the performance of current algorithms.
  • Experience in integrating multiple data sources into a data warehouse, including Oracle SE2, SQL Server, Flat Files, and Unstructured Files.
  • Ability to migrate data between RDBMS, NoSQL databases, and HDFS using Sqoop.
  • Extraction, Transformation, and Loading (ETL) data into Data Warehouses, as well as data processing, such as collecting, aggregating, and moving data from numerous sources using Apache Flume, Kafka, PowerBI, and Microsoft SSIS.
  • Hadoop architecture and many components, including Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Hadoop MapReduce programming.
  • Extensive experience utilizing Scala and Java to create simple to complicated Map Reduce and Streaming processes for data cleansing, filtering, and aggregation. Have a thorough understanding of the MapReduce framework.
  • Experience of using CI/CD techniques and processes DevOps/Git repository code promotion.
  • For development, I used Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio.
  • Linear Regression, Logistic Regression, Naive Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering are examples of Machine Learning techniques and Predictive Modeling that I am hands on.
  • Demonstrated expertise utilizing ETL tools: Talend Data Integration, SQL Server Integration Services (SSIS), Developed slowly changing dimension (SCD) mappings using Type-I, Type-II, and Type-III methods.
  • Data architectural expertise, including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning, and advanced data processing.
  • Working experience with NoSQL databases such as Cassandra and HBase, as well as developing real-time read/write access to very big datasets using HBase.
  • Developed Spark applications that handle data from a variety of RDBMS (MySQL, Oracle Database) and streaming sources.
  • Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.
  • Experience in SQL querying, data extraction/transformations, and query development skills for a variety of applications.
  • Hands-on experience with integration processes for the Enterprise Data Warehouse (EDW) and extensive knowledge of various Performance Tuning Techniques on Sources, Targets using Talend Mappings, Jobs.
  • Hands-on experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMRand other services of the AWS family.
  • Data analysis experience in Java 8 with HiveQL, Pig, HBase, and custom MapReduce programs.
  • Working knowledge of the GitHub/Git 2.12 source control and version control systems.
  • Strong understanding of key Java concepts such as Object-Oriented Design (OOD) and Java components such as the Collections Framework, Exception Handling, and the I/O System.

TECHNICAL SKILLS

Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0

Methodologies: RAD, JAD, System Development Life Cycle (SDLC), AgileBI Tools SSIS, SSRS, SSAS.

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Programming Languages: SQL, PL/SQL, and UNIX.

ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.

Cloud Platform: AWS, Azure, Google Cloud.

Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena

Databases: Oracle 12c/11g, Teradata R15/R14.

OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE

Senior Data Engineer

Confidential, TX

Responsibilities:

  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing Pyspark script to encrypting the raw data by using hashing algorithms concepts on client specified columns.
  • Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • T-SQL was used to create tables, stored procedures, and extract data for business users as needed.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, and NoSQL DB).
  • Expertise in implementing DevOps culture through CI/CD tools like Repos, Code Deploy, Code Pipeline, GitHub.
  • Created Talend jobs to copy the files from one server to another and utilized Talend FTP components.
  • Migration of on-premises data (SQL Database/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
  • Created continuous integration and continuous delivery (CI/CD) pipeline on AWS that helps to automate steps in software delivery process.
  • Worked on a direct query using PowerBI to compare legacy data with the current data and generated reports and stored and dashboards.
  • Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.
  • Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
  • SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Subreports, ad-hoc reports, parameterized reports, interactive reports & custom reports
  • Using ERWIN and MB MDR, does data analysis and design, as well as generates and maintains big, complicated logical and physical data models and metadata repositories.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Assist service developers in locating relevant material in models that already exist.
  • Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
  • Performed data quality issue analysis using Snow SQL by building analytical warehouses on Snowflake.
  • Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
  • To initiate data Stage jobs, I wrote a shell script.

Environment: MS SQL Server 2016, T-SQL, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS), Azure Data Lake, Azure Data Factory, SQL Azure, Management Studio (SSMS), Advance Excel), Spark, Python, ETL, Power BI, Snowflake, Tableau, Hive/Hadoop, IBM Cognos, Data Stage, and Quality Stage 7.5

Aws Data Engineer

Confidential, Foster city, CA

Responsibilities:

  • Designed & build infrastructure for the Google Cloud environment from scratch.
  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
  • Configured data loads from AWS S3 to Redshift using the AWS Data Pipeline.
  • Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.
  • Implemented Error handling in Talend to validate the data Integrity and data completeness for the data from the Flat File
  • Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines.
  • Loaded application analytics data into data warehouse in regular intervals of time.
  • Experienced in ETL concepts, building ETL solutions and Data modeling.
  • Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters.
  • Working with CI/CD tools such as Jenkins and version control tools Git,Bitbucket
  • Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management.
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
  • Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS
  • Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
  • Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
  • Built performant, scalable ETL processes to load, cleanse and validate data
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Measured Efficiency of Hadoop/Hive environment ensuring SLA is met Optimized the Tensorflow Model for efficiency
  • Compiled data from various sources to perform complex analysis for actionable results
  • Worked on confluence and Jira.

Environment: AWS, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, AWS Data Pipeline, AWS S3, Cloud SQL, MySQL, Postgres, SQL Server, Python, Scala, Spark, Hive, Spark - SQL

Data Engineer

Confidential, Boise, ID

Responsibilities:

  • Integrated AWS Kinesis with on premise Kafka cluster.
  • Implemented data ingestion and handling clusters in real time processing using Kafka.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Involved in migrating existing Teradata Datawarehouse to AWS S3 based data lakes.
  • Involved in migrating existing traditional ETL jobs to Spark and Hive Jobs on new cloud data lake.
  • Wrote complex Spark applications for performing various de-normalization of the datasets and creating a unified data analytics layer for downstream teams.
  • As part of Data Lake team, involved in ingesting 207 source systems which include databases (DB2, MySQL, Oracle), flat files, mainframe files, XML files into the Data Lake Hadoop environment which are later explored by reporting tools.
  • Primarily responsible for fine-tuning long running Spark applications, writing custom Spark UDFs, troubleshooting failures, etc.
  • Involved in building a real-time pipeline using Kafka and Spark streaming for delivering event messages to downstream application team from an external rest-based application.
  • Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
  • Creating Spark clusters and configuring high concurrency clusters using Databricks to speed up the preparation of high-quality data.
  • Design, develop and test dimensional data models using star and Snowflake schema methodologies under the Kimball method.
  • Used broadcast variables in Spark, effective & efficient joins, caching and other capabilities for data processing.
  • Involved in continuous integration of application using Jenkins.
  • Written AWS Lambda code for nested JSON files, converting, sorting, etc.
  • Employed Amazon Kinesis to stream, analyze and process real-time logs from Apache application server and Amazon Kinesis Firehose to store the processed log files in Amazon S3 bucket.
  • Used Lambda functions to automatically trigger jobs.
  • Involved in creating Hive scripts for performing ad-hoc data analysis required by the business teams.
  • Worked on utilizing AWS cloud services like S3, EMR, Redshift, Athena, and Glue Metastore.

Environment: AWS EMR, Spark, Hive SQL, HDFS, Sqoop, Kafka, Impala, Oozie, HBase, Pyspark, Scala, DataBricks, Flume, NiFi, Snowflake

Spark Developer

Confidential

Responsibilities:

  • Imported required modules like Keras and NumPy on Spark session, created directories for data and output.
  • Read train and test data into the data directory as well as into Spark variables for easy access.
  • Created a validation set using Keras2DML in order to test whether the trained model was working as intended or not.
  • Defined multiple helper functions that are used while running the neural network in session.
  • Defined placeholders and number of neurons in each layer, and created neural network's computational graph after defining weights and biases.
  • Created a TensorFlow session which is used to run the neural network as well as validate the accuracy of the model on the validation set.
  • The images upon being displayed are represented as NumPy arrays, and for easier data manipulation all the images are stored as NumPy arrays.
  • Executed multiple Spark SQL queries after forming the database to gather specific data corresponding to an image.
  • After executing the program and achieving acceptable validation accuracy, created a submission that is stored in the submission directory.

Environment: Scala, Python, PySpark, Spark, Spark ML Lib, Spark SQL, TensorFlow, NumPy, Keras, PowerBI.

We'd love your feedback!