We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

5.00/5 (Submit Your Rating)

Dallas, TexaS

PROFESSIONAL SUMMARY:

  • 8+ years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
  • Fluent programming experience with Scala, Java, Python, SQL, T - SQL, R.
  • Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, and Kafka.
  • Adept at configuring and installing Hadoop/Spark Ecosystem Components.
  • Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala.
  • Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
  • Experience in application of various data sources like Oracle SE2, SQL Server, Flat files, and unstructured files into a data warehouse.
  • Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, Power BI and Microsoft SSIS.
  • Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
  • Extensive experience in IT data analytics projects and Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.
  • Experience in implementing Azure data solutions, provisioning storage account, Azure Data Factory, SQL server, SQL Databases, SQL Data warehouse, Azure Data Bricks and Azure Cosmos DB.
  • Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
  • Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering, and data aggregation. Also possess detailed knowledge of MapReduce framework.
  • Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
  • Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
  • Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning and advanced data processing.
  • Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
  • Developed Spark Applications that can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
  • Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.
  • Capable of processing large sets (Gigabytes) of structured, semi-structured or unstructured data.
  • Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java 8.
  • Experience working with GitHub/Git 2.12 source and version control systems.
  • Strong in core Java concepts including Object-Oriented Design (OOD) and Java components like Collections Framework, Exception handling, I/O system.

TECHNICAL SKILLS:

Big Data Technologies: Map Reduce, Spark, PySpark, Cloudera, Azure HD Insights, Pig, Hive, Impala, Hue, HBase, Flume, Oozie, Ambari Server, Kafka, Databricks, Airflow

Web Design Tools: HTML, CSS, JavaScript, JSP, jQuery, XML

Languages: WSDL, CSS3, C, C++, XML,R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Shell Scripting

Cloud: GCP, GCS, BigQuery, Dataproc, DataStudio, DataFusion, AI Vertex, Databricks, Azure PowerShell, HD Insights, Azure CLI, ADLS, Blob storage, AWS, S3, EMR, Glue, Athena

Databases: Oracle, MySQL, SQL Server, Cassandra, Teradata

IDE and Notebooks: Eclipse, IntelliJ, PyCharm, Jupiter, Databricks notebooks

Data formats: Json, Parquet, AVRO, XML and CSV

Search and BI tools: Power BI, Data Studio, Tableau

Web Technologies: JDBC, JSP, Servlets, Struts (Tomcat, JBoss)

PROFESSIONAL EXPERIENCE:

Sr Data Engineer

Confidential, Dallas, Texas

Responsibilities:

  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Using AWS Redshift, extracted, transformed, and loaded data from various heterogeneous data sources and destinations.
  • Worked with AWS cloud and created EMR clusters with spark for analyzing raw data processing and access data from S3 buckets.
  • Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
  • Performed data analysis and design, and created and maintained large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR.
  • Assisted service developers in finding relevant content in the existing models.
  • Extracted, transformed, and loaded data from various heterogeneous data sources and destinations like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing PySpark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers.
  • Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
  • Compiling and validating data from all departments and Presenting to Director Operation.
  • Created Tableau reports with complex calculations and worked on Ad-hoc reporting using Power BI.
  • Created data model that correlates all the metrics and gives a valuable output.
  • Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
  • Performed ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers and Pre-processing is performed using Hive and Pig.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline activities.
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HD Insight / Databricks, NoSQL DB).
  • Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
  • Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target snowflake database.
  • Ability to design, develop, and test dimensional data models using Star and Snowflake schema methodologies under the Kimball method.
  • Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HD Insights.
  • Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest data.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Ingested data into Hadoop from various data sources like Oracle, MySQL using Sqoop tool. Created Sqoop job with incremental load to populate Hive External tables.
  • Ensured in preparation of deliverables (Daily, Weekly & Monthly MIS Reports) to satisfy the project requirements cost and schedule.
  • Worked on a direct query using Power BI to compare legacy data with the current data and generated reports and stored and dashboards.
  • Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP).
  • Using SQL Server reporting services (SSRS), created & formatted Crosstab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets using Power BI.
  • Developed visualizations and interactive dashboards using Power BI.
  • Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
  • Performed ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Created dashboards for analyzing POS data using Power BI.

Environment: MS SQL Server 2016, T-SQL, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS), Management Studio (SSMS), Advance Excel (creating formulas, pivot tables, Hlookup, Vlookup, Macros), Spark, Python, ETL, Power BI, Tableau, Hive/Hadoop, Snowflakes, Power BI, AWS Data Pipeline, IBM Cognos 10.1, Data Stage, Cognos Report Studio 10.1, Cognos 8 & 10 BI, Cognos Connection, Cognos office Connection, Cognos 8.2/3/4, Data stage and Quality Stage 7.5

Sr Data Engineer

Confidential

Responsibilities:

  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines.
  • Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines.
  • Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management.
  • Performed Data Migration to GCP.
  • Responsible for data services and data movement infrastructures.
  • Experienced in ETL concepts, building ETL solutions and Data modeling.
  • Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters and loaded application analytics data into data warehouse in regular intervals of time.
  • Designed & build infrastructure for the Google Cloud environment from scratch.
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension).
  • Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP.
  • Worked on confluence and Jira.
  • Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python.
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling.
  • Compiled data from various sources to perform complex analysis for actionable results.
  • Measured Efficiency of Hadoop/Hive environment ensuring service-level agreement (SLA) is met.
  • Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes.
  • Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS.
  • Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies.
  • Collaborated with team members and stakeholders in design and development of data environment
  • Involved in preparing associated documentation for specifications, requirements, and testing.

Environment: AWS, GCP, BigQuery, GCS Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Cloud SQL, MySQL, Postgres, SQL Server, Python, Scala, Spark, Hive, Spark -SQL

Data Engineer

Confidential

Responsibilities:

  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP.
  • Strong understanding of AWS components such as EC2 and S3.
  • Implemented a Continuous Delivery pipeline with Docker and Git Hub.
  • Worked with g-cloud function with Python to load Data into BigQuery for on arrival csv files in GCS bucket.
  • Process and load bound and unbound Data from Google pub/subtopic to BigQuery using cloud Dataflow with Python.
  • Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
  • Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python.
  • Developed and deployed data pipeline in cloud such as AWS and GCP.
  • Performed data engineering functions such as data extraction, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management.
  • Responsible for data services and data movement infrastructures and good experience with ETL concepts, building ETL solutions and Data modeling.
  • Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Involved in gathering and processing raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, writing applications).
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • Developed near real time data pipeline using spark.
  • Process and load bound and unbound Data from Google pub/subtopic to BigQuery using cloud Dataflow with Python.
  • Hands on experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/Sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
  • Implemented Apache Airflow for authoring, scheduling, and monitoring data pipelines.
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling.
  • Worked on confluence and Jira skilled in data visualization like Matplotlib and seaborn library.
  • Hands on experience with big data tools like Hadoop, Spark, Hive.
  • Experience implementing machine learning back-end pipeline with Pandas and NumPy.

Environment: GCP, BigQuery, GCS Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas, Matplotlib, seaborn library, text mining, Numpy, Scikit-learn, Heat maps, Bar charts, Line charts, ETL workflows, linear regression, multivariate regression, Python, Scala, Spark

Spark Developer

Confidential

Responsibilities:

  • Imported required modules such as Keras and NumPy on Spark session, also created directories for data and output.
  • Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.
  • Read train and test data into the data directory as well as into Spark variables for easy access and proceeded to train the data based on a sample submission.
  • The images upon being displayed are represented as NumPy arrays, for easier data manipulation and all the images are stored as NumPy arrays.
  • Created a validation set using Keras2DML in order to test whether the trained model was working as intended or not.
  • Defined multiple helper functions that are used while running the neural network in session. Also defined placeholders and number of neurons in each layer.
  • Created neural networks computational graph after defining weights and biases.
  • Created a TensorFlow session which is used to run the neural network as well as validate the accuracy of the model on the validation set.
  • After executing the program and achieving acceptable validation accuracy a submission was created that is stored in the submission directory.
  • Executed multiple SparkSQL queries after forming the Database to gather specific data corresponding to an image.

Environment: Scala, Python, PySpark, Spark, Spark ML Lib, Spark SQL, TensorFlow, NumPy, Keras, Power BI

Data Engineer

Confidential

Responsibilities:

  • Handled importing of data from various sources, performed transformations using Pig and loaded data into HDFS.
  • Involved in loading data from edge node to HDFS using shell scripting.
  • Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
  • Implemented SQL, PL/SQL Stored Procedures.
  • Written complex SQL queries for validating the data against different kinds of reports generated by Business Objects.
  • UsedSparkstream processing get data into in-memory, implementedRDDtransformations, actions to process as units.
  • Worked on two sources to bring in required data needed for reporting for a project by writing SQL extracts.
  • Evaluated data profiling, cleansing, integration, and extraction tools. (e.g. Informatica)
  • Trained analytical models with Spark ML estimators including linear regression, decision trees, logistic regression, and k-means.
  • Performed computations using Spark MLlib functionality that wasn’t present in SparkML by converting dataframes to RDDs and applying RDD transformations and actions.
  • Prototyped data visualizations using Charts, drill-down, parameterized controls using Tableau to highlight the value of analytics in executive decision support control.
  • Troubleshot and tuned machine learning algorithms in Spark.
  • Actively involved in code review and bug fixing for improving the performance.

Environment: SQL, PL/SQL, Informatica, Tableau, Spark, Spark MLlib, Spark ML, XML.

We'd love your feedback!