We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

Jacksonville, FL

SUMMARY

  • Around 6 years of experience in Data Engineering, Data Pipeline Design, Development, and Implementation as a Data Engineer/Data Developer and Data Modeler.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification, and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Strong experience in writing scripts using Python API, PySpark API, and Spark API for analyzing the data.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy, and Beautiful Soup.
  • Hands - on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Hands-on Spark MLlib utilities such as classification, regression, clustering, collaborative filtering, and dimensionality reduction.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Experience in building ETL (Azure Data Bricks) data pipelines leveraging PySpark and Spark SQL.
  • Hands-on experience in Azure Data Lake Store (ADLS), Azure Data Lake Analytics (ADLA), Azure SQL DW, Azure Data Bricks (ADB), etc.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked with Cloudera and Hortonworks distributions.
  • Good working knowledge and experience of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Redshift, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data Governance, Metadata Management, Master Data Management, and Configuration Management.
  • Experience in Google Cloud components, Google container builders, and GCP client libraries and cloud SDK.
  • Experience in developing customized UDFs in Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Experience in designing star schema, Snowflake schema for Data Warehouse, and ODS architecture.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design, and implementation of RDBMS-specific features.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.

TECHNICAL SKILLS

Databases: MySQL Server, Oracle DB, HiveQL, Spark SQL, HBase, Mongo DB, Dynamo DB, Redshift, Snowflake.

Big Data: HDFS, MapReduce, Hive, Kafka, Sqoop, Spark Streaming, Spark SQL, Oozie, Zookeeper.

Machine Learning: Regression analysis, Bayesian Method, Decision Tree, Random Forests, Support Vector Machine, Neural Network, Sentiment Analysis, K-Means Clustering, KNN and Ensemble Method, Natural Language Processing (NLP), AWS Sage Maker, Azure ML Studio.

ETL Tools: Azure Data Factory, AWS Glue.

Data Visualization: Tableau, Matplotlib.

Languages: Python, Scala, Shell scripting, R, SAS, SQL, T-SQL.

Operating Systems: PowerShell, UNIX/UNIX Shell Scripting (via PuTTY client), Linux, and Windows.

Cloud: Azure, AWS.

IDE Tools: Databricks, PyCharm, IntelliJ IDEA, Anaconda.

PROFESSIONAL EXPERIENCE

Confidential - Jacksonville, FL

Data Engineer

Responsibilities:

  • Involved in Analysis, Design, and Implementation/translation of Business User requirements.
  • Developed Spark code using Python and Spark-SQL for faster testing and processing of data.
  • Developed jobs to parse the raw data from logs using Pyspark and perform data cleansing, data Transformations, and loaded to the hive.
  • Developing data processing tasks using PySpark such as reading data from external sources, merging data, performing data enrichment, and loading into target data destinations.
  • Installed Hadoop, Map Reduce, HDFS, and AWS and developed multiple Map Reduce jobs in Pyspark and Hive for data cleaning and pre-processing.
  • Worked on batch processing of data sources usingApache Spark, and Elastic search.
  • Design and Develop ETL Processes in DataBricks to migrate Campaign data from external sources like Azure DataLake, and gen2 in ORC/Parquet/Text Files.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDDs, and Spark YARN.
  • A highly immersive Data Science program involving Data Manipulation and Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT, Unix Commands, NoSQL, MongoDB, and Hadoop.
  • Good understanding of Spark Architecture with Databricks, and Structured Streaming. Setting Up Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks, Managing the Machine Learning Lifecycle.
  • Developed the map-reduce flows in Microsoft HDInsight Hadoop environment using python.
  • Performed scoring and financial forecasting for collection priorities using Python.
  • Handled importing data from various data sources, performed transformations using Hive, and MapReduce, and loaded data into HDFS.
  • Currently setting up Spark environment to create data pipelines to load data into the Data Lake using Pyspark and Python.
  • Used Airflow for orchestration and scheduling of the ingestion scripts.
  • Developed python code for different tasks, dependencies, SLA watcher, and time sensor for each job for workflow management and automation using the Airflow tool.
  • Responsible for loading Data pipelines from web servers using Kafka and Spark Streaming API.
  • Used Spark for interactive queries, processing of streaming data, and integration with popular NoSQL databases for huge volumes of data.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.
  • Used Python in automation of Hive and Reading Configuration files.
  • Developed the batch scripts to fetch the data from ADL storage and do required transformations in Pyspark using the Spark framework.
  • Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using R, Tableau, and Power BI.

Environment: Spark, Spark Streaming, Apache Kafka, Apache NiFi, AWS, ETL, UNIX, Linux, Tableau, Teradata, ADL, Azure Data Bricks Sqoop, Oozie, Python, GitHub, Azure DataLake.

Confidential, Minneapolis, MN

Data Engineer

Responsibilities:

  • Worked as a Data Engineer with Big Data and Hadoop ecosystem components.
  • Involved in converting Hive/SQL queries into Spark transformations using Scala.
  • Created Spark data frames using Spark SQL and prepared data for data analytics by storing it in AWS S3.
  • Responsible for loading data from Kafka into HBase using REST API.
  • Developed the batch scripts to fetch the data from AWS S3 storage and perform required transformations in Scala using the Spark framework.
  • Used Spark streaming APIs to perform transformations and actions on the fly for building a common learner data model which gets the data from Kafka in near real-time and persists it to the HBase.
  • Created Sqoop scripts to import and export customer profile data from RDBMS to S3 buckets.
  • Developed various enrichment applications in Spark using Scala for cleansing and enrichment of clickstream data with customer profile lookups.
  • Troubleshooting Spark applications for improved error tolerance and reliability.
  • Used Spark Data frame and Spark API to implement batch processing of Jobs.
  • Well-versed with Pandas data frames and Spark data frames.
  • Used Apache Kafka and Spark Streaming to get the data from adobe live stream rest API connections.
  • Automated creation and termination of AWS EMR clusters.
  • Worked on fine-tuning and performance enhancements of various spark applications and hive scripts.
  • Used various concepts in spark like broadcast variables, caching, and dynamic allocation to design more scalable spark applications.
  • Imported Hundreds of structured datafrom relational databases using Sqoop import to process using Spark and stored the datainto HDFS in CSV format.
  • Created data partitions on large data sets in S3 and DDL on partitioned data.
  • Improving Efficiency by modifying existing Data pipelines on Matilion to load the data into AWS Redshift.
  • Identify source systems, their connectivity, related tables, and fields and ensure data suitability for mapping, preparing unit test cases, and providing support to the testing team to fix defects.
  • Defined HBase tables to store various data formats of incoming data from different portfolios.
  • Developed the verification and control process for daily data loading.
  • Involved in daily production support to monitor and troubleshoot Hive and Spark jobs.

Environment: AWS EMR, S3, Spark, Hive, Sqoop, Scala, MySQL, Oracle DB, Athena, Redshift

Confidential, San Diego, CA

Data Engineer

Responsibilities:

  • Designed solution for Streaming data applications using Apache Storm.
  • Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks.
  • Create anArchitectural solutionthat leverages the best Azure analytics tools to solve our specific need in the Chevron use case.
  • Design and present technical solutions to end-users in a way that is easy to understand.
  • Educating client/business users onthe pros and cons of variousAzure PaaS and SaaSsolutions ensuring themost cost-effectiveapproaches are taken into consideration
  • CreateSelf Servicereportingin Azure Data Lake Store Gen2using an ELT approach.
  • CreateSpark Vectorized panda user-definedfunctions for data manipulation and wrangling.
  • Worked in real-time data streaming data using AWS Kinesis, EMR, and AWS Glue.
  • Transfer data in logical stages from System of records to raw zone, refined zone and produce zone for easy translation and denormalization
  • Setting up Azure infrastructure likestorage accounts, integration runtime, service principalid, and app registrations to enablescalable and optimizedutilization of business user analytical requirements in Azure.
  • Wrote spark and spark SQL transformation in Azure Databricks to perform complex transformations for business rule implementation.
  • Creating Data factory pipelines that can bulk copy multiple tables at once from a relational database to Azure data lake gen2.
  • Create a custom logging framework for ETL pipeline logging using Append variables in the Data factory
  • Enabling monitoring and azure log analytics to alert the support team on usage and stats of the daily runs.
  • Took proof of concept project ideas from business, lead, developed, and created production pipelines that deliver business valueusing Azure Data Factory.
  • Kept our data separated and secure across national boundaries through multiple data centers and regions.
  • Implemented Continuous integration/continuous development best practice using Azure DevOps, ensuring code versioning.
  • UtilizedAnsible playbookfor code pipeline deployment.
  • Delivered denormalized data forPowerBIconsumers for modeling and visualization from the produced layer in Data Lake.
  • Worked in aSAFE (Scaled Agile Framework)team with daily standups, sprint planning, and quarterly planning.

Environment: Azure Data Lake Gen2, Azure Data Factory, Spark, Databricks, Azure DevOps, Agile, PowerBI, Python, R, SQL, Scaled Agile team environment

Confidential

Python Developer

Responsibilities:

  • Creating web-based applications using Python on the Django framework for data processing.
  • Implementing the preprocessing procedures along with deployment using the AWS services and creating virtual machines using EC2.
  • Good experience in Exploratory data analysis and performed data wrangling and data visualization.
  • Validating the datato check for the proper conversion and identifying and cleaning unwanteddata,dataprofiling for accuracy, completeness, and consistency.
  • Preparing standard reports, charts, graphs, and tables from a structureddata source by querying data repositories using Python and SQL.
  • Developed and produced a dashboard, and key performance indicators and monitor organization performance.
  • Definedataneeds, evaluatedataquality, and extract/transformdatafor analytic projects and research.
  • Used Django framework for application development. Designed and maintained databases usingPythonand developedPython-based API (RESTful Web Service) using Flask, SQLAlchemy, and PostgreSQL.
  • Worked on server-side applications usingPythonprogramming.
  • Performed efficient delivery of code and continuous integration to keep in line with Agile principles.
  • Experience in Agile Methodologies, Scrum stories, and sprints experience in aPython-based environment,
  • Importing and exporting data between different data sources using SQL Server Management Studio.
  • Maintaining program libraries, user manuals and technical documentation.

Environment: python, Django, RESTful web service, MySQL, PostgreSQL, Visio, SQL Server Management Studio, AWS.

We'd love your feedback!