We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

TexaS

SUMMARY

  • Overall, 7+ years of experience in software analysis, datasets, design, development, testing, implementation of Cloud, Big Data, Big Query, Spark, Scala, and Hadoop.
  • Expertise in Big Data technologies, Data Pipelines, SQL, Cloud based RDS, Distributed Database, Serverless Architecture, Data Mining, Web Scrapping, Cloud technologies like AWS EMR, Cloud Watch.
  • Hands on experience in designing and implementing data engineering pipelines and analyzing data using Hadoop ecosystem tools like HDFS, Spark, Sqoop, Hive, Flume, Kafka, Impala, PySpark, Oozie and HBase.
  • Experience in implementing E2E solutions on Big Data using Hadoop framework, executed, and designed big data solutions on multiple distribution systems like Cloudera (CDH3 & CDH4), Hortonworks.
  • Strong knowledge in writing pyspark UDF, Generic UDF's to in corporate complex business logic into data pipelines.
  • Vast experience in designing, creating, testing, and maintaining the complete data management from Data Ingestion, Data Curation and Data Provision with in - depth knowledge in Spark APIs like Spark Framework-SQL, DSL, Streaming, by working on different file formats like parquet, JSON, and performance tuning of spark applications from various aspects.
  • Gathering and translating business requirements into technical designs and development of the physical aspects of a specified design by creating Materialized views/Views/Lookups.
  • Experience in designing and testing highly scalable mission-critical systems, and Spark jobs both in Scala and PySpark, Kafka.
  • Expertise in end-to-end Data Processing jobs to analyze data using MapReduce, Spark, and Hive.
  • Strong Experience in working with Linux/Unix environments, writing Shell Scripts.
  • Developed a pipeline using spark and Kafka to load data from a server to Hive with automatic ingestions and quality audits of the data to the RAW layer of the Data Lake.
  • Developed end to end Analytical/Predictive model applications leveraging Business intelligence, and insights with both Structured and Unstructured data in Big Data Environment.
  • Strong experience in using Spark Streaming, Spark SQL, and other components of spark like accumulators, Broadcast variables, different levels of caching and optimizations for spark jobs.

TECHNICAL SKILLS

  • Hadoop
  • Spark
  • Hive
  • YARN
  • HDFS
  • Zookeeper
  • HBase
  • Kafka
  • Oracle
  • Teradata
  • DB2
  • Python/R/SQL/Scala 2.11.11
  • AWS EC2
  • EMR
  • Lambda
  • Terraform
  • Microsoft Azure
  • Azure Databricks
  • Google cloud platform (GCP)

PROFESSIONAL EXPERIENCE

Confidential, Texas

Data Engineer

Responsibilities:

  • Participated in all phases including Analysis, Design, Coding, Testing and Documentation and gathered requirements and performed Business Analysis.
  • Developed Entity-Relationship diagrams and modeling Transactional Databases and Data Warehouse using ER/ Studio and Power Designer.
  • Design and Develop complex Data pipelines using Sqoop, Spark, and Hive to Ingest, transform and analyze customer behavior data.
  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
  • Maintained data pipeline up-time of 99.9% while ingesting streaming and transactional data across 7 different primary data sources using Spark, Redshift, S3, and Python.
  • Ingested data from disparate data sources using a combination of SQL.
  • Google Analytics API, and Salesforce API using Python to create data views to be used in BI tools like Tableau.
  • Working with two different datasets one using HiveQL and other using Pig Latin.
  • Experience on moving the raw data between different systems using Apache Nifi.
  • Participated in building data lake in AWS and GCP.
  • Expertise in the analysis, design, and development of custom solutions/applications using Microsoft Azure technology stack basically on Virtual Machine, Azure Data Factory and Azure Data Bricks.
  • Development level experience in Microsoft Azure, PowerShell, Python, Azure Data Factory, Data Bricks,
  • Automating the data flow process using Nifi and hands-on experience on tracking the data flow in a real time manner using Nifi.
  • Wrote terraform scripts for CloudWatch Alerts.
  • Developed data warehouse model in snowflake for over 100 datasets using whereScape and created Reports in Looker based on Snowflake Connections.
  • Writing MapReduce code using python in order to get rid of certain security issues in the data.
  • Synchronizing both the unstructured and structured data using Pig and Hive on business prospectus.
  • Used Pig Latin at client-side cluster and HiveQL at server-side cluster.
  • Importing the complete data from RDBMS to HDFS cluster using Sqoop.
  • Worked on AWS environment and technologies such as S3, EC2, EMR, Glue, CFT, Lambda and databases Oracle, SMS, DynamoDB, MongoDB.
  • Creating external tables and moving the data onto the tables from managed tables.
  • Performing the subqueries in Hive and partitioning and bucketing the imported data using HiveQL.
  • Moving this partitioned data onto the different tables as per as business requirements.
  • Invoking an external UDF/UDAF/UDTF python script from Hive using Hadoop Streaming approach which is supported by Ganglia.
  • Validating the data from SQL Server to Snowflake to make sure it has a correct match
  • Setting up the work schedule using oozie and identifying the errors in the logs, rescheduling/resuming the job.
  • Able to handle whole data using HWI (Hive Web Interface) using Cloudera Hadoop distribution UI.
  • Enhance the existing product with newly features like User roles (Lead, Admin, Developer), ELB, Auto scaling, S3, Cloud Watch, Cloud Trail and RDS-Scheduling.
  • Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS.
  • After the transformation of data is done, this transformed data is then moved to Spark cluster where the data is set to go live on to the application using Spark streaming and Kafka.
  • Created RDD’s in Spark technology.
  • Extracting data from data warehouse (Teradata) on to the Spark RDD’s,
  • Working on Stateful Transformations in Spark Streaming.
  • Good hands-on experience on Loading data onto Hive from Spark RDD’s.
  • Worked on Spark SQL UDF’s and Hive UDF’s also worked with Spark accumulators and broadcast variables.
  • Using decision tree as a model evaluation for both classification and regression.
  • Collaborated with the infrastructure, network, database, application, and BI teams to ensure data quality and availability.
  • Developed, created and test environments of different applications by provisioning Kubernetes clusters on AWS using Docker, Ansible, and Terraform

Environment: Hadoop, Sqoop, Hive, HDFS, YARN, Pyspark, Zookeeper, HBase, Apache Spark, Scala, AWS EC2, S3, EMR, RDS, VPC, Lambda, Redshift, Glue, Athena, data lake, Terraform, Snowflake Kafka, Oracle, Python, Scala, Restful web service.

Confidential, California

Data Engineer

Responsibilities:

  • Worked closely with the business analysts to convert the Business Requirements into Technical Requirements and preparing low and high-level documentation.
  • Performing transformations using Hive, MapReduce, hands on experience in copying .log, snappy files into HDFS from Greenplum using Flume & Kafka, loaded data into HDFS and extracted the data into HDFS from MYSQL using Sqoop.
  • Imported required tables from RDBMS to HDFS using Sqoop and used Storm/ Spark streaming and Kafka to get real time streaming of data into HBase.
  • Developed views and templates with Python and Django's view controller and templating language to create a user-friendly website interface.
  • Worked on Snowflake and built the Logical and Physical data model for it as per the changes required.
  • Wrote a Python scripts in GCS bucket to maintain raw file archival.
  • Experience in Writing Map Reduce jobs for text mining and worked with predictive analysis team and Experience in working with Hadoop components such as HBase, Spark, Yarn, Kafka, Zookeeper, PIG, HIVE, Sqoop, Oozie, Impala and Flume using Java.
  • Open SSH tunnel to Google DataProc to access to yarn manager to monitor spark jobs.
  • Wrote HIVE UDF's as per requirements and to handle different schemas and xml data.
  • Wrote programs using Python and apache beam and executed them in cloud Dataflow to run Data validation between raw source file and Bigquery tables.
  • Developed data pipeline using Python, hive to load data into data link. Perform data analysis data mapping for several data sources.
  • Process and load bound and unbound Data from Google pub/subtopic to Bigquery using cloud Dataflow with Python
  • Defined virtual warehouse sizing for Snowflake for different type of workloads.
  • Developed Java Map Reduce programs for the analysis of sample log file stored in cluster.
  • Designed new Member and Provider booking system which allows providers to book new slots, with sending out the member leg and provider Leg directly to TP through Datalink.
  • Analyze various type of raw file like Json, Csv, Xml with Python using Pandas, NumPy etc.
  • Developed Spark applications using Scala for easy Hadoop transitions. And Hands on experienced in writing Spark jobs and Spark streaming API using Scala and Python.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
  • Designed and developed User Defined Function (UDF) for Hive and Developed the Pig UDF’S to pre-process the data for analysis as well as experience in (UDAFs) for custom data specific processing.
  • Created Airflow Scheduling scripts in Python.
  • Automated the existing scripts for performance calculations using scheduling tools like Airflow.
  • Designed and developed the core data pipeline code, involving work in Python and built on Kafka and Storm.
  • Good knowledge on Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive for optimized performance.
  • Performance tuning using Partitioning, bucketing of IMPALA tables.
  • Created cloud-based software solutions written in Scala Spray IO, Akka, and Slick.
  • Hands on experience on fetching the live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.
  • Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Worked on NoSQL databases including HBase and Cassandra.
  • Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.

Environment: Map Reduce, HDFS, Hive, Pig, HBase, Python, SQL, Sqoop, Flume, Oozie, Impala, Scala, Spark, Apache Kafka, Play, Snowflake, AKKA, Zookeeper, J2EE, Linux Red Hat, HP-ALM, Eclipse, Cassandra, SSIS.

Confidential, Atlanta

Data Engineer

Responsibilities:

  • Responsible for designing and implementing End to End data pipeline using Big Data tools including HDFS, Hive& Spark.
  • Extracting, Parsing, Cleaning and ingesting the incoming web feed data and server logs into the HDFS by handling structured and unstructured data.
  • Worked on loading CSV/TXT/AVRO/PARQUET files using pyspark language in Spark Framework and process the data by creating Spark Data frame and RDD and save the file in parquet format in HDFS to load into fact table.
  • Worked extensively on tuning SQL queries and database modules for optimum performance.
  • Writing complex SQL queries like CTEs, subqueries, joins, Recursive CTEs.
  • Good experience in Database, Data Warehouse and schema concepts like SnowFlake Schema.
  • Worked on Cluster size with many nodes Communicate with business users and source data owners to gather reporting requirements and to access and discover source data content, quality, and availability.
  • Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
  • Involved in file movements between HDFS & AWS S3 and extensively worked with S3 bucket in AWS.
  • Using Spark-SQL to load data from JSON to create schema RDD and loading in Hive Tables.
  • Used Scala SBT to develop Scala coded spark projects and executed using spark-submit.
  • Expertise on Spark, Spark SQL, Tuning and Debugging the Spark Cluster (Yarn).
  • Improving Efficiency by modifying existing Data pipelines on Matillion to load the data into AWS Redshift.
  • Deployed the Airflow server and setup dags for scheduled tasks.
  • Very good Experience with Hashi Corp Vault to write and read secrets into and from the lockboxes.
  • Migration of MicroStrategy reports and data from Netezza to IIAS.
  • Experienced with batch processing of Data sources using Apache Spark.
  • Extensive Usage of Python Libraries, Pylint and Auto testing framework behave.
  • Well versed with Pandas data frames and Spark data frames.
  • Developed Power enters mappings to extract data from various databases, Flat files and load into DataMart using the PySpark and Airflow.
  • Created data partitions on large data sets in S3 and DDL on partitioned data.
  • Implemented rapid-provisioning and life-cycle management for using Amazon EC2 and custom Bash scripts.

Environment: Unix Shell Script, Python 2&3, Scheduler (Cron), Jenkins, Artifactory, Matillion, EMR, Databricks, PyCharm, Spark SQL, Hive, SQL Jupyter, MicroStrategy, Putty, Power BI, Hive, AWS.

Confidential

SQL Developer

Responsibilities:

  • Analyzed requirements and impact by participating in Joint Application Development sessions with business.
  • Created various scripts (using different database objects) and SSIS packages (using different tasks) to Extract, Transform and Load data from various servers to client databases.
  • Optimized Stored Procedures and long running queries using indexing strategies and query optimization techniques.
  • Leveraged dynamic SQL for improving performance and efficiency.
  • Performed optimization and performance tuning on Oracle PL/SQL procedures and SQL Queries.
  • Developed PL/SQL Objects (Views, Packages, function and Procedures), SQL Loader for data migration.
  • Successfully developed and deployed SSIS packages into QA/UAT/Production environments and used package configuration to export various package properties.
  • Developed Tableau workbooks to perform year over year, quarter over quarter, YTD, QTD and MTD type of analysis.
  • Worked with team of developers designed, developed and implement a BI solution for Sales, Product and Customer KPIs.
  • Created and analyzed complex dashboards in tableau using the various sources of data like Excel sheets, SQL Server.
  • Developed SSRS reports and configured SSRS subscriptions per specifications provided by internal and external clients.
  • Designed and coded application components in an Agile environment utilizing a test-driven development approach.
  • Extensively worked on Excel using pivot tables and complex formulas to manipulate large data structures.
  • Interacted with the other departments to understand and identify data needs and requirements and worked with other members of the organization to deliver and address those needs.
  • Designed and created distributed reports in multiple formats such as Excel, PDF and CSV using SQL Server 2008 R2 Reporting Services (SSRS).

Environment: SQL Server 2008 R2, SSMS, SSIS, SSRS, XML, MS Access.

We'd love your feedback!