Data Engineer Resume
Bentonville, AR
SUMMARY
- Over 5 years of Professional IT experience as Data Engineer/Data Analyst in building data pipelines using Big data Hadoop ecosystem, Spark, Hive, Sqoop Google cloud storage, Python, SQL, Tableau, GitHub and ETL tools.
- Experienced in multiple domains like Finance, Retail, E commerce and Healthcare.
- Experience on Hadoop, HDFS, Yarn, Sqoop, Hive, MapReduce, Spark, GCP.
- Experience in using and writing SQL queries, database creation, and writing stored procedures, DDL, DML SQL queries.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice - versa.
- Efficient in working with Hive data warehouse tool creating tables, data distributing by implementing Partitioning and Bucketing strategy, writing, and optimizing the HiveQL queries.
- Experience in ingestion, storage, querying, processing, and analysis of Big Data with hands-on experience in Big Data including Apache Spark, Spark SQL, Hive.
- Used Spark, Google cloud storage to build scalable and fault-tolerant systems infrastructure to process TB’s data/day resulting in 15% increase in the total number of users.
- Experience in GCP services (Big Query/Bigtable, Dataproc, GCS, Data Flow, App engine and looker).
- Experience in designing, developing, and deploying projects in GCP suite including GCP Suite such as BigQuery, Data Flow, Data proc, Google Cloud Storage, Composer, Looker etc..
- Designed, tested, maintained the data management and processing systems using spark, GCP, Hadoop and shell scripting.
- Expertise in collecting, exploring, analyzing, and visualizing the data by generating tableau/Looker reports/dashboards.
- Worked with business users, product owners and engineers to design feature-based solutions, to implement them in agile fashion.
- Knowledge on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud Dataproc, Cloud Pub/Sub, cloud SQL, BigQuery, stack driver monitoring, Cloud spanner, looker, and deployment manager.
- Experience in Azure Development, worked onAzure web application,Azure storage,Azure SQL Database,Virtual machines,Azure Data Factory, HDInsight, Azure search, and notification hub.
- Effective team member, collaborative, and comfortable working independently
TECHNICAL SKILLS
Big Data: Apache Spark, Hadoop, HDFS, YARN, Hive, Sqoop, MapReduce, Tez, Ambari, Zookeeper, Data warehousing.
Database: MySQL, SQL Server, DB2, Cassandra, Teradata, BigQuery, Druid.
Cloud: Google Cloud Platform (Storage, Big Query, Data proc, Data flow, Cloud Pub/Sub, Data CatLog), Azure.
Methodologies: Agile(scrum), Waterfall.
Languages: Scala, Python, Pyspark, SQL, HiveQL, Shell Scripting.
Data Visualization Tools: Looker, PowerBI, Microsoft Excel (Pivot tables, graphs, charts, Dashboards).
Version Control: Git.
Tools: Automic, Hue, Looker, IntelliJ IDEA, Eclipse, Maven, Zookeeper, VMware, Putty, DBvisualizer.
PROFESSIONAL EXPERIENCE
Confidential, Bentonville, AR
Data Engineer
Responsibilities:
- Responsible to Build the ETL Pipelines (Extract, Transform, Load) from data lake to different databases based on the requirements.
- Design and Develop Data application using Hadoop, HDFS, Hive, Spark, Scala, Sqoop, Atomic Scheduler, DB2, SQL Server, Teradata, Thoughtspot.
- Development of base and consumption tables in datalake and moving from Datalake to Teradata.
- Built the catalog tables using batch processing, multiple complex joins to combine multiple dimension tables of the store transactions and the E-commerce transactions which has millions of records every day.
- Developed proof-of-concept prototype with faster iterations to develop and maintain design documentation, test cases, monitoring and performance evaluations using Git, Putty, Maven, Confluence, ETL, Automic, Zookeeper, Cluster Manager.
- Used shell scripting to automate the validations between different databases of each module, and report to the users to show the data quality, using frame works Aorta and Unified Data Pipeline.
- Experience in developing testing and deploying code in YAML.
- Responsible to Troubleshoot issues related to data pipeline failures or slowness, built using Map Reduce, Tez, hive or Spark to ensure SLA adherence.
- Optimized hive scripts by reengineering the DAG logic to use minimal resources and provide high throughput.
- Worked with the business users to find resolution and resolve discrepancies like error records and duplicate records across tables and Writing complex SQL/ HQL queries to validate the reports.
- Improving the performance and optimizing existing algorithms in Hadoop using Spark context, Spark- SQL, Data Frames, Pair RDD’s & Spark YARN.
- Worked on a migration project to migrate data from different sources (Teradata,Hadoop,DB2) to Google Cloud Platform(GCP) using UDP framework and transforming the data using Spark Scala scripts.
- Worked on creating data ingestion processes to maintain Global Data lake on the GCP cloud and Big Query.
- Built tableau dashboards to report store level sales and region level sales for Confidential US and global data.
- Effectively followed Agile Methodology and participated in Sprints and daily Scrums to deliver the tasks and user JIRA board to manage and update tasks
Environment: Hadoop, Spark, Scala, Teradata, Hive, Aorta, Sqoop, GCP, Google cloud storage, BigQuery, Dataproc, Dataflow, SQL, DB2, UDP, GitHub, Azure (Azure Data Factory, Azure Databases), Tableau, Lookers etc.
Confidential, Springfield,IL
Graduate Assistant
Responsibilities:
- Responsible for collecting, cleaning, labeling, and conceptualizing large database.
- Utilizing SQL queries, python libraries, MS Access, and MS Excel to filter and clean.
- Responsible for creating analysis reports, maintaining, and communicating with lab researchers.
- Provided a solution using HIVE, SQOOP (to export/ import data), for faster data load by replacing the traditional ETL. process with HDFS for loading data to target tables.
- Created tables, partitions, buckets and perform analytics usingHivead-hoc queries.
- Created UDFs and Oozie workflows to Sqoop the data from source to HDFS and then to the target tables.
- Handled importing of data from multiple data sources using Sqoop, performed transformations using Hive, and loaded data into HDFS.
Environment: Hadoop, HDFS, Hive, SQL, Cloudera Manager, Sqoop, Eclipse, Excel.
Confidential
Hadoop Developer
Responsibilities:
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL databases for huge volume of data.
- Toiled on numerous file formats like Text, Sequence files, Avro, Parquet, ORC, JSON, XML files and Flat files using Map Reduce Programs.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Resolved performance issues in Hive and Pig scripts with analyzing Joins, Group and Aggregation and how it translates to MR jobs.
- Hands-on experience with the Hadoop eco-system (HDFS, Hive, MapReduce, Hbase, Hive, Impala, Spark)
- Stock the data into Spark RDD and Perform in-memory data computation to generate the output exact to the requirements.
- Involved in scripting Spark applications using Scala to perform various data cleansing, validation, transformation, and summarization activities as to the requirements.
- Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyze operational data.
- Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into HBase tables.
- Expansively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
- Designing Oozie workflows for job scheduling and batch processing.
Environment: Hadoop, Spark, Scala, Teradata, Hive, pig, Impala, Sqoop, Oozie, SQL, DB2, spark SQL, airflow.
Confidential
Data Engineer
Responsibilities:
- Experience in the principles and best practices of Software Configuration Management (SCM) in Agile, scrum, and Waterfall methodologies.
- Designing Oozie workflows for job scheduling and batch processing.
- Expertise in performing investigation, analysis, recommendation, configuration, installation and testing of new hardware and software.
- Worked on verifying and validating Business Requirements Document, Test Plan, & Test Strategy documents.
- Experience in working GIT for branching, tagging, merging, and maintained GIT source code tool.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Effectively followed Agile Methodology and participated in Sprints and daily Scrums to deliver software tasks on-time and with good quality in coordination with onsite and offshore teams.
- Used Shell Scripting (Bash and ksh), PowerShell, Ruby and Python based scripts for merging, branching, and automating the processes across the environments
- Work closely with other data engineers, product managers, analysts to gather & analyze data requirements to support reporting & analytics.