We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

0/5 (Submit Your Rating)

Boise, ID

SUMMARY

  • Over 8 years of experience in IT industry with strong experience in Big Data in implementing complete Hadoop solutions.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate - wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
  • Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Airflow, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory.
  • Experience setting up AWS Data Platform - AWS CloudFormation, Development Endpoints, AWS Glue, EMR and Jupyter/Sagemaker Notebooks, Redshift, S3, and EC2 instances
  • Experience working with Hortonworks and Cloudera environments.
  • Strong experience in Teradata, Informatica, Python, UNIX shell scripting for processing large volumes of data from varied sources and loading into databases like Teradata, Oracle.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
  • Experience in scheduling and monitoring jobs using Oozie, Airflow and Zookeeper.
  • Expertise with Python, Scala and Java in Design, Development, Administrating and Supporting of large-scale distributed systems.
  • Design and create Data Architect Specifications for various ETL projects.
  • Experience in collection of Log Data and JSON data into HDFS using Flume and processed the data using Hive/Pig.
  • Strong experience with Informatica Designer, Workflow Manager, Workflow Monitor, Repository Manager.
  • Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
  • Extensive experience in developing and designing data integration solutions using ETL tool such as Informatica PowerCenter, Teradata Utilities for handling large volumes of data.
  • Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
  • Good knowledge in implementing various data processing techniques using Apache HBase for handling the data and formatting it as required.
  • Experience in using build/deploy tools such asJenkins, Docker for Continuous Integration & Deployment for Microservices.
  • Hands on experience in Test-driven development, Software Development Life Cycle (SDLC) methodologies like Agile and Scrum.
  • Good analytical, communication skills and ability to work with a team as well as independently with minimal supervision.

TECHNICAL SKILLS

Bigdata Technologies: HDFS, Hive, MapReduce, Pig, Sqoop, Oozie, Hadoop distribution, and HBase, Spark, Spark Streaming, Airflow, Yarn, Zookeeper, Kafka, ETL, (Nifi, Talend etc.), Snowflake

Languages: Python, Java, R, Scala, Terraform.

Databases: MySQL, MS-SQL Server 2012/16, Oracle 10g/11g/12c, Teradata.

NOSQL Databases: HBASE, DynamoDB.

Utilities/Tools: Eclipse, Tomcat, JUnit, SQL, SVN, Log4j,SOAP UI, ANT, Maven, Alteryx, Jenkins, Jira, Intellij.

Data Visualization Tolls: Tableau, SSRS, PowerBI.

Cloud Services: AWS (EC2, S3, EMR, RDS, Lambda, CloudWatch, Auto scaling, Redshift, Cloud Formation, Glue), Azure Databricks, Azure Data Factory, Azure SQL

PROFESSIONAL EXPERIENCE

Confidential, Boise, ID

Senior Big Data Engineer

Responsibilities:

  • Involved in Requirement gathering, Business Analysis and translated business requirements into technical design inHadoopand Big Data
  • Involved in file movements between HDFS andAWSS3 and extensively worked with S3 bucket inAWS.
  • Converted allHadoopjobs to run in EMR by configuring the cluster according to the data size.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
  • Used HiveQL to analyze the partitioned and bucketed data and compute various metrics for reporting
  • Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Automated resulting scripts and workflow usingApache Airflowandshell scriptingto ensure daily execution in production
  • Assisted in creating and maintaining technical documentation to launching Hadoop Clusters and even for executing Hive queries and Pig Scripts.
  • Extensively worked with Avro and Parquet, XML, JSON files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark.
  • Developed a Python Script to load the CSV files into the S3 buckets and createdAWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
  • Developed Python scripts to extract the data from the web server output files to load into HDFS.
  • Worked on Written a python script which automates to launch the EMR cluster and configures the Hadoop applications using boto3.
  • Automated and monitored complete AWS infrastructure with terraform.
  • Created various data pipelines using Spark, Scala and Spark SQL for faster processing of data.
  • Worked with a team to migrate from Legacy/On prem environment into AWS.
  • Created Dockerized backend cloud applications with exposed Application Program Interface (API) interfaces and deployed on Kubernetes.
  • Worked with Databricks platform and worked with delta tables and scheduled the spark jobs on Databricks platform.
  • Ingested data from S3 to Snowflake and vice versa.
  • Worked on performance tuning of snowflake jobs.
  • Used snowflake as cloud Datawarehouse database for BI and Reporting.
  • Involved in testing at the data base end and reviewing the Informatica Mappings as per the business logic
  • Worked with querying data using Spark SQL on top of Spark engine.
  • Experienced in analyzing and Optimizing RDD's by controlling partitions for the given data
  • Developed data pipeline using SQOOP, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.
  • Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.

Environment: HDFS, Hive, Scala, Sqoop, Spark, Tableau, Yarn, Cloudera, SQL, Terraform, Airflow, Splunk, RDBMS, Elastic search, Kerberos, Jira, Confluence, Shell/Perl Scripting, Zookeeper, AWS (EC2, S3, EMR, Redshift, ECS, Glue, S3, VPC, RDS etc.), Ranger, Git, Kafka, CI/CD(Jenkins), Kubernetes

Confidential, New York, New York

Senior Big Data Engineer

Responsibilities:

  • Usage of Spark Streaming, Kafka and Spark SQL API to process the files.
  • Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa loading data into HDFS.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Involved in Migrating the platform from Cloudera to EMR platform.
  • Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Proficient in Data Analysis, Cleansing, Transformation, Data Migration, Data Integration, Data Import, and Data Export through use of ETL tools such as Informatica.
  • Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Experience in Creating ETL mappings using Informatica to move Data from multiple sources like Flat files, Oracle into a common target area such as Data Warehouse.
  • Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Importdatafrom sources like HDFS/HBase into Spark RDD.
  • Extensively involved in developing Restful API using JSON library of Play framework.
  • Developed Storm topology to ingest data from various source into Hadoop Data Lake.
  • Developed web application using HBase and Hive API to compare schema between HBase and Hive tables.
  • Connected to AWS S3 using SSH and ran spark-submit jobs
  • Developed Python Script to import data SQL Server into HDFS & created Hive views on data in HDFS using Spark.
  • Stored data in AWS S3 like HDFS and performed EMR programs on data stored in S3.
  • Worked on Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
  • Developed workflow in Oozie to automate the tasks of loading data into HDFS and pre-processing with Hive.
  • Developed complex and Multi-step data pipeline using Spark.
  • Monitoring YARN applications. Troubleshoot and resolve cluster related system problems.
  • Developed and designed system to collect data from multiple portal using Kafka and then process it using spark

Environment: Hadoop, HDFS, Hive, Sqoop, Spark, Scala, Airflow, Hive, Cloudera CDH4, Oracle, Kerberos, SFTP, Impala, Jira, Alteryx, Teradata, Shell/Perl Scripting, Kafka, AWS EC2, S3, EMR, Cloudera.

Confidential, Atlanta, Georgia

Big Data Engineer

Responsibilities:

  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
  • Written Hive queries for data analysis to meet the business requirements.
  • Automated all the jobs, for pulling data from FTP server to load data into Hive tables, using Oozie workflows.
  • Involved in creating Hive tables & working on them using HiveQL and perform data analysis using Hive and Pig.
  • Worked on creating Data Pipelines for Copy Activity, moving, and transforming the data with Custom Azure Data Factory Pipeline Activities for On-cloud ETL processing.
  • Extensive experience on Azure Data Lake Analytics, Azure Data Lake Storage, AZURE Data Factory, Azure SQL databases and Azure SQL Data Warehouse for providing analytics and reports for improving marketing strategies.
  • Defined UDFs using PIG and Hive to capture customer behaviour.
  • Design and implement MapReduce jobs to support distributed processing using java, Hive and Apache Pig.
  • Create Hive external tables on the MapReduce output before partitioning, bucketing is applied on it.
  • Worked with ETL tools Including Talend Data Integration, Talend Big Data, Pentaho Data Integration and Informatica.
  • Writing a Data Bricks code and ADF pipeline with fully parameterized for efficient code management.
  • Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the map reduce jobs that extract
  • Worked with BI team in the area of Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
  • Analysed the SQL scripts and designed the solution to implement using Scala.
  • Developed analytical component using Scala, Spark and Spark Stream.
  • Used Scala collection framework to store and process the complex consumer information and used Scala functional programming concepts to develop business logic.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Sqoop, Zookeeper, Teradata, ADF, Data Bricks, Azure SQL, Power BI, PL/SQL, MySQL, HBase, ETL(Informatica/SSIS).

Confidential, Atlanta, Georgia

Hadoop Engineer/ Data Engineer

Responsibilities:

  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing.
  • Tested Hadoop Map Reduce developed in python, pig, Hive.
  • As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to Blob Storage Container.
  • Created Azure Data Factory and managing policies for Data Factory and Utilized Blob storage for storage and backup on Azure.
  • Expert in building the Azure Notebooks functions by using Python, Scala, and Spark.
  • Created the Power BI report on data in ADLS, Tabular Model and SQL server.
  • Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
  • Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI.
  • Suggested improvements and modify existing BI components (Reports, Stored Procedures).
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.

Environment: Hadoop, MapReduce, Hive, Apache Spark, Sqoop, Teradata, SQL Server, Python, Pig, GitHub, Teradata, Power BI, Azure SQL, Tableau, MS Excel, MS Power Point.

We'd love your feedback!