We provide IT Staff Augmentation Services!

Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Plano, TX

SUMMARY

  • Over 4 years of professional IT experience as Big Data Engineer.
  • Experience in Complete Software Development Life Cycle (SDLC) which includes Requirement Analysis, Design, Coding, Testing and Implementation using Agile (Scrum), TDD and other development methodologies.
  • Expertise in AWS Cloud Services (S3, EMR, EC2) and Snowflake Computing.
  • Strong understanding of Data warehouse concepts like ETL, Star Schema, Snowflake, data modeling experience using Normalization, Business Process Analysis, Dimensional Data modeling, physical & logical data modeling.
  • Experience in setting cluster in Amazon EC2 & S3 including the automation of setting & extending the clusters in AWS Amazon cloud.
  • Hands on experience in data transformation operations, by implementing various functions, for loading and evaluating data in the relations.
  • Loading log data into HDFS by collecting and aggregating the data from various sources.
  • Worked extensively on different databases Oracle, MySQL and have good database programming experience with SQL.
  • Expert in handling the various source schemes such as Flat files, DB2, MS SQL server, Excel, Oracle, Csv files, Teradata, XML files.
  • Worked on GIT for version control, JIRA for project tracking.

TECHNICAL SKILLS

Programming Languages: Core Java, Python, Scala.

AWS Services: EMR, S3, EC2, Lambda

Big Data Technologies: Hadoop, HDFS, Snowflake Computing, Scala, Spark.

Databases: MySQL, SQL Server.

NoSQL Databases: HBase and Cassandra.

Scripting and Query Languages: UNIX Shell scripting, SQL and PL/SQL.

Operating Systems: Windows, Linux

Other Tools: Eclipse, Tableau 10.1, Informatica, Control - M, ServiceNow

PROFESSIONAL EXPERIENCE

Confidential - Plano, TX

Big Data Engineer

Responsibilities:

  • Involved in gathering the business requirements, designing and development.
  • Responsible to review and understand how Quantum (Spark Wrapper) is used to ingest and process batch and real-timedatausing Apache Spark, Scala and SQL.
  • This project required an understanding of business rules, business logic, and use cases to be implemented.
  • Worked in a cloud environment on Amazon AWS using a Multistage deployment environment.
  • The project involved sources ofdataas disparate as CSV, Parquet, Avro, Kafka, Snowflake Tables, etc.
  • Developed Quantum workflows to read parquet files from S3 buckets and apply transformations, joins, filters, and SQL queries to different dataframes and create output datasets.
  • Synchronized and ensured high availability ofdatasources through AWS regions.
  • Prepared use cases, mockdataand error scenarios to test workflows execution in an EMR Cluster to be deployed from Dev to Test to QA.
  • Developed and coded exclusion rules workflow to connect it to Ability-to-Pay external process using Spark, Quantum, SQL and Python.
  • Design and implement ELK (ElasticSearch, and Kibana) stack solution for Proactive Monitoring of applications logs and statistics.
  • Use Scala to read Parquet files in HDFS and perform preprocessing.
  • Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
  • Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
  • Used Partitioning, Dynamic-Partitioning and Bucketing concepts for performance optimization in Hive.
  • Created Hive external Tables on top of HBase data.
  • Managed data in NOSQL databases like HBase.
  • Stored data into HBase data systems from Spark streams in kafka and spark-stream POC.
  • Performance tuning of various joins to Map side joins. Performance improvement of Hive queries using partitioning and bucketing
  • Work within and across Agile teams to design, develop, test, implement, and support technical solutions across a full stack of development tools and technologies.
  • Automate deployments on AWS using GitHub and Jenkins.
  • Verified and validated that ability-to-pay AWS Lambda triggered jobs appropriately to execute the cluster and process the accounts.
  • Set up the CI/CD pipelines using Jenkins, Maven, GitHub and AWS.
  • Used GitHub for control version and Jira for issues and project tracking.

Confidential - Round Rock, TX

Hadoop Developer

Responsibilities:

  • Worked with business teams and created Hive queries for ad-hoc access.
  • Responsible to manage data coming from different sources.
  • Involved in loading data from UNIX file system to HDFS.
  • Worked on Spark batch applications to convert HiveQL into Spark SQL using DataFrames and DataSets.
  • Created Hive tables and executed Hive queries on Hive warehouse.
  • Involved in review of functional and non-functional requirements and developed Hive queries for the analysts.
  • Extensively used Scala programming for developing Spark applications
  • Processing the schema-oriented data using Scala and Spark
  • Design and implement Hive tables (Partitioned, Non-Partitioned, Buckets).
  • Involved in HDFS maintenance and loading of structured and unstructured data.
  • Loaded the processed data into Hive tables.Worked on Hive developing external table, managed table, the pipelinefor smooth ETL processing.
  • Applied transformations on the data loaded into Spark Dataframes and done in memory data computation to generate the output response.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using Python.
  • Experience in Code version control using Git and maintain repositories as a best practice.
  • Developed multiple POCs using Spark Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
  • Used hive to analyze the partitioned data and compute various metrics for reporting.
  • Import the data from different sources like HDFS into Spark Data frames.
  • Experienced with Spark Context, Spark -SQL, Data Frame and Pair RDD's.
  • Reduced the latency of spark jobs by tweaking the spark configurations and following other performance and Optimization techniques.
  • Developed various data connections from data source to SSIS, Tableau Server for report and dashboard development
  • Used Tableau for Data visualization to identify the role of various factors.
  • Involved in identifying KPIs.

Confidential

ETL Developer

Responsibilities:

  • Interacted with business community and gathered requirements based on changing needs. Incorporated identified factors into Informatica mappings to build Data Warehouses.
  • Developed a standard ETL framework to enable the reusability of similar logic across the board. Involved in System Documentation of Dataflow and methodology.
  • Identified all the dimensions to be included in the target warehouse design and confirmed the granularity of the facts in the fact tables.
  • Analyzed the logical model of the databases and normalizing it when necessary and involved in identification of the fact and dimension tables.
  • Extensively used Informatica Power Center for extracting, transforming and loading into different databases.
  • Wrote PL/SQL stored procedures and triggers for implementing business rules and transformations.
  • Developed transformation logic as per the requirement, created mappings and loaded data into respective targets.
  • Stored reformatted data from relational, flat file, XML files using Informatica (ETL) and developed mapping to load the data in slowly changing dimension.
  • Replicated operational tables into staging tables, to transform and load data into the enterprise data warehouse using Informatica.
  • Involved in Performance Tuning at various levels including Target, Source, Mapping, and Session for large data files.
  • Documented Data Mappings/ Transformations as per the business requirement.
  • Performed testing, knowledge transfer and mentored other team members.

We'd love your feedback!