Big Data Engineer Resume
5.00/5 (Submit Your Rating)
Plano, TX
SUMMARY
- Over 4 years of professional IT experience as Big Data Engineer.
- Experience in Complete Software Development Life Cycle (SDLC) which includes Requirement Analysis, Design, Coding, Testing and Implementation using Agile (Scrum), TDD and other development methodologies.
- Expertise in AWS Cloud Services (S3, EMR, EC2) and Snowflake Computing.
- Strong understanding of Data warehouse concepts like ETL, Star Schema, Snowflake, data modeling experience using Normalization, Business Process Analysis, Dimensional Data modeling, physical & logical data modeling.
- Experience in setting cluster in Amazon EC2 & S3 including the automation of setting & extending the clusters in AWS Amazon cloud.
- Hands on experience in data transformation operations, by implementing various functions, for loading and evaluating data in the relations.
- Loading log data into HDFS by collecting and aggregating the data from various sources.
- Worked extensively on different databases Oracle, MySQL and have good database programming experience with SQL.
- Expert in handling the various source schemes such as Flat files, DB2, MS SQL server, Excel, Oracle, Csv files, Teradata, XML files.
- Worked on GIT for version control, JIRA for project tracking.
TECHNICAL SKILLS
Programming Languages: Core Java, Python, Scala.
AWS Services: EMR, S3, EC2, Lambda
Big Data Technologies: Hadoop, HDFS, Snowflake Computing, Scala, Spark.
Databases: MySQL, SQL Server.
NoSQL Databases: HBase and Cassandra.
Scripting and Query Languages: UNIX Shell scripting, SQL and PL/SQL.
Operating Systems: Windows, Linux
Other Tools: Eclipse, Tableau 10.1, Informatica, Control - M, ServiceNow
PROFESSIONAL EXPERIENCE
Confidential - Plano, TX
Big Data Engineer
Responsibilities:
- Involved in gathering the business requirements, designing and development.
- Responsible to review and understand how Quantum (Spark Wrapper) is used to ingest and process batch and real-timedatausing Apache Spark, Scala and SQL.
- This project required an understanding of business rules, business logic, and use cases to be implemented.
- Worked in a cloud environment on Amazon AWS using a Multistage deployment environment.
- The project involved sources ofdataas disparate as CSV, Parquet, Avro, Kafka, Snowflake Tables, etc.
- Developed Quantum workflows to read parquet files from S3 buckets and apply transformations, joins, filters, and SQL queries to different dataframes and create output datasets.
- Synchronized and ensured high availability ofdatasources through AWS regions.
- Prepared use cases, mockdataand error scenarios to test workflows execution in an EMR Cluster to be deployed from Dev to Test to QA.
- Developed and coded exclusion rules workflow to connect it to Ability-to-Pay external process using Spark, Quantum, SQL and Python.
- Design and implement ELK (ElasticSearch, and Kibana) stack solution for Proactive Monitoring of applications logs and statistics.
- Use Scala to read Parquet files in HDFS and perform preprocessing.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Used Partitioning, Dynamic-Partitioning and Bucketing concepts for performance optimization in Hive.
- Created Hive external Tables on top of HBase data.
- Managed data in NOSQL databases like HBase.
- Stored data into HBase data systems from Spark streams in kafka and spark-stream POC.
- Performance tuning of various joins to Map side joins. Performance improvement of Hive queries using partitioning and bucketing
- Work within and across Agile teams to design, develop, test, implement, and support technical solutions across a full stack of development tools and technologies.
- Automate deployments on AWS using GitHub and Jenkins.
- Verified and validated that ability-to-pay AWS Lambda triggered jobs appropriately to execute the cluster and process the accounts.
- Set up the CI/CD pipelines using Jenkins, Maven, GitHub and AWS.
- Used GitHub for control version and Jira for issues and project tracking.
Confidential - Round Rock, TX
Hadoop Developer
Responsibilities:
- Worked with business teams and created Hive queries for ad-hoc access.
- Responsible to manage data coming from different sources.
- Involved in loading data from UNIX file system to HDFS.
- Worked on Spark batch applications to convert HiveQL into Spark SQL using DataFrames and DataSets.
- Created Hive tables and executed Hive queries on Hive warehouse.
- Involved in review of functional and non-functional requirements and developed Hive queries for the analysts.
- Extensively used Scala programming for developing Spark applications
- Processing the schema-oriented data using Scala and Spark
- Design and implement Hive tables (Partitioned, Non-Partitioned, Buckets).
- Involved in HDFS maintenance and loading of structured and unstructured data.
- Loaded the processed data into Hive tables.Worked on Hive developing external table, managed table, the pipelinefor smooth ETL processing.
- Applied transformations on the data loaded into Spark Dataframes and done in memory data computation to generate the output response.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using Python.
- Experience in Code version control using Git and maintain repositories as a best practice.
- Developed multiple POCs using Spark Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
- Used hive to analyze the partitioned data and compute various metrics for reporting.
- Import the data from different sources like HDFS into Spark Data frames.
- Experienced with Spark Context, Spark -SQL, Data Frame and Pair RDD's.
- Reduced the latency of spark jobs by tweaking the spark configurations and following other performance and Optimization techniques.
- Developed various data connections from data source to SSIS, Tableau Server for report and dashboard development
- Used Tableau for Data visualization to identify the role of various factors.
- Involved in identifying KPIs.
Confidential
ETL Developer
Responsibilities:
- Interacted with business community and gathered requirements based on changing needs. Incorporated identified factors into Informatica mappings to build Data Warehouses.
- Developed a standard ETL framework to enable the reusability of similar logic across the board. Involved in System Documentation of Dataflow and methodology.
- Identified all the dimensions to be included in the target warehouse design and confirmed the granularity of the facts in the fact tables.
- Analyzed the logical model of the databases and normalizing it when necessary and involved in identification of the fact and dimension tables.
- Extensively used Informatica Power Center for extracting, transforming and loading into different databases.
- Wrote PL/SQL stored procedures and triggers for implementing business rules and transformations.
- Developed transformation logic as per the requirement, created mappings and loaded data into respective targets.
- Stored reformatted data from relational, flat file, XML files using Informatica (ETL) and developed mapping to load the data in slowly changing dimension.
- Replicated operational tables into staging tables, to transform and load data into the enterprise data warehouse using Informatica.
- Involved in Performance Tuning at various levels including Target, Source, Mapping, and Session for large data files.
- Documented Data Mappings/ Transformations as per the business requirement.
- Performed testing, knowledge transfer and mentored other team members.
