We provide IT Staff Augmentation Services!

Big Data Engineer Resume

2.00/5 (Submit Your Rating)

NC

SUMMARY

  • Around 7+ years of experience in Software/Application Development using Python,SQL, and in - depth understanding of Distributed Systems Architecture and Parallel Processing Frameworks.
  • Deep knowledge and strong deployment experience in Hadoop and Big Data ecosystems- HDFS, MapReduce, Spark, Pig, Sqoop, Hive, Oozie, Kafka, zookeeper, and HBase.
  • Knowledge on current trends in data technologies, data services, data virtualization, data integration, Master Data Management.
  • Used various Hadoop distributions (Cloudera, Hortonworks, Amazon EMR, Microsoft Azure HDInsight) to fully implement and leverage new Hadoop features.
  • Constructing and manipulating large datasets of structured, semi-structured, and unstructured data and supporting systems application architecture using tools like SAS, SQL, Python, R, MiniTab, PowerBI, and more to extract multi-factor interactions and drive change.
  • Responsible for converting all ETL logic into SQL queries, create INFA mapping to load into Netezza and Snowflake database. Played key role in Migrating Teradata objects into SnowFlake environment.
  • Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
  • Expertise in working with Hive data warehouse infrastructure creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries.
  • Strong experience in tuning Spark applications and Hive scripts to achieve optimal performance.
  • Developed Spark Applications using Spark RDD, Spark-SQL and Dataframe APIs.
  • Created Hive tables, loaded data into it, and wrote Hive Ad-hoc queries that would run internally in MapReduce and Spark.
  • Created Logical and Physical Data Models by using Erwin based on requirement analysis.
  • Expertise in AWS Resources like EC2, S3, EBS, VPC, ELB, AMI, SNS, RDS, IAM, Route 53, Auto scaling, Cloud Formation, Cloud Watch, Security Group.
  • Experience in optimizing volumes, EC2 instances, created multiple VPC instances, created alarms and notifications for EC2 instances using Cloud Watch.
  • Set up the scripts for creation of new snapshots and deletion of old snapshots in S3 using S3 CLI tools.
  • Worked with Amazon IAM console to create custom users and groups
  • Significant experience writing custom UDF in Hive and custom Input Formats in MapReduce.
  • Knowledge of job workflow management and monitoring tools like Oozie, and zookeeper.
  • Experience working with NoSQL database technologies, including MongoDB, Cassandra, and HBase.
  • Strong experience building end-to-end data pipelines on the Hadoop platform.
  • Proficient in writing quick and dirty Bash, Pearl, Python scripts to automate and provide Control Flow.
  • Used Software methodologies like Agile, Scrum, TDD, and Waterfall.

TECHNICAL SKILLS

Programming Languages: PYTHON |SAS PROGRAMMING |SQL |Scala| JAVASCRIPT| Shell Scripting

Database Design Tools: MS Visio| Fact and Dimensions tables| Normalization and De- normalization techniques| Kimball Inmon Methodologies

Data Modelling Tools: Erwin Data Modeler and Manager|Star Schema/Snowflake Schema modeling| ER Studio v17|physical and logical data modeling

ETL/DATA Warehouse Tools: Informatica Power Center| Redshift| Tableau| Pentaho| SSIS| DataStage

Querying Languages: SQL| NO SQL| PostgreSQL| MySQL| Microsoft SQL| Spark-SQL |Sqoop 1.4.4

Databases: Snowflake| AWS RDS| Teradata| Hadoop FS| SQL Server| Oracle| Netezza| Microsoft SQL| DB2| Postgre SQL

NOSQL Databases: MongoDB | Hadoop HBase | Apache Cassandra

Cloud Technologies: AWS| GCP| Azure| Snowflake| Databricks

Hadoop Ecosystem: Hadoop| MapReduce| Yarn| HDFS| Kafka| Storm| Pig| Oozie| Zookeeper

Big Data Ecosystem: Spark| SparkSQL| Spark Streaming| PySpark| Hive| Impala

Integration Tools: Git| Gerrit| Jenkins| ant| Maven

Methodologies: Agile| Scrum| Waterfall UML| Perl

Operating Systems: Linux (Ubuntu, CentOS), Windows, Mac OS

Visualization and Reporting platform: Tableau| Power BI| Platfora| Matplotlib| Seaborn| Bokeh| ggplot| iplots| Shiny

PROFESSIONAL EXPERIENCE

Confidential, NC

Big Data Engineer

Responsibilities:

  • Created Hive tables for loading and analysing data.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
  • Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.
  • Loaded the data into Spark RDD, perform advanced procedures like text analytics and processing using in-memory data computation’s capabilities of Spark using Scala to generate the Output response.
  • ETL development using EMR/Hive/Spark, Lambda, Scala, DynamoDB Streams, Amazon Kinesis Firehose, Redshift and S3.
  • Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into the OLTP system through Sqoop.
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Handled large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective efficient Joins, Transformations and others during the ingestion process itself.
  • Worked with AWS cloud platform and its features which include EC2, IAM, EBS CloudWatch and AWS S3
  • Deployed application using AWS EC2 standard deployment techniques and worked on AWS infrastructure and automation.
  • Worked on CI/CD environment on deploying application on Docker containers.
  • Used AWS S3 Buckets to store the file and injected the files into Snowflake tables using Snow Pipe and run deltas using Data pipelines.
  • Installed and configured apache airflow for workflow management and created workflows in python.
  • Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
  • Optimizing existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD.
  • Used Spark Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real-time and persists into Cassandra.
  • Implementing the strategy to migrate Netezza based analytical systems to Snowflake on AWS.
  • Worked with Architect on final approach and streamlined the integration - Informatica with Snowflake.
  • Created various reports using Tableau based on requirements with the BI team.

Environment: Spark, Hive, Scala, AWS cloud platform, CI/CD environment, SQL, Kafka, Python, Tableau, Cassandra, EMR, Hive, Spark, Lambda, Scala, DynamoDBStreams, Amazon KinesisFirehose, Redshift, S3, Informatica, Snowflake, Hadoop, Yarn.

Confidential, NV

Hadoop Developer

Responsibilities:

  • Utilized Sqoop, Kafka, Flume and Hadoop File System API’s for implementing data ingestion pipelines.
  • Worked on real time streaming, performed transformations on the data using Kafka and Spark Streaming.
  • Created storage with Amazon S3 for storing data. Worked on transferring data from Kafka topic into AWS S3 storage.
  • Created Hive tables, loaded with data, and wrote Hive queries to process the data.
  • Created Partitions and used Bucketing on Hive tables and used required parameters to improve performance and developed Hive UDFs as per business use-cases.
  • Developed Hive scripts for source data validation and transformation.
  • Automated data loading into HDFS and Hive for pre-processing the data using Oozie.
  • Collaborated in data modeling, data mining, Machine Learning methodologies, advanced data processing, ETL optimization.
  • Worked on various data formats like Avro, Sequence File, JSON, Map File, Parquet, and XML.
  • Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, Snowflake.
  • Used Apache NiFi to automate data movement between different Hadoop components.
  • Used NiFi to perform conversion of raw XML data into JSON, Avro.
  • Experienced in working with Hadoop from Cloudera Data Platform and running services through Cloudera manager.
  • Assisted in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager.
  • Experienced in Hadoop Production support tasks by analysing the Application and cluster logs.
  • Used Agile Scrum methodology/ Scrum Alliance for development.

Environment: Hadoop, HDFS, AWS, Vertica, Scala, Kafka, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Java, NiFi, HBase, MySQL, Kerberos, Maven.

Confidential

Big Data Developer

Responsibilities:

  • Involved in building scalable distributed data lake system for Confidential real time and batch analytical needs.
  • Involved in designing, reviewing, optimizing data transformation processes using Apache Storm.
  • Experience in job management using Fair Scheduling and Developed job processing scripts using Control-M workflow.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFs using both Data frames/SQL and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Scoop.
  • Experienced in Performing tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Loaded the data into Spark RDD and do in memory data computation to generate the output response.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD’s.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capacities of Spark using Scala.
  • Imported data from Kafka Consumer into HBase using Spark streaming.
  • Experienced in using Zookeeper and Oozie Operational Services for coordinating the cluster and scheduling workflows.
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce, Hive and Sqoop as well as system specific jobs.
  • Experienced in handling large datasets using partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective efficient Joins, Transformation and other during ingestion process itself.
  • Worked on migrating legacy Map Reduce programs into Spark transformations using Spark and Scala.
  • Worked on a POC to compare processing time for Impala with Apache Hive for batch applications to implement the former in project.
  • Worked extensively with Sqoop for importing metadata from Oracle.

Environment: Apache Storm, Spark API, Hadoop, Scala, kafka, Zookeeper, MapReduce, Hive, Sqoop, HBase,ImpalaOozie,Oracle,Yarn, text analytics

Confidential

Data Analyst/Modeler

Responsibilities:

  • Performed as a Data Analysis, Data Modeling, Data Migration and data profiling using complex SQL on various sources systems including Oracle and Teradata.
  • Experienced in building applications based on large datasets in MarkLogic.
  • Translated business requirements into working logical and physical data models for Data warehouse, Data marts and OLAP applications.
  • Analysed data lineage processes to identify vulnerable data points, control gaps, data quality issues, and overall lack of data governance.
  • Worked on data cleansing and standardization using the cleanse functions in Informatica MDM.
  • Designed Star and Snowflake Data Models for Enterprise Data Warehouse using ERWIN.
  • Validated and updated the appropriate LDM’s to process mappings, screen designs, use cases, business object model, and system object model as they evolve and change.
  • Maintained data model and synchronized it with the changes to the database.
  • Designed and developed use cases, activity diagrams, and sequence diagrams using UML.
  • Extensively involved in the modeling and development of Reporting Data Warehousing System.
  • Designed the database tables created table and column level constraints using the suggested naming conventions for constraint keys.
  • Implemented enterprise grade platform (Mark logic) for ETL from mainframe to NOSQL (cassandra).
  • Used ETL tool BO DS to extract, transform and load data into data warehouses from various sources like relational databases, application systems, temp tables, flat files etc.
  • Wrote packages, procedures, functions, exceptions using PL/SQL.
  • Reviewed the database programming for triggers, exceptions, functions, packages, procedures.

Environment: MarkLogic, OLAP, Oracle, Teradata, ERWIN, ETL, NoSQL, Star, Snowflake data models, PL/SQL

We'd love your feedback!