Big Data Engineer Resume
NC
SUMMARY
- Around 7+ years of experience in Software/Application Development using Python,SQL, and in - depth understanding of Distributed Systems Architecture and Parallel Processing Frameworks.
- Deep knowledge and strong deployment experience in Hadoop and Big Data ecosystems- HDFS, MapReduce, Spark, Pig, Sqoop, Hive, Oozie, Kafka, zookeeper, and HBase.
- Knowledge on current trends in data technologies, data services, data virtualization, data integration, Master Data Management.
- Used various Hadoop distributions (Cloudera, Hortonworks, Amazon EMR, Microsoft Azure HDInsight) to fully implement and leverage new Hadoop features.
- Constructing and manipulating large datasets of structured, semi-structured, and unstructured data and supporting systems application architecture using tools like SAS, SQL, Python, R, MiniTab, PowerBI, and more to extract multi-factor interactions and drive change.
- Responsible for converting all ETL logic into SQL queries, create INFA mapping to load into Netezza and Snowflake database. Played key role in Migrating Teradata objects into SnowFlake environment.
- Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
- Expertise in working with Hive data warehouse infrastructure creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries.
- Strong experience in tuning Spark applications and Hive scripts to achieve optimal performance.
- Developed Spark Applications using Spark RDD, Spark-SQL and Dataframe APIs.
- Created Hive tables, loaded data into it, and wrote Hive Ad-hoc queries that would run internally in MapReduce and Spark.
- Created Logical and Physical Data Models by using Erwin based on requirement analysis.
- Expertise in AWS Resources like EC2, S3, EBS, VPC, ELB, AMI, SNS, RDS, IAM, Route 53, Auto scaling, Cloud Formation, Cloud Watch, Security Group.
- Experience in optimizing volumes, EC2 instances, created multiple VPC instances, created alarms and notifications for EC2 instances using Cloud Watch.
- Set up the scripts for creation of new snapshots and deletion of old snapshots in S3 using S3 CLI tools.
- Worked with Amazon IAM console to create custom users and groups
- Significant experience writing custom UDF in Hive and custom Input Formats in MapReduce.
- Knowledge of job workflow management and monitoring tools like Oozie, and zookeeper.
- Experience working with NoSQL database technologies, including MongoDB, Cassandra, and HBase.
- Strong experience building end-to-end data pipelines on the Hadoop platform.
- Proficient in writing quick and dirty Bash, Pearl, Python scripts to automate and provide Control Flow.
- Used Software methodologies like Agile, Scrum, TDD, and Waterfall.
TECHNICAL SKILLS
Programming Languages: PYTHON |SAS PROGRAMMING |SQL |Scala| JAVASCRIPT| Shell Scripting
Database Design Tools: MS Visio| Fact and Dimensions tables| Normalization and De- normalization techniques| Kimball Inmon Methodologies
Data Modelling Tools: Erwin Data Modeler and Manager|Star Schema/Snowflake Schema modeling| ER Studio v17|physical and logical data modeling
ETL/DATA Warehouse Tools: Informatica Power Center| Redshift| Tableau| Pentaho| SSIS| DataStage
Querying Languages: SQL| NO SQL| PostgreSQL| MySQL| Microsoft SQL| Spark-SQL |Sqoop 1.4.4
Databases: Snowflake| AWS RDS| Teradata| Hadoop FS| SQL Server| Oracle| Netezza| Microsoft SQL| DB2| Postgre SQL
NOSQL Databases: MongoDB | Hadoop HBase | Apache Cassandra
Cloud Technologies: AWS| GCP| Azure| Snowflake| Databricks
Hadoop Ecosystem: Hadoop| MapReduce| Yarn| HDFS| Kafka| Storm| Pig| Oozie| Zookeeper
Big Data Ecosystem: Spark| SparkSQL| Spark Streaming| PySpark| Hive| Impala
Integration Tools: Git| Gerrit| Jenkins| ant| Maven
Methodologies: Agile| Scrum| Waterfall UML| Perl
Operating Systems: Linux (Ubuntu, CentOS), Windows, Mac OS
Visualization and Reporting platform: Tableau| Power BI| Platfora| Matplotlib| Seaborn| Bokeh| ggplot| iplots| Shiny
PROFESSIONAL EXPERIENCE
Confidential, NC
Big Data Engineer
Responsibilities:
- Created Hive tables for loading and analysing data.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
- Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.
- Loaded the data into Spark RDD, perform advanced procedures like text analytics and processing using in-memory data computation’s capabilities of Spark using Scala to generate the Output response.
- ETL development using EMR/Hive/Spark, Lambda, Scala, DynamoDB Streams, Amazon Kinesis Firehose, Redshift and S3.
- Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into the OLTP system through Sqoop.
- Wrote various data normalization jobs for new data ingested into Redshift.
- Handled large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective efficient Joins, Transformations and others during the ingestion process itself.
- Worked with AWS cloud platform and its features which include EC2, IAM, EBS CloudWatch and AWS S3
- Deployed application using AWS EC2 standard deployment techniques and worked on AWS infrastructure and automation.
- Worked on CI/CD environment on deploying application on Docker containers.
- Used AWS S3 Buckets to store the file and injected the files into Snowflake tables using Snow Pipe and run deltas using Data pipelines.
- Installed and configured apache airflow for workflow management and created workflows in python.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
- Optimizing existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD.
- Used Spark Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real-time and persists into Cassandra.
- Implementing the strategy to migrate Netezza based analytical systems to Snowflake on AWS.
- Worked with Architect on final approach and streamlined the integration - Informatica with Snowflake.
- Created various reports using Tableau based on requirements with the BI team.
Environment: Spark, Hive, Scala, AWS cloud platform, CI/CD environment, SQL, Kafka, Python, Tableau, Cassandra, EMR, Hive, Spark, Lambda, Scala, DynamoDBStreams, Amazon KinesisFirehose, Redshift, S3, Informatica, Snowflake, Hadoop, Yarn.
Confidential, NV
Hadoop Developer
Responsibilities:
- Utilized Sqoop, Kafka, Flume and Hadoop File System API’s for implementing data ingestion pipelines.
- Worked on real time streaming, performed transformations on the data using Kafka and Spark Streaming.
- Created storage with Amazon S3 for storing data. Worked on transferring data from Kafka topic into AWS S3 storage.
- Created Hive tables, loaded with data, and wrote Hive queries to process the data.
- Created Partitions and used Bucketing on Hive tables and used required parameters to improve performance and developed Hive UDFs as per business use-cases.
- Developed Hive scripts for source data validation and transformation.
- Automated data loading into HDFS and Hive for pre-processing the data using Oozie.
- Collaborated in data modeling, data mining, Machine Learning methodologies, advanced data processing, ETL optimization.
- Worked on various data formats like Avro, Sequence File, JSON, Map File, Parquet, and XML.
- Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, Snowflake.
- Used Apache NiFi to automate data movement between different Hadoop components.
- Used NiFi to perform conversion of raw XML data into JSON, Avro.
- Experienced in working with Hadoop from Cloudera Data Platform and running services through Cloudera manager.
- Assisted in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager.
- Experienced in Hadoop Production support tasks by analysing the Application and cluster logs.
- Used Agile Scrum methodology/ Scrum Alliance for development.
Environment: Hadoop, HDFS, AWS, Vertica, Scala, Kafka, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Java, NiFi, HBase, MySQL, Kerberos, Maven.
Confidential
Big Data Developer
Responsibilities:
- Involved in building scalable distributed data lake system for Confidential real time and batch analytical needs.
- Involved in designing, reviewing, optimizing data transformation processes using Apache Storm.
- Experience in job management using Fair Scheduling and Developed job processing scripts using Control-M workflow.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed Scala scripts, UDFs using both Data frames/SQL and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Scoop.
- Experienced in Performing tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Loaded the data into Spark RDD and do in memory data computation to generate the output response.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD’s.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capacities of Spark using Scala.
- Imported data from Kafka Consumer into HBase using Spark streaming.
- Experienced in using Zookeeper and Oozie Operational Services for coordinating the cluster and scheduling workflows.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce, Hive and Sqoop as well as system specific jobs.
- Experienced in handling large datasets using partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective efficient Joins, Transformation and other during ingestion process itself.
- Worked on migrating legacy Map Reduce programs into Spark transformations using Spark and Scala.
- Worked on a POC to compare processing time for Impala with Apache Hive for batch applications to implement the former in project.
- Worked extensively with Sqoop for importing metadata from Oracle.
Environment: Apache Storm, Spark API, Hadoop, Scala, kafka, Zookeeper, MapReduce, Hive, Sqoop, HBase,ImpalaOozie,Oracle,Yarn, text analytics
Confidential
Data Analyst/Modeler
Responsibilities:
- Performed as a Data Analysis, Data Modeling, Data Migration and data profiling using complex SQL on various sources systems including Oracle and Teradata.
- Experienced in building applications based on large datasets in MarkLogic.
- Translated business requirements into working logical and physical data models for Data warehouse, Data marts and OLAP applications.
- Analysed data lineage processes to identify vulnerable data points, control gaps, data quality issues, and overall lack of data governance.
- Worked on data cleansing and standardization using the cleanse functions in Informatica MDM.
- Designed Star and Snowflake Data Models for Enterprise Data Warehouse using ERWIN.
- Validated and updated the appropriate LDM’s to process mappings, screen designs, use cases, business object model, and system object model as they evolve and change.
- Maintained data model and synchronized it with the changes to the database.
- Designed and developed use cases, activity diagrams, and sequence diagrams using UML.
- Extensively involved in the modeling and development of Reporting Data Warehousing System.
- Designed the database tables created table and column level constraints using the suggested naming conventions for constraint keys.
- Implemented enterprise grade platform (Mark logic) for ETL from mainframe to NOSQL (cassandra).
- Used ETL tool BO DS to extract, transform and load data into data warehouses from various sources like relational databases, application systems, temp tables, flat files etc.
- Wrote packages, procedures, functions, exceptions using PL/SQL.
- Reviewed the database programming for triggers, exceptions, functions, packages, procedures.
Environment: MarkLogic, OLAP, Oracle, Teradata, ERWIN, ETL, NoSQL, Star, Snowflake data models, PL/SQL