We provide IT Staff Augmentation Services!

Big Data Engineer Resume

0/5 (Submit Your Rating)

Jersey City, NJ

SUMMARY

  • I am a Big Data Engineer with 5 years of professional experience in the IT industry.
  • As an experienced Big Data consultant, I will ensure the successful delivery of high - quality big data solutions.
  • I combine an understanding of the business case, a variety of skills, frameworks, best practices and coding skill.
  • Additionally, I have a strong work ethics, the ability to work well with teams, to create the right platforms, pipelines and reporting tools for clients.
  • Used Spark SQL and Data Frame API extensively to build Spark applications.
  • Experienced in working on CQL (Cassandra Query Language), for retrieving the data present in Cassandra cluster by running queries in CQL.
  • Learn and adapt to perform for the CICD tool (GITHUB, Jenkins) chain that is available at Customer environment or proposed to be made available.
  • Configured the ELK stack for Jenkins logs, and syslogs
  • Used Spark to work on streaming analyzed data to HBase and make available for visualization and report generation by the BI team.
  • Used Spark Structured Streaming to structure real time data frame and update it in real time.
  • Prototyped analysis and joining of customer data using Spark in Scala and processed it to HDFS.
  • Implemented Spark in EMR for processing Big Data across our Data Lake in AWS System
  • Developed AWS strategy, planning, and configuration of S3, Security groups, IAM, EC2, EMR and Redshift.
  • Experience integrating Kafka with Avro for serializing and deserializing data. Expertise with Kafka producer and consumer.
  • Experience in configuring, installing and managing Hortonworks & Cloudera Distributions.
  • Involved in continuous Integration of application using Jenkins.
  • Implemented Spark and Spark SQL for faster testing and processing of data.
  • Experience writing streaming applications with Spark Streaming/Kafka.
  • Utilized Spark Structured Streaming to update the data frame in real time and process it
  • Experienced in Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, RDS and IAM entities, roles, and users.
  • Knowledgeable of deploying the application jar files into AWS instances.
  • Creation of Kafka brokers in structured streaming to get structured data by schema.
  • Handling schema changes in data stream using Kafka.
  • Skilled in HiveQL, custom UDFs written in Hive, and optimizing Hive Queries, as well as writing incremental imports into Hive tables.
  • Skillset

TECHNICAL SKILLS

DATABASE AND DATA WAREHOUSE: Cassandra, Hbase, Amazon Redshift, DynamoDB, MongoDB, Oracle, PostgreSQL, MySQL, Hive

DATA STORES (repositories): Data Lake, HDFS, Data Warehouse, S3

SOFTWARE DEVELOPMENT: Spark, Scala, Hive, Pig, Java, PySpark, Keras and TensorFlow JavaScript, HTML, SQL, C, C++, C#, Shell Script, HTML, CSS VBA, Python (Jupyter Notebook, Pandas, Numpy, Matplotlib, Scikit- learn, Boto3, Psycopg2, BeautifulSoup, GeoPandas, Rasterio) R-Programming, MATLAB, C++, C#

DEVELOPMENT TOOLS, AUTOMATION, CICD: Git, GitHub, MVC, Jenkins, CI CD, Jira, Agile, Scrum

ELK LOGGING & SEARCH: Elasticsearch, Logstash, Kibana

DATA PIPELINES/ETL: Flume, Spark, Kafka, Hive, Pig, Spark Streaming, SparkSQL, Data Frames, Kinesis, Spark, Spark Streaming, Spark Structured Streaming

BIG DATA PLATFORMS: Cloudera CDH, Hortonworks HDP, Amazon Web Services (AWS)/Amazon Cloud

AWS PLATFORM: AWS IAM Formation, AWS Redshift, AWS RDS, AWS EMR, AWS S3, EC2, AWS Lambda, AWS Kinesis, AWS ELK, AWS Cloud

DATA VISUALIZATION: Tableau, Power BI, Excel, Kibana

PROFESSIONAL EXPERIENCE

BIG DATA ENGINEER

Confidential, Jersey City, NJ

Responsibilities:

  • Created training program to form professionals as Machine Learning Developers.
  • Trained IT professionals in Python and Spark that at the end of the program will be able to understand capabilities, features, custom solutions and limitations in order to deliver high-quality solutions to a model of the data processing by using the PySpark programs for proof of concept.
  • Built Real-Time Streaming Data Pipelines with Kafka, Spark Streaming and Hive.
  • Spark streaming implemented for real-time data processing with Kafka.
  • Handled large amounts of data utilizing Spark.
  • Wrote streaming applications with Spark Streaming/Kafka.
  • Used Spark SQL to perform transformations and actions on data residing in Hive.
  • Responsible for designing and deploying new ELK clusters.
  • Log monitoring and generating visual representations of logs using ELK stack.
  • Implement CI/CD tools Upgrade, Backup and Restore
  • Created Infrastructure design for ELK Clusters.
  • Worked with Elasticsearch and Logstash (ELK) performance and configure tuning.
  • Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
  • Handled schema changes in data stream using Kafka.
  • Support for the clusters, topics on the Kafka manager.
  • Responsible for Kafka operation and monitoring, and handling of messages funneled through Kafka topics.
  • Coordinated Kafka operation and monitoring with dev ops personnel; formulated balancing impact of Kafka producer and Kafka consumer message(topic) consumption.
  • Versioning with Git and set-up a Jenkins CI to manage CICD practices.
  • Pulled data and populated the data in Kibana.
  • Kibana dashboard designed over Elasticsearch for visualizing the data
  • Used Kibana to create custom dashboards, data visualization and reports.
  • Built Jenkins jobs for CI/CD infrastructure from GitHub repos

AWS BIG DATA ENGINEER

Confidential, Atlanta, GA

Responsibilities:

  • Implemented AWS IAM user roles and policies to authenticate and control access.
  • Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS.
  • Developed AWS Cloud Formation templates to create custom infrastructure of our pipeline.
  • Working on AWS Kinesis for processing huge amounts of real time data.
  • Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS.
  • Ingestion data through AWS Kinesis Data from various sources to S3.
  • Cloud Formation, AWS IAM and Security Group in Public and Private Subnets in VPC.
  • Worked with AWS Lambda functions for event-driven processing to various AWS resources.
  • Implemented Spark in EMR for processing Big Data across our Data Lake in AWS System.
  • Worked with Amazon AWS IAM console to create custom users and groups.Hands-on work with AWS EMR and S3.
  • Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS cloud Formation templates.
  • Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.
  • Implemented security measures AWS provides, employing key concepts of AWS Identity and Access Management (IAM).
  • Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring.
  • AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).
  • Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configuring the servers for specified applications.
  • Responsible for Designing Logical and Physical data modelling for various data sources on AWS Redshift.
  • Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3 bucket using Amazon API gateway.
  • AWS Kinesis used for real time data processing.
  • Experienced in Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, ELB and IAM entities, roles, and users.
  • Developed AWS Cloud Formation templates to for RedShift.

BIG DATA DEVELOPER

Confidential, St. Paul, MN

Responsibilities:

  • Wrote Hive queries and optimized the Hive queries with Hive QL.
  • ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.
  • Experienced in importing real-time logs to HDFS using Flume.
  • Created UNIX shell scripts to automate the build process, and to perform regular jobs like file transfers.
  • Managed Hadoop clusters and check the status of clusters using Ambari.
  • Moved Relational Database data using Spark to transform in and move into Hive Dynamic partition tables using staging tables.
  • Developed scripts to automate the workflow processes and generate reports.
  • Transferred data between a Hadoop ecosystem and structured data storage in a RDBMS such as MySQL using Sqoop.
  • Involved in writing incremental imports into Hive tables.
  • Extensively worked on HiveQL, join operations, writing custom UDFs, and skilled in optimizing Hive Queries.
  • Spark API over Hadoop YARN to perform analytics on data in Hive.
  • Developed Shell Scripts, Oozie Scripts and Python Scripts.
  • Download data through Hive in HDFS platform.
  • Developed job processing scripts using Oozie workflow to run multiple Spark Jobs in sequence for processing data.
  • Used Ambari for maintaining heathy cluster.
  • Configured Hadoop components (HDFS, Zookeeper) to coordinate the servers in clusters.
  • Hive partitioning, bucketing, and joins on Hive tables, utilizing Hive SerDe’s.
  • Wrote shell scripts to automate workflows to pull data from various databases into Hadoop.
  • Loaded into Hbase tables and Hive tables consumption purposes.

BIG DATA DEVELOPER

Confidential, Bloomington, IN

Responsibilities:

  • Experience in configuring, installing and managing Hortonworks (HDP) Distributions.
  • Enabled security to the cluster using Kerberos and integrated clusters with LDAP at Enterprise level.
  • Worked on tickets related to various Hadoop/Big data services which include HDFS, Yarn, Hive, Oozie, Spark, Kafka.
  • Worked on Hortonworks Hadoop distributions (HDP 2.5)
  • Performed cluster tuning and ensured high availability.
  • Cluster coordination services through Zookeeper and Kafka.
  • Coordinates with monitors cluster upgrade needs, and monitors cluster health and builds proactive tools to look for anomalous behaviors.
  • Managing Hadoop clusters via Command Line, and Hortonworks Ambari agent.
  • Monitored multiple Hadoop clusters environments using Ambari.
  • Worked with cluster users to ensure efficient resource usage in the cluster and alleviate multi-tenancy concerns.
  • Managed cluster using Ambari
  • Managed and scheduled batch jobs on a Hadoop Cluster using Oozie.
  • Monitored Hadoop cluster using tools like Ambari.
  • Performed cluster and system performance tuning.
  • Run multiple Spark jobs in sequence for processing data.
  • Performed analytics on data using Spark.
  • Moved data from Spark and persist it to HDFS.
  • Used Spark SQL and UDFs to perform transformations and actions on data residing in Hive.

We'd love your feedback!