Big Data Engineer Resume
Jersey City, NJ
SUMMARY
- I am a Big Data Engineer with 5 years of professional experience in the IT industry.
- As an experienced Big Data consultant, I will ensure the successful delivery of high - quality big data solutions.
- I combine an understanding of the business case, a variety of skills, frameworks, best practices and coding skill.
- Additionally, I have a strong work ethics, the ability to work well with teams, to create the right platforms, pipelines and reporting tools for clients.
- Used Spark SQL and Data Frame API extensively to build Spark applications.
- Experienced in working on CQL (Cassandra Query Language), for retrieving the data present in Cassandra cluster by running queries in CQL.
- Learn and adapt to perform for the CICD tool (GITHUB, Jenkins) chain that is available at Customer environment or proposed to be made available.
- Configured the ELK stack for Jenkins logs, and syslogs
- Used Spark to work on streaming analyzed data to HBase and make available for visualization and report generation by the BI team.
- Used Spark Structured Streaming to structure real time data frame and update it in real time.
- Prototyped analysis and joining of customer data using Spark in Scala and processed it to HDFS.
- Implemented Spark in EMR for processing Big Data across our Data Lake in AWS System
- Developed AWS strategy, planning, and configuration of S3, Security groups, IAM, EC2, EMR and Redshift.
- Experience integrating Kafka with Avro for serializing and deserializing data. Expertise with Kafka producer and consumer.
- Experience in configuring, installing and managing Hortonworks & Cloudera Distributions.
- Involved in continuous Integration of application using Jenkins.
- Implemented Spark and Spark SQL for faster testing and processing of data.
- Experience writing streaming applications with Spark Streaming/Kafka.
- Utilized Spark Structured Streaming to update the data frame in real time and process it
- Experienced in Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, RDS and IAM entities, roles, and users.
- Knowledgeable of deploying the application jar files into AWS instances.
- Creation of Kafka brokers in structured streaming to get structured data by schema.
- Handling schema changes in data stream using Kafka.
- Skilled in HiveQL, custom UDFs written in Hive, and optimizing Hive Queries, as well as writing incremental imports into Hive tables.
- Skillset
TECHNICAL SKILLS
DATABASE AND DATA WAREHOUSE: Cassandra, Hbase, Amazon Redshift, DynamoDB, MongoDB, Oracle, PostgreSQL, MySQL, Hive
DATA STORES (repositories): Data Lake, HDFS, Data Warehouse, S3
SOFTWARE DEVELOPMENT: Spark, Scala, Hive, Pig, Java, PySpark, Keras and TensorFlow JavaScript, HTML, SQL, C, C++, C#, Shell Script, HTML, CSS VBA, Python (Jupyter Notebook, Pandas, Numpy, Matplotlib, Scikit- learn, Boto3, Psycopg2, BeautifulSoup, GeoPandas, Rasterio) R-Programming, MATLAB, C++, C#
DEVELOPMENT TOOLS, AUTOMATION, CICD: Git, GitHub, MVC, Jenkins, CI CD, Jira, Agile, Scrum
ELK LOGGING & SEARCH: Elasticsearch, Logstash, Kibana
DATA PIPELINES/ETL: Flume, Spark, Kafka, Hive, Pig, Spark Streaming, SparkSQL, Data Frames, Kinesis, Spark, Spark Streaming, Spark Structured Streaming
BIG DATA PLATFORMS: Cloudera CDH, Hortonworks HDP, Amazon Web Services (AWS)/Amazon Cloud
AWS PLATFORM: AWS IAM Formation, AWS Redshift, AWS RDS, AWS EMR, AWS S3, EC2, AWS Lambda, AWS Kinesis, AWS ELK, AWS Cloud
DATA VISUALIZATION: Tableau, Power BI, Excel, Kibana
PROFESSIONAL EXPERIENCE
BIG DATA ENGINEER
Confidential, Jersey City, NJ
Responsibilities:
- Created training program to form professionals as Machine Learning Developers.
- Trained IT professionals in Python and Spark that at the end of the program will be able to understand capabilities, features, custom solutions and limitations in order to deliver high-quality solutions to a model of the data processing by using the PySpark programs for proof of concept.
- Built Real-Time Streaming Data Pipelines with Kafka, Spark Streaming and Hive.
- Spark streaming implemented for real-time data processing with Kafka.
- Handled large amounts of data utilizing Spark.
- Wrote streaming applications with Spark Streaming/Kafka.
- Used Spark SQL to perform transformations and actions on data residing in Hive.
- Responsible for designing and deploying new ELK clusters.
- Log monitoring and generating visual representations of logs using ELK stack.
- Implement CI/CD tools Upgrade, Backup and Restore
- Created Infrastructure design for ELK Clusters.
- Worked with Elasticsearch and Logstash (ELK) performance and configure tuning.
- Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
- Handled schema changes in data stream using Kafka.
- Support for the clusters, topics on the Kafka manager.
- Responsible for Kafka operation and monitoring, and handling of messages funneled through Kafka topics.
- Coordinated Kafka operation and monitoring with dev ops personnel; formulated balancing impact of Kafka producer and Kafka consumer message(topic) consumption.
- Versioning with Git and set-up a Jenkins CI to manage CICD practices.
- Pulled data and populated the data in Kibana.
- Kibana dashboard designed over Elasticsearch for visualizing the data
- Used Kibana to create custom dashboards, data visualization and reports.
- Built Jenkins jobs for CI/CD infrastructure from GitHub repos
AWS BIG DATA ENGINEER
Confidential, Atlanta, GA
Responsibilities:
- Implemented AWS IAM user roles and policies to authenticate and control access.
- Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS.
- Developed AWS Cloud Formation templates to create custom infrastructure of our pipeline.
- Working on AWS Kinesis for processing huge amounts of real time data.
- Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS.
- Ingestion data through AWS Kinesis Data from various sources to S3.
- Cloud Formation, AWS IAM and Security Group in Public and Private Subnets in VPC.
- Worked with AWS Lambda functions for event-driven processing to various AWS resources.
- Implemented Spark in EMR for processing Big Data across our Data Lake in AWS System.
- Worked with Amazon AWS IAM console to create custom users and groups.Hands-on work with AWS EMR and S3.
- Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS cloud Formation templates.
- Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.
- Implemented security measures AWS provides, employing key concepts of AWS Identity and Access Management (IAM).
- Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring.
- AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).
- Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configuring the servers for specified applications.
- Responsible for Designing Logical and Physical data modelling for various data sources on AWS Redshift.
- Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3 bucket using Amazon API gateway.
- AWS Kinesis used for real time data processing.
- Experienced in Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, ELB and IAM entities, roles, and users.
- Developed AWS Cloud Formation templates to for RedShift.
BIG DATA DEVELOPER
Confidential, St. Paul, MN
Responsibilities:
- Wrote Hive queries and optimized the Hive queries with Hive QL.
- ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.
- Experienced in importing real-time logs to HDFS using Flume.
- Created UNIX shell scripts to automate the build process, and to perform regular jobs like file transfers.
- Managed Hadoop clusters and check the status of clusters using Ambari.
- Moved Relational Database data using Spark to transform in and move into Hive Dynamic partition tables using staging tables.
- Developed scripts to automate the workflow processes and generate reports.
- Transferred data between a Hadoop ecosystem and structured data storage in a RDBMS such as MySQL using Sqoop.
- Involved in writing incremental imports into Hive tables.
- Extensively worked on HiveQL, join operations, writing custom UDFs, and skilled in optimizing Hive Queries.
- Spark API over Hadoop YARN to perform analytics on data in Hive.
- Developed Shell Scripts, Oozie Scripts and Python Scripts.
- Download data through Hive in HDFS platform.
- Developed job processing scripts using Oozie workflow to run multiple Spark Jobs in sequence for processing data.
- Used Ambari for maintaining heathy cluster.
- Configured Hadoop components (HDFS, Zookeeper) to coordinate the servers in clusters.
- Hive partitioning, bucketing, and joins on Hive tables, utilizing Hive SerDe’s.
- Wrote shell scripts to automate workflows to pull data from various databases into Hadoop.
- Loaded into Hbase tables and Hive tables consumption purposes.
BIG DATA DEVELOPER
Confidential, Bloomington, IN
Responsibilities:
- Experience in configuring, installing and managing Hortonworks (HDP) Distributions.
- Enabled security to the cluster using Kerberos and integrated clusters with LDAP at Enterprise level.
- Worked on tickets related to various Hadoop/Big data services which include HDFS, Yarn, Hive, Oozie, Spark, Kafka.
- Worked on Hortonworks Hadoop distributions (HDP 2.5)
- Performed cluster tuning and ensured high availability.
- Cluster coordination services through Zookeeper and Kafka.
- Coordinates with monitors cluster upgrade needs, and monitors cluster health and builds proactive tools to look for anomalous behaviors.
- Managing Hadoop clusters via Command Line, and Hortonworks Ambari agent.
- Monitored multiple Hadoop clusters environments using Ambari.
- Worked with cluster users to ensure efficient resource usage in the cluster and alleviate multi-tenancy concerns.
- Managed cluster using Ambari
- Managed and scheduled batch jobs on a Hadoop Cluster using Oozie.
- Monitored Hadoop cluster using tools like Ambari.
- Performed cluster and system performance tuning.
- Run multiple Spark jobs in sequence for processing data.
- Performed analytics on data using Spark.
- Moved data from Spark and persist it to HDFS.
- Used Spark SQL and UDFs to perform transformations and actions on data residing in Hive.