We provide IT Staff Augmentation Services!

Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Goodlettsville, TN

SUMMARY

  • 6+ years working on Big Data and Hadoop projects and overall IT knowledge of over 8 years.
  • 1½ years as a Linux Systems Administrator.
  • Expertise with Hadoop Ecosystem tools such as HDFS, Pig, Hive, Sqoop, Spark, Kafka, Yarn, Oozie, Zookeeper, etc.
  • Skilled with Python and Scala.
  • Experienced applying user - defined functions (UDF) for Hive and Pig using Python.
  • Perform ETL, data extraction, transformation, and load using Hive, Pig, and HBase.
  • Hands-on with Spark Architecture, including Spark Core, Spark SQL, and Spark Streaming.
  • Real-time experience in Hadoop Distributed files system (Cloudera, MapR, S3), Hadoop framework and Parallel processing implementation.
  • Develop applications using RDBMS, Hive, Linux/Unix shell scripting and Linux internals.
  • Experience writing UDFs for Hive and Pig.
  • Experience with AWS Cloud IAM, Data pipeline, EMR, S3, EC2, AWS CLI, SNS, and other services.
  • Cleanse and analyze data using HiveQL and Pig Latin.
  • Database experience with Cassandra, HBase, MongoDB, SQL Server, and My SQL database.
  • Create scripts and macros using Microsoft Visual Studio to automate tasks.
  • Experience in working with GitHub Repository.
  • Execute Flume to load the log data from multiple sources directly into HDFS.
  • Design both time-driven and data-driven automated workflows using Oozie.
  • Experience writing UNIX shell scripts.
  • Experienced extracting and generating analysis using Business Intelligence Tool and Tableau for better analysis of data.
  • Experience importing and exporting data using Sqoop and SFTP for Hadoop to/from RDBMS.

TECHNICAL SKILLS

Big Data Platforms: Hadoop, Cloudera Hadoop, Cloudera Impala, Hortonworks.

Hadoop Ecosystem (Apache) Tools: Cassandra, Flume, Kafka, Spark, Hadoop YARN, HBase, Hive, Oozie, Pig, Spark Streaming, Spark MLlib.

Hadoop Ecosystem Components: Sqoop, Kibana, Tableau, AWS, Cloud Foundry, DataFrames, Datasets, ZooKeeper, HDFS, Hortonworks, Apache Airflow, GitHub, BitBucket.

Scripting: Python, Scala, SQL, Spark, Hive QL, Pig Latin.

Data Storage and Files: HDFS, Data Lake, Data Warehouse, Redshift, Parquet, Avro, JSON, Snappy, Gzip.

Databases: Apache Cassandra, Apache HBase, MongoDB, SQL, MySQL, RDBMS, NoSQl, DB2, DynamoDB.

Cloud Platforms and Tools: S3, AWS, EC2, EMR, Lambda services, Microsoft Azure, Adobe Cloud, Amazon Redshift, Open Stack, Google Computer Cloud, MapR Cloud, Elastic Cloud

Data Reporting and Visualization: Tableau, PowerBI, Kibana

Web Technologies and APIs: XML, Blueprint XML, Ajax, REST API, Spark API, JSO.

PROFESSIONAL EXPERIENCE

Big Data Engineer

Confidential, Goodlettsville, TN

Responsibilities:

  • Applied knowledge on Spark framework on both batch and real-time data processing.
  • Processed data using Spark Streaming API.
  • Created Lambda functions on AWS using Python.
  • Hands-on with AWS, Redshift, DynamoDB, and various cloud tools.
  • Collected data using REST API, built HTTPS connection with client-server, sent GET request and collected response in Kafka Producer.
  • Configured the packages and jars for Spark work to load data into HBase.
  • Split the JSON file into RDD level to be processed in parallel for better performance and fault tolerance.
  • Loaded the data into HBase under default namespace, assigned row key, and datatype.
  • Developed new Flume agents to extract data from Kafka.
  • Decoded raw data and loaded into JSON before sending batched streaming file over the Kafka Producer.
  • Created Kafka topics for Kafka brokers to listen from and data transferring to function in a distributed system.
  • Created a consumer and listened to the topic and brokers on port and created a direct streaming channel.
  • Parsed the JSON file received and loads it again for further transformation.
  • Built schema in the form of structure to access the information from a layered JSON file.
  • Created a data frame in Apache Spark by passing in the schema as a parameter to the ingested data.
  • Developed Spark UDFs using Scala for better performance.
  • Developed Airflow Dags using Python.
  • Created Hive and SQL queries to spot emerging trends by comparing data with historical metrics.
  • Used Spark to parse out needed data by using Spark SQL Context and select the columns with target information and assigned names.
  • Worked with unstructured data and parsed out the information by Python built-in function.
  • Started and configured master and slave nodes for Spark session and initiated Spark context in the standard of Spark2.
  • Utilized the transformation and action in Spark to interact with the data frame to show and to process data.
  • Configured HBase and YARN settings to build connections between HBase and a task manager to assign adequate tasks to the HBase node.

Cloud Engineer

Confidential, Oak Brook, IL

Responsibilities:

  • Worked as part of the Big Data Engineering team to design and develop data pipelines in an Azure environment using ADL Gen2, Blob Storage, ADF, Azure Databricks, Azure SQL, Azure Synapse for analytics and MS Power BI for reporting.
  • Created Spark jobs using Databricks and AWS infrastructure and Python as a programming language.
  • Ingested data into AWS S3 buckets via Python scripts.
  • Applied Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Visual Studio Online (VSO) and SQL Azure.
  • Used Spark SQL and DataFrames API to load structured and semi-structured data into Spark Clusters.
  • Loaded and transformed large sets of structured and semi-structured data using AWS Glue.
  • Created and maintaining a data warehouse in AWS Redshift.
  • Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes, supporting full relational KQL in Confluent Kafka.
  • Migrated data from Hortonworks cluster to Amazon EMR cluster.
  • Developed Kafka Producer/Consumer scripts to process JSON responses in Python.
  • Implemented different instance profiles and roles in IAM to connect tools in AWS.
  • Applied understanding/knowledge of tools in AWS like Glue, EMR, S3, Lambda, Redshift, and Athena.
  • Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks.
  • Optimized Hive analytics SQL queries, created tables/views, wrote custom queries and Hive-based exception process.
  • Streamed data from AWS Fully Managed Kafka brokers using Spark Streaming and processed the data using explode transformations.
  • Finalized the data pipeline using DynamoDB as a NoSQL storage option.
  • Developed consumer intelligence reports based on market research, data analytics, and social media.
  • Joined, manipulated, and drew actionable insights from large data sources using Python and SQL.

Big Data Engineer

Confidential, Modesto, CA

Responsibilities:

  • Created Kafka cluster that used a schema to send structured data via micro-batching.
  • Developed a Spark streaming service to receive real time data using Kibana.
  • Played a key role in installation and configuration of the various Big Data ecosystem tools such as Kibana, Kafka, (ELK) Elastic Search, Logstash, and Cassandra.
  • Implemented applications on Hadoop/Spark on Kerberos secured cluster.
  • Streamed processed data to RedShift using EMR/Spark to make it available for visualization and report generation by the BI team.
  • Used Kinesis with Spark streaming for high-speed data processing.
  • Moved data to EMR cluster where the data is set to go live on the application using Kinesis.
  • Applied the latest development approaches, including applications in Spark using Scala.
  • Wrote Spark SQL scripts to create real-time processing of structured data with Spark Streaming processed through micro-batching.
  • Automated, configured, and deployed instances on AWS, Azure environments and Data Centers.
  • Performed log monitoring and generated visual representations of logs using ELK stack.
  • Coordinated Kafka operation and monitoring with dev ops personnel.
  • Formulated balancing impact of Kafka producer and Kafka consumer message(topic) consumption.
  • Implemented Hadoop cluster automation using Docker containers.
  • Streamed prepared information to DynamoDB utilizing Spark to make data accessible for representation and report age by the BI group.
  • Built Hive views on top of the source data tables.
  • Utilized HiveQL to query the data to discover trends in the data.
  • Provided connections to different Business Intelligence tools to the tables in the data warehouse such as Tableau and Power BI.
  • Integrated Spark code into the SDLC with the CI/CD pipeline using Jenkins CI with GitLab.
  • Worked with Big Data distribution Cloudera.
  • Performed Big Data processing using Hadoop, MapReduce, Sqoop, Oozie, and Impala.

Hadoop Developer

Confidential, SeaTac, WA

Responsibilities:

  • Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies.
  • Connected various data centers and transferred data between them using Sqoop and various ETL tools.
  • Developed InfoPath forms allowing programmatic submission to a SharePoint Form Library and initiated workflow processes.
  • Provided a workflow and initiated the workflow processes.
  • Load and transform large sets of structured, semi-structured and unstructured data.
  • Used different file formats like Text files, Sequence Files, Avro.
  • Loaded data from various data sources into HDFS using Kafka.
  • Tuning and operating Spark and its related technologies like Spark SQL and Streaming.
  • Used shell scripts to dump the data from MySQL to HDFS.
  • Used NoSQL databases like MongoDB in implementation and integration.
  • Worked on streaming the analyzed data to Hive Tables using Sqoop for making it available for visualization and report generation by the BI team.
  • Configured Oozie workflow engine scheduler to run multiple Hive, Sqoop and Pig jobs
  • Cleansed and preprocessed data implementing map-reduce jobs in a multi-node Hadoop cluster.
  • Performed aggregation functions with SQL.
  • Created a benchmark between Cassandra and HBase for fast ingestion.
  • Loaded data from legacy warehouses onto HDFS using Sqoop.
  • Created and managed sites, site collections, lists, form templates, documents, and form libraries.
  • Worked on SharePoint Designer and InfoPath Designer and developed workflows and forms.
  • Built data pipeline using MapReduce scripts and Hadoop commands to store onto HDFS.
  • Used Oozie to orchestrate the MapReduce jobs that extract the data on time.
  • Build Graphs and Plots using Python libraries like PyPlot to visualize data.
  • Integrated Kafka with Spark Streaming for real-time data processing
  • Imported data from disparate sources into Spark RDD for processing.
  • Built a prototype for real-time analysis using Spark streaming and Kafka.
  • Transferred data using Informatica tool from AWS S3.
  • Using AWS Redshift for storing the data on the cloud.
  • Developed and maintained continuous integration systems in a Cloud computing environment (Google Cloud Platform (GCP)).

Linux Systems Administrator

Confidential, Seattle, WA

Responsibilities:

  • Worked with DBA team for database performance issues, network related issues on LINUX/UNIX servers and with vendors regarding hardware related issues.
  • Analyzed and monitored log files to troubleshoot issues.
  • Installed, configured, monitored, and administrated Linux servers.
  • Configured and installed RedHat and Centos Linux Servers on virtual machines and bare metal installations.
  • Wrote Python scripts for automating build and deployment processes.
  • Utilized Nagios-based open-source monitoring tools to monitor Linux Cluster nodes.
  • Created users, managed user permissions, maintained user and file system quotas, and installed and configured DNS.
  • Monitored CPU, memory, hardware, and software including raid, physical disk, multipath, filesystems, and networks using Nagios monitoring tool.
  • Performed kernel and database configuration optimization such as I/O resource usage on disks.
  • Created and modified users and groups with root permissions.
  • Administered local and remote servers using the SSH on a daily basis.

We'd love your feedback!