We provide IT Staff Augmentation Services!

Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Baltimore, MarylanD

SUMMARY

  • 7 years’ experience in Big Data development applying Apache Spark, HIVE, Apache Kafka, and Hadoop.
  • 9 years’ total IT/software/database design/development/deployment/support experience (7 years in Big Data and 2 years in software/IT data systems).
  • Experienced with Big Data link technologies such as Amazon Warehouse Services (AWS), Microsoft Azure, Apache Kafka, Python, Apache Spark, HIVE, Apache Kafka, and Hadoop.
  • Experienced analyzing Microsoft SQL Server data models and identifying and creating inputs to convert existing dashboards that use Excel as a data source.
  • Applied Python - based design and development programming to multiple projects.
  • Created Pyspark Data Frames on multiple projects and tied into Kafka.
  • Configured Big Data Hadoop and Apache Spark in Big Data.
  • Built AWS Cloud Formation templates used for Terraform with existing plugins.
  • Developed AWS Cloud Formation templates to create custom infrastructures of pipelines.
  • Implemented AWS IAM user roles and policies to authenticate and control user access.
  • Applied expertise designing custom reports using data extraction and reporting tools, and development of algorithms based on business cases.
  • Performance-tuned data-heavy dashboards and reports for optimization using various options such as Extracts, Context filters, writing efficient calculations, Data source filters, Indexing, and Partitioning in the data source, etc.
  • Wrote SQL queries for data validation of reports and dashboards.
  • Worked with Data Lakes and Big Data ecosystems (Hadoop, Spark, Hortonworks, Cloudera).
  • Proven success working on different Big Data technology teams operating within an Agile/Scrum project methodology.

TECHNICAL SKILLS

File Formats: Parquet, Avro & JSON, ORC, Text, CSV.

Apache: Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala.

Operating Systems: Unix/Linux, Windows 10, Ubuntu, Apple OS.

Scripting: HiveQL, MapReduce, XML, FTP, Python, UNIX, Shell scripting, LINUX.

Distributions: Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS).

Data Processing (Compute) Engines: Apache Spark, Spark Streaming, Flink.

Data Visualization Tools: Pentaho, QlikView, Tableau, PowerBI, Matplot.

Compute Engines: Apache Spark, Spark Streaming, Storm.

Databases: Microsoft SQL Server Database (2005, 2008R2, 2012), Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive, MongoDB.

PROFESSIONAL EXPERIENCE

Confidential, Baltimore, Maryland

Big Data Engineer

Responsibilities:

  • Created Pyspark streaming job to receive real time data from Kafka.
  • Defined Spark data schema and set up development environment inside the cluster.
  • Designed Spark Python job to consume information from S3 Buckets using Boto3.
  • Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.
  • Created a pipeline to gather data using Pyspark, Kafka, and HBase.
  • Used Spark streaming to receive real-time data using Kafka.
  • Worked with unstructured data and parsed out the information by Python built-in functions.
  • Configured a Python API Producer file to ingest data from the Slack API, using Kafka for real-time processing with Spark.
  • Processed data with natural language toolkit to count important words and generate word clouds.
  • Started and configured master and slave nodes for Spark.
  • Set up cloud compute engine managed and unmanaged mode and SSH key management.
  • Worked on virtual machines to run pipelines to on a distributed system.
  • Led presentations about the Hadoop ecosystem, best practices, and data architecture in Hadoop.
  • Managed hive connection with tables, databases, and external tables.
  • Installed Hadoop using Terminal and set the configurations.

Confidential, Cincinnati, OH

AWS Data Engineer

Responsibilities:

  • Made and oversaw cloud VMs with AWS EC2 command-line clients and AWS administration reassure.
  • Used Apache Spark DataFrame API over the Cloudera platform to perform analytics on Hive data.
  • Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
  • Used Ansible Python Script to generate inventory and push the deployment to AWS Instances.
  • Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.
  • Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift.
  • Implemented AWS Lambda functions to run scripts in response to events in the Amazon Dynamo DB table or S3.
  • Populated database tables via AWS Kinesis Firehose and AWS Redshift.
  • Automated the installation of ELK agent (file beat) with Ansible playbook. Developed KAFKA. Queue System to Collect Log data without Data Loss and Publish to various Sources.
  • Applied AWS Cloud Formation templates for Terraform with existing plugins.
  • Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline.
  • Implemented AWS IAM user roles and policies to authenticate and control access.
  • Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS.

Confidential, Houston, Texas

Hadoop Engineer

Responsibilities:

  • Integrated Kafka with Spark Streaming for real time data processing of logistic data.
  • Used shell scripts to migrate the data between Hive, HDFS and MySQL.
  • Installed and configured HDFS cluster for bigdata extraction, transformation, and load.
  • Utilized Zookeeper and Spark interface for monitoring proper execution of Spark Streaming.
  • Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration.
  • Created a pipeline to gather data using Pyspark, Kafka and HBase.
  • Sent requests to source REST Based API from a Scala script via Kafka producer.
  • Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.
  • Hands-on with Spark Core, SparkSession, SparkSQL, and Data Frames/Data Sets/RDD API, Spark jobs, Spark SQL, and Data Frames API to load structured data into Spark clusters.
  • Created a Kafka broker that uses the schema to fetch structured data in structured streaming.
  • Defined Spark data schema and set up development environment inside the cluster.
  • Interacted with data residing in HDFS using Pyspark to process the data.

Confidential, New York, NY

Big Data Engineer

Responsibilities:

  • Connected and ingested data using different ingestion tools such as Kafka and Flume.
  • Worked on importing the received data into Hive using Spark.
  • Applied HQL for querying desirable data in Hive used for further analysis.
  • Implemented Partitioning, Dynamic Partition and Buckets in Hive which resulted in an increase in performance as well as proper and logical organization of data.
  • Decoded the raw data and loaded into JSON before sending the batched streaming file over the Kafka producer.
  • Received the JSON response in Kafka consumer written in Python.
  • Established a connection between the HBase and Spark for the transfer of the newly populated data frame.
  • Designed Spark Scala job to consume information from S3 Buckets.
  • Monitored background operations in Hortonworks Ambari.
  • HDFS Monitoring job status and life of the Data Nodes according to the specs.

Confidential, Oklahoma City, OK

Software/IT Data Systems Programmer

Responsibilities:

  • Gathered requirements from the client, analyzed, and prepared a requirement specification document for the client.
  • Identified data types and wrote and ran SQL data cleansing and analysis scripts.
  • Formatted files to import and export data to a SQL Server repository.
  • Applied Git to store and organize SQL queries.
  • Improved user interface for the database by reducing user input with automated inputs.
  • Re-designed forms for easier access.
  • Applied code modifications and wrote new scripts in Python.
  • Worked with software/IT technology team to improve data integration processing.
  • Reported and resolved discrepancies in a timely manner through the appropriate channels.

We'd love your feedback!