Big Data Engineer Resume
Baltimore, MarylanD
SUMMARY
- 7 years’ experience in Big Data development applying Apache Spark, HIVE, Apache Kafka, and Hadoop.
- 9 years’ total IT/software/database design/development/deployment/support experience (7 years in Big Data and 2 years in software/IT data systems).
- Experienced with Big Data link technologies such as Amazon Warehouse Services (AWS), Microsoft Azure, Apache Kafka, Python, Apache Spark, HIVE, Apache Kafka, and Hadoop.
- Experienced analyzing Microsoft SQL Server data models and identifying and creating inputs to convert existing dashboards that use Excel as a data source.
- Applied Python - based design and development programming to multiple projects.
- Created Pyspark Data Frames on multiple projects and tied into Kafka.
- Configured Big Data Hadoop and Apache Spark in Big Data.
- Built AWS Cloud Formation templates used for Terraform with existing plugins.
- Developed AWS Cloud Formation templates to create custom infrastructures of pipelines.
- Implemented AWS IAM user roles and policies to authenticate and control user access.
- Applied expertise designing custom reports using data extraction and reporting tools, and development of algorithms based on business cases.
- Performance-tuned data-heavy dashboards and reports for optimization using various options such as Extracts, Context filters, writing efficient calculations, Data source filters, Indexing, and Partitioning in the data source, etc.
- Wrote SQL queries for data validation of reports and dashboards.
- Worked with Data Lakes and Big Data ecosystems (Hadoop, Spark, Hortonworks, Cloudera).
- Proven success working on different Big Data technology teams operating within an Agile/Scrum project methodology.
TECHNICAL SKILLS
File Formats: Parquet, Avro & JSON, ORC, Text, CSV.
Apache: Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala.
Operating Systems: Unix/Linux, Windows 10, Ubuntu, Apple OS.
Scripting: HiveQL, MapReduce, XML, FTP, Python, UNIX, Shell scripting, LINUX.
Distributions: Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS).
Data Processing (Compute) Engines: Apache Spark, Spark Streaming, Flink.
Data Visualization Tools: Pentaho, QlikView, Tableau, PowerBI, Matplot.
Compute Engines: Apache Spark, Spark Streaming, Storm.
Databases: Microsoft SQL Server Database (2005, 2008R2, 2012), Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive, MongoDB.
PROFESSIONAL EXPERIENCE
Confidential, Baltimore, Maryland
Big Data Engineer
Responsibilities:
- Created Pyspark streaming job to receive real time data from Kafka.
- Defined Spark data schema and set up development environment inside the cluster.
- Designed Spark Python job to consume information from S3 Buckets using Boto3.
- Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.
- Created a pipeline to gather data using Pyspark, Kafka, and HBase.
- Used Spark streaming to receive real-time data using Kafka.
- Worked with unstructured data and parsed out the information by Python built-in functions.
- Configured a Python API Producer file to ingest data from the Slack API, using Kafka for real-time processing with Spark.
- Processed data with natural language toolkit to count important words and generate word clouds.
- Started and configured master and slave nodes for Spark.
- Set up cloud compute engine managed and unmanaged mode and SSH key management.
- Worked on virtual machines to run pipelines to on a distributed system.
- Led presentations about the Hadoop ecosystem, best practices, and data architecture in Hadoop.
- Managed hive connection with tables, databases, and external tables.
- Installed Hadoop using Terminal and set the configurations.
Confidential, Cincinnati, OH
AWS Data Engineer
Responsibilities:
- Made and oversaw cloud VMs with AWS EC2 command-line clients and AWS administration reassure.
- Used Apache Spark DataFrame API over the Cloudera platform to perform analytics on Hive data.
- Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
- Used Ansible Python Script to generate inventory and push the deployment to AWS Instances.
- Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift.
- Implemented AWS Lambda functions to run scripts in response to events in the Amazon Dynamo DB table or S3.
- Populated database tables via AWS Kinesis Firehose and AWS Redshift.
- Automated the installation of ELK agent (file beat) with Ansible playbook. Developed KAFKA. Queue System to Collect Log data without Data Loss and Publish to various Sources.
- Applied AWS Cloud Formation templates for Terraform with existing plugins.
- Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline.
- Implemented AWS IAM user roles and policies to authenticate and control access.
- Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS.
Confidential, Houston, Texas
Hadoop Engineer
Responsibilities:
- Integrated Kafka with Spark Streaming for real time data processing of logistic data.
- Used shell scripts to migrate the data between Hive, HDFS and MySQL.
- Installed and configured HDFS cluster for bigdata extraction, transformation, and load.
- Utilized Zookeeper and Spark interface for monitoring proper execution of Spark Streaming.
- Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration.
- Created a pipeline to gather data using Pyspark, Kafka and HBase.
- Sent requests to source REST Based API from a Scala script via Kafka producer.
- Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.
- Hands-on with Spark Core, SparkSession, SparkSQL, and Data Frames/Data Sets/RDD API, Spark jobs, Spark SQL, and Data Frames API to load structured data into Spark clusters.
- Created a Kafka broker that uses the schema to fetch structured data in structured streaming.
- Defined Spark data schema and set up development environment inside the cluster.
- Interacted with data residing in HDFS using Pyspark to process the data.
Confidential, New York, NY
Big Data Engineer
Responsibilities:
- Connected and ingested data using different ingestion tools such as Kafka and Flume.
- Worked on importing the received data into Hive using Spark.
- Applied HQL for querying desirable data in Hive used for further analysis.
- Implemented Partitioning, Dynamic Partition and Buckets in Hive which resulted in an increase in performance as well as proper and logical organization of data.
- Decoded the raw data and loaded into JSON before sending the batched streaming file over the Kafka producer.
- Received the JSON response in Kafka consumer written in Python.
- Established a connection between the HBase and Spark for the transfer of the newly populated data frame.
- Designed Spark Scala job to consume information from S3 Buckets.
- Monitored background operations in Hortonworks Ambari.
- HDFS Monitoring job status and life of the Data Nodes according to the specs.
Confidential, Oklahoma City, OK
Software/IT Data Systems Programmer
Responsibilities:
- Gathered requirements from the client, analyzed, and prepared a requirement specification document for the client.
- Identified data types and wrote and ran SQL data cleansing and analysis scripts.
- Formatted files to import and export data to a SQL Server repository.
- Applied Git to store and organize SQL queries.
- Improved user interface for the database by reducing user input with automated inputs.
- Re-designed forms for easier access.
- Applied code modifications and wrote new scripts in Python.
- Worked with software/IT technology team to improve data integration processing.
- Reported and resolved discrepancies in a timely manner through the appropriate channels.