We provide IT Staff Augmentation Services!

Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Atlanta, GA

SUMMARY

  • Close to 7 years of professional experience in Big Data, Hadoop and Cloud.
  • Experienced with Hadoop ecosystem (Hadoop, Cloudera, Impala, Hortonworks) and cloud computing environments such as Amazon Web Services (AWS), and Microsoft Azure.
  • Hands on experience in Hadoop Framework and its ecosystem, including but not limited to HDFS Architecture Programming, Hive, Sqoop, HBase, MongoDB, Cassandra, Oozie, Spark RDDs, Spark DataFrames, Spark Datasets, Spark MLlib, etc.
  • Involved in building a multi - tenant cluster, with disaster management with Hadoop cluster.
  • Hands on experience in installing, configuring Cloudera's and Horton distribution.
  • Excellent knowledge on Hadoop Architecture and ecosystems such as HDFS, configuration of nodes, YARN, MapReduce, Spark, Falcon, HBase, Hive.
  • Develop Scripts and automated end-end data management and sync between all the clusters.
  • Extend Hive and Pig core functionality by writing custom UDFs.
  • Use Apache Flume to collect logs and error messages across the cluster.
  • Troubleshoot and tune relevant programming languages like SQL, Java, Python, Scala, PIG, Hive, RDDs, DataFrames, and MapReduce.
  • Design elegant solutions using problem statements.
  • Accustomed to working with large complex data sets, real-time/near real-time analytics, and distributed big data platforms.
  • Proficient in major vendor Hadoop distribution like Cloudera, Hortonworks.
  • Deep knowledge in incremental imports, partitioning and bucketing concepts in Hive and Spark SQL needed for optimization.
  • Experience collecting log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.
  • Experience collecting real-time log data from different sources like webserver logs and social media data from Facebook and Twitter using Flume, and storing in HDFS for further analysis.
  • Experience deploying large multiple nodes of a Hadoop and Spark cluster.
  • Experience developing Oozie workflows for scheduling and orchestrating the ETL process.
  • Solid understanding of statistical analysis, predictive analysis, machine learning, data mining, quantitative analytics, multivariate testing and optimization algorithms.
  • Proficient in mapping business requirements, use cases, scenarios, business analysis, and workflow analysis. Act as liaison between business units, technology and IT support teams.

TECHNICAL SKILLS

PROGRAMMING/ SCRIPTING and Frameworks: Java, Python, Scala, Hive, HiveQL, MapReduce, UNIX, Shell scripting, Yarn, Spark, Spark Streaming, Kafka

DEVELOPMENT ENVIRONMENTS: IntelliJ, PyCharm, Visual Studio

AMAZON CLOUD: Amazon AWS (EMR, EC2, EC3, SQL, S3, DynamoDB, Cassandra, Redshift, Cloud Formation) DATABASES NoSQL, MongoDB, Cassandra, Dynamo, Mongo, SQL, MySQL, Oracle

HADOOP DISTRIBUTIONS: Cloudera, Hortonworks, Elastic

QUERY/SEARCH: SQL, HiveQL, Impala, Kibana, Elasticsearch

Visualization Tools: Tableau, QlikviewFile Formats Parquet, Avro, Orc, JSON

Data Pipeline Tools: Apache Airflow, Nifi

Admin Tools: Oozie, Cloudera Manager, Zookeeper, Airflow

SOFTWARE DEVELOPMENT: Agile, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Git, Jenkins, Jira

PROFESSIONAL EXPERIENCE

Big Data Engineer

Confidential, Atlanta, GA

Responsibilities:

  • Developed Spark streaming application to pull data from cloud servers to the Hive table.
  • Used Spark SQL to process the huge amount of structured data.
  • Worked with the Spark-SQL context to create data frames to filter input data for model execution.
  • Designed and developed data pipelines in an Azure environment using ADL Gen2, Blob Storage, ADF, Azure Databricks, Azure SQL, Azure Synapse for analytics and MS PowerBI for reporting.
  • Converted some existing hive scripts to Spark applications using RDD's for transforming data and loading into HDFS.
  • Developed Hive queries in daily use to retrieve datasets
  • Ingested data from JDBC Databases to Hive,
  • Developed Spark jobs using Spark SQL, Python, and Data Frames API to process structured data into Spark clusters.
  • Used Spark to do transformation and preparation of Dataframes
  • Removed and filtered unnecessary data from raw data using Spark.
  • Configured Stream-set to store the converted data to HIVE using JDBC drivers.
  • Wrote Spark code to remove certain fields from the Hive table.
  • Joined multiple tables to find the correct information for certain addresses.
  • Wrote code to standardize string and IP addresses over datasets.
  • Used Hadoop as a data lake to store large amounts of data.
  • Developed Oozie workflows and ran Oozie job to run jobs in parallel.
  • Used Oozie to update tables automatically over Hive.
  • Wrote unit tests for all code using different frameworks like PyTest.
  • Wrote code according to the schema change of the source table based on the data warehouse dimensional modeling.
  • Created a geospatial index, performed a geo-spatial search, and populated a result column that indicated Y/N based on the distance found.
  • Used Scala and Spark SQL for faster testing and processing of data.
  • Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR), and AWS Redshift.
  • Developed test cases for features deployed for Spark code and UDFs.
  • Optimized JDBC connection with bulk upload for Hive import and Spark Imports
  • Handled defects from internal testing tools to increase code coverage over Spark.
  • Hands-on with EC2, Cloud Watch, Cloud Formation and managing security groups on AWS.

Big Data Engineer

Confidential, Columbus, OH

Responsibilities:

  • Hands-on with AWS data migration between database platforms Local SQL Servers to Amazon RDS and EMR HIVE.
  • Optimized Hive analytics SQL queries and created tables/views and wrote custom queries and Hive-based exception processes.
  • Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks.
  • Developed consumer intelligence reports based on market research, data analytics, and social media.
  • Joined, manipulated, and drew actionable insights from large data sources using Python and SQL.
  • Worked on AWS to form and manage EC2 instances and Hadoop Clusters.
  • Implemented a Hadoop Cloudera distributions cluster using AWS EC2.
  • Deployed the large knowledge Hadoop application mistreatment Talend on Cloud AWS.
  • Utilized AWS Redshift to store Terabytes of data on the Cloud.
  • Used Spark SQL and DataFrames API to load structured and semi-structured data into Spark Clusters.
  • Wrote shell scripts for log files to Hadoop cluster through automatic processes.
  • Used Spark-SQL and Hive Query Language (HQL) for obtaining client insights.
  • Ingested large data streams from company REST APIs into EMR cluster through AWS kinesis.
  • Streamed data from AWS Fully Managed Kafka brokers using Spark Streaming and processed the data using explode transformations.
  • Finalized the data pipeline using DynamoDB as a NoSQL storage option.

Data Engineer

Confidential, Detroit, MI

Responsibilities:

  • Managed, configured, tuned and ran continuous deployment of 100+ Hadoop nodes in a Red Hat Enterprise (Edition 5).
  • Sqoop incremental load of data daily basis from MySQL database to hdfs file system and write data into hive partition tables and doing analytics on hive tables.
  • Written hive udf to create custom function using java and these udf are used in hive query languages.
  • Worked on reading server logs file using hive regex SERDE to parse and write into hive table and done analytics on their info and error messages.
  • Worked on importing and exporting data from hive from one cluster to another cluster.
  • Worked on Sqoop job for importing data from sql database to hive tables.
  • Configured via the AWS console for 2 medium-scale AMI instances for the Name Nodes.
  • Developed Python scripts to automate the workflow processes and generate reports.
  • Developed a task execution framework on EC2 instances using SQL and DynamoDB.
  • Stored non-relational data on HBase.
  • Created alter, insert, and delete queries involving lists, sets, and maps in DataStax Cassandra.
  • Worked on Impala to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
  • Utilized Impala to read, write, and query Hadoop data in HDFS.

Hadoop Data Engineer

Confidential, New York, NY

Responsibilities:

  • Assisted in building out a managed cloud infrastructure with improved systems and analytics capability.
  • Researched various available technologies, industry trends, and cutting-edge applications.
  • Designed and set-up POCs to test various tools, technologies, and configurations, along with custom applications.
  • Used Oozie to automate/schedule business workflows which invoked Sqoop and Pig jobs.
  • Used Zookeeper and Oozie for coordinating the cluster and scheduling workflows.
  • Used the Hive JDBC to verify the data stored in the Hadoop cluster.
  • Loaded and transformed large sets of structured, semi-structured, and unstructured data.
  • Used different file formats like Text files, Sequence Files, Avro.
  • Loaded data from various data sources into HDFS using Kafka.
  • Tuned and operated Spark and its related technologies like SQL.
  • Used shell scripts to dump the data from MySQL to HDFS.
  • Used NoSQL database MongoDB in implementation and integration.
  • Worked on streaming the analyzed data to Hive tables using Storm for making it available for visualization and report generation by the BI team.
  • Used the image files of an instance to create instances containing Hadoop installed and running.
  • Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies.
  • Implemented GCP Dataproc to create a cloud cluster and run Hadoop jobs.
  • Connected various data centers and transferred data between them using Sqoop and various ETL tools.
  • Extracted the data from RDBMS (Oracle, MySQL) to HDFS using Sqoop.
  • Worked with the client to reduce churn rate, read, and translate data from social media websites.
  • Collected the business requirements from the subject matter experts like data scientists and business partners.
  • Configured Oozie workflow engine scheduler to run multiple Hive, Sqoop, and Pig jobs.
  • Consumed the data from Kafka queue using Storm.

We'd love your feedback!