We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

SUMMARY:

  • Multiple years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
  • Fluent programming experience with Scala, Java, Python, SQL, T - SQL, R.
  • Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
  • Adept at configuring and installing Hadoop/Spark Ecosystem Components.
  • Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
  • Experience in application of various data sources like Oracle SE2 , SQL Server , Flat Files and Unstructured files into a data warehouse.
  • Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
  • Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
  • Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed knowledge of MapReduce framework.
  • Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
  • Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
  • Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning and advanced data processing.
  • Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
  • Developed Spark Applications that can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
  • Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.
  • Capable of processing large sets (Gigabytes) of structured, semi-structured or unstructured data.
  • Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java 8.
  • Experience working with GitHub/Git 2.12 source and version control systems.
  • Strong in core Java concepts including Object-Oriented Design (OOD) and Java components like Collections Framework, Exception handling, I/O system.

TECHNICAL SKILLS:

Languages: \ Cluster Management & Monitoring Python 3.7.0+, Java 1.8, Scala 2.11.8+, SQL, \ Cloudera Manager 6.0.0+, Hortonworks Ambari TSQL, R 3.5.0+, C++, C, MATLAB. \ 2.6.0+, CloudxLab. \

Hadoop Ecosystem: \ Database Hadoop 2.8.4+, Spark 2.0.0+, MapReduce\ MySQL 5.X, SQL Server Oracle 11g, HBase HDFS, Kafka 0.11.0.1+, Hive 2.1.0+, HBase \ 1.2.3+, Cassandra 3.11. 1.4.4 +, Sqoop 1.99.7+, Pig 0.17, Flume 1.6.0+, Keras 2.2.4.

Visualization: Virtualization PowerBI, Oracle BI, Tableau 10.0+. \ VM ware workstation, AWS.\

Operating Systems: Markup Languages Linux, Windows, Ubuntu. \ HTML5, CSS3, JavaScript. \

Other Tools: IDE Jupyter Notebook, KNIME, MS SSMS, Putty, \ Eclipse, GitHub, PyCharm, Maven, IntelliJ, WinSCP, MS Office 365, SageMath, SEED \ RStudio, Visual Studio.\ Ubuntu, TensorFlow, NumPy.

PROFESSIONAL EXPERIENCE:

Confidential

Data Engineer

Responsibilities:

  • Member of the Business intelligence team, responsible for designing and optimizing systems.
  • Optimized system that processes the ~500GB of logs generated by the Nexmo API platform every day and loading it into the data warehouse.
  • Designed and implemented data loading and aggregation frameworks and jobs that will be able to handle hundreds of GBs of json files, using Spark, Airflow and Snowflake.
  • Built tools using Tableau to allow internal and external teams to visualize and extract insights from big data platforms.
  • Responsible for expanding and optimizing data and data pipeline architecture, as well as optimizing data flow and collection for cross functional teams.
  • Build best practice ETLs with Apache Spark to load and transform raw data into easy to use dimensional data for self-service reporting.
  • Improved the deployment and testing infrastructure within AWS, using tools like Jenkins, Puppet and Docker.
  • Work closely with the Product, Infrastructure and Core teams, to make sure data needs are considered during product development and to guide decisions related to data.

Environment: Scala 2.13, Spark 2.4, Spark SQL, Kafka 2.3.0, Apache Airflow 1.10.4, Snowflake, AWS (Redshift, Jenkins, Docker), Tableau 2019.2

Spark Developer

Confidential

Responsibilities:

  • Imported required modules such as Keras and NumPy on Spark session, also created directories for data and output.
  • Read train and test data into the data directory as well as into Spark variables for easy access and proceeded to train the data based on a sample submission.
  • The images upon being displayed are represented as NumPy arrays, for easier data manipulation all the images are stored as NumPy arrays.
  • Created a validation set using Keras2DML in order to test whether the trained model was working as intended or not.
  • Defined multiple helper functions that are used while running the neural network in session. Also defined placeholders and number of neurons in each layer.
  • Created neural networks computational graph after defining weights and biases.
  • Created a TensorFlow session which is used to run the neural network as well as validate the accuracy of the model on the validation set.
  • After executing the program and achieving an acceptable validation accuracy a submission was created that is stored in the submission directory.
  • Executed multiple SparkSQL queries after forming the Database to gather specific data corresponding to an image.

Environment: Scala 2.12.8, Python 3.7.2, PySpark, Spark 2.4, Spark ML Lib, Spark SQL, TensorFlow 1.9, NumPy 1.15.2, Keras 2.2.4, PowerBI

Data Engineer

Confidential

Responsibilities:

  • Designed stream processing job used by Spark Streaming which is coded in Scala.
  • Ingested information from several sources like Kafka, Flume, and TCP sockets.
  • Processed data using advanced algorithms expressed with high-level functions like map, reduce, join and window.
  • Set up VirtualBox to gain access to a Linux environment. Also set up Vagrant which is crucial for setting up and installing the required software for running the Spark job.
  • The package job was first extracted to a deployment folder and then deployed to yarn so that yarn can take care of the scheduling and resource management.
  • Fed inbound events into the Scala-project-inbound topic in order to check if the window summary event functions as intended or not.

Environment: Scala 2.12.3, Spark Streaming, Apache Hadoop 2.7.2, HDFS, YARN, slf4j 1.7.7, Kafka 0.11.0.1, json4s 3.2.11, jodaTime 2.3, VirtualBox, Vagrant, Cassandra 3.11

Data Engineer

Confidential

Responsibilities:

  • Extensively involved in installation and configuration of Cloudera Distribution Hadoop platform.
  • Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with Data Frames in Spark.
  • Utilized SparkSQL to extract and process data by parsing using Datasets or RDDs in Hive Context, with transformations and actions (map, flat Map, filter, reduce, reduce By Key).
  • Extended the capabilities of Data Frames using User Defined Functions in and Scala.
  • Resolved missing fields in Data Frame rows using filtering and imputation.
  • Integrated visualizations into a Spark application using Databricks and popular visualization libraries (ggplot, matplotlib).
  • Trained analytical models with Spark ML estimators including linear regression, decision trees, logistic regression, and k-means.
  • Performed pre-processing on a dataset prior to training, including standardization, normalization.
  • Created pipelines to create a processing pipeline including transformations, estimations, evaluation of analytical models.
  • Evaluated model accuracy by dividing data into training and test datasets and computing metrics using evaluators.
  • Tuned training hyper-parameters by integrating cross-validation into pipelines.
  • Computed using Spark MLlib functionality that wasn’t present in SparkML by converting DataFrames to RDDs and applying RDD transformations and actions.
  • Troubleshot and tuned machine learning algorithms in Spark.

Environment: Spark 2.0.0, Spark MLlib, Spark ML, Hive 2.1.0, Sqoop 1.99.7, Flume 1.6.0, HBase 1.2.3, MySQL 5.1.73, Scala 2.11.8, Shell Scripting, Tableau 10.0, Agile

SQL Developer

Confidential

Responsibilities:

  • Gathered business requirements and converted them into new T-SQL stored procedures in visual studio for database project.
  • Performed unit tests on all code and packages.
  • Analyzed requirement and impact by participating in Joint Application Development sessions with business client online.
  • Performed and automated SQL Server version upgrades, patch installs and maintained relational databases.
  • Performed front line code reviews for other development teams.
  • Modified and maintained SQL Server stored procedures, views, ad-hoc queries, and SSIS packages used in the search engine optimization process.
  • Updated existing and created new reports using Microsoft SQL Server Reporting Services. Team consisted of 2 developers.
  • Created files, views, tables and data sets to support Sales Operations and Analytics teams
  • Monitored and tuned database resources and activities for SQL Server databases.

We'd love your feedback!