The present research proposes a novel approach to estimate incoming jobs runtime based on similarities of reocurring jobs. To achieve this goal, we utilize the latest achievements in neural network techniques to embed the job dependencies. Subsequently, we perform multiple clustering techniques to form meaningful groups of reoccurring jobs. Finally, based on the similarities within the groups of samples, we predict runtimes. A recently published trace dataset allows us to develop and evaluate our contribution with more than 200,000 complex and real-world jobs.
The cloud data centers should daily handle numerous jobs with complex parallelization. In order to schedule such a heavy and complicated workload and reach efficient resource utilization, runtime prediction is critical. Moreover, accurate runtime prediction may assist cloud users in choosing their required resources more intelligently. Despite the importance of runtime prediction, achieving an accurate prediction is not straightforward because the execution time of jobs in complicated environments of clouds is affected by many factors, e.g., cluster status, users’ requirements, etc.
Table of Contents
1. Introduction
1.1 Motivation
1.2 Problem Description
1.3 Contribution
1.4 Framework and Structure
2. Background
2.1 Use Case Description
2.2 Data Parallel Processing Workloads
2.2.1 Cloud Computing Scheduler
2.2.2 Task Dependencies
2.2.3 Recurrent Jobs
2.3 Graph Theory
2.3.1 Directed Acyclic Graphs
2.3.2 Definitions
2.4 Graph Embedding
2.4.1 GNN
2.5 Data Clustering and Classification
2.5.1 Rule-Based Classification
2.5.2 K-Means
2.5.3 DBSCAN
2.5.4 OPTICS
2.5.5 Clustering Evaluation
2.6 Linear Regression
2.7 Evaluation Metrics in Linear Models
3. Approach
3.1 Overview
3.2 Data Cleaning and Integration
3.2.1 Data Cleaning
3.2.2 DAGs Extraction
3.3 Graph Embedding and Clustering
3.3.1 Graph Embedding
3.3.2 Clustering
3.3.3 Runtime Prediction
4. Implementation
4.1 DAGs Extraction
4.1.1 Data Preprocessing
4.1.2 DAGs Objects
4.1.3 Target Variables Definition
4.2 Graph Embedding
4.2.1 Model Structure
4.2.2 Model Training
4.3 Data Clustering Techniques
4.3.1 Preprocessing
4.3.2 DBSCAN
4.3.3 OPTICS
4.4 Runtime Prediction
5. Evaluation and Discussion
5.1 Dataset
5.1.1 Exceptions Handling
5.1.2 Data Characteristics
5.2 Rule-Based Classification
5.2.1 Recurring Jobs Detection
5.2.2 Rule-Based Classification
5.2.3 Runtime Prediction
5.2.4 Evaluation
5.3 GNN Evaluation
5.4 OPTICS Clustering
5.5 K-means Clustering
5.5.1 Model Setup
5.5.2 Evaluation
5.6 DBSCAN Clustering
5.6.1 Runtime Prediction
5.6.2 Comparison with the Baseline
5.7 Limitations
5.8 Discussions
6. State of the Art
6.1 Graph Representation Learning
6.2 Runtime and Resource Usage Prediction
6.3 Reoccurring Jobs
6.4 Alibaba Cluster Dataset
7. Conclusion
7.1 Provision of the Code
7.2 Provision of the Datasets
Objectives and Research Themes
This thesis aims to develop a framework for predicting the runtime of incoming data-parallel jobs in cloud environments. By leveraging the structural similarities of reoccurring jobs through Graph Neural Networks (GNN) and clustering techniques, the research seeks to improve prediction accuracy and enable better resource allocation.
- Graph representation learning using GNNs to embed Directed Acyclic Graphs (DAGs).
- Application of clustering algorithms (OPTICS, DBSCAN, K-means) to identify meaningful groups of reoccurring jobs.
- Runtime estimation via linear regression models tailored to specific job clusters.
- Evaluation using large-scale production trace data to validate the predictive value of job dependency structures.
Book Excerpt
3.3 Graph Embedding and Clustering
The generated DAGs enable further analysis of structural dependencies. The current thesis aims to cluster reoccurring jobs based on their similarity in structural properties of DAGs. In other words, we aim to form a cluster of reoccurring jobs. To this end, the graph should be represented in a fixed-size data structure. This step is mandatory, as many clustering machine learning models do not accept the discrete data type of DAGs as input. The embeddings should hold three main conditions: (1) they have to represent the structural features of DAGs and the information regarding node- and graph-level attributes. (2) The representations must be definable in Euclidean space to make the comparison for clustering models possible, and (3) the representation learning phase should be categorized in the inductive models since the graph embedding phase in real-world scenarios should be repeatable to enable generating of comparable embeddings.
Summary of Chapters
1. Introduction: This chapter provides an overview of the motivation behind runtime prediction in cloud environments, defines the problem, and outlines the thesis structure.
2. Background: This chapter establishes the necessary domain knowledge, covering parallelism, graph theory, graph embedding (GNN), and clustering/classification techniques.
3. Approach: This chapter introduces the methodology for data cleaning, DAG extraction, GNN-based embedding, and the clustering-based runtime prediction framework.
4. Implementation: This chapter details the technical execution of the proposed approach, including data preprocessing, feature selection, GNN architecture, and clustering setup.
5. Evaluation and Discussion: This chapter presents the statistical analysis of the dataset, evaluates the performance of the proposed methods against baselines, and discusses the results and limitations.
6. State of the Art: This chapter reviews existing literature related to graph representation learning, cloud resource prediction, and the analysis of the Alibaba cluster dataset.
7. Conclusion: This chapter summarizes the research findings, contributions, and potential areas for future work regarding job dependency embeddings.
Keywords
Cloud Computing, Data-Parallel Jobs, Runtime Prediction, Directed Acyclic Graphs (DAGs), Graph Neural Networks (GNN), Graph Embedding, Data Clustering, DBSCAN, OPTICS, Resource Allocation, Alibaba Cluster Trace, Job Dependencies, Machine Learning, Runtime Estimation, Feature Selection
Frequently Asked Questions
What is the core focus of this research?
The thesis focuses on predicting the runtime of data-parallel jobs in cloud environments by identifying and clustering reoccurring job patterns based on their structural dependencies.
What are the primary themes addressed?
The central themes are cloud workload management, graph representation learning, unsupervised clustering techniques, and empirical evaluation using large-scale production traces.
What is the main objective of this work?
The primary goal is to improve the accuracy of job runtime predictions by analyzing the Directed Acyclic Graph (DAG) structure of tasks, thereby assisting cloud providers in efficient resource scheduling.
Which scientific methods are utilized?
The study employs Graph Neural Networks (GNN) for embedding DAGs, clustering algorithms such as DBSCAN and OPTICS to identify job groups, and linear regression models to predict runtime per cluster.
What content is covered in the main body of the text?
The main body covers the extraction of DAGs from raw trace data, the design of a GNN model to create fixed-size vector representations, and a comparative study of different clustering and prediction strategies.
Which keywords characterize this thesis?
Key terms include Cloud Computing, Graph Neural Networks, Runtime Prediction, DAGs, Data Clustering, and Resource Utilization.
How does the author define recurring jobs in this context?
Recurring jobs are identified as tasks sharing isomorphic structural dependencies (DAGs), allowing the framework to group similar operations together even if the input datasets vary.
What impact does the research have on existing prediction baselines?
The proposed framework demonstrates a significant improvement in accuracy, specifically reducing the mean absolute error of baseline predictions by 27%.
How is the Alibaba cluster dataset used?
The dataset is used as a real-world benchmark with over 200,000 jobs to evaluate the robustness and reliability of the proposed dependency-based prediction approach.
- Quote paper
- Alireza Alamgiralem (Author), 2021, Trace-Based Runtime Prediction of Reoccurring Data-Parallel Processing Jobs, Munich, GRIN Verlag, https://www.grin.com/document/1180656