For a DBMS that provides support for a declarative query language like SQL, the query optimizer is a crucial piece of software. The declarative nature of a query allows it to be translated into many equivalent evaluation plans. The process of choosing a suitable plan from all alternatives is known as query optimization. The basis of this choice are a cost model and statistics over the data. Essential for the costs of a plan is the execution order of join operations in its operator tree, since the runtime of plans with different join orders can vary by several orders of magnitude. An exhaustive search for an optimal solution over all possible operator trees is computationally infeasible. To decrease complexity, the search space must be restricted. Therefore, a well-accepted heuristic is applied: All possible bushy join trees are considered, while cross products are excluded from the search.
There are two efficient approaches to identify the best plan: bottom-up and top-down join enumeration. But only the top-down approach allows for branch-and-bound pruning, which can improve compile time by several orders of magnitude, while still preserving optimality.
Hence, this thesis focuses on the top-down join enumeration. In the first part, we present two efficient graph-partitioning algorithms suitable for top-down join enumeration. However, as we will see, there are two severe limitations: The proposed algorithms can handle only (1) simple (binary) join predicates and (2) inner joins. Therefore, the second part adopts one of the proposed partitioning strategies to overcome those limitations. Furthermore, we propose a more generic partitioning framework that enables every graph-partitioning algorithm to handle join predicates involving more than two relations, and outer joins as well as other non-inner joins. As we will see, our framework is more efficient than the adopted graph-partitioning algorithm. The third part of this thesis discusses the two branch-and-bound pruning strategies that can be found in the literature. We present seven advancements to the combined strategy that improve pruning (1) in terms of effectiveness, (2) in terms of robustness and (3), most importantly, avoid the worst-case behavior otherwise observed.
Different experiments evaluate the performance improvements of our proposed methods. We use the TPC-H, TPC-DS and SQLite test suite benchmarks to evaluate our joined contributions.

Extrait

1. Introduction

1.1. Motivation

1.2. Contribution

1.2.1. Graph-Aware Join Enumeration Algorithms

1.2.2. Hypergraph-Aware Join Enumeration Algorithms

1.2.3. Branch-and-Bound Pruning

1.2.4. Conclusion and Appendix

2. Graph-Aware Join Enumeration Algorithms

2.1. Preliminaries

2.1.1. Graphs

2.1.2. Query Graphs, Plan Classes and Costs

2.1.3. Problem Specification

2.2. Basic Memoization

2.2.1. Generic Top-Down Join Enumeration

2.2.2. Naive Partitioning

2.2.3. Exemplified Execution of TDBASIC

2.2.4. Analysis of TDBASIC

2.3. Advanced Generate and Test

2.3.1. TDMINCUTAGAT

2.3.2. An Improved Connection Test

2.3.3. Correctness of Advanced Generate and Test

2.4. Conservative Graph Partitioning

2.4.1. Correctness Constraints

2.4.2. The Algorithm in Detail

2.4.3. Exploring Connected Components

2.5. Branch Partitioning

2.5.1. Branch Partitioning - An Overview

2.5.2. The Algorithm in Detail

2.5.3. Two Optimization Techniques

2.5.4. Exploring Restricted Neighbors

2.5.5. Two Examples

2.5.6. Complexity of Branch Partitioning

2.6. Evaluation

2.6.1. Experimental Setup

2.6.2. Organizational Overview

2.6.3. Experiments

3. Hypergraph-Aware Join Enumeration Algorithms

3.1. Motivation

3.2. Preliminaries

3.2.1. Hypergraphs

3.2.2. Graphs vs. Hypergraphs

3.2.3. Neighborhood

3.3. Basic Memoization for Hypergraphs

3.3.1. Generic Top-Down Join Enumeration for Hypergraphs

3.3.2. Naive Partitioning of Hypergraphs

3.3.3. Test for Connectedness of Hypergraphs

3.4. Conservative Partitioning for Hypergraphs

3.4.1. Overview of MINCUTCONSERVATIVEHYP

3.4.2. The Algorithm in Detail

3.4.3. Avoiding Duplicates

3.4.4. GETCONNECTEDCOMPONENTS

3.4.5. MAINTAINCSET

3.4.6. An Example

3.5. Generic Top-Down Join Enumeration for Hypergraphs

3.5.1. High-Level Overview

3.5.2. Structure of the Generic Partitioning Framework

3.5.3. Generating the Adjacency Information

3.5.4. Composing Compound Vertices

3.5.5. Economizing on Connection Tests

3.5.6. Efficient Subgraph Handling

3.5.7. Implementation Details

3.6. Evaluation

3.6.1. Implementation

3.6.2. Workload

3.6.3. Organizational Overview

3.6.4. Evaluation of Random Acyclic Query Graphs

3.6.5. Evaluation of Random Cyclic Query Graphs

3.6.6. Overhead Detection

3.6.7. Performance Evaluation with Different Benchmarks

4. Branch-and-Bound Pruning

4.1. Motivation

4.2. Accumulated-Cost Bounding and Predicted-Cost Bounding

4.2.1. Building a Join Tree

4.2.2. Accumulated-Cost Bounding

4.2.3. Predicted-Cost Bounding

4.2.4. Combining the Methods

4.2.5. An Example for Accumulated-Predicted-Cost Bounding

4.3. Technical Advances

4.4. Evaluation

4.4.1. Implementation

4.4.2. Workload

4.4.3. Performance Evaluation with Simple Query Graphs

4.4.4. Performance Evaluation with Complex Query Graphs

4.4.5. Performance Evaluation with Different Benchmarks

5. Conclusion

5.1. Graph-Aware Join Enumeration Algorithms

5.2. Hypergraph-Aware Join Enumeration Algorithms

5.3. Branch and Bound Pruning

5.4. Graceful Degradation

5.5. Summary

Objectives and Topics

This thesis focuses on the process of top-down join enumeration in database management systems. Its primary objective is to develop efficient partitioning algorithms and advanced branch-and-bound pruning strategies that close the performance gap between top-down join enumeration and existing bottom-up approaches, while also addressing real-world query requirements like complex predicates and non-inner joins.

Top-down join enumeration via memoization
Graph-partitioning algorithms for query optimization
Hypergraph-aware query processing
Advanced branch-and-bound pruning techniques
Performance evaluation using TPC-H, TPC-DS, and SQLite benchmarks

Excerpt from the Book

1.1. Motivation

Queries against DBMSs are often formulated in declarative languages. Prominent examples are SQL, OQL, XPath and XQuery. Writing such a declarative query has two advantages: (1) The querist does not need to decide upon the actual algorithms and execution order to access and combine the data, which in turn (2) leaves the DBMS with several degrees of freedom to choose the best evaluation and execution strategy in order to answer the query. This is a shift of complexity: from formulating the query towards how to answer it in a most efficient way. We refer to the process of transforming the declarative query in an imperative execution plan as plan generation, and we call the component in the DBMS which deals with the complexity of choosing a suitable plan from all alternatives the plan generator.

Today’s plan generators are cost-based. This means that they rely on a cost model and statistics over the data in order to select from all equivalent plans the one with the lowest costs. Essential for the costs of a plan is the execution order of join operations in its operator tree, since the runtime of plans with different join orders can vary by several orders of magnitude. An exhaustive search for an optimal solution over all possible operator trees is computationally infeasible. To decrease complexity, the search space must be restricted. For the optimization problem discussed in this thesis, the well-accepted connectivity heuristic is applied: We consider all possible bushy join trees, but exclude cross products from the search, presuming that all considered queries span a connected query graph. Thereby, a query graph is an undirected graph where join predicates span the edges between the relations referenced in the SQL query, i.e., a graph edge between R1 and R2 is introduced if there exists a join predicate involving attributes of R1 and R2.

Summary of Chapters

1. Introduction: Discusses the motivation for efficient join enumeration in DBMSs and outlines the contribution of the thesis, including graph-aware and hypergraph-aware algorithms.

2. Graph-Aware Join Enumeration Algorithms: Introduces basic memoization and develops novel, efficient graph-aware partitioning strategies, concluding with performance comparisons.

3. Hypergraph-Aware Join Enumeration Algorithms: Extends the framework to handle hypergraphs, necessary for complex predicates and non-inner joins, proposing a generic partitioning framework.

4. Branch-and-Bound Pruning: Details advanced pruning techniques to significantly reduce plan generation time by curtailing the search space without sacrificing optimality.

5. Conclusion: Summarizes the thesis, highlighting the achievement of bridging the performance gap between top-down and bottom-up join enumeration through robust, efficient algorithms.

Keywords

Join Enumeration, Query Optimization, Memoization, Hypergraphs, Branch-and-Bound Pruning, Graph Partitioning, DBMS, SQL, Cost-based Optimization, Query Execution Plans, Performance Evaluation, TPC-H, TPC-DS, SQLite.

Frequently Asked Questions

What is the core problem addressed in this work?

The work addresses the computational complexity of finding the optimal execution order for join operations in database queries, particularly for complex queries that involve many relations and joins.

What are the primary fields of study?

The research lies at the intersection of database system internal design, query optimization, algorithmic graph theory, and performance engineering.

What is the ultimate goal of the proposed methods?

The goal is to enable top-down join enumeration to perform as efficiently as bottom-up approaches, while leveraging the inherent advantages of top-down processing for branch-and-bound pruning.

Which scientific methods are utilized?

The thesis utilizes graph partitioning, memoization techniques, and cost-based query optimization strategies, alongside extensive empirical performance testing using standardized benchmarks.

What content is covered in the main part of the thesis?

The main body focuses on developing new partitioning algorithms for simple graphs and hypergraphs, creating a generic framework for partitioning, and introducing seven specific advancements to branch-and-bound pruning.

How can one define the key concepts?

The work is defined by the concepts of graph-aware and hypergraph-aware join enumeration, and the use of branch-and-bound pruning to drastically reduce compile time.

What is the significance of the "generic partitioning framework"?

The generic framework is significant because it allows existing, highly optimized graph-partitioning algorithms to be adapted for complex hypergraphs, which are essential for modern query optimizers dealing with outer joins and complex predicates.

How does this work contribute to real-world database performance?

By drastically reducing the "compile time" of queries—especially for complex, large-scale workloads—the proposed algorithms allow database systems to generate optimal or near-optimal plans faster, leading to lower latency and higher throughput in real-world scenarios.

Fin de l'extrait de 201 pages - haut de page

Résumé des informations

Titre: Algorithms for Efficient Top-Down Join Enumeration
Université: University of Mannheim (School of Business Informatics and Mathematics)
Cours: Databases
Note: summa cum laude
Auteur: Pit Fender (Auteur)
Année de publication: 2014
Pages: 201
N° de catalogue: V274543
ISBN (ebook): 9783656663430
ISBN (Livre): 9783656663423
Langue: anglais
mots-clé: algorithms efficient top-down join enumeration
Sécurité des produits: GRIN Publishing GmbH

Citation du texte: Pit Fender (Auteur), 2014, Algorithms for Efficient Top-Down Join Enumeration, Munich, GRIN Verlag, https://www.grin.com/document/274543

Algorithms for Efficient Top-Down Join Enumeration