A lot of data contains implicit hierarchical structures, e.g. type hierarchies. The property graph model – among others employed in some graph databases – provides no tools to capture those internally.
In this thesis we derive such hierarchies automatically. First a survey is conducted to find the most promising approaches that cluster a data set hierarchically. In the next step various features and vectors thereof are experimented with to extend the methodology to graphs, capturing the structure as well as possible.
We found that there is not one specific feature vector that works well for all data sets and forms of representation in a graph, but rather needs to be constructed adaptive, depending on the way data is modelled. Finally, some extensions of a specific algorithm that was used during experimentation – namely Cobweb – are discussed as well as the use case of cardinality estimation in property graph databases, leveraging the hierarchy as an associative multi-level histogram.
Table of Contents
1 Introduction
2 Background
2.1 The Property Graph Model
2.2 Cluster Analysis
2.2.1 Approaches
2.2.2 Clustering as a Search
2.3 Taxonomy
2.4 Probability Theory
3 Algorithms
3.1 Hierarchical Clustering Algorithms
3.1.1 Hierarchical Agglomerative Clustering
3.1.2 Robust Single Linkage
3.2 Partition-based Clustering
3.2.1 K-Means
3.2.2 TTSAS
3.3 Density-based Clustering
3.3.1 DBSCAN
3.3.2 OPTICS
3.3.3 HDBSCAN
3.4 Model-based Clustering – Conceptual Clustering
3.4.1 Cobweb
3.4.2 Extensions of Cobweb
3.5 Feature Extraction
3.5.1 Characteristic Sets
3.5.2 Recursive Feature Extraction
4 Label Inference
4.1 Problem Statement
4.2 Proposed Solution
4.2.1 Pre-Processing
4.2.1.1 Encoding Sets of Tags as Vectors
4.2.1.2 Feature Vector Extension
4.2.2 Clustering
4.2.3 Post-Processing: Taxonomy Extraction
5 Evaluation
5.1 Tag-based clustering
5.1.1 Setup
5.1.1.1 Data
5.1.1.2 Implementation
5.1.2 Results
5.1.3 Discussion
5.2 Graph-aware clustering of Nodes
5.2.1 Setup
5.2.1.1 Data
5.2.1.2 Pre-Processing
5.2.1.3 Implementation
5.2.2 Results
5.2.3 Discussion
6 Conclusion
6.1 Summary
6.2 Future Work
Research Objectives and Focus
The thesis aims to automate the extraction of hierarchical type taxonomies from property graph databases, addressing the lack of implicit hierarchy support in common graph data models. By deriving these structures, the study facilitates improved query optimization, data exploration, and the generation of more concise queries.
- Automated hierarchy derivation in property graphs
- Evaluation of clustering algorithms for graph structures
- Development of a graph-aware processing pipeline
- Implementation of conceptual clustering algorithms (e.g., Cobweb)
- Adaptive feature vector construction for diverse data sets
Extract from the Book
3.4.1 Cobweb
Fisher describes COBWEB as a conceptual clustering algorithm that is designed to maximize inference ability, is incremental (sometimes also referred to as an online algorithm) and computationally economical [22]. To achieve this, Fisher takes the viewpoint of clustering as a learning task for concepts [50] and applies the search paradigm to the task. He describes incremental learning along two dimension: search direction and search control.
As COBWEB shall produce a hierarchy of clusters along with a concept description there are three search problems present according to Fisher [23]:
1. Searching the space of Characterizations: According to Mitchell [50] the space of characterizations may be ordered by generality. Thus one can either search this space (search direction) bottom-up, starting with a very specific hypothesis and generalize incrementally or search top-down, starting with the most general hypothesis and discriminating incrementally or one can search in both directions, converging to some solution that is in the best case neither too specific nor too general. Mitchell called this the version space strategy [50].
2. Searching the space of Aggregations: This is the search through the space of all possible partitionings in all levels of the hierarchy. One can start with the smallest possible partitions and the largest number of partitions and then proceed by selectively merging two or more partitions or the other way round with only one partition consisting of all elements, splitting in subsequent search steps.
3. Searching the space of Hierarchies: This is the space of orderings of the partitions with the constraint that the partitions must be ordered by the subset relation. Here one may build the hierarchy from the smallest partitions to the largest or the other way round.
Summary of Chapters
1 Introduction: Discusses the ubiquity of hierarchical structures in data and identifies the lack of native support for these hierarchies in current property graph models.
2 Background: Provides foundational knowledge on property graph models, cluster analysis definitions, taxonomies, and probability theory.
3 Algorithms: Describes and analyzes various clustering algorithms, including hierarchical, partition-based, density-based, and conceptual clustering methods.
4 Label Inference: Outlines the problem of extracting explicit hierarchies from implicit data and presents a solution pipeline involving pre-processing, clustering, and taxonomy extraction.
5 Evaluation: Details the experimental setup and results for both tag-based and graph-aware clustering, discussing the performance and applicability of different feature vectors.
6 Conclusion: Summarizes the thesis findings and suggests potential future research directions in conceptual clustering and adaptive feature extraction.
Keywords
Property Graph Model, Hierarchical Clustering, Conceptual Clustering, Cobweb, Taxonomy Extraction, Label Inference, Graph Databases, Neo4j, Data Mining, Feature Extraction, Cardinality Estimation, Cluster Analysis, Incremental Learning, Synthetic Data, Knowledge Discovery
Frequently Asked Questions
What is the core problem addressed in this research?
The research addresses the fact that property graph databases lack internal support for hierarchical structures (type taxonomies), which are essential for tasks like query optimization and effective data exploration.
What are the central thematic areas of this work?
The thesis focuses on clustering algorithms, property graph data models, taxonomy extraction, and the development of pipelines to derive structural hierarchies automatically.
What is the primary objective of this thesis?
The primary objective is to develop and evaluate a pipeline that can automatically extract meaningful, representative type hierarchies from data within property graph databases.
Which scientific methods are utilized in this research?
The study employs a mix of survey-based research of existing clustering literature and experimental implementation of algorithms, including hierarchical agglomerative, density-based, and conceptual clustering (Cobweb) within a Neo4j environment.
What is covered in the main part of the thesis?
The main part covers the theoretical background of clustering, a detailed algorithmic overview, the design of a graph-aware taxonomy extraction pipeline, and an extensive evaluation of different feature vectors using synthetic and real-world datasets.
Which keywords characterize this research?
Key terms include Property Graph Model, Hierarchical Clustering, Conceptual Clustering, Taxonomy Extraction, and Label Inference.
Why is the Cobweb algorithm specifically highlighted in this work?
Cobweb is highlighted because it is parameter-free, incremental, provides probabilistic concept descriptions, and produces hierarchies by design, making it highly suitable for graph-aware clustering.
How is the "well-formed" nature of the hierarchy ensured?
The pipeline uses a post-processing step to flatten dendrograms into usable taxonomies by aggregating consecutive merges with identical distances.
Does a single feature vector work for all data types?
No, the research demonstrates that no single feature vector is universally optimal; instead, feature vectors must be constructed adaptively based on the specific properties and structures available in the data.
- Quote paper
- Fabian Klopfer (Author), 2020, Label Hierarchy Inference in Property Graph Databases, Munich, GRIN Verlag, https://www.grin.com/document/1012318