However, research in the field of clustering and cluster labelling had so far a very theoretical view or were applying its methods in web search engines. In this paper the research results in those fields are evaluated for the use in electronic communication data. So in the first chapter the different approaches are described and their algorithms are explained. The approaches are evaluated regarding their hypothesis, clustering algorithm needed, their implementation complexity and their correctness.
2 The Different Approaches
The algorithms as proposed by Treeratpituk and Callan [1], Jickels and Kondrak [2], Geraci et al. [3][4], Popescul and Ungar [5], Chen and Dong [6] and Tseng et al. [7] are explored here.
2.1 Treeratpituk and Callan [1]
Treeratpituk and Callan [1] based their algorithm on the works of Glover et al. [8]. Glover et al. based their algorithm on the hypothesis that a word very common in a single cluster and less common in the other remaining clusters must be a good descriptor for that cluster. Treeratpituk and Callan provide a more general labelling algorithm, that would incorporate more features for selecting a descriptor and thus overcome the limitations of the algorithm by Glover et al. Their algorithm would go through four steps:
Fig. 1. The four steps of the algorithm by Treeratpituk and Callan
Collect phrase statistics For each phrase in the cluster, the following statistics need to be collected:
– Document Frequency DF
– Term Frequency T F with respect to cluster S, the parent P and the complete English corpus E – Document frequency of a phrase p with respect to cluster C - DF C – Term frequency of a phrase p with respect to cluster C - T F C
Select label candidates Not every phrase occurring in a cluster are considered. Stop words are removed. According to their hypothesis that a good descriptor must not necessarily be in the majority of the documents in the cluster, it must be in at least 20% of the documents in the cluster. Bigram and trigram phrases 1 need to occur in at least 5% of the documents in the cluster.
1 phrases with two or three words respectively
Calculate the descriptive score Here the descriptiveness of a label DScore p for a cluster is measured. The DScore p is based on the following sets of features: – Normalised Document Frequency (nDF) – TFIDF – Rank of TFIDF and nDF – Boost in ranking – Phrase Length (LEN) Given a set of features X i and their weights b i the final score DScore p can be calculated:
DScore p = b 0 + b i X i + e (1) where e is an error term with a mean of zero. The weights b i are determined using linear regression and training data.
Finally the potential labels are sorted according to their DScore. Calculate the cutoff point There exists a big drop-off in the descriptive scores at some point. Determining this drop-off the algorithm can automatically decide on how many potential labels (N ) to show.
N = c 0 − c 1 (DScore L 1 − DScore L 2)
− . . .
where the weights c i are determined beforehand using linear regression and training data.
Finally all the labels up to the cut-off point are presented as output.
2.2 Jickels and Kondrak [2]
The algorithm by Jickels and Kondrak [2] is incorporating the grammatical struc- ture of the sentence when labelling clusters. They consider cluster labelling as
”automatic identification of hyponymy 2 relations” [2]. Consider the set y = {Germany, F rance, P oland} as an example, the algorithm would identify the hypernym x = Europe for the set y.
Before clustering the dependency relations of the corpus needs to be ex- tracted. Dependency relations are ”syntactic relationships between words in a sentence”[2]. There are subject relation (N:subj:N) and appositive relations (N:appo:N). Given a sentence ”Berlin, a capital city, has the Brandenburg Gate” the subject relation Berlin and Brandenburg Gate and the the appositive relation Berlin and capital city can be extracted.
After the clustering the dependency data is generated using the results from the dependency relations extraction. Revisiting the example, above the depen- dency data would look like this:
2 y is a hyponym of x if every y is also an x [9]. Hypernymy is the inverse of hyponymy.
Fig. 2. An example of the dependency data. The numbers behind the feature indicate the feature score(not normalised) and the numbers behind the feature word the label score. The label score add up to the feature score. A large feature score indicate that a good descriptor is likely to be found in its feature word. So capital city would be in this example a good descriptor. Modified from Jickels and Kondrak [2]
The aim of the algorithm is to give each feature a weight and each feature word a label score. A high feature weight indicates that a good cluster descrip- tor can be found among the associated feature words. Within these the best descriptor is the feature word with the largest label score.
The input of the algorithm is the dependency data of the clusters. The algo- rithm has to go through a training process. The output of the training process is a feature score F S(f ) for each feature f . The final output of the algorithm is a ranking of labels for each cluster.
The training (fig. 3) happens in the lines 3-15. The loop runs until the feature score converges (see lines 13 and 15). In lines 8 and 9 the label score and the temporary feature score are determined. n(f, a) is the number of occurrences of feature word a in feature f . So the label score LS(a) is calculated by multiplying the feature score by the number of occurrences of the feature word a in feature f (line 8). The temporary feature score F S (f ) is the sum of label scores for the feature word a that is in feature f . In line 14 the temporary feature score is normalised and this is then the feature score F S(f ).
The descriptor for a given cluster will be set to the label with the highest label score.
2.3 Geraci et al. [3][4]
The approach by Geraci et al. [3][4] use the information gain as their basis to define cluster labels. Information gain (IG(t, c)) measures the amount of infor-
Arbeit zitieren:
Dennis HU Egert, 2009, Cluster Labelling, München, GRIN Verlag GmbH
Dieser Text kann über folgende URL aufgerufen und zitiert werden:
Einbetten
DOI
Formatvorlage (Microsoft Word) für eine Diplomarbeit, Masterarbeit, Ha...
Für MS Word 2003 - Update 2010
Vorlagen, Muster, Formulare, Infobroschüren
Ausarbeitung, 25 Seiten
Formatvorlage (OpenOffice) für eine Diplomarbeit, Masterarbeit, Hausar...
Vorlagen, Muster, Formulare, Infobroschüren
Ausarbeitung, 35 Seiten
Formatvorlage / Vorlage zur Erstellung einer Diplomarbeit, Bachelorarb...
Vorlagen, Muster, Formulare, Infobroschüren
Ausarbeitung, 15 Seiten
Formatvorlage / Vorlage für eine Diplomarbeit / Hausarbeit
Für MS Word 2007 - dotx
Vorlagen, Muster, Formulare, Infobroschüren
Ausarbeitung, 25 Seiten
Anleitung zum Erstellen schriftlicher Arbeiten: Der Aufbau einer wisse...
Vorlagen, Muster, Formulare, Infobroschüren
Ausarbeitung, 20 Seiten
Erstellen einer schriftlichen Hausarbeit
Vorlagen, Muster, Formulare, Infobroschüren
Hausarbeit, 14 Seiten
Grundtechniken wissenschaftlichen Arbeitens
Bibliografieren - Reden - Schr...
Vorlagen, Muster, Formulare, Infobroschüren
Skript, 46 Seiten
Ratgeber zur Erstellung wissenschaftlicher Arbeiten. Diplomarbeiten - ...
Vorlagen, Muster, Formulare, Infobroschüren
Ausarbeitung, 39 Seiten
Dennis HU Egert's Text Cluster Labelling ist nun auf dem Buchmarkt erhältlich
Dennis HU Egert hat den Text Cluster Labelling veröffentlicht
Dennis HU Egert hat einen neuen Text hochgeladen
Advances in Computing Science - ASIAN 2002. Internet Computing and Mod...
7th Asian Computing Science Co...
Jean-Marie Alain
Transactions on Computational Science VII
Yan Solihin, Rémi Léandre, Reshma Lal, Chin-Kun Hu, Matthew Hoekstra, Siddhartha Chhabra, Marina L. Gavrilova, C. J. Kenneth Tan
Computability, Complexity, and Languages: Fundamentals of Theoretical ...
Martin Davis, Ron Sigal, Elaine J. Weyuker
0 Kommentare