Free online reading

## Contents

1 Clustering, Conceptual Clustering and Concept Formation

1.1 Clustering

1.2 Conceptual Clustering

1.2.1 Concept

1.2.2 Concept Learning

1.2.3 Conceptual Clustering

1.3 Concept Formation

2 Some Approaches of Concept Formation

2.1 Feigenbaum's EPAM

2.2 Lebowitz' UNIMEM

2.3 Fisher's COBWEB

2.4 Gennari's, Langley's and Fisher's CLASSIT

3 KoCLASSIT { Adaption of CLASSIT to KoScript

3.1 Choice of CLASSIT for use in KoScript

3.2 Special properties of KoScript

3.3 KoCLASSIT

4 Bibliography

## 1 Clustering, Conceptual Clustering and Con- cept Formation

In the following we will use a special class of clustering methods called concept formation. We rst want to introduce these terms and so distinguish these terms from each other. Afterwards, we will go into detail and explain the functioning of di erent concept formation methods.

### 1.1 Clustering

In the task of machine learning we have a set of observed data which we divide into di erent groups.

There are now two di erent ways to do this:

Supervised: We know for sure that there is a set of groups for this data and the aim is to insert each element of the observed data into the best tting group.

Unsupervised: We don't know for sure into which groups the data has to be split and the aim is to nd the best set of groups and to insert the elements of the observed data into these groups.

(compare [MST94], p. 6)

The supervised learning is often called discrimination. Examples for supervised learning are Linear Discriminants, Decision Trees and k-Nearest-Neighbour. The unsupervised learning is called Clustering. In this chapter we will see some di erent types of clustering algorithms as conceptual clustering and concept for- mation.

Note that in a "simple" clustering, we neither make any assumptions about how the clusters built upon a set of observed data are represented nor if there has to be any representant of a cluster.

### 1.2 Conceptual Clustering

In this chapter we will often use the term of a concept, but what is a concept?

#### 1.2.1 Concept

In [Anzai92], page 177 we read that "words such as 'a triangle', 'a co ee cup', and 'a hammer' indicate a concept. An individual triangle is considered to be an example of a concept called triangle."

There we also nd two de nitions of the term concept:

extensional meaning: We have a set of examples and the meaning of the concept is de ned by the set of these examples ([Anzai92], 177)

intensional meaning: We have a set of properties and the meaning of a concept is de ned by the set of these attributes ([Anzai92], 178)

Suppose we have a concept P de ned intensionally. Then we have a set of prop- erties p1; :::; pn which each instance of P shall satisfy. Each of these pi can be a variable symbol like "shape", "color", "size" etc. For each of these variable symbols we have a set of constant symbols or ground symbols, such as "red", "blue", "white" etc for the variable symbol "color" or "small" and "big" for the variable symbol "size".

Let Q be a nite set of ground symbols for the pi.

Now let us de ne as a substitution that replaces all the variable symbols in- cluded in P with constant symbols, i.e. = f j8v 2 pi; 8w 2 Q : = piv=w ; i = 1::ng

The term = piv=w indicates the substitution of the variable symbolsofthepi (v) by constant symbols of Q (w).

If P Q holds for some substitution , then we say that Q is a positive instance of the concept P .

Now let R be a nite set of ground symbols and P R for any substitution that replaces variable symbols in P with constant symbols of R. We then say that R is a negative instance of the concept P (compare to [Anzai92], p. 182).

Now we have de ned the terms of concept as well as the positive and negative instances of a concept. Based on this, we de ne the generalization and the specialization of a concept.

Suppose we have a set of positive instances p1; :::; pn and a set of substitutions 1; :::; n such that i replaces constant symbols in pi with variable symbols. If there exists an instance p such that for all i = 1::n; p pi i holds for some i, we call p a generalization of the positive instances p1; :::; pn (based on 1; :::; n) (compare to [Anzai92], p. 183).

Let us consider an example of that generalization. Suppose we have the two positive instances "big red tree" and "small red tree". In both instances, we have constant symbols for size ("big" and "small") and color ("red"). One possibility is to replace the attributes "big" and "small" by the variable symbol "size". Then we would get a more general object that we knew only the color of (red tree), paying no attention to the size.

We can de ne the specialization of a concept using the generalization. A concept p1 is a specialization of a concept p2 if p2 is a generalization of p1 (compare to [Anzai92], p. 183).

#### 1.2.2 Concept Learning

In [Anzai92], page 184 we nd a de nition of learning a concept based on positive and negative instances.

There, we have learned a concept for some positive instances p1; :::; pm and neg- ative instances q1; :::; qn if we have found a well-formed formula r such that there is a substitution i (i = 1::m) that replaces constant symbols by variable symbols so that r pi i (i = 1::m) and for which there is no substitution j (j = 1::n) that replaces constant symbols by variable symbols so that r qj j (j = 1::n).

In other words, we search for a generalization r of the positive instances pi such that the generalized concept r is not a generalization of the negative instances qj , whatever substitution we may apply.

"We say that r is a concept for the instances p1; :::; pm; q1; :::; qm. Learning a concept for a set of instances is called induction of the concept" ([Anzai92], p. 184).

#### 1.2.3 Conceptual Clustering

Above we have de ned clustering as a task of building up groups and classes on observed data and splitting the data into these classes. We have also got to know concept learning as a search for relation among the elements of the given data such that we describe concepts about the observed data. It is possible, and advantageous, to intertwine these both concepts. "Classifying a given set of instances into clusters so that each cluster corresponds to a di erent concept is called conceptual clustering" ([Anzai92], p. 197). Di ering from a simple clustering, conceptual clustering gives us the possibility to represent the found clusters in a way that these clusters correspond to the concepts made up on the observed data. Nothing concrete is said about how the concepts (and therefore the clusters) have to be represented, especially there is no notion about hierarchies and no restriction about how the concepts and clusters have to be found, but we know that the representation of clusters has to t the representation of the concepts (or the other way round).

### 1.3 Concept Formation

Let us summarize the properties of the three tasks clustering, concept learning and conceptual clustering which we have considered so far. All three approaches are methods for unsupervised learning. While clustering divides the observed data into usefull classes, concept learning is able to identify concepts on the observed data. With conceptual clustering we have clusters which correspond to concepts on the observed data. We have a single level of hierarchy for a not necessarily incremental search of the clusters.

In [GLF] we nd a special case of conceptual clustering, the so called concept formation.

The most important and distinctive features of concept formation are "the hierarchical organization of concepts, top-down classi cation and an incremental hill-climbing approach to learning" ([GLF], p. 16).

## 2 Some Approaches of Concept Formation

Below we will give a short summary of four approaches to concept formation which satisfy the constraints given above and, going further, integrate the classi cation process with learning.

Here we will only give a short and simpli ed overview of how these approaches work with respect to the concept formation task. For a more detailed description please read at [GLF] or the bibliography given there.

### 2.1 Feigenbaum's EPAM

EPAM uses attribute-value pairs to denote the instances seen. "The acquiered knowledge is represented and organized in an discrimantion network. [..] In this network, each nonterminal stands for a test, and each edge emanating from this node represents a possible result of this test" ([GLF], p. 17). There is a special edge called "other" that stands for multiple results which do not have their own edge.

"Each terminal node contains an image { a partial set of attribute-value pairs (and component categories) expected to hold for instances sorted to that node."([GLF], p. 17-18).

With respect to concept learning, we see that here a generalization is made of the instances seen so far.

A new instance is inserted to the tree as follows:

EPAM starts at the root node of the discrimination network and makes tests with the new instance. It sorts the instance down the branch with the matching result for the test (if no test matches, the instance is sort down the "other"- branch). Once the instance reaches a terminal node, there are three learning mechanisms that can be applied. The rst is called familiarization and is in- voked if the image stored at the terminal node matches the instance sort to that node. Familiarization "selects an attribute that occurs in the instance but not in the image, and then adds the attribute (along with the instance's value) to the image. In this way, EPAM gradually makes its images more speci c as it encounters more instances"([GLF], p. 19).

The second mechanism is "creating new disjunctive branches through discrimi- nation"([GLF], p. 21). It is applied when an instance did not match the image at a special node. Then it is sort through the tree again, searching for a node in which the instance di ers from a stored test. Here two new branches are created, "one based on the instance's value for the test and the other based on the image's value"([GLF], p. 21).

The third mechanism is "extending the network downward through discrimina- tion"([GLF], p. 21). This mechanism is applied if both, the rst and the second mechanism, fail. Then "the system [..] sorts the instance back down to the terminal node where the mismatch originally [i.e. in familiarization] occured. EPAM creates two new branches in this case as well, along with corresponding terminal nodes"([GLF], p. 21).

Now let us summarize the properties of EPAM with respect to the properties of concept formation.

The images created and updated in each terminal node correspond to concepts of the instances sort to that node. Each concept corresponds to a cluster upon the instances seen. The discrimination network is a tree structure, in which a node contains more general information about the instances sort to it than its children do.

The insertion of a new instance to the tree is incremental because the instance can be added to an existing tree (so not all instances have to be known a priori) and for its insertion into a branch only local information is used. It is a hillclimbing method because di erent operations are applied and the best matching one is taken (compare to the three mechanisms above). Of course EPAM is a learning system for it updates its discrimation network at each step adapting the test results and images stored in the nodes. These test results and images then are used for further classi cation of new instances.

### 2.2 Lebowitz' UNIMEM

As mentioned in [GLF], p. 22, "one can view Lebowitz' UNIMEM as a successor to EPAM, since it shares many features with the earlier model". Like EPAM, UNIMEM uses images, but di ering from EPAM, UNIMEM uses concept descriptions also in nonterminal nodes. "Each [concept] description consists of a conjunction of attribute-value pairs, with each value having an as- sociated integer. [..] This corresponds to the idea of predictability, i.e., how well the feature can be predicted given an instance of the concept. [..] In addition, each feature on a link has an associated integer score, specifying the number of links on which that feature occurs. This second score measures the predic- tiveness of the feature, i.e., how well it can be used to predict instances of the various children"([GLF], p. 23f ).

In a changed manner, we will see both the predictability and the predictiveness scores also in the following concept formation algorithms.

A new instance is sort into the tree comparing its attribute values to those stored in the image. If an instances values are close enough to those of an image, then the instance is sort into the node containing that image. "Both the number of features necessary for this match and the closeness of each value [..] are system parameters"([GLF], p. 25). In both cases of match or not match the scores stored in the node are updated. UNIMEM is able to sort single instances into multiple paths.

If an instance is sort to a node but does not match any of its children the new instance is matched with all instances stored at the children . If it nds an instance that "shares enough features with a new one (another system parameter), the model creates a new, more general node based on these features and stores both instances as its children " ([GLF], p. 25). In this case the predictiveness scores are updated in the new node.

As in EPAM, the images stored in the nodes correspond both to concepts and clusters. It is important to note that the clusters built in UNIMEM are not disjoint.

We have mentioned above that the nodes are described as a conjunction of attribute-value pairs. The images of nodes in lower levels are conjunctions of the images of the parent nodes with new attribute-value pairs. From these both statements we can easily understand that the images in the lower levels in the tree correspond to more speci c concepts than in the higher levels in the tree. With the same reasons give above for EPAM, UNIMEM is an incremental hill- climbing learning algorithm.

### 2.3 Fisher's COBWEB

Each instance is represented as a tupel of attribute value pairs in which there is a probability given for each occurence of an attribute value. Even values not occured are given a probability. In the latter case the probability is 0. Each terminal node corresponds to an instance and has either 0 or 1 probability for an attribute value. Each nonterminal node contains for every attribute value the average probability of all its children nodes' attribute values.

Each time a node is inserted to the tree the system rst compares the node with all the children ci of the local root node (of the current partial tree). It nds out which of the children ci is most and which is secondmost similar to the new node (which, of course, as an instance has either a 0 of a 1 probability for a given attribute value (see above)). The comparison is called matching. After nding out the best and secondbest matching children the system takes out four di erent operations and computes the quality of the local partial tree (we will discuss the quality a bit later) after each of these operations. The operation with the best quality is the one to take out.

The operations mentioned above are:

new The current instance is added to the tree as a new child of the local root. insert The system calls itself recursively with the best matching child as the local root and the new instance as (still) the node to place into the tree. split The best matching child of the local root is deleted and replaced by its children. The system calls itself recursively with the local root and the still to discriminate instance. merge A mew node is created with the averaged values of the two best matching children of the local root. This new node is placed into the tree as a child of the local root and as the parent of the two nodes it gets its values from.

Now let us take a look at how COBWEB computes the quality for a given clustering. As a means of quality, COBWEB uses the predictability and the predictiveness scores it stores (or computes) and updates for a given tree structure. Formally, for a given set of clusters Ck, a set of attributes Ai and a set of attribute values Vij , predictability is P (Ai = Vij j Ck ), the probability of attribute Ai having value Vij given the class membership Ck. Accordingly, predictiveness is the probability of an instance being member of a class Ck given attribute Ai having value Vij , formally P (Ck j Ai = Vij ).

In clustering, we try to maximize both, the predictiveness and the predictability scores. This tradeo can be summed up into one formula, the overall measure of cluster quality:

P PP

P (Ai = Vij )P (Ck j Ai = Vij )P (Ai = Vij j Ck ) k i j

In this formula, predictability and predictiveness are weighed by P (Ai = Vij ), the probability of the occurence of the attribute value. In this way, attribute values which occur more often play a more important role than those occuring more rarely.

In the above formula, we can substitute P (Ai P (Ck )P (Ai = Vij P Ck ) usi PP

illustration not visible in this excerpt

tribute values that one can correctly guess for an arbitrary member of class Ck. [..] one guesses a value with probability P (Ai = Vij j Ck ) and [..] this guess is correct with the same probability"([GLF], p. 34).

For use in a clustering algorithm, we rst would like to know how good we can guess an attribute value having class information towards not having any clustering information. Secondly, we want to increase the expected number of correct guesses.

For this, we rst substract the simple number of correct guesses without class information (i.e. P (Ai = Vij )2) from the conditional probability given class information (i.e. P (Ai = Vij j Ck )2, weighed by the cluster strength P (Ck)). To be able to compare di erent sizes of clusterings we divide the result by K , the number of clusters. Thus, we get the following formula:

illustration not visible in this excerpt

With this formula, COBWEB scores the results of the four operations and selects the one with the highest score. In the next subsection, we will consider this formula once again.

In COBWEB each instance corresponds to a concept. The higher a node stands in the concept tree, the more generally it describes its instances. Unfortunately, there is no way to see that a concept is more or less general considering only the probability of seen values (of course, it is possible to understand that a concept is general by considering its weight in comparison with other concepts, but this is not enough to show how good a concept describes its instances). A sollution for this problem is shown in the next subsection, discussing the method CLAS- SIT.

A concept is learned by adding a new instance's values to those of an existing concept description, by creating a new concept corresponding to the new in- stance or altering the tree structure by merging or splitting nodes. Note that each time a new instance passes a local root, the local root updates its proba- bility for every attribute, putting into account the attribute values of the new instance.

In each level, each of the children corresponds to a disjoint cluster. Every one of these clusters consists of the terminal nodes reachable from one of the children (i.e. the local root of the partial tree). In this way, there is a hierarchy of clus- ters and concepts (in COBWEB, each cluster corresponds to a concept) with the root node being the most general and the terminal nodes the most special clusters and concepts.

For an existing tree it is possible to add new instances one after another. In this way COBWEB is incremental.

The most obvious sign for COBWEB being a hill climbing method is that is rst tries four di erent operations, scores their results and takes the operation with the best score.

Learning is done by updating the tree structure and the predictability scores in each node which is passed by a new instance.

### 2.4 Gennari's, Langley's and Fisher's CLASSIT

Gennari's, Langley's and Fisher's CLASSIT is a successor of the above consid- ered COBWEB. Similar to COBWEB, also CLASSIT represents each concept and each instance as attribute value pairs, but other than COBWEB, CLASSIT does not use a simple probability for each attribute value, but the mean and the standard deviation for each attribute. For a set of n elements s:::; an the mean is de ned as

illustration not visible in this excerpt

n and the standard deviation is de ned as n

In CLASSIT, the attribute values are numerical instead of symbolic or nominal. CLASSIT uses the same four operations considered in COBWEB above, so there is no change to the way a new instance is sort into the tree. However, there is a slight change to the computation of the clustering quality. Remember, that for any local root of a partial tree, COBWEB computes the (local) cluster quality with the formula

illustration not visible in this excerpt

We mentioned above that CLASSIT uses a mean and a standard deviation for

illustration not visible in this excerpt

an attribute value instead of a probability for each (symbolic or nominal) value.

This makes it necessary to alter the formula as follows. P P

The partial formulas P (Ai = Vij j Ck)2 and P (Ai j j = Vij ) include all val-2 ues of an attribute, summed over the squares. The rst partial formula shows the distribution of an attribute value over di erent classes, whereas the second formula does not include any class information. This, actually, corresponds to the distribution in the parent node, because the parent node contains all instances without clustering information.

In order for these formulas to be applied to continous attribute values, Gennari, Langley and Fisher have changed the summations to integrations and assumed "that the distribution of values for each attribute follows a normal curve. Thus, the probability of a particular attribute value is the height of the curve at that value and the summation of the square of all probabilities becomes the integral of the normal distribution squared"([GLF], p. 39).

For the second formula, e.g. this leads to the following expression:

illustration not visible in this excerpt

where stands for the mean and for the standard deviation. Since we use the

result only for comparison, we can discard the p 2

constant. What remains, is

The evaluation function in COBWEB can now be expressed as

illustration not visible in this excerpt

"where I is the number of attributes, K is the number of classes in the partition, ik is the standard deviation for a given attribute in a given class, and ip is the standard deviation for a given attribute in the parent node"([GLF], p. 39). Looking at the new formula, it is obvious that problems may occur if the stan- dard deviation is zero and we divide by the standard deviation. To prevent this, Gennari, Langley and Fisher introduce the system parameter acuity, which is a lower bound for the standard deviation. "This limit corresponds to the notion of a 'just noticeable di erence' [..]"([GLF], p. 40). There is another system parameter called cuto that enables the system to insert an instance at a higher level in the tree rather than to follow its way down to a terminal node. Thus, if an instance is "similar enough" to a concept description, the instance is in- corporated into this concept description, even if this concept node has children more speci c.

"Because acuity strongly a ects the score of new disjuncts, it indirectly controls the breadth, or branching factor of the concept hierarchy produced, just as the cuto parameter controls the depth of the hierarchy"([GLF], p. 40).

Like in COBWEB, in CLASSIT, each node corresponds to a concept and also a concept description. Each new instance is inserted into the concept tree and thus learned. So CLASSIT is a concept learning system. Each concept on the other hand corresponds to a cluster, so we can call CLASSIT a conceptual clus- tering system.

There is a hierarchy of clusters (and therefore concepts) containing subclusters. The higher a node stands in the hierarchy, the more general its concept descrip- tion is. Other than in COBWEB, it is possible to see how general a concept is and how good it represents its instances. The measure is the standard deviation of the attribute values contained in each node. The greater this value is, the more general the concept description is. The less the value is, the more accurate the concept description is.

CLASSIT is an incremental hill climbing learning method, for it is able to ex- tend an existing tree by a new instance (so it is incremental), it rst tries four di erent operations, evaluates their results and then selects the operation to execute (so it is a hill climbing method), and once an instance is inserted to the tree, the values for predictability and predictiveness are updated (so it is a learning system).

We have decided to use CLASSIT for our purposes of clustering character rules in KoScript. In the next section, we will explain why we decided to and how we adapted CLASSIT to KoScript.

## 3 KoCLASSIT { Adaption of CLASSIT to Ko- Script

First we want to explain why we chose CLASSIT for our aim to cluster character rules.

### 3.1 Choice of CLASSIT for use in KoScript

In KoScript, we have a single attribute for use in clustering, and that is the measure our matching gives us as a similarity value (the similar two character rules are, the less is this value, so actually, it is an "unsimilarity" value) for two objects. This value, of course, is not nominal, but numeric. Of the clustering algorithms considered above, only UNIMEM and CLASSIT were able to handle with numeric attribute values.

In KoScript, we have two contrary aims between which we have to nd a tradeo : On the one side, we want KoScript to well recognize characters. The more character rules we have for comparison, the better this aim can be achieved. On the other side, we want KoScript to be a fast recognition system, so we have to reduce the number of character rules for comparison.

Our idea to solve this problem of contrary aims is looking for a set of character rules so that KoScript works most performant. For this, we want to cluster the set of character rules and, at a special level, select only one representative from every cluster for recognition. Thus, we do not take into consideration all elements contained in a cluster, but one that is similar enough to all elements of that cluster. In this respect, CLASSIT is the only concept formation method that allows us to see how similar a concept description (for us, this is the repre- sentative to choose) is to the instances it is derived from. The less the standard deviation of the concept description, the more similar the concept description is to the instances.

In the section "Leveling" we will see how the standard deviation information can be used for KoScript.

A third reason for us to choose CLASSIT is the possibility to control clustering results by modifying system parameters.

We want to be able to tune the recognition and the performance of KoScript, so we want to be able to cluster the character rules with di erent system parame- ters to achieve better or faster recognition. For these purposes, the parameters acuity and cuto introduced in CLASSIT seem to be very suitable. Since we have to adapt CLASSIT to KoScript (see below), we do not know a priori, how our system will behave in some special cases. For our system still to work, we need a set of parameters to control its behaviour.

### 3.2 Special properties of KoScript

In KoScript, we have a set of character rules R = ri; i = 1::n to cluster. The character rules consist of shapeline rules ri ! sia1 sia2 ::siam , together with some size relations and contact angle information. A character C can consist of di er- ent sets of shapelines skj; k = [1]::p; j = [1]::l. Thus, there are many di erent ways to build up character rules rk; k = [1]::p with shapeline rules rk ! sk[1]sk[2]::sk forl one distinct character. The character rules for this distinct character then may di er in the number, the set and the order of shapeline rules appearing in the character rule, so for similarity clustering, we cannot use the data given in the character rules directly. Instead, we use the matching function of KoScript.

In KoScript, there is the possibility of matching two character rules, mutually embeding them into a set of each others tting shapeline rules.

Suppose we have two characters ci; i = [1]::[2], two sets of shapeline rules Si = s1i ::sai ; ai 2 N and two character rules ri ! s1i s2i ::sbi ; bi 2 N with s1i ::sbi Si. Describing easily, we now try to embed each shapeline rule smi ;mi = 1i::ai, into character rule r3 i at each position sn3 i ; n3 i = 13 i::b3 i and calculate the embed costs. For some sets of shapeline rules fspi g fsm g, we will havei embedded the complete right side of the rule r3 i ! s13 i s23 i ::sb3 i so we can calculate entire embed costs. All embedings result in costs, and the best tting embedings cause the least costs. Suppose that we nd a set of shapeline rules fsqi g fsp g whose embedingleadstoleastcosts.Thiscostisthenchosenasi the di erence score between both characters.

We use this score as the only attribute value for clustering.

Note that this is a distance score, and in some cases has to be handled di erently from usual numeric values.

Character rules consist of di erent numbers of components. Therefore, for a set of character rules, it is not possible to calculate a suitable mean. Instead, in clustering, we want to use the median. For a given set of character rules R = frig, i = 1::n, we de ne the median as an element rj, j 2 1::n, which is closest to the mean . For our own purposes, we want to rede ne tse median

illustration not visible in this excerpt

as the element rj , j 2 1::n, for which the standard deviation = n

is minimal. The subexpression (rj ri) corresponds to the di erence score we are able to compute for character rules using KoScripts embeding functions. Note that contrary to the mean, the median must be a member of R.

Each time the median is changed, all the di erence scores have to be updated with respect to the new median.

With the properties of KoScript discussed above, we now want to introduce KoCLASSIT, the adaption of CLASSIT to KoScript.

### 3.3 KoCLASSIT

representation of the instances All instances are represented as FS having following structure:

PARENT:[];

CHILDRENLIST:[]; CLUSTER-INFOS:[]; Z-LERN:[];

The PARENT eld contains a pointer to the parent node of this instance in the concept formation tree. CHILDRENLIST on the other hand contains a list of pointers to all the children this special instance has. Z-LERN contains the object description. In our case, this is the character rule, together with a set of shapeline rules, a set of base segments and some other information. CLUSTERINFOS is a complex FS containing following attributes:

ID:[];

PARENTMATCHCOST:[]; SUM-AI:[];

SUM-AI-SQUARES:[];

CHILDRENNUMBER:[]; ELEMENTNUMBER:[]; MEAN:[];

STANDARDDEVIATION:[];

The ID contains an integer that identi es the node in the concept formation tree. This is necessary for testing the results of KoCLASSIT. For a distinct node n, CHILDRENNUMBER stores the number of children this node n has. Note that CHILDRENNUMBER is the number of direct successors of node n. In a directed acyclic graph (V; E; ), we would say that for a distinct

node n, it corresponds to the following number:

j fv j (n; v) 2 Eg j

ELEMENTNUMBER, on the other hand, is the number of all successors of a distinct node n, which are terminal nodes, plus the number of nodes that have been incorporated to the successors of n due to the cuto system parameter (the latter are also counted, but not listed in the concept formation tree). ELEMENTNUMBER, accurately, is the number of instances that have passed node n when being sort to the concept tree. This number corresponds to the total cluster strength which is represented by node n.

PARENTMATCHCOST is the value for the matching described above between a node and its parent node. For matching, we need the information contained in the Z-LERN FS which we described above.

Let T = ftig; i = 1::r, be the set of terminal successors of a local root node n, plus the above mentioned incorporated nodes. Then SUM-AI corresponds to P the sum of all matching costs (n Ti), thus match(n; ti), where match(n; ti) i is the matching function of KoScript described above. SUM-AI-SQUARES, re- P spectively, is the sum of all matching costs squared, thus (match(n; ti))2 and i ELEMENTNUMBER is j T j.

We have described the SUM-AI and also the ELEMENTNUMBER. MEAN is the mean of the matching costs between a node n and all the nodes that passed node n when being inserted to the concept tree (in the following, we willPfer this set with fTig). Usually, the mean is computed with the formula match(n;ti) i P

ELEMEN TNUMBER. Since we have stored match(n;ti)inSUM-AI,wecan i

get the same result by

SUM AI ELEM EN T N U M BER .

For a local root node n, STANDARDDEVIATION is, just like MEAN, com- puted over the set T and indicates the standard deviation of matching costs P (match(n;ti))2 i

relative to n, i.e. for a set T , we compute P

e store ELEM EN T N UM BER . Since w

the inner expression (match(n; ti))2 as SUM-AI-SQUARES, we can write the

q i

same formula as

SUM AI SQUARES ELEM EN T N U M BER .

structure of nonterminal and terminal nodes As in CLASSIT, in Ko- CLASSIT terminal and nonterminal nodes do not di er in structure. All nodes have the structure described above. The only things they di er from each other are the CHILDRENLIST (for terminal nodes, this list is empty) and the value for CLUSTER-INFOS:CHILDRENNUMBER (for terminal nodes, this value is 0).

insertion of nodes to the tree The insertion of nodes into the concept for- mation tree follows the same rules as in CLASSIT. First of all, the instance to insert into the concept formation tree is enriched by some additional attributes, so it has the same structure as the nodes in the tree.

Now, we have a (local) root of a (partial) tree and rst want to know if this (local) root is a terminal node.

If this is the case, we create two children of this (local) root, from which one has the same object description (i.e. the Z-LERN attribute contains the same values) as the (local) root and the other is the node we want to insert into the concept formation tree.

Otherwise, if the (local) root is a nonterminal node, we try the same four op- erations used in CLASSIT and COBWEB (i.e. insert, new, split and merge). On a copy of the partial tree, we execute all four operations and evaluate the results. The operation with the best result score is then executed on the partial tree.

evaluation function Since we have only one attribute, our evaluation func- tion is a bit easier than that of CLASSIT. With the sum over the attributes P

missing, we can simplify our evaluation function to

illustration not visible in this excerpt

, where i is the standard deviation in class Ci and p is the standard deviation in the parent node. K corresponds to the number of children the parent node has, thus leading to the formula

illustration not visible in this excerpt

concept learning Each terminal node corresponds to an instance and, at the same time, to a concept. A local root is a concept for all its successors. Once a new instance is inserted to the tree, all the attribute values along its descend path are updated putting into account this new instances comparison values with the local roots.

Especially, the standard deviation is updated, so the system behaves as if it knew the new instance. Therefore, we can call KoCLASSIT a concept learning algorithm.

clustering Each concept corresponds to a cluster. Terminal nodes stand for clusters consisting of a single element, with this element as the representant of the cluster. Nonterminal nodes n correspond to a cluster consisting of all the elements represented by terminal nodes reachable from the node n, plus those nodes which are incorporated to successors of n due to the cuto system parameter. The node n is then the representant of this cluster.

hierarchy tree All nodes are organized in a tree. This tree implicitly describes a hierarchy. The more general a concept description is, the higher this description is found in the tree. The root of the tree contains the most general concept description. At the same time, it corresponds to the representant of the cluster containing all elements.

incremental KoCLASSIT is an incremental algorithm, since it is able to add a new node into an existing tree without computing the tree once again. The instances to insert into the tree need not to be known at the beginning, but can be added one after another.

hill climbing KoCLASSIT is a hill climbing method, since it rst tries the four operations insert, new, split and merge, evaluates their results and then decides which one to execute. This is a typical hill climbing e ort.

learning KoCLASSIT, at least, is a learning method, since it builds up a structure on the instances seen and sorts them into clusters of di erent gen- erality. It also lets us recognize how general a distinct cluster is. Therefore, it updates the standard deviation of the comparison values relativ to the local roots of partial trees. The higher the standard deviation is, the more general the cluster description is. The less the standard deviation, the more accurate the cluster description.

## 4 Bibliography

[MST94] D. Michie, D.J. Spiegelhalter, C.C. Taylor: "Machine Learning, Neural and Statistical Classi cation", Ellis Horwood, 1994

[Anzai92] Y. Anzai: "Pattern Recognition and Machine Learn- ing", Academic Press, Inc., 1992

[GLF] John H. Gennari, Pat Langley, Doug Fisher: "Models of Incremental Concept Formation" in Jaime Carbonell: "Machine Learning", pages 11-61, MIT/Elsevier, 1990

- Quote paper
- Taner Korkankorkmaz (Author), 2000, Identifying the tasks of Clustering, Conceptual Clustering and Concept Formation in the field of Artificial Intelligence, Munich, GRIN Verlag, https://www.grin.com/document/97798

Comments