Optimizing Web Search Results for Image. K-means Clustering Algorithm


Academic Paper, 2020

55 Pages, Grade: 9.5


Excerpt

TABLE OF CONTENTS

Chapter 1: Introduction
1.1 Clustering
1.2 Types of Clustering
1.3 Classification of Clustering Algorithms
1.4 Requirements of Clustering
1.5 Stages in Clustering
1.6 Different Types of Clusters
1.7 Different Types of Clustering Algorithms
1.8 Applications of Clustering
1.9 Web Clustering Engines

Chapter 2: Literature Survey

Chapter 3: Tools and Technologies
3.1 System Requirements
3.2 System Environment

Chapter 4: Problem Description
4.1 Existing System
4.2 Objective
4.2.1 HACM Clustering Algorithm and its Shortcomings
4.2.2 K-Means Clustering Algorithm and its Advantages over HACM
4.3 Proposed System

Chapter 5: System Design
5.1 System Architecture

Chapter 6: Conclusion and Future Work

References

LIST OF FIGURES

Figure 1.1- An example of Clustering

Figure 1.2- Different ways of clustering the same set of points

Figure 1.3- Stages in clustering

Figure 1.4- Different types of clusters as illustrated by sets of two-dimensional points

Figure 4.1- Showing dendogram formed from the dataset of size "N=60"

Figure 4.2- An Example of K-Means Clustering

Figure 5.1- Flowchart of the method

Figure 5.2- User Interface showing three buttons

Figure 5.3- K-Means Algorithm forms four clusters of an image dataset

CHAPTER 1

INTRODUCTION

Internet is expanding at a great rate and is used almost by everyone. Hence, it has become an essential part of life 1. Information Retrieval (IR) is a process which involves the representation, organization, storage and search of digital data in an effective and efficient manner that can retrieve relevant results to a user's query request from the collection of information items 2. With each passing day, the amount of digital information and data on Internet is increasing enormously 3. The availability of a large repository of digital data, along with dynamic and heterogeneous nature of the Web, makes IR a tedious task for an average user 4. Hence, it has become difficult to get the desired search results due to information overload on Internet 5.

Now-a-days, Search Engines (SEs) and Meta Search Engines (MSEs) are developed to help users to get digital information that can satisfy and fulfil their needs and interests. An SE is an essential component of IR Model whose major responsibility is to compare digital information on Internet through obtaining similarity values 6. Since there is a heavy usage of Internet to churn out relevant information specifically required by users, there is a constraint of time to provide relevant results. Hence, to provide precise search results to users, there is a shift from SEs to MSEs, as SEs can cover about 16% of web documents i.e 84% of web remains unreachable which may be desired by the users 7. An MSE is a search tool which uses other SEs to produce effective search results for a query term. MSEs do not maintain their own database for storing different search results. In turn, different SEs, which is being used by MSEs, creates their own database for storing search results. This maintenance of database of search results is known as indexing. Hence, MSEs utilize this advantage of available SEs 8. However, to provide search results in an excellent form to the user search term, MSEs make use of the concept of “Clustering” of web results. Hence, the main focus of this work is to enhance image search results by using the concept of “Clustering”.

However, this chapter mainly focuses on the concept of ‘Clustering’. The introduction to ‘Clustering’ is presented in Section 1.1; in Section 1.2 ‘Requirements of Clustering’ are presented; in Section 1.3 ‘Stages in Clustering’ are explained; in Section 1.4 ‘Different Types of Clusters’ are suggested, in Section 1.5 ‘Different Types of Clustering Algorithms’ are discussed; in Section 1.6 ‘Applications of Clustering’ are explored and in Section 1.7 ‘Web Search Engines’ are studied.

1.1 Clustering

Cluster Analysis or Clustering is a concept which defines the discipline of grouping similar objects or data items into clusters 9. A cluster is said to be a “Collection of data objects”. These formed clusters of similar data items differ in characteristics and features. Hence, Clustering can be defined as a solution for classifying web search results in an effective way for searching data items 10. Clustering allows users to identify their required group at a glance by looking at the cluster labels. Hence, it saves time while searching on Internet.

Clustering is a major process in the field of data mining and a common technique for analyzing statistical data used in many fields, including machine learning, image analysis, bioinformatics, pattern recognition and information retrieval 11 . It can also be defined as the most common unsupervised learning problem. It makes it easier for users to perceive the significance of natural grouping or structure in a dataset. It can be employed either as a stand-alone tool to get insight into data distribution or as pre-processing step for other algorithms.

In the context of interpreting and understanding data, clusters are potential groups of data objects and cluster analysis is the study of techniques or tasks for automatically finding the clusters. Cluster analysis provides an abstraction from individual data objects to the clusters in which those data objects reside. It classifies and groups the data objects based only on information found in the data that describes the objects and their relationships. Data objects which share similar characteristics are assigned to the same cluster or class.

For clustering two types of similarity are considered in 11:

- Intra-cluster similarity - Objects are similar to objects in same cluster
- Inter-cluster dissimilarity - Objects are dissimilar to objects in other clusters

The major goal of clustering is that the data objects within a class must be similar to one another and different from the data objects in other classes. In other words, one can say that the intra-cluster distance between similar data objects must be minimum and the inter-cluster distance between dissimilar data objects must be maximum to achieve effective clustering. The greater the similarity (or homogeneity) within a group and the greater the difference between groups, the better or distinct will be the clustering.

Cluster analysis is associated to other techniques that can be used to classify data objects into classes or groups. For instance, clustering can be seen as a form of classification in that it creates a labelling of data objects with class (cluster) labels. However, it derives these labels only from the information contained in the data of similar objects in a group. We can show this with a simple graphical example:

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.1 An example of Clustering

A Tutorial on Clustering Algorithms [Image of Clustering: An Introduction]. https://matteucci.faculty.polimi.it/Clustering/tutorial_html/index.html

In this case, we have easily identified for clusters into which data can be classified; the similarity criterion is distance: two or more objects belong to the same class if they are close according to a given distance.

In many applications, the concept of cluster analysis is not well defined. To better gain the perception of difficulty in deciding what constitutes a cluster, refer Figure 1.2, which displays twenty points and three different ways of classifying them into clusters. Figures 1.2(b) and 1.2(d) divide the data points into two and six parts respectively. However, the apparent division of each of the two larger clusters into three sub-clusters may simply be an artefact of the human visual system. Also, it may not be unreasonable to say that the points form four clusters as shown in Figure 1.2(c). Hence, this figure shows that the definition of cluster analysis is imprecise and that the best definition depends on the nature of data and the desired results.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.2 Different ways of clustering the same set of points

Tan, P. N., Steinbach, M., & Kumar, V. (2016). Different ways of clustering the same set of points. [Image] Pearson Education India. https://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf

The terms segmentation and partitioning are seldom used as synonyms for clustering, these terms are frequently used for approaches outside the traditional bounds of cluster analysis. For example, the term partitioning is often used in connection with techniques that divide graphs into sub-graphs and that are not strongly connected to clustering. Segmentation often refers to the division of data into groups using simple techniques; e.g., an image can be split into segments based only on pixel intensity and colour.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to understand preferences of your costumers to scale up your business. Is it possible for you to look at details of each costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate strategy for costumers in each of these 10 groups. This is what we call clustering. To elaborate the concept a little bit, we consider several examples:

Example 1.1- (Clustering in health psychology) Cluster analysis has been applied to many areas of health psychology, including the promotion and maintenance of health, improvement to the health care system, and prevention of illness and disability. In health care development systems, cluster analysis is used to identify groups of people that may benefit from specific services. In health promotion, cluster analysis is used to select target groups that will most likely benefit from specific health promotion campaigns and to facilitate the development of promotional material. In addition, cluster analysis is used to identify groups of people at risk of developing medical conditions and those at risk of poor outcomes.

Example 1.2- (Clustering in market research) In market research, cluster analysis has been used to segment the market and determine target markets. In market segmentation, cluster analysis is used to break down markets into meaningful segments, such as men aged 21–30 and men over 51 who tend not to buy new products.

Example 1.3- (Image segmentation) Image segmentation is the decomposition of a grey-level or colour image into homogeneous tiles. In image segmentation, cluster analysis is used to detect borders of objects in an image. Clustering constitutes an essential component of so- called data mining, a process of exploring and analyzing large amounts of data in order to discover useful information. Clustering is also a fundamental problem in the literature of pattern recognition.

Now, that we understand what is clustering. Let’s take a look at the types of clustering.

1.2 Types of Clustering

Broadly speaking, clustering can be divided into two sub-groups:

- Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example each customer is put into one group out of the 10 groups.
- Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store.

1.3 Classification of Clustering Algorithms

Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty. Every methodology follows a different set of rules for defining the ‘ similarity’ among data points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms are used popularly; let’s look at them in detail:

- Connectivity models: As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start with classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lack scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its variants.
- Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K- Means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima.
- Distribution models: These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from over-fitting. A popular example of these models is Expectation-maximization algorithm which uses multivariate normal distributions.
- Density Models: These models search the data space for areas of varied density of data points in the data space. It isolates various different density regions and assigns the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.

1.4 Requirements of Clustering

(a) Scalability: Highly scalable clustering algorithms are needed to deal with large databases and to achieve potential clusters.
(b) High dimensionality: An effective clustering algorithm must be able to handle dimensional space in addition to low dimensional dataset.
(c) Interpretability: The results or clusters generated by clustering algorithms must be in interpretable, comprehensible and usable form.
(d) Ability to deal with different kinds of attributes: Clustering algorithms should be capable to be applied on any kind of data such as interval-base (numerical) data, categorical data and binary data.
(e) Ability to deal with noisy data: Databases may contain noisy, missing or erroneous data. Some clustering algorithms are sensitive to such data and may lead to poor quality clusters.
(f) Discovery of clusters with attribute shape: Clustering algorithms must be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.
(g) Relevancy: The clustering algorithm needs to group data objects relevant to a user.
(h) Overlap: Since data objects have multiple topics, it is necessary to avoid confining each object to only one cluster.

1.5 Stages in Clustering

Typical clustering activity involves the following steps as mentioned in 12:

1. Pattern representation
2. Definition of a pattern proximity measure appropriate to the data domain
3. Clustering or grouping
4. Data abstraction
5. Assessment of output

Figure 1.3 depicts a typical sequencing of the first three steps, including feedback path where the grouping process output could affect subsequent feature extraction and similarity computations.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.3 Stages in clustering

Mythili, S., & Madhiya, E. (2014).Stages in clustering [Image] International Journal of Computer Science and Mobile Computing. https://d1wqtxts1xzle7.cloudfront.net/32843015/V3I1201467.pdf?1390699416=&response-content-disposition=inline%3B+filename%3DAn_Analysis_on_Clustering_Algorithms_in.pdf&Expires=1613539989&Signature=KOG-lYJ89FUuSwQE2h0YWXP1JH7lhgjiTss0ly7xobykuftL6mG1i1ymBwEITUpKtbOKK1RzMpwuu~8wQ7OuTT6souxOuiYYcV7-IxnORVYufvgRx5-CXLn3uw3ruMI5ALqrH8X~AQ4NhIQO8krZ8~ewD~40vrP5EATFHMq5mKqpJ9fniEyPR5NOFcJwxvLwUpnzxRlzZHM07iiFJTT5KALBGsxSperV7DW4DTa~ZvefyyEYbe~sfWw0O~bhuGPhav0F4WsTsFeMIXtoD1~fYO7NGEebBBODRGVGvvbmGUvkdN3EoCPFl6pSDzexRfYqVJC5DKX7vv4g4P6xY9rtgg__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA

1. Pattern Representation

It refers to the number of clusters, the number of available patterns, and the number, type, and scale of the characteristics of data objects available to the clustering algorithm. Feature selection is the procedure of locating the most effectively subset of the original characteristics to be used in the task of clustering. Feature extraction is the use of one or more transformations of the input features to produce new salient features. Either or both of these techniques can be used to obtain an appropriate set of features to use in clustering.

2. Pattern Proximity

It is usually evaluated by a distance function defined on pairs of available patterns. A variety of distance measures are in use in the various communities 13. A basic distance measure like Euclidean distance can often be used to reflect dissimilarity between two patterns, whereas other similarity measures can be used to characterize the conceptual similarity between patterns 13.

3. Grouping

The grouping step can be performed in a number of ways. The output clustering (or cluster analysis) can be hard (a partition of the data into groups) or fuzzy (where each pattern has a variable degree of membership in each of the output clusters). Hierarchical clustering algorithms produce a nested series of partitions based on a criterion for merging or splitting clusters based on similarity. Partitional clustering algorithms identify the partition that optimizes (usually locally) a clustering criterion. Additional techniques for the grouping operation include probabilistic 14 and graph-theoretic 14 clustering methods.

4. Data Abstraction

It is the process of extracting a simple and compact representation of a data set. Here, simplicity is either from the perspective of automatic analysis (so that a machine can perform further processing efficiently) or it is human-oriented (so that the representation obtained is easy to comprehend and intuitively appealing). In the clustering context, a typical data abstraction is a compact description of each cluster, usually in terms of cluster prototypes or representative patterns such as the centroid 14. How is the output of a clustering algorithm evaluated? What characterizes a "good" clustering result and a "poor" one? All clustering algorithms will, when presented with data, produce clusters-regardless of whether the data contain clusters or not. If the data does contain clusters, some clustering algorithms may obtain "better" clusters than others.

5 . Assessment of Output

The assessment of a clustering procedure's output, then, has several facets. One is actually an assessment of the data domain rather than the clustering algorithm itself-data which do not contain clusters should not be processed by a clustering algorithm. The study of cluster tendency, wherein the input data are examined to see if there is any merit to a cluster analysis prior to one being performed, is a relatively inactive research area, and will not be considered further in this survey. Cluster validity analysis, by contrast, is the assessment of a clustering procedure's output. Often this analysis uses a specific criterion of optimality; however, these criteria are usually arrived at subjectively. Hence, little in the way of "gold standards" exist in clustering except in well-prescribed sub domains. Validity assessments are objective 14 and are performed to determine whether the output is meaningful. A clustering structure is valid if it cannot reasonably have occurred by chance or as an artefact of a clustering algorithm. When statistical approaches to clustering are used, validation is accomplished by carefully applying statistical methods and testing hypotheses. There are three types of validation studies. An external assessment of validity compares the recovered structure to an a priori structure. An internal examination of validity tries to determine if the structure is intrinsically appropriate for the data. A relative test compares two structures and measures their relative merit.

1.6 Different Types of Clusters

Clustering focuses to determine efficient and useful classes of data objects, where usefulness is defined by the goals of the data analysis. The various types of clusters that can be generated according to user requirements are as follows:

1. Well-Separated

A cluster is a collection of data objects in which each object is closer to every other object in the same class than to any object not belonging to the cluster. Sometimes a threshold is used to specify that all data objects in a cluster must be sufficiently similar to one another. This idealistic definition of a cluster is fulfilled only when the data contain natural clusters that are quite dissimilar from one another. Figure 1.4(a) shows an example of well-separated clusters that contains two classes of data points in a two-dimensional space. The distance between any two given points in separate classes is larger than the distance between any two data points within the same class of objects. Well-separated clusters do not required to be globular, but can have any shape.

2. Prototype-Based

A cluster is a collection of data objects in which each data object is more similar to the prototype that defines the class of similar objects than to the prototype of any other class of objects. For data, with continuous features, the prototype is often a centroid, i.e., the mean of all the data points in the class of similar objects. When a centroid is not meaningful, such as in the case of categorical data, the prototype is often taken as a medoid, i.e., the most representative point of a cluster. For many types of data, the prototype can be defined as the most central point, and in such cases prototype based clusters are often referred as centre-based clusters. These clusters tend to be globular. Figure 1.4(b) shows an example of centre-based clusters.

3. Graph-Based

If the data is represented as a graph, where the nodes are expressed as data objects and the links represent connections among data objects. Hence, a cluster can be defined as a connected component; i.e., a class of data objects that are linked to one another, but objects do not have links outside their clusters. A significant instance of graph-based clusters is contiguity-based clusters, where two data objects are linked only if they are within a specified distance of each other. This explains that each data object in a contiguity-based cluster is closer to some other object in the cluster than to any other point in a different cluster. Figure 1.4(c) displays an example of such clusters for two-dimensional data points. This definition of a cluster is useful when clusters are irregular or intertwined, but can cause trouble when noise is present since, as shown in two spherical clusters of Figure 1.4(c).

4. Density-Based

A cluster is defined as a dense region of data objects that is surrounded by a region of low- density. Figure 1.4(d) explains some density-based clusters for data created by adding noise to the data of Figure 1.4(c), because the bridge between them fades into the noise. Likewise, the curve that is presented in Figure 1.4(c) also fades into the noise and does not form a cluster. A density-based definition of a cluster is frequently employed when the clusters are irregular or intertwined, and when noise and outliers are present. By contrast, a contiguity-based definition of a cluster would not work well for the data of Figure 1.4(d) since the noise would tend to form bridges between clusters.

5. Shared-Property (Conceptual Clusters)

It defines a cluster as a collection of data objects that share some property. This definition is a combination of all the previously mentioned definition of a cluster; e.g., data objects in a centre based cluster share the property that they are all closet to the same centroid or medoid. However, the shared-property approach also defines new types of clusters. Consider the clusters shown in Figure 1.4(e), a triangular area or cluster is adjacent to a rectangular one, and there are two intertwined circles or clusters. In both cases, a clustering algorithm would need a very specific concept of a cluster to successfully detect these clusters. The process of determining such clusters is called conceptual clustering.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.4 Different types of clusters as illustrated by sets of two-dimensional points

Tan, P. N., Steinbach, M., & Kumar, V. (2016). Different types of clusters as illustrated by sets of two-dimensional points. [Image] Pearson Education India. https://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf

[...]

Excerpt out of 55 pages

Details

Title
Optimizing Web Search Results for Image. K-means Clustering Algorithm
Grade
9.5
Author
Year
2020
Pages
55
Catalog Number
V983236
ISBN (eBook)
9783346348586
ISBN (Book)
9783346348593
Language
English
Tags
optimizing, search, results, image, k-means, clustering, algorithm
Quote paper
Priyanka Nandal (Author), 2020, Optimizing Web Search Results for Image. K-means Clustering Algorithm, Munich, GRIN Verlag, https://www.grin.com/document/983236

Comments

  • No comments yet.
Read the ebook
Title: Optimizing Web Search Results for Image. K-means Clustering Algorithm



Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free