Using Data Mining for Facilitating User Contributions in the Social Semantic Web

Doctoral Thesis / Dissertation, 2011

185 Pages, Grade: 1,0



1. Introduction
1.1 Motivations and Research Questions
1.2 Thesis overview and contributions
1.2.1 Matching Recommendation Technology and Domains
1.2.2 Improving Link Analysis for Tag Recommendation in Social Tagging Systems
1.2.3 Using Recommender Systems for Continuous Ontology development
1.2.4 Combating Attacks Against Social Tagging Environments

2. Background
2.1 Recommender Systems
2.2 Recommendation Algorithms
2.2.1 Collaborative Filtering
2.2.2 Content-based Recommendation
2.2.3 Knowledge-based Recommendation
2.2.4 Other Types of Recommendation
2.3 Social Tagging Systems
2.3.1 Challenges in Social Tagging Systems
2.3.2 Formalization of Folksonomy
2.4 Related Work in Social Tagging Research
2.4.1 Characterize Social Tagging Systems and User Motivations
2.4.2 Resource Recommendation and Personalized Search
2.4.3 Tag Recommendation
2.4.4 User Recommendation
2.4.5 Emergent Ontologies from Folksonomies
2.5 Chapter Summary

3. Matching Recommendation Technologies and Domains
3.1 Introduction
3.2 Related Work
3.3 Knowledge Sources
3.3.1 Individual Knowledge
3.3.2 Social Knowledge
3.3.3 Content
3.4 Recommendation types
3.5 Domain Properties
3.5.1 Heterogeneity
3.5.2 Risk
3.5.3 Churn
3.5.4 Interaction Style
3.5.5 Preference stability
3.5.6 Inscrutability
3.6 Mapping Knowledge Sources to Domain Properties
3.6.1 Individual
3.6.2 Social Knowledge
3.6.3 Content and Domain Knowledge
3.7 Mapping Domains to Technologies
3.7.1 Algorithms
3.8 Sample Recommendation Domains
3.9 Domain of This Thesis: Social Web
3.10 Conclusion
3.11 Acknowledgements

4. Improving Link Analysis for Tag Recommendation in Folksonomies
4.1 Introduction
4.2 Related Work
4.3 Background on PageRank Algorithm
4.3.1 Folksonomy-Adapted PageRank
4.3.2 FolkRank
4.3.3 Graph-Based Tag Recommendation in Folksonomies
4.4 A weighted Directed Graph Model for Folksonomies
4.5 Experimental Evaluation
4.5.1 Datasets
4.5.2 Experimental Results
4.5.3 Discussion
4.6 Conclusion and Future Work
4.7 Acknowledgment

5 Using Recommender Systems to Support Continuous Ontology Development
5.1 Introduction
5.2 Related Work
5.3 Application Scenarios
5.3.1 Floyd
5.3.3 Wikipedia
5.4 Algorithms
5.4.1 Algorithm 1: Recommendation of Super-Concepts Without An Existing Seed Hierarchy
5.4.2 Algorithm 2: Recommendation of Super-Concepts by Learning From an Existing Concept Hierarchy
5.4.3 Algorithm 3: Hybrid Recommendation
5.5 Evaluation
5.5.1 Data Sets Wikipedia
5.5.2 Evaluation Methodology
5.5.3 Experimental Metrics
5.6 Experimental Results
5.6.1 Experimental Results from Algorithm 1
5.6.2 Experimental Results from Algorithm 2
5.6.3 Experimental Results from Algorithm 3
5.6.4 Expert Evaluation
5.6.5 Discussion
5.7 Conclusion and Future Work
5.8 Acknowledgment

6. Combating Attacks Against Social Tagging Environments
6.1 Introduction
6.2 Related Work
6.3 Navigation Channels in Social Tagging Systems
6.4 Attack Dimensions
6.4.1 Attack Types
6.5 Retrieval Algorithms
6.6 Evaluating Impact of Attacks
6.6.1 Measuring the Local Impact of Attack
6.6.2 Measuring the Global Impact of Attack
6.7 Experimental Results
6.7.1 Experimental Setup
6.7.2 Overload Attack
6.7.3 Piggyback Attack
6.7.4 Co-occurrence Attack (Tag Push)
6.7.5 Comparison Of Attack Impact On Different Data Sets
6.7.6 Comparison of Different Attack Types
6.7.7 Comparison of Local Impact
6.8 Discussion
6.9 Conclusion and Future Work
6.10 Acknowledgment

7. Conclusion
7.1 Answer to Research Questions
7.1.1 Objective 1: Matching Recommendation Technology and Domains
7.1.2 Objective 2: Improving Link Analysis for Tag Recommendation in Social Tagging Systems
7.1.3 Objective 3: Using Recommender Systems for Continuous Ontology Development
7.1.4 Objective 4: Combating Attacks Against Social Tagging Systems
7.2 Summary of Contributions
7.2.1 Conceptual Contributions
7.2.2 Algorithm Development
7.2.3 Empirical Contributions
7.3 Future Directions
7.3.1 Explore Other Recommendation Tasks
7.3.2 Network Evolution and Attacks
7.3.3 Scalability And Realtime Analysis


List of Figures

List of Tables

Chapter 1 Introduction

Social Web applications have emerged as powerful applications for Internet users allowing them to freely contribute to the Web content, to organize and to share information, and to utilize the collective knowledge of others for discovering new topics, resources and new friends.

In social Web applications, information access functions such as search, navigation, and resource sharing are usually supported by annotations, labels that users apply to resources. These labels can be numerical ratings such as those used in many standard recommender systems; simple metadata in the form of short tags such as those supported by Delicious1, Flickr2, and many other social tagging systems; or they might be more structred in the form of categories in Wikipedia3. Many of these applications support their users by providing recommendations.

Recommender systems reduce the burden of navigating large information spaces and aid users in discovering new items of their interest. However, single-function recommender systems, which connect users with relevant resources are being progressively replaced by more complex and dynamic applications in which social Web and social networking play an increasingly important role.

This thesis aims to use machine learning and data mining algorithms, particularly, recommender systems to assist users of social Web applications to bring meaningful contributions to the system with minimum effort. In addition, we want to understand the conditions under which information access in social Web applications can be expected to produce useful results, and in particular, how these functions can be insulated from adversaries, even in systems that are open to the public.

In this chapter, we first present our motivations and research questions in section 1.1.

Next, we present our contributions and the thesis overview in section 1.2.

1.1 Motivations and Research Questions

In this thesis, we investigate how machine learning and data mining algorithms can be applied to social Web applications in order to improve their usability. For this purpose, we specifically investigate the role of recommender systems in social Web applications, particularly in social tagging systems. In this section, we outline the objectives and the research questions that this thesis aims to address.

Objective 1: In the first step of this thesis, we study the problem of matching recommender systems technologies to different domains. With the rapid development of recommender system technologies in the recent years, it has become difficult for developers to determine which technology is suitable to a particular context. To alleviate this difficulty, we propose a framework that organizes the space of recommendation problems and provides a systematic approach to finding the appropriate recommendation technology for addressing a given problem in a specific domain. The research questions we try to answer in this part of the thesis are the following.

What are the main characteristics of a domain that influence the choice of selecting the appropriate recommendation technology?

What are the main knowledge sources in recommendation systems and how do they relate to the recommendation technology?

How can we map the domain characteristics to recommendation technologies?

Objective 2: Once we have a good understanding of recommender system technologies and their applications, we focus our attention on building recommenders in the social tagging domain. Social tagging applications allow users to annotate online resources, resulting in a complex three dimensional network of interrelated users, resources and tags often called a Folksonomy. In order to develop a recommender application for such a system, the first step is to create a model of the folksonomy that takes into account the information flow between users, resources and tags. We suggest a directed graph to model the folksonomy and apply an adaptation of PageRank algorithm on the graph for tag recommendation. Our goal in this part of the thesis is to improve the link analysis in folksonomies for tag recommendation and the research question we try to address are the following.

How can we model the folksonomy as a graph so that we can capture the flow of information?

How can we use the model for tag recommendation?

Does the proposed model produce better recommendation than the previous approaches?

Objective 3: Bridging the gap between folksonomies and ontologies as a research problem has recently received attention by many researchers in the semantic community. Applying semantic Web technologies to the data of the social Web can help users in search and navigation. On the other hand, the collaborative environment of social tagging systems can also provide a suitable platform for developing ontologies. In this thesis, we investigate how recommender systems can help users to collaboratively develop an ontology. We propose a machine learning algorithm that utilizes a seed concept hierarchy to automatically learn from the existing relations among concepts. The algorithm recommends semantic relations between a new concept (entered by users) and the existing concepts in the hierarchy. The research questions we try to answer are the following.

How can we aggregate user activities in a social tagging environment to infer semantic similarity?

How can we use machine learning to learn from the existing semantic relations in a concept hierarchy to predict semantic relations between a new concept and existing concepts?

How can the algorithm support collaborative ontology development?

Objective 4: Although flexibility and openness of social tagging systems has attracted millions of users, these properties also introduce challenges to the system. One of the major challenges that has received little attention from research community is the security of such systems. Attackers may attempt to distort the systems adaptive behavior by inserting erroneous or misleading annotations, thus altering the way in which information is presented to legitimate users. In this thesis, we systematically study this issue. We are interested to answer the following questions.

How can we systematically study the problem of attacks against social tagging systems?

How can we evaluate the impact of attacks in social tagging systems?

What model of attacks are more successful?

How many malicious users can a tagging system tolerate before results significantly degrade?

How much effort and knowledge is needed by an attacker to attack the system?

1.2 Thesis overview and contributions

In this section, we present an overview of the thesis and outline our contributions. The following subsections provide a brief description of each of the main chapters of the thesis.

1.2.1 Matching Recommendation Technology and Domains

The goal of this chapter is to assist researchers and developers to identify the recommendation technology that is applicable to different domains of recommendation. This chapter is a conceptual contribution to the field of recommender systems and it provides a comprehensive understanding of how recommender systems can be applied in different domains. In this chapter, we adopt a taxonomy of knowledge sources in recommendation from existing literature and match those knowledge sources to recommendation technologies. We identify the domain characteristics that influence on selection of recommendation technologies and map those characteristics to recommendation technologies based on the knowledge sources they require.


Extensive survey on the recommendation literature

Identification of domain characteristics that influence on selecting and applying recommendation technology

Mapping the domain characteristic to recommendation technologies


This work has provided the basic fundamentals for development of recommender systems in IBM T.J Watson Research Center.

Preliminary results of the work have been published as a paper in the workshop on recommendation and collaboration at the Intelligent User Interfaces Conference 2008 [167].

An extended version of the work has been published in the Recommender Systems Handbook providing a guide for research scientists and practitioners in the recommender system area [39].

1.2.2 Improving Link Analysis for Tag Recommendation in Social Tagging Systems

Tag recommendation is one of the techniques that can help reduce noise, redundancy, and ambiguity in social tagging systems. Many researchers have developed different techniques for tag recommendation. Our goal in this chapter is to improve existing tag recommendation techniques by introducing a new model of the folksonomy graph.


Developed an approach to model a folksonomy as a weighted directed graph.

Used the proposed directed graph and an adaptation of PageRank for tag recommendation.

Evaluated the proposed algorithm with data set from popular tagging applications such as Bibsonomy, Citeulike and Delicious and showed improvement over popular existing graph-based algorithms.


Improvement on existing tag recommendation algorithms

The results of this work is published in the workshop on recommender systems and the social Web [169] and as a rising scholar paper in the Journal of Emerging Technologies in Web Intelligence [166].

1.2.3 Using Recommender Systems for Continuous Ontology development

In this chapter we develop a recommender system to support continuous ontology development. Our goal is to develop a recommendation algorithm in a Web 2.0 platform that supports end users in the collaborative development of an ontology. The developed algorithm uses an existing seed hierarchy, the concepts already present and the sub/super concept relations between them to place a given new concept in the hierarchy. Thus, the output of the algorithm consists of potential super-concepts for a new concept.


Developed a machine learning algorithm that can learn from the existing relations of a concept hierarchy and suggest the potential super concepts for a new concept.

Experimental results from the proposed algorithm with data sets from Wikipedia show excellent results which confirm that the algorithm can be directly used in Wikis to support users for creating sub-super category relations.

Results from expert evaluations show that the algorithm can suggest accurate high quality semantic relations.


The developed algorithm has been implemented in the social semantic book-marking system SOBOLEO [229] and is being used by users.

The algorithm has been partly implemented into SAP Floyd system. Floyd is a case management system used by market researchers. Users receive recommendations on the potential semantic relations for a new term.

The algorithm can be directly used in Wikis to support users for creating sub-super category relations.

The results of this work has been published in the proceedings of 5th IEEE International Conference on Intelligent Systems [173] and 17th International Conference on Knowledge Engineering and Knowledge Management, EKAW 2010 [168].

1.2.4 Combating Attacks Against Social Tagging Environments

In this chapter we discuss the problem of security and robustness in social tagging systems. We introduce a framework to model the navigation channels in social tagging systems and we identify different types of potential attacks against the system through different navigation channels. We propose approaches to evaluate the impact of attacks. We model different attack types and experiment their impact using dataset from popular social tagging systems. This chapter provides a strong conceptual understanding of the possible attacks against social tagging systems and the experimental section of the chapter helps identify the types of attacks that are more successful in changing the system behavior. Gaining a fundamental understanding of the nature and impact of such attacks can lead to more secure and robust social Web applications.

illustration not visible in this excerpt

Figure 1.1: Structure of the thesis


Outlined a framework to model the attacks based on various navigation channels and target elements.

Introduced evaluation metrics to measure the local and global impact of attacks.

Implemented and evaluated different type of attacks on data sets from real popular folksonomies such as Delicious and Bibsonomy.


This work creates the awareness of the security and manipulation problem in open adaptive systems specifically social tagging systems.

The results of this work can help social tagging Websites discover the vulnerabilities of their system, inform them about the parts of the system that need more monitoring, guide researchers and developers to develop more robust systems, and provide them with clues for attack detection.

The results of this work have been published in several workshops including [170,186] as well as IEEE International Conference on Social Computing 2009 [171].

We start this thesis with a background chapter. We present basic concepts and preliminaries, introduce the notation and briefly survey the related work. Next, in chapter 3 we investigate the problem of matching recommendation technologies and domains. In chapter 4, we present our algorithm to improve graph-based tag recommendation. Chapter 5 presents our approach to assist users for continuous ontology development. Chapter 6 is dedicated to our studies on the problem of attacks against social tagging systems. Finally, chapter 7 concludes the thesis with a summary and an outlook to future possibilities to extend this work. Figure 1.1 shows the structure of the thesis.

Chapter 2 Background

In this chapter we review the basic concepts and terminology used in this thesis. This chapter serves as an introduction into general related work covering the subareas of our work.

We start this chapter in section 2.1 by introducing recommender systems: first, a brief history of the field will be given, followed by description of popular recommendation algorithms. Next, in section 2.3 we introduce social tagging systems and our terminology to model such systems. We survey the current areas of research in social tagging systems including different recommendation approaches. Finally, we discuss several challenges in social tagging systems: ambiguity and redundancy, and attacks against social tagging systems. These challenges are the motivation of our work in the next chapters of this thesis.

2.1 Recommender Systems

The growth of the World Wide Web in the 1990s resulted in an explosive growth of the amount of information available online, outgrowing the ability of individual users to process all this information. This was a motivation for the increase of interest in research fields such as information retrieval (IR) and information filtering (IF).

Information Retrieval originated as a research field in the 1950s focusing on automatically matching user’s need (specially in form of a query) against a set of documents. Web search engines are the most visible IR applications today.

Information filtering systems emerged in the 1990s to help users with the information overload problem. These systems generally use long term user profiles to filter out irrelevant or redundant data items and just present the part of information that the user is interested in. Typically, such systems construct a model of users’ interests and match that against a stream of information objects. While IR and IF are considered separate research fields, they share many characteristics, such as a focus on analyzing textual content [22].

Recommender systems can be considered as active information filtering systems that attempt to present the user with information items he or she is interested in. Thus, they actively add personalized information items to the information flowing towards the users. Recommender systems were originally defined in [176] as “systems in which people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients”. However, this definition only captures “collaborative filtering” which is a specific algorithm for recommending. The term “recommender systems” evolved to replace and broaden the use of the term collaborative filtering and is defined by Schafer [190] as systems that specifically recommend lists of products and help users evaluate products. In this definition recommenders are systems that are used in E-commerce sites to suggest products to their customers and to provide consumers with information to help them decide which products to purchase. Schafer [190] present recommender systems as a way to achieve mass customization in e-commerce. From this perspective, recommender systems can be regarded as specialized data mining systems that have been optimized for interaction with consumers rather than marketers.

As the above definitions suggest, recommender systems are generally applied to ecommerce domains such as movies, music, books, news, images, and web pages. In this thesis, specifically in chapter 3, our perspective on recommender systems is broader than only used in e-commerce applications. We view recommenders as systems that produce individualized recommendations as output or have the effect of guiding the user in a personalized way to interesting or useful objects in a large space of possible options [32]. In the following section, we present the common recommendation algorithms.

2.2 Recommendation Algorithms

Over the past two decades many different recommendation algorithms have been proposed. Generally, in recommendation literature, all different approaches can be categorized to three main approaches: Collaborative Filtering (CF), Content-based (CB) and knowledge-based (KB) algorithms. In this section, we describe the basics of each recommendation technique and discuss the advantages and disadvantages of each approach.

2.2.1 Collaborative Filtering

Collaborative Filtering (CF) is the most popular and, to date, the most successful recommendation technique in e-commerce applications. CF algorithms usually employ statistical techniques to find like-minded people, known as neighbors, that have a history of agreeing with the target user. Once a neighborhood of users is formed, different algorithms are used to combine the preferences of neighbors to produce a prediction of ratings for not-rated items or a list of top n items as recommendation for the target user. The underlying assumption of the CF approach is that those who agreed in the past tend to agree again in the future.

The term “collaborative filtering” was first introduced by Goldberg et al. [73], to describe their Tapestry filtering system in which people collaborate to help each other perform filtering by recording their reactions to documents they read. The reactions, called annotations, can be accessed by other people’s filters. Users had to identify the people who have similar taste to them by themselves. Subsequent CF approaches automated this process of locating the nearest neighbors of the active user. CF algorithm as it is known today, which tries to predict ratings for not rated items based on ratings of nearest neighbors, was presented by Resnick et al. [175] for recommending Usenet articles, and by Shardanand and Maes [192] in their Ringo music recommender system. These were the first to find the distance between users by correlating their rating history in order to (1) determine the most similar neighbors and (2) use those similarities to predict interest in new items.

Since then, many researchers have worked on expanding and improving collaborative filtering systems. Breese et al. [27] introduced the notion of memory-based and modelbased algorithms to improve the efficiency of CF techniques. Herlocker et al. [88] performed a large-scale evaluation of collaborative filtering algorithms using different weighting schemes. In the following subsections we explain the basics of collaborative filtering algorithm and describe the differences between user-based, item-based, memory-based and model-based algorithms.

Overview of the Collaborative Filtering Process

The task of collaborative filtering algorithms is to predict how well a user will like an item that he has not rated given his historical preferences together with the historical preferences of a community of users. User preferences can be explicit statements provided intentionally by the user such as ratings or implicitly inferred from user behavior such as browsing the web pages. The problem space can be formulated as a n x m matrix, R, for n users versus m items where cell Rij of the matrix represents the rating of user i on item j. Under this formulation, the problem is to predict values for empty cells. In general, this matrix is very sparse since each user usually has rated a small percentage of total number of items. Figure

2.1 shows a simplified example of user-item matrix for movie ratings. In this example, the recommender systems attempts to provide a prediction for Indep.Day for user 5.

The user-based collaborative filtering, tries to predict the rating of an item by looking at the “nearest neighbors” of the target user. In this example, based on the comparison of rating history of user 5 with other users, user 2 has the most similar rating history to user 5. They have closely agreed on all the movies that they have both seen. As a result, user 2’s opinion

illustration not visible in this excerpt

Figure 2.1: A simplified example of user-item matrix for movie ratings

about Indep. Day will influence the prediction for user 5. For simplification, in this example, if the system decides only based on one nearest neighbor, then the prediction for user 5 will be equal to 6 for Indep.Day.

Herlocker et al. [88] separates the user-based method into three steps: (1) Weight all users with respect to similarity with the active user. The most common technique to find similarity among users is Pearson correlation. (2) Select a subset of users to use as the nearest neighbors for prediction. (3) Normalize ratings and compute a prediction using a weighted combination of selected nearest neighbors.

Although user-based collaborative filtering has been very successful, it has some potential challenges including sparsity and scalability. In practice, in many commercial systems, even active users may have purchased or rated under 1% of the items. Thus, a user-based recommender system can hardly find nearest neighbors for a particular user and thus it will be unable to make any item recommendations. To solve this problem, Sarwar etal. [188] proposed the item-based collaborative filtering.

Item-based collaborative filtering looks at the set of items the target user has rated and computes how similar they are to the target item i and then selects k most similar items. Once the most similar items are found, the prediction is then computed by taking a weighted average of the target user’s ratings on these similar items. Figure 2.2 shows the same useritem matrix as in the previous example. Based on user-ratings the Indep.Day is most similar to Star Wars. Thus, the system predicts that user 5 will have a similar opinion about these movies and predicts 7 for Indep.Day (if we consider only one similar item). In practice, the prediction is based on weighted average of the k most similar items. The recommender system used in (people who bought this, also bought this) works on the same principle.

Memory-based vs. model-based collaborative filtering

CF algorithms are commonly divided into two types, memory-based and model-based algorithms. Memory-based algorithms are similar to lazy machine learning algorithms that

illustration not visible in this excerpt

Figure 2.2: Item-based collaborative filtering finds similar items based on users rating history

postpone the task of building a model to the time of recommendation generation. At generation time, they scan the entire database to produce a prediction. Generally, user-based CF is considered as memory-based since all the computations are done real time at the time of user interaction.

Model-based CF algorithms learn an aggregated model of user behavior in advance and use this model to generate recommendations. The model building process can be performed by different machine learning algorithms including Bayesian learning, clustering, association rule mining, and factor models.

Bayesian networks create a model based on a training set with a decision tree where a node corresponds to each item. The states of each node correspond to the possible vote values for each item. The model can be built offline. The resulting model is small, fast, and essentially as accurate as nearest neighbor methods [27]. Bayesian networks may prove practical for environments in which knowledge of user preferences changes slowly with respect to the time needed to build the model but are not suitable for environments in which consumer preference models must be updated rapidly or frequently [190] since a new model has to be built every time the preferences change.

Clustering techniques work by identifying groups of users who have similar preferences. Once the clusters are created, predictions for an individual can be made by averaging the opinions of the other users in that cluster. Clustering techniques usually produce lesspersonal recommendations than other methods, and in some cases, the clusters have worse accuracy than nearest neighbor algorithms [27]. Once the clustering is complete, however, performance can be very good, since the size of the group that must be analyzed is much smaller. Clustering techniques can also be applied as a first step for creating a smaller set of candidates for a nearest neighbor algorithm or for distributing nearest-neighbor computation across several recommender engines [190].

Association rules have been used for many years in marketing, to analyze patterns of preference across products, and to recommend products to consumers based on other products they have selected [190]. Association rules are used to discover regularities between products in large scale transaction data. An association rule expresses the relationship that one product is often purchased along with other products. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The well-known constraints are minimum thresholds on support and confidence

Association rules are more commonly used for larger populations rather than for individual consumers. For example, they can be used as the basis for decisions about marketing activities such as promotional pricing or product placements in supermarkets. Like other learning methods that first build and then apply models, association rules are less suitable for applications where knowledge of preferences changes rapidly. By contrast, recommender systems based on nearest neighbor techniques are easier to implement for personal recommendation in a domain where user opinions are frequently added, such as e-commerce applications [190].

Latent factor models try to reduce the dimensionality of the space of user-item ratings by mapping users and items to the same latent factor space [109]. The users and items are then related to each other through these latent factors. Examples of latent factor techniques applied to recommendation include Singular Value Decomposition (SVD) [189, 163], factor analysis [42], Probabilistic Latent Semantic Analysis (PLSA) [97]. Factor models have been the most successful recommendation algorithms to date. The successful algorithms in the Netflix competition all used some kind of matrix factorization models [203]. The Netflix prize4, a research competition with one million dollar prize, was awarded to the research group that could improve the Netflix5 existing recommendation algorithm for predicting movie ratings by 10%.

Graph-based techniques create a graph from the user-item matrix and use different graph-based appraoches to make a prediction. Horting is a graph-based technique in which nodes are users, and edges between nodes indicate degree of similarity between two users [7]. Predictions are produced by walking the graph to nearby nodes and combining the opinions of neighbors. Horting differs from nearest neighbor as the graph may be walked through other users who have not rated the product in question, thus exploring transitive relationships that nearest neighbor algorithms do not consider [190]. Graph-based techniques have gained more popularity in recommendation in the social Web domain. In chapter 4 we present more details of those approaches.

K-nearest neighbor algorithm is generally considered as one the lazy algorithms in machine learning. However, if all of the user similarities (for user-based CF) or item similarities (for item-based CF) are computed in advance, and not Realtime, then we can consider these algorithms as model-based as well. In this case, the model is basically the similarity matrix.

Advantages and Disadvantages of Collaborative Filtering

The greatest strength of collaborative techniques is that they do not require any information about the product that they are recommending. Thus, they are completely independent of any machine-readable representation of the objects being recommended, and work well for complex objects such as music and movies where variations in taste are responsible for much of the variation in preferences [32]. This feature makes collaborative filtering as the most widely implemented and most popular recommendation technology specifically for ecommerce domains where new products come and go on regular bases.

Another advantage of CF algorithm is that CF algorithms are able to take the quality of an item into account when recommending items, especially in the case of explicit user ratings [22]. For instance, two movies might have the same genre, but might have very different quality. By taking actual user preferences into account, CF algorithms can prevent poor recommendations that are merely based on product features.

Collaborative filtering systems are the only kind of recommender systems that can provide serendipitous recommendation. A serendipitous recommendation is something new, non-obvious and appreciated that the user would likely not have discovered on his/her own. For example, an unfamiliar song from an unfamiliar musician, or a unfamiliar movie from an unfamiliar director.

On the other hand, CF algorithms suffer from several problems. First, collaborative filtering systems are not able to provide explanation for recommendations. Current CF systems are black boxes, providing no transparency or reasoning behind a recommendation. Many users want to know why they get a certain recommendation specifically if there is a high risk involved.

Second problem of CF algorithms is the “cold-start” problem. This problem appears in startup phase of the recommender system, or when a new user or a new item is added to the system. In these cases, the system cannot generate any recommendations. Solutions to this problem include using other data sets to seed the system, using different recommendation algorithms or simply recommending most popular items when the cold start problem exists. Even after acquiring more ratings from the users, sparsity of the user-item matrix can still be a problem for CF.

Another problem mentioned in literature is that collaborative recommenders work best for a user who fits into a niche with many neighbors of similar taste. It does not work well for so-called “gray sheep” [50], who fall on a border between existing cliques of users [32].

2.2.2 Content-based Recommendation

Content-based recommendation algorithms, also known as content-based filtering, can be considered as an outgrowth and extension of the information filtering research. Contentbased systems work by creating some kind of representation of the items based on their content features. For example, information retrieval systems like the newsgroup system NewsWeeder [118] use terms in a document as features. A content-based recommender learns a profile of the user’s interests based on the features present in objects the user has rated. The type of user profile derived by a content-based recommender depends on the learning method employed. The user profiles might be long-term models that are updated as more evidence about user preferences is observed or might be ephemeral profiles (usually in form of user query).

Typically, content-based filtering is approached as either an IR problem or a machine learning problem [22]. In the IR approach textual similarity (e.g tf.idf) is used to match items representations (e.g. documents) against user representations (e.g. query). Examples of this approach include many news recommender applications such as [81]. In the machine learning approach the features of items are used to train a prediction or classification algorithm. Examples of this approach include neural networks used in Re:Agent [23] for email filtering, decision trees and Bayesian networks used in InfoFinder [160] for document recommendation.

Advantages and Disadvantages of Content-based Recommendation

One of the advantages of CB algorithms is that, in contrast to collaborative approaches, they have no cold start problem for new items since the features of the new item can be extracted when the item is added. Second advantage of CB algorithms is that similar to CF algorithms, they do not need deep domain knowledge and it is sufficient to store the item features in a database and collect implicit feedback from the users about their item preferences. However, they are still more explainable than CF algorithms since they match the features that the user is interested in with item features.

The main disadvantage of CB algorithms is that in domains with complex objects such as music and movies where there are large numbers of features, it is difficult to do a perfect content analysis and discover the main real features that make the product attractive to a user. In addition, content-based filtering algorithms can only recommend items from a narrow topic range; they are unable to provide serendipitous recommendations.

2.2.3 Knowledge-based Recommendation

All recommendation algorithms attempt to help a user to find interesting items of their taste. Collaborative filtering algorithms do this based on the behavior of the user and other likeminded users, whereas content-based filtering approaches do this based on the features of the items of users interests and the features of available items. A third class of recommendation algorithms is formed by knowledge-based algorithms. They use domain knowledge, and recommend items based on functional knowledge of how a specific item meets a particular user need [32].

Domain knowledge can take many forms, but without loss of generality we can describe it as an ontology in which there are relations among attributes, objects, and concepts, especially related to item features. It may also contain means-ends knowledge about how items may meet potential user goals. The presence of domain knowledge may permit items to be described with structured representations as opposed to simple vectors of features. A recommender system may also rely on knowledge that is external to the question of product suitability or “best match”. The owners of a system may prefer certain recommendations for business reasons. For example, a video rental service may prefer to recommend items from its less-popular back catalog over in-demand new releases. Such knowledge we refer to as business rules knowledge.

Burke [32] identifies three types of knowledge required in a knowledge-based system: catalog knowledge, functional knowledge, and user knowledge. Catalog knowledge is knowledge about the objects being recommended and their features. This type of knowledge is similar to the one used in content-based systems, however, it has more deep knowledge about the relationships between the features. For example, the Entree recommender [40] should know that “Thai” cuisine is a kind of “Asian” cuisine. Functional knowledge is the knowledge that enables the system to map between the users needs and the object that might satisfy those needs. For example, Entree knows that a need for a romantic dinner spot could be met by a restaurant that is “quiet with an ocean view”. User knowledge is the knowledge the system gets about the user. This might take the form of general demographic information or specific information about the need for which a recommendation is sought.

Advantages and Disadvantages of Knowledge-based Recommendation

Knowledge-based recommenders are the most reliable approach (if the domain knowledge is comprehensive and up to date) because the background knowledge is free of noise. KB systems, specifically the conversational ones, allow the user to provide a rich specification of their need, which in turn results in more satisfying recommendations. Other advantages of knowledge-based recommendation include no cold start problem, and its ability for intuitive explanations of why a certain item was recommended.

The disadvantage of KB is the high cost and effort for setup and maintenance. It is usually too much of effort for domain experts to capture all the objects, attributes and relations in a domain. As a result, knowledge-based recommendation is not as popular as the other two classes of algorithms.

2.2.4 Other Types of Recommendation

Although in most recommendation literature, the three types of CF, CB, and KB has been identified as main types of recommedation techniques, some researchers have other categorizations. Several researchers such as Adomavicius et al. argue that CB and KB are basically similar since they use item features. Thus, they consider only two main categories of collaborative and content-based techniques [5]. While other researchers such as Burke have considered other categories such as demographic and utility-based as additional recommendation techniques [32]. Burke defines utility-based as systems that make suggestions based on a computation of the utility of each object for the user and demographic recommenders as the ones which categorize the user based on personal attributes and make recommendations based on demographic classes.

Hybrid recommender systems combine two or more recommendation techniques to gain better performance with fewer drawbacks as compared to individual techniques. A survey on existing recommender systems and possible directions in hybrid systems is presented in [32].

2.3 Social Tagging Systems

In the past decade there has been a considerable increase of social websites focusing on facilitating information sharing, interoperability, user-centered design, and collaboration. These websites named as “Web 2.0” allow users to interact with each other and contribute to websites’ content, in contrast to websites that users are limited to the passive viewing of information. Examples of Web 2.0 applications include social-networking sites, wikis, blogs, and websites that support content sharing.

An important component of many of these websites is social tagging; allowing users to associate tags to items (also called resources) and share them with others. These items can vary from bookmarks6, to photos7, videos8, books9, scientific articles10, movies11, music12, slides13, news articles14, activities15, and etc. Tags are freely chosen keywords associated to an item reflecting what the user thinks is the appropriate term to describe that item. Depending on the system’s specifications a tag can be made up of one or more words. Some systems like Delicious for example, consider two words separated with space as two distinct tags. In such systems users have to escape the one-word-only limitation by concatenating words. For example, in Delicious “webdesign” is one tag but “web design” is considered as two distinct tags “web” and “design”. There is typically no limit to the number of tags that may be assigned to a resource and there is no strict hierarchy of tags.

Social tagging systems are gaining more popularity because they offer users several benefits. First, they allow users to organize their own data with a level of freedom not possible in traditional taxonomic filing systems. Second, social tagging systems provide users with the means to openly share this information among friends and colleagues. Third, they also allow anyone to utilize the collective knowledge of others for discovering new topics, resources or perhaps even new friends. Fourth, the reuse of tags creates a dynamic user driven approach to formalize semantic relationships.

Social tagging systems facilitate the retrieval and discovery of resources by providing different possibilities to navigate through the system. Figure 2.3 shows a screenshot of Delicious website showing the popular bookmarks for the tag “ontology”. There are many different options for users to navigate through the search results. On the top, there is an option for recent or popular bookmarks allowing users to look at the most recent or the most popular bookmarks. In front of each link, there is a number showing the total number of people who have saved a specific resource; by clicking on the number, the user can navigate into other users’ profiles and browse through their other tags and resources. The user can also see other tags that have been commonly associated to each resource. On the right side of the screen, “related tags” are presented which are calculated by the system based on the aggregation of user profiles and the number of times two tags have co-occurred together.

Social tagging systems have also been called “collaborative tagging” and “Folksonomy”. Even though some researchers have differentiated between these names, they have been commonly used in the literature as other names for social tagging systems. Bogers [22] differentiates between social tagging and collaborative tagging by considering the two types of tagging systems: broad and narrow systems. In a broad tagging system such as social bookmarking websites, all users can tag the resources while in narrow systems such as photo or video sharing websites only the creator of a resource can tag the item. Bogers considers only the broad tagging systems as collaborative tagging since a tag can be applied multiple times to the same resource in such systems. The word “Folksonomy” is a portmanteau of “Folk” and “Taxonomy” and was first introduced by Vander Wal [215], who defined a folksonomy as the result of “personal free tagging of information and objects for one’s own retrieval”. Different variations on this definition have been proposed in the past. In this thesis, we use the terms social tagging and collaborative tagging interchangeably whereas folksonomy is the outcome of social tagging; a complex network of interrelated users, resources and tags.

illustration not visible in this excerpt

Figure 2.3: Screenshot of Delicious website showing the popular bookmarks for ontology. There are different navigation possibilities through the system.

2.3.1 Challenges in Social Tagging Systems

Despite the many benefits offered by folksonomies, they also present unique challenges. In this section, we briefly discuss some of the major challenges including ambiguity, redundancy, and attacks against social tagging systems.

Ambiguity and Redundancy

Most collaborative tagging applications allow the user to describe a resource with any tag they choose. As a result they contain numerous ambiguous and redundant tags.

Ambiguous tags have multiple meanings. A tag may have different word senses; “apple” can refer to the company or to the fruit. Names may also result in ambiguity; “paris” might mean the city or the celebrity. Subjective tags such as “cool” can result in ambiguity since different users have contradictory notions of what constitutes cool. Finally, overly vague tags such as “tool” can mean gardening implements or software packages. Ambiguous tags can impede users as they navigate the system or burden the user with unwanted recommendations.

Redundant tags share a common meaning: “America” and “USA” confer the same idea. Because users may annotate resources with any tag they choose, folksonomies are full of redundant tags that share a common meaning. Syntactic variance such as “blogs” or “blogging” can cause redundancy. Case (“java” or “Java”), spelling (“gray” or “grey”), and multilinguism (“Photo” or “Foto”) may also result in redundancy. The use of non-alphanumeric characters, abbreviations, and acronyms are other sources of redundancy.

We studied the impact of ambiguity and redundancy on tag recommendation in [69]. In this work, we studied how ambiguity and redundancy can cause misleading evaluation of the effectiveness of tag recommenders. While recommenders are often judged by their ability to predict items occurring in a tests set, the quality of a tag recommender may be underestimated by traditional metrics if it routinely passes up ambiguous tags in order to recommend tags with greater information value. Moreover, evaluation metrics may overvalue a recommender that proposes ambiguous tags despite their lack of specificity. Similarly, redundancy can hamper the effort to judge recommendations as well, but from the opposite perspective. A recommended tag may be counted as a miss even though it is synonymous to a tag in the holdout set. An example may be when the holdout set for a test user contains the tag “java” while the recommendation set contains “Java.” Therefore, redundancy may mask the true effectiveness of the recommendation algorithm: while from the user’s perspective “Java” is a good recommendation, in the evaluation process it would appear as incorrect.

We employed a cluster-based approach to define and measure ambiguity and redundancy. We clustered both resources and tags into highly cohesive partitions based on co-occurrence. A tag is considered ambiguous if several resources from different clusters have been annotated with it. Tags from the same tag cluster are considered redundant. We chose the clusterbased approach over a variety of semantic and linguistic approaches because it provides a more general and language independent method for defining ambiguity and redundancy. We defined metrics, based on the resulting clusters, for measuring the degree of ambiguity for a tag, and the level of redundancy for pairs of tags. We provided extensive evaluation on three real world folksonomies to determine the impact of ambiguity and redundancy across several common tag recommendation algorithms as well as across data sets.

Other researchers have also looked at ambiguity and redundancy problems. Ambiguity is a well known problem in information retrieval and has been identified as a problem in folksonomies as early as 2004 in [134]. Flickr uses the co-occurrence of tags to cluster tags and shows related tags grouped into clusters. Sigurbjornsson and Zwol [194] use a probabilistic approach to model tag co-occurrences and measure ambiguity to facilitate the tag recommendation for photographs. WordNet has been used to identify ambiguous tags and disambiguate them by using synonyms [119]. Folksonomy search is expanded with ontologies in [106] to solve the ambiguity in tagging systems. In [152] the user profile is used to resolve ambiguity, by considering other tags the user has applied. Clustering was used to measure ambiguity by Yeung et al. [226]. They focus on the network analysis techniques to discover clusters of nodes in networks. The ambiguous tags appear in several clusters and the top ten tags from each cluster are used to find the right context for disambiguation. The work is continued in [227] where a method for exploring the semantics of a tag is described. In [108] multidimensional scaling is used for co-word clustering to visualize the relationships between tags.

Entropy as a measure of tag ambiguity has been proposed in [232, 222]. They used a probabilistic generative model for data co-occurrence to determine ambiguous tags. The tagging space is assumed to cover different categories and the tag membership of each category is estimated via the EM algorithm. Our measure of ambiguity in [69] uses a similar approach. We cluster resources and use the distribution of tags across the clusters to measure their ambiguity.

Redundancy in a folksonomy is largely due to the ability of users to tag resources irrespective of a strict taxonomy. In [75] two types of redundancy are identified: structural and synonymic. Structural redundancy refers to terms that are essentially different forms of the same word. Synonymy refers to different tags that share identical or roughly-equivalent meanings. In [214] structural redundancy is explained by stemming to remove suffixes, removing stop words, comparing tags for differences of only one character or identifying compound tags. Synonymic redundancy is evaluated in [9] by using WordNet to determine synonyms. In [67, 68] agglomerative clustering is used to identify similar tags. Clusters of tags are used as intermediaries between users and resources in order to reduce the noise generated by redundant tags. Tag recommendation is one of the techniques to avoid ambiguity and redundancy in social tagging systems. We introduce a graph-based tag recommendation in chapter 4 and we will experimentally show that our algorithm outperforms one of the most successful graph-based tag recommenders, FolkRank.

Attacks Against Social tagging Systems

Like other publicly accessible adaptive systems such as collaborative recommender systems, tagging systems present a security problem. Attackers, who cannot be readily distinguished from ordinary users, may inject biased profiles in an attempt to force a system to adapt in a manner advantageous to them. This problem, even though serious, has not been taken so much of attention in the research community. We discuss the problem in more depth in chapter 6. We will present the dimensions that characterize an attack and outline a framework to identify different types of potential attack strategies against a social tagging system.

In the following sections, we first present our terminology to formally describe a folksonomy and next we review some related work in this area.

2.3.2 Formalization of Folksonomy

In this section, we present our formalization of folksonomies. We use the same formalization in the next chapters. First, we review some basic concepts and definitions in graph theory used in the formalization.

Basic concepts and definitions

We consider the folksonomy as a network and model it with a graph. A graph G = (V, E) is defined with a vertex set V, where N = denotes the number of nodes, and an edge set E. An edge is a two-element subset of V. We interchangeably use terms vertex or node to refer to elements of the vertex set V, and similarly edge or link to refer to elements of the edge set E. We present the definitions of several basic graph-theoretic concepts:

Bipartite graph: Graph G is bipartite if its vertex set can be partitioned into two disjoint sets V1, V2, so that there are only edges connecting nodes across the sets V1 and V2. Or equivalently, there exist no edges between the nodes of the same partition.

Tripartite: Graph G is tripartite if its vertex set can be partitioned into three disjoint sets V1, V2, V3, so that there are only edges connecting nodes across the sets V 1,V2 and V3. Or equivalently, there exist no edges between the nodes of the same partition.

Directed and undirected graph: A graph is undirected if (i, j) E E (j, i) E E, i.e., edges are unordered pairs of nodes. If pairs of nodes are ordered, i.e., edges have direction, then the graph is directed.

Hypergraph and Hyperedge: Hypergraph is a generalization of a graph, where an edge can connect any number of vertices. Formally, a hypergraph H is a pair H = (V, E) where V is a set of vertices, and E is a set of non-empty subsets of V called hyperedges.

Folksonomy Graph

In order to model networks of folksonomies at an abstract level, we use the formalization introduced by Mika [138], representing the system as a tripartite graph with hyperedges. The set of vertices is partitioned into the three disjoint sets U = u1, u2,...,uk, corresponding to the set of users, T = {t1, t2,...,tm} the set of tags, R = r1, r2,...rl the set of resources or objected annotated. In a social tagging system, users associate objects with tags, creating ternary associations between the user, the tag and the resource. Thus we can define A, a set of annotations, represented as user-tag-resource triples.

illustration not visible in this excerpt

Thus, the folksonomy can be described as a four-tuple D:

illustration not visible in this excerpt

To simplify analysis, we can reduce the hypergraph into three bipartite graphs with regular edges. The graphs model aggregate associations between users and resources (UR), users and tags (UT), and tags and resources (TR) [138, 102]. Aggregate projections of the data can reduce the dimensionality by sacrificing information [191]. The relation between resources and tags can be formulated as a two-dimensional projection, RT, such that each entry, RT(r, t), is the weight associated with the resource, r, and the tag, t. This weight may be binary, merely showing that one or more users have applied that tag to the resource, or it may be finer grained using the number of users that have applied that tag to the resource:

illustration not visible in this excerpt

Such a measure is equivalent to term frequency or tf common in Information Retrieval. Similar two-dimensional projections can be constructed for UT in which the weights correspond to users and tags, and UR in which the weights correspond to users and resource where

illustration not visible in this excerpt

In words, the bipartite graph UT links the users to the tags that they have associated at least to one resource. Each link is weighted by the number of times the person has used that tag. UR links the users to the resources that they have tagged . Each link is weighted by the number of tags the person has used for that resource. We can further break each bipartite graph into two simple graphs. For example, the UT graph can be folded into two graphs: a social network of users based on overlapping sets of tags; and a network of tags based on overlapping sets of users. Such a network of tags is referred to lightweight ontology of tags by [138]. Figure 2.4 shows the steps of the described process.

The other two bipartite graphs that we derive from the original tripartite model can also be folded into one-mode networks in a similar fashion. In particular, the TR graph leads to another semantic network, where the links between tags are weighted by the number of resources that are tagged with both terms. This type of network mimics the basic method applied in text mining, where terms are commonly associated by their co-occurrence in documents. The UR graph results in another social network of users, where the weight between two users is given by the number of resources they have both tagged. We also get a network of resources, with associations showing the number of people who have tagged a given pair of resources.

illustration not visible in this excerpt

Figure 2.4: We can reduce the hypergraph into three bipartite graphs with regular edges. The graphs model aggregates associations between users and resources (UR), users and tags (UT), and tags and resources (TR)

2.4 Related Work in Social Tagging Research

As social tagging applications have gained in popularity, researchers have started to explore and characterize the tagging phenomenon. In this section we provide an overview of the state of the art in the research area of social tagging systems. First, we review general related work in this area including approaches to characterize different social tagging systems and user motivations. Next, we review different data mining approaches for improving search and navigation in social tagging systems including different dimensions of recommendation. Finally, we present some current research efforts on bridging the gap between ontologies and folksonomies.

2.4.1 Characterize Social Tagging Systems and User Motivations

Many researchers have started to characterize social tagging systems and explore the reasons for emergence and popularity of tagging phenomenon [124, 76]. One of the significant formal studies of tagging systems appeared in the work of Golder and Heberman [76]. The authors studied the information dynamics in collaborative tagging systems, specifically, the delicious system. The authors discussed how tags have been used by individual users over time and how tags for an individual resource stabilizes over time. They also discussed two semantic difficulties: polysemy (when a single word has multiple related meanings) and synonymy (when different words have the same meaning). Macgregor and McCulloh provide an overview of the phenomenon and explore reasons why both social tagging and ontologies will have a place in the future of information access [124]. Ames and Naaman investigate the incentives and motivations for tagging in photo sharing systems such as Flickr. They present a taxonomy of motivations which combines sociality and functional motivations [10].

Chi and Mytkoswicz [47] have analyzed Delicious and found that the efficiency of social tagging decreases as the communities grow; that is, tags are becoming less and less descriptive and consequently it becomes harder to find a particular item. Simultaneously, it becomes harder to find tags that efficiently mark an item for future retrieval.

Adam Mathes [134] describes tags as user-created metadata, where users of the documents and media create metadata for their own individual use that is also shared throughout a community. Mathes points to several limitations and strengths of tagging systems. The major limitations include ambiguity, synonymy, and multiple words. The multiple words problem happens because most tagging systems are designed primarily to deal with a single word as a tag. So multiple word tags are subsequently parsed as separate tags (”Bill” and ”Clinton”); or users escape the one-word-only limitation by concatenating words, for example “Bill-Clinton” or “BillClinton”. All these different variations can make the search and filtering of the system inefficient. We will discuss the ambiguity and redundancy problems in more detail in section 2.3.1. The strength of tagging systems that Mathes points in [134] are serendipity while navigating through the system and the freedom of users to use their individual vocabulary. In a similar study, Wu et al. discuss the benefits and challenges of social tagging systems [221]. They propose folksonomies as potential technological infrastructure to support knowledge management activities in an organization or a society. However, they also point to the problem of lack of a document hierarchy in the knowledge taxonomy emerged by employee-generated folksonomy which prevents it from being widely adopted by enterprises. The authors suggest harvesting social knowledge by identifying communities of common interest, and identifying information leaders or domain experts, and generating ontologies.

Marlow et al. [133] analyze and compare the design and features of existing social tagging systems and develop two organizational taxonomies for social tagging systems . The first taxonomy describes system design and attributes and the second taxonomy describes user incentives. The first taxonomy is presented in seven dimensions including tagging rights (self-tagging, permission-based, free-for all), tagging support (blind, suggested, viewable), aggregation model (bag, set), resource type (textual, non-textual), source of material (usercontributed, system, global), resource connectivity (links, groups, none), and social connectivity (links, groups, none). The second taxonomy describes motivations of tagging and is categorized into two high-level categories of organizational and social.

2.4.2 Resource Recommendation and Personalized Search

Many researchers have explored different data mining techniques to facilitate user navigation, and to improve search and personalization in social tagging systems.

Recommender systems have been widely applied in folksonomies to overcome information overload and to help users to find high-quality sources, whether resources (e.g documents) or people. Unlike traditional recommender systems which have a two-dimensional relation between users and items, tagging systems have a three dimensional relation between users, tags and resources. Recommender systems can be used to recommend each of the dimensions based on one or two of the other dimensions.

Resources in social tagging systems can vary from bookmarks to video, music, photos and etc. Social tagging systems are potentially a suitable place for people interested in a specific domain to find new resources of their interest. However, to deal with the amount of information, users need to get supported by the system. Helping users to find interesting resources can be considered as “search”, when a user explicitly types a query, or “recommendation”, when the system suggests a resource to a user based on their profile and activities without the user explicitly asking for it. The search results might be personalized, in which the user profile such as user tags and resources are taken into account. In a personalized search, each user might get a different output for the same query. In this subsection, we review the state of the art in both areas of personalized search and recommendation of resources in folksonomies.

Traditional item recommendation algorithms can be used to recommend resources to users. However, tags can enrich user profiles with additional valuable information for recommendation. For instance, authors in [206] apply user-based and item-based collaborative filtering to recommend resources. The authors incorporat tags into standard CF algorithms by reducing the three-dimensional correlations to three two dimensional correlations and then applying a fusion method to reassociate these correlations. Similarly, [151] and [150] use tags as context information to recommend resources. Similar approaches for search personalization using user tags are presented in [223, 209]

Using tag clustering for resource recommendation is presented in [153]. In this work, first an affinity level between a user and a set of tag clusters is calculated. A collection of resources are then identified for each cluster based on tag usage. Resources are recommended to the user based on the user’s affinity to the clusters and the associated resources. Similarly, the role of tag clustering in personalized search and navigation is presented in [72, 71, 193]. The authors show that clustering provides a means to overcome redundancy and ambiguity thereby facilitating recommendation. Similar clustering algorithm for search personalization in suggested in [18]. In this work, the latent semantic associations between tags is captured using a third-order tensor technique. Then, hierarchical agglomerative clustering was employed to discover relevant resources and rank the search results. Pan et al. [156] suggest expanding folksonomy search with ontologies to address the problem of ambiguity in tagging systems. Probabilistic models have been utilized in [162] and [217] for resource recommendation.

In [100], a novel algorithm, FolkRank, for search and ranking in folksonomies is proposed that considers the interrelations among tags, resources and users. The authors extend the commonly known PageRank algorithm to folksonomies under the assumption that users, resources and tags are important if they are connected to other important tags, resources and users. They use a weight passing scheme to derive the importance of an object in folksonomies.

In addition to personalized search within the folksonomies, researchers have proposed to use social bookmarking websites such as Delicious for improving search results in search engines. In [14], ranking of web search is optimized using social annotations by taking into account the similarity of the query to the Webpages in Delicious. Morrison [149] compared the search performance of social bookmarking Websites against search engines and subject directories and found out that folksonomy search results overlap with those from the other systems, and documents found by both search engines and folksonomies are significantly more likely to be judged relevant than those returned by any single IR system type.

2.4.3 Tag Recommendation

Tag recommendation, the suggestion of tags during the annotation process reduces the user effort. Tag recommenders assist the tagging process by suggesting users a set of tags that they are likely to use for a resource. Personalized tag recommenders take the users’ tagging behavior in the past into account when they recommend tags. Researchers have applied different data mining and machine learning techniques to the tag recommendation problem.

A comparison of user-based collaborative filtering and a graph-based recommender based on the PageRank algorithm to recommend personalized tags is offered in [105]. Association rules are explored in [94] to recommend tags and introduce an entropy-based metric to find how predictable a tag is. The title of a resource and the user vocabulary is used in [122] to generate recommendations. The results show that tags retrieved from the user’s vocabulary outperform recommendations driven by resource information. The authors in [224] present general criteria for a good tagging system including high coverage of multiple facets, high popularity and least-effort. They categorize tags into different categories of content-based, context-based, attribute, subjective, and organizational tags and use a probabilistic method to recommend tags. Basile et al propose a classification algorithm for tag recommendation in [15] and Adrian et al. suggest a semantic tag recommendation system in the context of a semantic desktop in [6].

User-defined tags and co-occurrence are employed by [194] to recommend tags to users on Flickr. The assumption is that the user has already assigned a set of tags to a photo and the recommender uses those tags to recommend more tags. A similar study is conducted in [66] and a classification algorithm for tag recommendation is introduced. We adapted K-Nearest Neighbor for tag recommendation in [70] and showed incorporating user tagging habits into recommendation can improve recommendation.

Increasing interest in improving the effectiveness of tag recommendation attracted many researchers from all over the world to the ECML/PKDD Discovery Challenge 2009 [60] which was mainly focused on tag recommendation. Different data mining approaches including content-based techniques, probabilistic models, and factor models have been applied for tag recommendation. The comparison of different approaches [174] shows that graphbased and factorization models have the best performance in tag recommendation.

Most of the mentioned approaches for tag recommendation are appropriate for dense parts of the data where there is enough information available about the user or the resource. Thus, these approaches have problem in cold start situation when a new user or new resource is introduced into the system, or if the resource has not been associated to certain number of tags by certain number of users in the past. In these cases, content-based approaches are more applicable. Song et al. utilized machine learning algorithms including gaussian process classification, SVM model, vector space model and poisson mixture model to predict tags based on the content [197, 198]. One part of the ECML/PKDD 2009 Discovery Challenge was dedicated to the sparse part of the data to challenge content-based algorithms for tag recommendation. The succefull approaches relied on a combination of good preprocessing, some external knowledge sources and a good heuristic to choose the right set of tags [123]. An extensive comparison of different content-based algorithms for tag recommendation is presented in [104].

2.4.4 User Recommendation

Identifying communities of common interest, and information leaders or domain experts are the major potentinal benefits of social tagging systems from a knowledge management point of view [221]. User recommendation is common in social networking systems in which there is explicit connections among users. For example, facebook recommends new friends based on the number of friends in common. However, the more interesting application is to help users to find domain experts or like-minded people based on their profiles. Some social tagging systems such as show users their neighbors, the people that have similar listening tastes. In such scenarios, users can find new friends or groups with similar taste who they did not know before; this recommendation can be more useful since it is more serendipitous and provides users with a valuable information that they could not easily find out by themselves.

ExpertRank is introduced by [107] to quantify users’ expertise in the context of a tag in an enterprise tagging system. Two models are proposed to calculate ExpertRank. The first model simply assumes an unstructured tag space where tags have no dependencies and expertise gained in a tag context is independent of expertise gained in another tag context. The second and more realistic model assumes clustered tag space where each cluster contains set of tags related to each other. The relationships between the tags in a cluster are represented by links between tags and indicate overlapping expertise areas. The model enables a mechanism similar to the personalized version of the PageRank algorithm [29] to propagate expertise through the linked structure of the tags.

Another approach to expert finding is introduced in [64]. In this work, user-provided content and taxonomy classifications are used to describe topics of expertise. The approach is implemented for the DBLP16 -Expert Finder, a system for expert finding and exploratory search across researchers and topics in the field of computer science. Similar approach to expert finding is presented in [41]. The paper describes an approach that finds experts, expertise and collaboration networks in the context of the peer-review process. A taxonomy of computer science topics and an ontology of publications is used to create relationships from papers to one or more topics based on paper abstracts and keywords. Next, an expertise profile for a researcher is built based on the aggregation of the topics of his/her publications. Tag-based profiles are used in [58] to find persons with similar interests. Similar to previous ones, this work also uses DBLP collection as data set and the paper keywords as tags. Although a scientific database such as DBLP has some similarities to a narrow social tagging system, it has different properties. Such data is free of noise and has key features such as the abstract or citations that can help identify the properties of resources precisely.

The literature focusing on the problem of user recommendations in folksonomies is still sparse. One of the bottlenecks might be the difficulty of evaluating such recommendation in comparsion to tag or resource recommendation.

2.4.5 Emergent Ontologies from Folksonomies

Peter Mika [138, 139] is one of the first researchers who proposed social tagging systems as a semantic social network which could lead to ontology emergence. The idea roots in the emergent semantics proposed by [2] and the vision is a community of self-organizing, autonomous, agents co-operating in dynamic, open environments, each organizing knowledge (e.g. document instances) and establishing connections according to a self-established ontology. Mika suggests an Actor-Concept-Instance model of ontologies using the semanticsocial networks in the form of a tripartite graph of person, concept and instance associations, and extends the traditional concept of ontologies (concepts and instances) with the social dimension. He suggests reducing the hypergraph into three bipartite graphs with regular edges. These three graphs model the associations between actors and concepts, concepts and instances, and actors and instances which are the same as UT, TR, and UR explained in section 2.3.2. Similarly, [210] proposes deriving ontologies from folksonomies by integrating multiple resources and techniques. These resources are: (1) the statistical analysis of folksonomies (2) online lexical resources like dictionaries, Wordnet, Google and Wikipedia (3) ontologies and Semantic Web resources (4) ontology mapping and matching approaches, and (5) involving the community.

The idea of ontology of Folksomony is also proposed by [78] and [59]. Thomas Gruber in [78] discusses the differences of ontology and folksonomy and points out some design considerations for constructing ontologies from tags. The authors in [59] provide more details on the ontology model and an algorithm to create such an ontology from a folksonomy. Their model consists of an ontology designed in OWL that defines the following classes: Source, Resource, Tag, User, Annotation, AnnotationTag and Polarity. For the class Tag, two subclasses are also defined: “TagPersonal” and “TagCommon” which are used to classify the existing tags according to their type, separating the ones of personal type, like those related to the planning of personal tasks or self-reference tags such as“to-read” from the rest of tags (TagCommon). The tag class has the properties of “hasAltLabel” and “hasHidden-Label” which are meant to represent the different variations of a tag, including singular and plurals, verbal tenses, synonyms, misspellings, incorrect syntactic forms, etc., from the tags preferred representation. For example, the tag with preferred value “New York”, could have associated to hasAltLabel the strings “new-york”, “york”, and to hasHiddenLabel the strings “neyork”, “nyc”, etc. The authors suggest an algorithm for creating the ontology from the folksonomy which basically covers the associations between user, resources and tags but does not include how to find the tag types or properties. Although the idea of separating the tag types and the tag properties seem interesting, no well-defined algorithm is presented for finding such relations. In a similar work, [234] suggests mapping tags to an ontology and presents the process of mapping in a simple example. However, this work also postpones the automatic creation of an ontology and mapping tags to the related class of the ontology for future research.

Other researchers have started solving the details of the problem using information retrieval and machine learning techniques. Heymann and Molinay [93] suggest creating a hierarchical taxonomy of tags. Their algorithm calculates the similarity between tags using the cosine similarity between tag vectors and each new tag added to the system will be categorized as the child of the most similar tag. If the similarity value is less than a pre-defined threshold then the new tag will be added as a new category, which is a new child for the root. The problem with this algorithm is that there is no heuristic to find the parent-child relation. Any new similar tag will be considered as a child of the most similar tag previously added to the system although it might be more general than the other tag.

Markines et al. [132] present different aggregation methods in folksonomies and present different similarity measure for evaluating tag-tag and resource-resource similarity. Authors evaluate their approach by comparing the results with WordNet for tag similarity and ODP17 for resource similarity. Although this work presents a strong grounding for finding related or similar tags in folksonomies, it does not provide approaches for finding the kind of relationship between tags such as super or sub-concept relations.

Marinho et al. [13] use frequent itemset mining for learning ontologies from Folksonomies. In this work, a folksonomy is enriched with a domain expert ontology and the output is a taxonomy which is used for resource recommendation. Authors use two major heuristic to create the taxonomy. First, more popular tags are considered more general and second, a tag x is a super-concept of a tag y if there are frequent itemsets containing both tags suchthat support(x) > support(y) .

Hotho et al. suggest using association rules for discovering semantic relations beteen tags [99]. Wu et al. use probabilistic EM algorithms to estimate the probability of cooccurrences of folksonomy elements to obtain the emergent semantics contained in the data [232, 222]. The discovered semantic relations are then applied for semantic search and resource discovery. Similarly, Zhou et al. [235] propose a probabilistic clusetring approach named Deterministic Annealing (DA) for extracting hierarchical semantics from social annotations.

An ontology learning approach that captures the hierarchical semantic structure of folksonomies is proposed in [204]. The proposed approach consists of three stages. In the first stage, generative probabilistic models are used to model correspondences between tags and documents. The basic idea is that assuming each tag has multiple submeanings, a tag having similar high distributions on multiple topics indicates a high likelihood of being a general tag; while a tag having a high distribution on only one specific topic indicates that the tag possibly has a specific meaning. In the second stage, the possible relations between tags are estimated. Four divergence measures between tags are defined based on the modeling results from the previous stage in order to quantitatively characterize the possible relations between tags. In the third stage, the relation between tags are determined and a hierarchical structure is constructed.

Although many tools have been developed in recent years to support automatic ontology creation, the resulting ontologies are still not meant to be used directly by the end ontology users and the construction of ontologies continues to be a manual, labor-intensive exercise carried out by specialists in the domain [216, 126]. In chapter 5, we propose a semiautomatic approach to support users to collaboratively develop an ontology in a social Web environment.

2.5 Chapter Summary

In this chapter, we gave an overview of the state of the art of recommender systems and social tagging systems. We presented different recommendation technologies along with their advantages and disadvantages. Next, we presented social tagging systems and our terminology to model a folksonomy which will be used throughout this thesis. Next, we focused on the related work in the social tagging area. We introduced different possible recommendation tasks in social tagging systems and briefly reviewed current state of the art. We also presented the current research efforts on creating ontologies from folksonomies.

In this chapter we discussed some major challenges that social tagging systems suffer from including ambiguity, redundancy, and attacks against social tagging systems. Our next chapters present our efforts on addressing these challenges.

Chapter 3 Matching Recommendation Technologies and Domains

3.1 Introduction

This chapter presents a comprehensive taxonomy of recommender systems with guidelines on how to select and apply these systems in different domains. In this chapter we look at recommender system from a broad perspective. Unlike the common traditional view on recommender systems as being used only in e-commerce applications [190], we consider a broad definition of recommendation systems that captures several domains. We consider recommender systems as ones that enable a particular kind of interaction with the user: “any system that produces individualized recommendations as output or has the effect of guiding the user in a personalized way to interesting or useful objects in a large space of possible options” [36]. This expansive definition makes the scope of recommender systems research quite broad, but it fails to give much guidance on what criteria to consider for implementing such systems.

With the rapid development of recommender system technologies in the recent years, it has become difficult for developers to determine which technology is suitable to a particular context. To alleviate this difficulty, we propose a framework that organizes the space of recommendation problems and provides a systematic approach to finding the appropriate recommendation technology for addressing a given problem in a specific domain. A crucial question this chapter tries to address is how recommendation techniques can be matched to recommendation domains.

In this chapter, we first distinguish among different recommendation algorithms based on the source of knowledge they use. Based on existing literature, we classify the source of knowledge in recommender systems to three different kinds: individual knowledge which

basically comes from user input in different forms, social knowledge which is the aggregation of masses of individual knowledge, and content knowledge which can be in form of item features or can get a more complex form of domain knowledge.

Next, we extract the domain characteristics that can help an implementer to select the suitable recommendation technology. These characteristics include heterogeneity, risk, churn, interaction style, preference stability, and inscrutability. We describe each characteristic and explain how they can effect the knowledge sources. Finally, we map the domain properties to recommendation technologies and provide examples from the real world application and recommender system literature.

The research approach for this study consists of a meta-analysis of the research literature and real-world applications. More than 70 articles and application examples of recommender systems from the 12-year period 1997-2008 were gathered for analysis. This time frame covers the development of recommender system research from its early stages to the present time and is extensive enough to identify the different domains that recommender systems have been applied to. The articles then were categorized based on their domain of recommendations. Main application domains include e-commerce, movie, music, news, Web page, real financial service recommendation, etc. Next, we tried to identify the main characteristic of each domain that influence on the selection of the appropriate recommendation technology. We investigated the impact of each characteristic on the knowledge sources of recommendation and mapped those characteristics to the recommendation technologies based on the knowledge sources they use.

















16 BDLP:

17 Open Directory Project

Excerpt out of 185 pages


Using Data Mining for Facilitating User Contributions in the Social Semantic Web
Karlsruhe Institute of Technology (KIT)
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
4676 KB
Social tagging, data mining, Foksonomy, Recommendder system
Quote paper
Maryam Ramezani (Author), 2011, Using Data Mining for Facilitating User Contributions in the Social Semantic Web, Munich, GRIN Verlag,


  • guest on 11/20/2011


Look inside the ebook
Title: Using Data Mining for Facilitating User Contributions in the Social Semantic Web

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free