Name: Extending the "sameAs" infrastructure of the LOD cloud Vrije Universiteit
Price: 0.99 EUR
Availability: InStock
Author: Abdullah Nashef
ISBN: 9783346256478

This work extends MetaLink by assigning a source to the owl:sameAs links and adding the links’ frequency. Based on that, this work suggests a way of validating the trustworthiness of owl:sameAs links. Additional big scale data analysis has been conducted on the data collected from the LOD cloud.

Linking data is an essential method to build up the semantic web architecture. The interlinking between datasets across the web represents relations, together with the data they create context. One of the most widely used relations is owl:sameAs, a relation between two sources that share the same content. Linking these datasets is useful for a variety of needs such as Knowledge discovery and information retrieval. MetaLink is a resource that contains more than 558M owl:sameAs links with their metadata.

Excerpt

1 Introduction
1.1 The Semantic Web
1.2 Related Work
1.2.1 LOD Laundromat
1.2.2 MetaLink
1.3 Contribution & Thesis Organisation

2 Approach
2.1 Extracting owksameAs from the LOD Laundromat
2.2 Finding the identity statements on MetaLink
2.3 Annotate the sameAs with its frequency
2.4 Limitation

3 Data analysis

4 Use Cases
4.1 Filter links based on their source
4.2 Detect erroneous owl:sameAs links

5 Conclusion and discussion

Abstract

Linking data is an essential method to build up the semantic web architecture. The interlinking between datasets across the web represents relations, together with the data they create context. One of the most widely used relations is owl:sameAs, a relation between two sources that share the same content. Linking these datasets is useful for a variety of needs such as Knowledge discovery and information retrieval. MetaLink is a resource that contains more than 558M owhsameAs links with their metadata. This work extends MetaLink by assigning a source to the owhsameAs links and adding the links’ frequency. Based on that, this work suggests a way of validating the trustworthiness of owhsameAs links. Additional big scale data analysis has been conducted on the data collected from the LOD cloud.

Keywords MetaLink - Semantic web - The web of data - Linked open data - Knowledge graphs - owhsameAs

Chapter 1 Introduction

The initial form of the Web 1.0, was a network of interconnected pages by hyperlinks. This interconnection allowed the Web to scale up and evolve to Web 2.0 of a never-ending network. The newer version converted the web from the state Read-Only to collaborative dynamic usergenerated content. As it gave birth to many new concepts such as social media, video sharing sites (YouTube), and web applications (Google Docs). This rapid emergence of user-generated content resulted in the unprecedented phenomenon of Big Data Explosion, which led to difficulties in managing the published information. To be able to control this drastic growth, new technologies are needed to deal with several challenges such as storage, processing, and data retrieval. Storage requires enabling machines and supercomputers with massive capacity to store the data in clouds. Processing this information requires computing power, not only for data but also for the immense emerge of IoT. Finally retrieving, the information that is processed and stored needs to be accessed in an intelligent, efficient, and scalable, where search engines play a significant role in.

The search engines have changed remarkably over the past few years. These search engines are mostly based on several technologies (e.g. Machine learning, Semantic Web, and Knowledge Representation). Yet, not all the queries return desired results. For instance, searching for the following query: “Why cats eat grass?”. The results of such a query would return a variety of sources with information that may be related or not to the query. Thus, searching for information is rather time-consuming and not optimal. Optimization for search algorithms are needed to overcome such a problem. The Semantic Web technology has already solutions, which has crucial impact on developing web 3.0.

1.1 The Semantic Web

The Semantic Web is an extension of the current Web introduced by The World Wide Web Consortium1 (W3C) in 1999 2. It adds the notion of interlinking data described with metadata. The metadata has three objectives: interlinking data, identify them and denote this data with a certain relation (e.g. sameAs). Machines benefit from this metadata because it makes them understand the data and make context out of it. Tim Berners Lee introduced 5 rules to publish data on the Semantic Web 3. Following these rules is significant for the data to have a “linkable” structure. The first start is attainable by making the data available on the Web, despite the format. The other stars stand for discoverability by any user or machine, the second star is intended for making the data readable (e.g. Excel sheet and not an image). The third is awarded to data that does not require software to read it (e.g. XLS, CSV). Instead, using non-proprietary format (e.g. The Open¹

Document Format; ODS). The fourth is for using open W3C standards such as RDF to be able to identify sources. The final 5th star was introduced as an addition in 20103, which is linking data to other datasets. To sum up, these are a consecutive step towards the Web of data which empowers the Web and makes substantial enhancements in a number of Helds such as:

- Automation: We can save a tremendous amount of time doing hideous tasks such as scheduling appointments or booking tickets, which could be done by virtual assistants
- Searching engines: The results of searching will be more relevant than ever using the Semantic Web technology.
- Personalization: A way to make use of the huge amount of data uploaded to the internet every day in a personalized manner, where agents collect interesting content for us.

To convert the current unstructured data, tools are required to extract the underlying data semantics in a form of machine-friendly language. In 2004, the Web Ontology Language² (OWL) was introduced and became a W3C recommendation. OWL is a Semantic Web language supplied with expressive representational constructs that represent rich, taxonomic, and complex knowledge. This knowledge represents entities, groups of entities, or relations between things 11. OWL allows reasoning and consistency checking, which is a useful feature when dealing with knowledge coming from multiple sources. Since it is common practice for these multiple sources to describe the same real-world entity, and in the absence of a central naming authority in the Semantic Web, it is unavoidable for this same real-world entity to be denoted by different identifiers. Hence, linking these identifiers is necessary. For this purpose OWL introduced the owksameAs 8 relation to state equality between identifiers referring to the same real-world entity. With its strict logical semantics, an owksameAs statement between two identifiers indicates that every property asserted to one identifier will also be inferred to the other. Hence, allowing both identifiers to be used interchangeably in all contexts.

1.2 Related Work

This section presents an overview of the related work to this thesis. Section 1.2.1 describes the crawled data and stored in the Linked Open Data (LOD) Laundromat, whilst Section 1.2.2 covers MetaLink the travel guide of the LOD Laundromat.

1.2.1 LOD Laundromat

The LOD cloud has experienced, over the past decade, an immense growth. Nevertheless, not all the data is clean and sound for use because it contains many mistakes and incompatibilities 7. These mistakes made the data not ideal to use. In 2015, a tremendous amount of RDF data was crawled out of the Web and stored in the LOD Laundromat 1. LOD Laundromat collected only data filtering out the stains making it healthy for uptake. The corpus consists of more than 38 Billion triples from around 650K datasets, on which the paper’s contribution is based. The data is publicly accessible and can be queried online with an HTML-based browser LOD-a-lot 6.

1.2.2 MetaLink

Metalink³ 2 is an identity meta-dataset that focuses only on owl:sameAs links, where they are stored, reihed and compressed in it using Header Dictionary Triples library(HDT)⁴. As it adds an identity statement for each owl:sameAs statement. Every identity statement represents the subject, predicate, and object. In simple words, it is the travel guide of the LOD-Laundromat. According to several studies, a number of owhsameAs links are erroneous [9, 5]. Where some links by no means relate to other entities and they are denoted with the predicate owl:sameAs. Another study showed an impactful method to detect these links based on the community structure 10. MetaLink embedded an error degree for every identity statement assigned between 0 (good) and 1 (bad). This work extends MetaLink by adding the source and the frequency to every identity statement.

1.3 Contribution & Thesis Organisation

This thesis makes the two following contributions:

1. It extracts all owl:sameAs links available in the 650K datasets of the LOD Laundromat. While this process has been previously done in MetaLink 2, this thesis extends previous work by extracting the source of every owl:sameAs, in addition to its frequency.
2. It presents an in-depth analysis of over 1 Billion extracted owhsameAs, giving insights on their sources and frequency distributions.

The structure of the thesis consists of four sections. Section 2 describes the process of extracting the data from the LOD Laundromat and annotating the sameAs links with their source and frequency and the limitation during this process. Furthermore, section 3 gives analysis about the data that is extracted in section 2 and make comparisons with previous data analysis from MetaLink. Section 4 discusses the use cases of sources and frequency. Finally section 5 concludes.

Chapter 2 Approach

2.1 Extracting owl:sameAs from the LOD Laundromat

As motivated previously, we have available datasets collected in the LOD Laundromat. This data consists of 38 Billion RDF triple tab-separated in 650K datasets. These triples contain many heterogeneous predicates. The first step would be to loop through these large datasets and extract only triples containing owksameAs statements. Collecting them only does not serve our purpose, however, we need to And the source of the statement. The source is the publisher on whose hostname the triple was found. Thus, while looping through the datasets the source must be added, presuming that every dataset is collected from one source. This implies that this source provides its own data linked to another source that has the same content by a particular predicate. Therefore, we assumed that the most frequent subject of a triple in a dataset, carries the publisher’s hostname. Since the dataset can consist of hundreds of millions of triples, it was not feasible to dig into the source of 650K datasets because of the enormous resources the program would consume. Consequently, we checked only for the first 500 subjects of every dataset. As a result of this assumption, the most frequent hostname of the first 500 subjects, is assigned as a source of the dataset. The hostnames are added to the extracted owl:sameAs triples where the triple originally assumingly belong. The code is used to extract the owksameAs links is available on Github⁵ Furthermore, having the source of the owksameAs statements can enable pursuing further research to assess the trustworthiness and credibility of the publisher of the statement, and hence the statement itself. This is briefly discussed at the end of the thesis.

2.2 Finding the identity statements on MetaLink

After extracting the triples from the LOD Laundromat, we matched them to their identity statement in MetaLink using the HDT library, we described in section (1.2.2). The library contains a simple function that helps to find the identity statement of a certain triple in Metalink. hdt.search_triples"" This function accepts up to three parameters. For example, searching for the triple x sameAs y , using the above function takes three steps. We feed the function with x in the subject position, the predicate, and the first position remains empty as described in the following statement: hdt.search_triples("","http://www.w3.org/1999/02/22-rdf-syntax-ns#subject",x)

This function returns a 2-element tuple, with an iterator over the matching RDF triples and the triple’s cardinality. The matching triples contain all objects with the matched subjects that are stored in MetaLink. The list of returned triples have in the hrst column the identity statements, the second is the predicates, and the third is the subjects. Next, we call the same function again, we insert y in the object position, the predicate, and the hrst position remains empty as follows: hdt.search_triples("","http://www.w3.org/1999/02/22-rdf-syntax-ns#object",y)

Similarly, it returns the same results as in the hrst call, however, this time it returns all subjects with the matched object that are stored in MetaLink. Finally, we take out the identity statements lists of the two previous steps and apply intersection function on them. The output would be the identity statement that represents the queried triple x sameAs y . The code of this algorithm can be found on Github⁶

2.3 Annotate the sameAs with its frequency

The frequency is a feature added to the collected triples denoting the occurrences of the repetitive owl:sameAs links collected in the data. Although it is the same identical owhsameAs statement, however, it may have a different source, in which it had been published. This means that multiple sources are agreeing that the subject and the object refer to the same thing. Thus, the more frequent the triple is, increases the quality of the owhsameAs statement. In order to get the frequency of a triple, we linked them to their identity statement in MetaLink. This process reduces processing long strings to a simple integer that represents the identiher. Fortunately, The HDT library facilitates linking subjects and objects with the corresponding identiher. The library works out conveniently at low-cost.

2.4 Limitation

In this section, we list several obstacles were encountered during the aforementioned processes and the way we tackled them:

- A number of datasets contain links that have special characters, which cannot be encoded in utf-8. The work-around was to make a list of datasets that raised the misencoding error and run the program again on that list with the encoding cpl256. Finally, we combined the triples with both encoding in one hie. Due to this problem, extracting the triples cannot be done in one step. Three additional steps instead are required:

1. Make a list of misencoded datasets
2. run the program again with the encoding cpl256
3. combine the two triples lists together

- A number of owhsameAs links could not be matched with their identity statement in MetaLink. Consequently, these triples had to be counted separately to annotate their frequency. Due to this limitation, we failed to reason why there is a slight difference between the data in this paper and MetaLink as shown in chapter 3.

Chapter 3 Data analysis

This chapter gives an in-depth analysis of the collected owksameAs triples, presenting insights on their sources and frequency distributions. The following table 3.1, represents the data collected in numbers: