Extending the "sameAs" infrastructure of the LOD cloud Vrije Universiteit


Bachelor Thesis, 2020

14 Pages, Grade: 8


Excerpt


Contents

1 Introduction
1.1 The Semantic Web
1.2 Related Work
1.2.1 LOD Laundromat
1.2.2 MetaLink
1.3 Contribution & Thesis Organisation

2 Approach
2.1 Extracting owksameAs from the LOD Laundromat
2.2 Finding the identity statements on MetaLink
2.3 Annotate the sameAs with its frequency
2.4 Limitation

3 Data analysis

4 Use Cases
4.1 Filter links based on their source
4.2 Detect erroneous owl:sameAs links

5 Conclusion and discussion

Abstract

Linking data is an essential method to build up the semantic web architecture. The interlinking between datasets across the web represents relations, together with the data they create context. One of the most widely used relations is owl:sameAs, a relation between two sources that share the same content. Linking these datasets is useful for a variety of needs such as Knowledge discovery and information retrieval. MetaLink is a resource that contains more than 558M owhsameAs links with their metadata. This work extends MetaLink by assigning a source to the owhsameAs links and adding the links’ frequency. Based on that, this work suggests a way of validating the trustworthiness of owhsameAs links. Additional big scale data analysis has been conducted on the data collected from the LOD cloud.

Keywords MetaLink - Semantic web - The web of data - Linked open data - Knowledge graphs - owhsameAs

Chapter 1

Introduction

The initial form of the Web 1.0, was a network of interconnected pages by hyperlinks. This interconnection allowed the Web to scale up and evolve to Web 2.0 of a never-ending network. The newer version converted the web from the state Read-Only to collaborative dynamic user­generated content. As it gave birth to many new concepts such as social media, video sharing sites (YouTube), and web applications (Google Docs). This rapid emergence of user-generated content resulted in the unprecedented phenomenon of Big Data Explosion, which led to difficulties in managing the published information. To be able to control this drastic growth, new technologies are needed to deal with several challenges such as storage, processing, and data retrieval. Storage requires enabling machines and supercomputers with massive capacity to store the data in clouds. Processing this information requires computing power, not only for data but also for the immense emerge of IoT. Finally retrieving, the information that is processed and stored needs to be accessed in an intelligent, efficient, and scalable, where search engines play a significant role in.

The search engines have changed remarkably over the past few years. These search engines are mostly based on several technologies (e.g. Machine learning, Semantic Web, and Knowledge Representation). Yet, not all the queries return desired results. For instance, searching for the following query: “Why cats eat grass?”. The results of such a query would return a variety of sources with information that may be related or not to the query. Thus, searching for information is rather time-consuming and not optimal. Optimization for search algorithms are needed to overcome such a problem. The Semantic Web technology has already solutions, which has crucial impact on developing web 3.0.

1.1 The Semantic Web

The Semantic Web is an extension of the current Web introduced by The World Wide Web Con­sortium1 (W3C) in 1999 2. It adds the notion of interlinking data described with metadata. The metadata has three objectives: interlinking data, identify them and denote this data with a certain relation (e.g. sameAs). Machines benefit from this metadata because it makes them understand the data and make context out of it. Tim Berners Lee introduced 5 rules to publish data on the Semantic Web 3. Following these rules is significant for the data to have a “linkable” structure. The first start is attainable by making the data available on the Web, despite the format. The other stars stand for discoverability by any user or machine, the second star is intended for making the data readable (e.g. Excel sheet and not an image). The third is awarded to data that does not require software to read it (e.g. XLS, CSV). Instead, using non-proprietary format (e.g. The Open1

Document Format; ODS). The fourth is for using open W3C standards such as RDF to be able to identify sources. The final 5th star was introduced as an addition in 20103, which is linking data to other datasets. To sum up, these are a consecutive step towards the Web of data which empowers the Web and makes substantial enhancements in a number of Helds such as:

- Automation: We can save a tremendous amount of time doing hideous tasks such as scheduling appointments or booking tickets, which could be done by virtual assistants
- Searching engines: The results of searching will be more relevant than ever using the Semantic Web technology.
- Personalization: A way to make use of the huge amount of data uploaded to the internet every day in a personalized manner, where agents collect interesting content for us.

To convert the current unstructured data, tools are required to extract the underlying data semantics in a form of machine-friendly language. In 2004, the Web Ontology Language2 (OWL) was introduced and became a W3C recommendation. OWL is a Semantic Web language supplied with expressive representational constructs that represent rich, taxonomic, and complex knowledge. This knowledge represents entities, groups of entities, or relations between things 11. OWL allows reasoning and consistency checking, which is a useful feature when dealing with knowledge coming from multiple sources. Since it is common practice for these multiple sources to describe the same real-world entity, and in the absence of a central naming authority in the Semantic Web, it is unavoidable for this same real-world entity to be denoted by different identifiers. Hence, linking these identifiers is necessary. For this purpose OWL introduced the owksameAs 8 relation to state equality between identifiers referring to the same real-world entity. With its strict logical semantics, an owksameAs statement between two identifiers indicates that every property asserted to one identifier will also be inferred to the other. Hence, allowing both identifiers to be used interchangeably in all contexts.

1.2 Related Work

This section presents an overview of the related work to this thesis. Section 1.2.1 describes the crawled data and stored in the Linked Open Data (LOD) Laundromat, whilst Section 1.2.2 covers MetaLink the travel guide of the LOD Laundromat.

1.2.1 LOD Laundromat

The LOD cloud has experienced, over the past decade, an immense growth. Nevertheless, not all the data is clean and sound for use because it contains many mistakes and incompatibilities 7. These mistakes made the data not ideal to use. In 2015, a tremendous amount of RDF data was crawled out of the Web and stored in the LOD Laundromat 1. LOD Laundromat collected only data filtering out the stains making it healthy for uptake. The corpus consists of more than 38 Billion triples from around 650K datasets, on which the paper’s contribution is based. The data is publicly accessible and can be queried online with an HTML-based browser LOD-a-lot 6.

1.2.2 MetaLink

Metalink3 2 is an identity meta-dataset that focuses only on owl:sameAs links, where they are stored, reihed and compressed in it using Header Dictionary Triples library(HDT)4. As it adds an identity statement for each owl:sameAs statement. Every identity statement represents the subject, predicate, and object. In simple words, it is the travel guide of the LOD-Laundromat. According to several studies, a number of owhsameAs links are erroneous [9, 5]. Where some links by no means relate to other entities and they are denoted with the predicate owl:sameAs. Another study showed an impactful method to detect these links based on the community structure 10. MetaLink embedded an error degree for every identity statement assigned between 0 (good) and 1 (bad). This work extends MetaLink by adding the source and the frequency to every identity statement.

1.3 Contribution & Thesis Organisation

This thesis makes the two following contributions:

1. It extracts all owl:sameAs links available in the 650K datasets of the LOD Laundromat. While this process has been previously done in MetaLink 2, this thesis extends previous work by extracting the source of every owl:sameAs, in addition to its frequency.
2. It presents an in-depth analysis of over 1 Billion extracted owhsameAs, giving insights on their sources and frequency distributions.

The structure of the thesis consists of four sections. Section 2 describes the process of extracting the data from the LOD Laundromat and annotating the sameAs links with their source and fre­quency and the limitation during this process. Furthermore, section 3 gives analysis about the data that is extracted in section 2 and make comparisons with previous data analysis from MetaLink. Section 4 discusses the use cases of sources and frequency. Finally section 5 concludes.

Chapter 2

Approach

2.1 Extracting owl:sameAs from the LOD Laundromat

As motivated previously, we have available datasets collected in the LOD Laundromat. This data consists of 38 Billion RDF triple tab-separated in 650K datasets. These triples contain many heterogeneous predicates. The first step would be to loop through these large datasets and extract only triples containing owksameAs statements. Collecting them only does not serve our purpose, however, we need to And the source of the statement. The source is the publisher on whose hostname the triple was found. Thus, while looping through the datasets the source must be added, presuming that every dataset is collected from one source. This implies that this source provides its own data linked to another source that has the same content by a particular predicate. Therefore, we assumed that the most frequent subject of a triple in a dataset, carries the publisher’s hostname. Since the dataset can consist of hundreds of millions of triples, it was not feasible to dig into the source of 650K datasets because of the enormous resources the program would consume. Consequently, we checked only for the first 500 subjects of every dataset. As a result of this assumption, the most frequent hostname of the first 500 subjects, is assigned as a source of the dataset. The hostnames are added to the extracted owl:sameAs triples where the triple originally assumingly belong. The code is used to extract the owksameAs links is available on Github5 Furthermore, having the source of the owksameAs statements can enable pursuing further research to assess the trustworthiness and credibility of the publisher of the statement, and hence the statement itself. This is briefly discussed at the end of the thesis.

2.2 Finding the identity statements on MetaLink

After extracting the triples from the LOD Laundromat, we matched them to their identity statement in MetaLink using the HDT library, we described in section (1.2.2). The library contains a simple function that helps to find the identity statement of a certain triple in Metalink. hdt.search_triples"" This function accepts up to three parameters. For example, searching for the triple x sameAs y , using the above function takes three steps. We feed the function with x in the subject position, the predicate, and the first position remains empty as described in the following statement: hdt.search_triples("","http://www.w3.org/1999/02/22-rdf-syntax-ns#subject",x)

This function returns a 2-element tuple, with an iterator over the matching RDF triples and the triple’s cardinality. The matching triples contain all objects with the matched subjects that are stored in MetaLink. The list of returned triples have in the hrst column the identity statements, the second is the predicates, and the third is the subjects. Next, we call the same function again, we insert y in the object position, the predicate, and the hrst position remains empty as follows: hdt.search_triples("","http://www.w3.org/1999/02/22-rdf-syntax-ns#object",y)

Similarly, it returns the same results as in the hrst call, however, this time it returns all subjects with the matched object that are stored in MetaLink. Finally, we take out the identity statements lists of the two previous steps and apply intersection function on them. The output would be the identity statement that represents the queried triple x sameAs y . The code of this algorithm can be found on Github6

2.3 Annotate the sameAs with its frequency

The frequency is a feature added to the collected triples denoting the occurrences of the repetitive owl:sameAs links collected in the data. Although it is the same identical owhsameAs statement, however, it may have a different source, in which it had been published. This means that multiple sources are agreeing that the subject and the object refer to the same thing. Thus, the more frequent the triple is, increases the quality of the owhsameAs statement. In order to get the frequency of a triple, we linked them to their identity statement in MetaLink. This process reduces processing long strings to a simple integer that represents the identiher. Fortunately, The HDT library facilitates linking subjects and objects with the corresponding identiher. The library works out conveniently at low-cost.

2.4 Limitation

In this section, we list several obstacles were encountered during the aforementioned processes and the way we tackled them:

- A number of datasets contain links that have special characters, which cannot be encoded in utf-8. The work-around was to make a list of datasets that raised the misencoding error and run the program again on that list with the encoding cpl256. Finally, we combined the triples with both encoding in one hie. Due to this problem, extracting the triples cannot be done in one step. Three additional steps instead are required:

1. Make a list of misencoded datasets
2. run the program again with the encoding cpl256
3. combine the two triples lists together

- A number of owhsameAs links could not be matched with their identity statement in Met­aLink. Consequently, these triples had to be counted separately to annotate their frequency. Due to this limitation, we failed to reason why there is a slight difference between the data in this paper and MetaLink as shown in chapter 3.

Chapter 3

Data analysis

This chapter gives an in-depth analysis of the collected owksameAs triples, presenting insights on their sources and frequency distributions. The following table 3.1, represents the data collected in numbers:

Abbildung in dieser Leseprobe nicht enthalten

Table 3.1: Details about the data collected from the LOD Laundromat

It is noteworthy that there is a slight difference between the numbers in the table 3.1 and the results in the paper of MetaLink 2. The total number of sameAs links stored in MetaLink is 558,943,116. However, there are 2,790,662 of these links are reflexive, they contain identical subjects and objects. Excluding the reflexive statements, the total would be 556,152,454. In comparison with our results, the difference would be 61,761 sameAs links.

The data is published on the web and free to download the on the website Zenodo1. The structure of the dataset contains the following columns: Subject, Predicate, Object, Assertion7 8, Source, Folder path9

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.1: The frequency of reoccurring owl:sameAs links

Figure 3.1 shows the frequency of reoccurring owksameAs links, it is apparent that the vast majority of the links occurred a few times. The owksameAs occurred a single time represent 20,29% of all the data. Remaining 79,71% of the data that occurred at least twice, we will discuss this phenomenon further in the next section. Moreover, the maximum in this diagram represents the total number of the least frequent sameAs links in the diagram while the minimum is the most frequent link10. The average and the median are (2,099,487.635) and (1,101) respectively. The top 5 most frequent links can be found the table 3.2.

Abbildung in dieser Leseprobe nicht enthalten

Table 3.2: Top 5 most frequent sameAs links in the collected data

In Figure 3.2, every blue bar represents a hostname. These hostnames are assigned to the triples as we explained previously in section 2.1. The most frequent source was DBpedia11 it appeared 150,881,135 times and assigned to 13,66% of the data. Apart from that, the subdomains of DBpedia represent 76,01% of the data. As a result of this analysis, most of the data was collected from DBpedia in a total of 89,67% The subdomains and the main hostname of DBpedia included. In table 3.3, we can hnd the most occurred hostnames. The minimum, average, and median are (1), (2,943,000), and (11,340) respectively.

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.2: The frequency of the triples’ assigned hostnames

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.3: The distribution of owhsameAs links over the datasets

Abbildung in dieser Leseprobe nicht enthalten

Table 3.3: Top 5 most frequent sources

Finally, Figure 3.3 demonstrates the distribution of the sameAs links over the datasets. Only the datasets that contain sameAs links are shown in this graph. Interestingly, most of owhsameAs links reside in 1,500 datasets. The richest dataset contains 2,74% of all owl:sameAs links. The maximum, average, median and minimum are (30,230,000), (140,200), (1), (1) respectively.

Chapter 4

Use Cases

Many studies have been done evaluating the trustworthiness of owhsameAs links [10, 4]. This branches out to two different things to evaluate. On the one hand, we need to know whether the source of the information or the website that provides the data is trustworthy. On the other hand, it is also signihcant to know whether the two entities are about the same thing in other words we need to prove the validity of the owhsameAs link. Below we suggest a way to evaluate the aforementioned points using the frequency that we worked out above.

4.1 Filter links based on their source

It is crucial to know where the owhsameAs is originated. As a result of the process above we have done, every triple has an assigned source that is available to be evaluated. The frequency is a good way to assess sources. In the analysis, we demonstrated many pieces of useful information regarding the sources. Frequent sources have often credible content. That was remarkable in the example DBpedia mentioned previously in the analysis. It provided 13,66% of the entire data. As it has many incoming and outgoing arrows links to other

4.2 Detect erroneous owl:sameAs links

According to Raad et al. 10, “more than 1.2 million owhsameAs links have an error degree of [0.99- 1]” meaning that this is a serious problem that owhsameAs encounters. This problem can be also tackled by the frequency, this can be done in two methods.

The Hrst method is simple, we are now able to Hnd the frequency of the triple, as Hnding the number of times the triple is asserted. An approach for detecting erroneous owhsameAs links can presume that the more frequent the triple is, the more trustworthy the link can be.

Secondly, another advantage of Hnding the source of the triple is that we can now see how many sources are asserting the triple. For example, the triple <http://www.last.fm/user/Zeritas> <owl:sameAs> <http://www.last.fm/user/Zeritas/^avatarPanel> has frequency of 4763, this in­dicates the assertion times. After digging more into this triple, we found that this triple has 21 assigned sources in different datasets.

As a result, we are fairly conhdent to declare the Zeritas triple trustworthy based on 21 different hostnames and asserted 4763 times. However, the frequency cannot tell much about a huge chunk of the links that occurred a single time. These links represent 20,29% of the data where the frequency cannot be useful but the sources still can be assessed for their credibility

Chapter 5

Conclusion and discussion

In this paper, we extracted the owl:sameAs from the data collected in LOD Laundromat and extended the infrastructure of these links in MetaLink by adding the source and the frequency. Based on these features, we suggested using them to evaluate the sources and the owl:sameAs links. The frequency feature can be combined, for example, with the error degree in MetaLink that was introduced in Raad et al. 10. However, we mentioned that it has a limitation in the links that are asserted a single time. The frequency, in that case, does not provide much information. Whilst the source assigned to these triple can be assessed for its credibility, more research is needed to validate the assigned source and assessing its trustworthiness.

References

1 Wouter Beek et al. “LOD laundromat: a uniform way of publishing other people’s dirty data”. In: International semantic web conference. Springer. 2014, pp. 213-228.

2 Wouter Beek et al. “MetaLink: A Travel Guide to the LOD Cloud”. In: The Semantic Web. Ed. by Andreas Harth et al. Cham: Springer International Publishing, 2020, pp. 481-496. isbn: 978-3-030-49461-2.

3 T. Berners-Lee. Linked data: Design issues. 2006. URL: http://www.w3.org/DesignIssues/ LinkedData.html.

4 Philippe Cudré-Mauroux et al. “IdMesh: Graph-Based Disambiguation of Linked Data”. In: Proceedings of the 18th International Conference on World Wide Web. WWW ’09. Madrid, Spain: Association for Computing Machinery, 2009, pp. 591-600. ISBN: 9781605584874. DOI: 10.1145/1526709.1526789. URL: https://doi.org/10.1145/1526709.1526789.

5 Li Ding et al. “owksameAs and Linked Data: An Empirical Study”. In: Proceedings of the Second Web Science Conference. Raleigh NC, USA, Apr. 2010.

6 Javier D. Fernandez et al. “LOD-a-lot: A queryable dump of the LOD cloud”. English. In: The Semantic Web - ISWC 2017 - 16th International Semantic Web Conference, Proceedings. Vol. 10588 LNCS. Lecture Notes in Computer Science (including subseries Lecture Notes in Artihcial Intelligence and Lecture Notes in Bioinformatics). Springer/Verlag, 2017, pp. 75-83. ISBN: 9783319682037. DOI: 10.1007/978-3-319-68204-4_7.

7 Aidan Hogan et al. “An Empirical Survey of Linked Data Conformance”. In: Web Semant. 14 (July 2012), pp. 14-44. ISSN: 1570-8268. DOI: 10.1016/j.websem.2012.02.001. URL: https://doi.Org/10.1016/j.websem.2012.02.001.

8 I. Horrocks, P.F. Patel-Schneider, and F. van Harmelen. “From SHIQ and RDF to OWL: the making of a Web Ontology Language”. In: journal of Web Semantics 1.1 (2003), pp. 7-26.

9 Joe Raad, Nathalie Pernelle, and Fatiha Sai's. “Detection of Contextual Identity Links in a Knowledge Base”. In: Proceedings of the 9th International Conference on Knowledge Cap­ture. Austin, United States, Dec. 2017. URL: https://hal.archives-ouvertes.fr/hal- 01665062.

10 Joe Raad et al. Detecting erroneous identity links on the web using network metrics. Oct. 2018.

11 W3C. Web Ontology Language (OWL). 2012. URL: https://www.w3.org/0WL/.

[...]


1 http://www.w3.org

2 http://www.w3.org/TR/owl-features/

3 https://krr.triply.cc/krr/metalink

4 https://pypi.org/project/hdt/

5 https://github.com/AppieO/sameAs/extractor.py

6 https://github.com/AppieO/sameAs/getID.py

7 https://doi.org/10.5281/zenodo.3986832

8 if a triple is not asserted in the LOD cloud, then it is replaced by the tag <NaN> so that we have all assertions in a one column.

9 The folder path where the sameAs link was found in the LOD cloud

10 <http://rdfdata.eionet.europa.eu/eurostat/void.rdf^MT-sdmx> owl:sameAs <http://dbpedia.org/resource/S- DMX>

11 http://dbpedia.org/

Excerpt out of 14 pages

Details

Title
Extending the "sameAs" infrastructure of the LOD cloud Vrije Universiteit
College
VU University Amsterdam
Grade
8
Author
Year
2020
Pages
14
Catalog Number
V925593
ISBN (eBook)
9783346256478
Language
English
Notes
Dutch grade "8" corresponds approximately to a 1.7 in the German grading system
Keywords
extending, vrije, universiteit
Quote paper
Abdullah Nashef (Author), 2020, Extending the "sameAs" infrastructure of the LOD cloud Vrije Universiteit, Munich, GRIN Verlag, https://www.grin.com/document/925593

Comments

  • No comments yet.
Look inside the ebook
Title: Extending the "sameAs" infrastructure of the LOD cloud Vrije Universiteit



Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free