Excerpt

## Graph search for communities with corresponding keywords

Andrea Attwenger

Abstract— The era of big data and world-spanning social networks has highlighted the necessity of ways to make sense of this vast amount of information. Data can be arranged in a graph of connected vertices, therefore giving it a basic structure. If the vertices are further described by keywords, the structure is called an attributed graph. This paper discusses a query algorithm that scans these attributed graphs for communities that are not only structurally linked - therefore forming subgraphs - but also share the same keywords. This method might give new insights into the composition of large networks, highlight interesting connections and give opportunities for effectively targeted marketing. As a specific use case, the idea of the attributed community query is applied to the example of a film recommendation program.

## 1 INTRODUCTION

Recent developments in the digital world have caused the data land- scape to change drastically from what it looked like just a few years ago. Maybe the most intrusive of numerous changes in our daily life is the advancement of social networks. In the last few years, these networks have not only begun to display seemingly every area of our lives, but also become gigantic, spanning nearly the whole world and surpassing several large countries in terms of population. This leads to an enormous amount of highly coveted information, that is albeit difficult to make sense of.

Since a large part of this data is gathered by social networks, the topic of attributed graphs has become relevant, as they seem to provide a good way of mapping these networks[7]. The idea of combining structural clustering and attributive information for graph analysis is quite new, with many existing algorithms only addressing one of the two[3]. An advantage of the concept is that it reflects the configuration of actual networks: individuals usually belong to several different kinds of communities and relationships simultaneously, leading to a network of numerous overlapping, cohesive subgroups of nodes[6]. One of the goals of this new combined method is to truly understand the network, its structure and the properties of the connections between various entities. That way, different kinds of communities within the network can be found and further described.

This seminar paper will discuss a community query algorithm for attributed graphs designed by Fang et al.[7], that relies on both of the above mentioned characteristics: the structure of the graph network and the describing attributes of its nodes. The remainder of the pa- per is structured as follows. Section 2 introduces concepts and terms used throughout this paper, specifically related to attributed graphs and its communities. Section 3 gives an overview on the related work on the subject, while also demonstrating the reasons that led to the de- velopment of the specific query algorithm that is discussed in the next section. The chapter discusses and summarizes the paper by Fang et al.[7] and presents advantages and drawbacks of their algorithm for attributed community queries. Section 5 expounds a possible use case for the algorithm that differs from the fields of application usually dis- cussed in this area of research. The paper closes with section 6, in which the findings are discussed and summarized.

## 2 ATTRIBUTED COMMUNITY GRAPHS

The underlying data structure of the concepts discussed in this paper is the attributed graph, sometimes also called attributed relational graph [2]. It is a graph consisting of vertices (representing people, organi-

- Andrea Attwenger is studying Media Informatics at the University of Munich, Germany, E-mail: Andrea.Attwenger@campus.lmu.de

- This seminar paper was written for the course ’Recent Developments in Data Science’, 2016

zations, objects...) and edges (representing relationships), and asso- ciated with text strings or keywords describing details of each vertex and/or edge, the so-called attributes [2, 7]. Attributed graphs can be viewed from two different perspectives, whether from a structural or a thematic point of view. The structural dimension corresponds to a so- cial graph, displaying different vertices and the various relationships between them. The compositional perspective then further describes these vertices[4]. Graphs that model real-world networks are often called attributed multi graphs, since they deal with objects containing rich attributes and are connected by mutiple types of edges[13].

Newman and Girvan[12] describe community structure as one of the common properties of networks, which are thereby divided into groups of vertices within which the connections are dense, but be- tween which they are sparser. However, an attributed community (AC) does not only need to satisfy this condition of structural cohesiveness by having closely linked vertices, but also must possess keyword co- hesiveness, meaning that its vertices must have keywords in common. The characteristics of attributed communities are described in more detail in section 4.1. The algorithm discussed in this seminar paper aims to enable attributed community queries (ACQ), that return one or more subgraphs of an attributed graph, the above defined attributed communities, in an online and query-based manner[7].

## 3 RELATED WORK

The partition and analysis of graph data has long been an important subject in data science. Not all of the developed methods include at- tributive information to distinguish between different kinds of relation- ships, with some only focusing on structural characteristics.

This seminar paper will concentrate on graph clustering and the detection and search of communities. However, there are also other related areas of research focusing on the structure and properties of graphs that will not be discussed in depth in this seminar paper.

Subgraph isomorphism or graph matching does not consider all the communities in a graph, but instead concentrates only on find- ing subgraphs that are isomorphic instances of one particular template graph[2, 17].

Social network analysis specifically aims to understand the struc- ture of relationships in a network. While it often takes a more qualita- tive approach, using questionnaires and interviews in order to under- stand the nature of relations, interdependencies and relationships are often represented as attributed graphs[15].

### 3.1 A new challenge in graph clustering

Clustering is one of the most commonly known methods for grouping objects based on their similarities. Among the different approaches, one can distinguish between those which focus on the attributes and properties describing objects, and those which exploit the relationships between different objects. The latter, often called graph clustering techniques, usually focus on the topological structure alone, therefore not making use of the characteristics of the vertices. The challenge in clustering networks lies now in the combination of both of these approaches[3].

Recent studies show the superiority of these combined methods when it comes to clustering accuracy [1, 13]. Combe et al. present different approaches to combine the data: structure-based clustering on an attribute weighted graph, attribute-based clustering on structural distance and a hybrid clustering method[3]. Papadopoulos et al. re- mark that especially in real-world networks, the relevance of an at- tribute must be taken into account, so as not to introduce noise and therefore reduce the accuracy. They propose ranking the vertex prop- erties by importance in order to achieve the best results[13].

### 3.2 Community detection vs. Community search

Considering the different approaches of retrieving communities in a graph, Fang et al. distinguish between those that focus on community detection and others that conduct community search[7]. The difference lies in the number and choice of communities obtained by an algorithm - detection algorithms aim to find all the communities in a graph, while search algorithms only concentrate on communities specific to a predefined query.

A community detection algorithm depicts the overall structure of a network by displaying all the communities in it. The discovered com- munities are densely connected subgroups with ideally a high attribute similarity. This might take a long time, especially for large graphs, which is why these methods are not suitable for query requests, where users search for the community that surrounds a specific vertex[7].

Some of these algorithms do not consider the textual information attributed to graph nodes: Du et al. present an algorithm that concen- trates on efficient community detection in large-scale social networks, and while they use entity attributes to define and describe the discov- ered communities later on, they do not use it for the discovery itself [6]. Fortunato characterizes the main elements of the problem of find- ing and describing community structures and presents an overview on existing methods, but does not specifically cover attributed graphs[8] Newman and Girvan refine the measures used to differentiate between communities, but also do not consider textual information[12].

Other works already include contextual attributes: Zhou et al. use a unified distance measure to combine the structural closeness and attribute similarity[18]. Ruan et al. highlight the role of content information in the reduction of noise and heightening of the community signal[14]. Liu et al. also argue for the integration of attributional information, using the example of literature archives that could greatly benefit from a common consideration of the high-level topics of a paper and the social network of its author[10]. The same application area is also discussed as a promising use case by Nallapati et al.[11].

In order to allow for online graph analysis that is geared towards finding a community related to a specific query vertex, community search solutions have recently been developed. They enable the user to submit query requests and retrieve only the communities specific to this request[7]. Sozio and Gionis argue that more often than looking for all the communities in a graph, it is interesting to find the commu- nities formed by a specific set of nodes, which they implemented in their algorithm[16]. Cui et al. also developed a method that allows the user to find the overlapping communities a query vertex belongs to, while the community criterion can be adapted for each vertex[5]. However, since these and other existing algorithms mainly concentrate on non-attributed graphs, Fang et al.[7] saw the need to develop an algorithm that includes contextual as well as structural information: this ACQ algorithm is discussed in the next section.

## 4 EFFECTIVE COMMUNITY SEARCH FOR LARGE ATTRIBUTED GRAPHS

The following section will summarize and discuss the paper ”Effective Community Search for Large Attributed Graphs” by Fang et al.[7] in more detail. The concepts and terms used throughout the section cor- respond with those introduced in section 2 and will be defined further, if necessary.

As discussed in section 3.2, existing community search algorithms only discuss non-attributed graphs. Community detection algorithms for attributed graphs on the other hand are not suitable for query-based analysis, due to their lack in time efficiency. This is why the authors of the hereby discussed paper developed a community search algorithm for large attributed graphs, thus combining the inclusion of textual in- formation with query-based, efficient analysis.

### 4.1 Problem definition

In their introduction and summary of the related work, the authors explain extensively why existing algorithms are not adequate for efficiently searching for communities in attributed graphs, and claim that their method is the first solution to do so. They also define the most important terminology and give a first overview on their own contribution. Subsequently, the underlying requirements are presented:

The attributed graph G(V,E) is assumed to be undirected and consists of vertex set V and edge set E. The corresponding sizes of V and E are called n and m, respectively. Each vertex v ∈ V is associated with a set of keywords, W(v). The degree of a vertex v of G is denoted by degG(v) and, as it is an undirected graph, simply con- stituted by the number of edges connecting the vertex to other vertices.

Communities, as already introduced in section 2, are subgraphs of G that satisfy structure cohesiveness. Structure cohesiveness is the condition of the minimum degree of all the vertices in the community being k or more, with k being a positive integer.

Given k, the k-core of G, denoted by Hk, is the largest subgraph of G, so that all vertices in Hk have a degree larger or equal to k. Hk then has an order of k. It is important to note that Hk does not have to be connected, but might be composed of several distinct components. Given a vertex v ∈ V, its core number, denoted by coreG(v), is the highest order of a k-core that contains v.

In the following, the authors define the ACQ problem: Given a graph G(V,E), a positive integer k, a query vertex q ∈ V and a set of keywords S ⊆ W(q), one wants to return a set of subgraphs, where every subgraph Gq in the set fulfills the following requirements:

- Connectivity: Gq ⊆ G is connected and contains q

- Structure cohesiveness: ∀v ∈ Gq, degGq (v) ≥ k

- Keyword cohesiveness: The size of L(Gq,S) is maximal, where

L(Gq,S) = ∩v ∈ Gq(W(v) ∩ S) is the set of keywords shared in S by all vertices of Gq

Gq is called the attributed community (AC) of q and L(Gq,S) the AC- label of Gq. The condition ”maximal” is not absolutely necessary, but it leads to the ACQ returning only the most related vertices, in terms of the number of common keywords.

illustration not visible in this excerpt

Fig. 1. The attributed community (circled) is formed by a connected sub- graph with vertex degree 3. Its vertices have two keywords in common, ”research” and ”sports”[7].

Figure 1 illustrates an example of an attributed community. A query on the displayed graph with for example Bob as query vertex q, the set S of keywords a subset of the keywords of Bob ({research, sports} and {research, sports, yoga} would both work in this example) and a minimum vertex degree k of 3 would return the circled attributed community. Its vertices {John, Mike, Jack, Bob} have two keywords, ”research” and ”sports”, in common.

### 4.2 Implementation

The authors start by presenting a very intuitive basic solution for the ACQ problem. It consists of enumerating all non-empty subsets of S, a verification step, where the existence of a subgraph for all the subsets of S is asserted, and the computation of these subgraphs. This solution is of course very inefficient, which is why Fang et al. developed a so-called core label tree as an index structure to achieve higher query efficiency.

#### 4.2.1 The core label tree

The concept of the core label tree benefits from the observation that cores as described in section 4.1 are nested. A (k+1)-core must always be contained in a k-core, since vertices that have a minimum degree of a higher number always also satisfy the requirement of the degree of a lower number. Due to this fact, all k-cores can be organized into a tree structure. For any internal node p in the tree, the vertex sets of its child nodes are the subsets of p’s vertex set. By removing the redundant vertices from p’s vertex set that are shared by its child nodes, one obtains a compressed tree where each graph vertex only needs to be mentioned once, thus saving space cost. Information on the attributes of a node is given by adding inverted attribute lists. Each node then consists of four elements:

- coreNum: the number of the k-core

- vertexSet: a set of graph vertices

- invertedList: a list of (key, value) pairs, where the key is a key- word contained by vertices in vertexSet and the value is the list of vertices in vertexSet containing the key

- childList: a list of child nodes

Figure 2 illustrates the construction and structure of a core label tree. The tree structure shown in (a) does not store information very efficiently, since there is a large overlap between the vertices contained in child nodes and in their parent nodes. The core label tree index in (b) now removes these redundant vertices, only leaving the newly added vertices, when traversing the tree bottom-up. The numbers of the k-cores are added to each node, as well as the inverted lists displaying keywords and the vertices of the node that contain them.

## REFERENCES

[1] H. Cheng, Y. Zhou, and J. X. Yu. Clustering large attributed graphs: A balance between structural and attribute similarities. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(2):12, 2011.

[2] T. Coffman, S. Greenblatt, and M. Sherry. Graph-based technologies for intelligence analysis. Communications of the ACM, 47(3):45-47, 2004.

[3] D. Combe, C. Largeron, E. Egyed-Zsigmond, and M. Gery. Getting clus- ters from structure data and attribute data. In Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), pages 710-712. IEEE Computer Society, 2012.

[4] J. D. Cruz, C. Bothorel, and F. Poulet. Integrating heterogenous informa- tion within a social network for detecting communities. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 1453-1454. ACM, 2013.

[5] W. Cui, Y. Xiao, H. Wang, Y. Lu, and W. Wang. Online search of over- lapping communities. In Proceedings of the 2013 ACM SIGMOD In- ternational Conference on Management of Data, pages 277-288. ACM, 2013.

[6] N. Du, B. Wu, X. Pei, B. Wang, and L. Xu. Community detection in large-scale social networks. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pages 16-25. ACM, 2007.

[7] Y. Fang, R. Cheng, S. Luo, and J. Hu. Effective community search for large attributed graphs: Hku cs tech report. 2016.

[8] S. Fortunato. Community detection in graphs. Physics reports, 486(3):75-174, 2010.

[9] J. Lischka and H. Karl. A virtual network mapping algorithm based on subgraph isomorphism detection. In Proceedings of the 1st ACM workhop on Virtualized infrastructure stystems and architectures, pages 81-88. ACM, 2009.

[10] Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link lda: Joint models of topic and author community. In Proceedings of the 26th annual interna- tional conference on machine learning, pages 665-672. ACM, 2009.

[11] R. Nallapati, A. Ahmed, E. Xing, and W. W. Cohen. Joint latent topic models for text and citations. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 542-550. ACM, 2008.

[12] M. E. Newman and M. Girvan. Finding and evaluating community struc- ture in networks. Physical review E, 69(2), 2004.

[13] A. Papadopoulos, D. Rafailidis, G. Pallis, and M. D. Dikaiakos. Clus- tering attributed multi-graphs with information ranking. In International Conference on Database and Expert Systems Applications, pages 432- 446. Springer International Publishing, 2015.

[14] Y. Ruan, D. Fuhry, and S. Parthasarathy. Efficient community detection in large networks using content and links. In Proceedings of the 22nd international conference on World Wide Web, pages 1089-1098. ACM, 2013.

[15] O. Serrat. Social network analysis. Knowledge Solutions, 28:1-4, 2009.

[16] M. Sozio and A. Gionis. The community-search problem and how to plan a successful cocktail party. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 939-948. ACM, 2010.

[17] H. Tong, B. Gallagher, C. Faloutsos, and T. Eliassi-Rad. Fast best-effort pattern matching in large attributed graphs. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 737-746. ACM, 2007.

[18] Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on struc- tural/attribute similarities. Proceedings of the VLDB Endowment, 2(1):718-729, 2009.

[19] L. Zhu, W. Keong Ng, and J. Cheng. Structure and attribute index for approximate graph matching in large graphs. Information Systems, 36(6):958-972, 2011.

**[...]**

- Quote paper
- Andrea Attwenger (Author), 2017, Social Networks and Questions of Big Data. Graph search for communities with corresponding keywords, Munich, GRIN Verlag, https://www.grin.com/document/369515

Publish now - it's free

Comments