1.2 What is Social Network Analysis?
1.3 Motivation and Problem Statement
1.5 Outline of Thesis
2 LITERATURE REVIEW
2.2 Apriori-Based Approach
2.3 Pattern-Growth Approach
2.3.1 Survey and Techniques of Frequent Pattern Mining
2.4 Survey and Techniques of Online Social Networks
2.5 Survey on Opinion Mining
3 AN ANALYSIS OF SOCIAL NETWORK DATA USING GEPHI
3.2.1 Real-time Visualization
3.2.4 Dynamic Network Analysis
3.2.5 Create Cartography
3.2.6 Clustering and Hierarchical graphs
3.2.7 Dynamic filtering
3.2.10 Plug-in center
3.3 Implementation of Gephi to Social Network data
3.3.1 Data Collection and Experimental Details
3.4 Results and Discussion
4 TOOLS AND TECHNIQUES USED FOR EXTRACTION AND PROCESSING OF THE REAL TIME TWEETS
4.2 Twitter 4j Library
4.2.1 System Requirements
4.2.2 How To Use
4.2.4 Source Code
4.2.5 Maven Integration
4.3 Tweet search method by GET search/tweets in API V 1.1
4.4 JSOUP JAVA Libraries
4.4.1 Parsing and traversing a Document
4.4.2 Parse a document from a String
4.4.3 Parsing a body fragment
4.4.4 Extract attributes, text, and HTML from elements
4.4.5 Sanitize untrusted HTML (to prevent XSS)
4.5 Netbeans IDE
4.5.1 Working with NetBeans Modules
4.6 Processing of tweets
4.7.1 History and team members
4.7.2 Database contents
4.7.3 Knowledge structure
4.7.4 Psycholinguistic aspects of WordNet
4.7.5 WordNet as a lexical ontology
5 EXTRACTION AND THE PROCESSING OF REAL TIME TWEETS
5.2 Experimental Details
5.3 Results and Discussions
6 CLASSIFICATION OF TWEETS USING SENTIMENTAL ANALYSIS
6.2 Machine learning
6.2.1 Naive Bayes Classification
6.2.2 Maximum Entropy
6.3 Support Vector Machines
6.3.1 What SVM is used for?
6.3.2 How SVM Works
6.4 Evaluation of Sentiment Classification
6.5 Enhanced Emoticon Classification
6.5.1 EEC score calculation
6.6 Improved Polarity Classification
6.6.1 PC score calculation
6.7 SentiwordNet Classification
6.7.1 SWNC score calculation
6.8 Method Details
6.9 Results And Discussions
7 CLASSIFICATION OF TWEETS USING HYBRID MODEL
7.3 Results And Discussions
8 CONCLUSIONS AND FUTURE WORK
8.2 Future Work
I thank the almighty for his boundless blessings showered on me. If words are considered as symbols of approval and tokens acknowledgement, then the words play heralding role of expressing my gratitude to all who have helped me directly and indirectly during my Research work.
I would like to express my heartfelt thanks to my respectable and friendly guide Dr. S. SHEEJA, M.C.A., M. Phil, Ph.D., Associate Professor Department of Computer Applications, Karpagam University for her kind supervision. Her wise academic advice and ideas have played an extremely vital role in the work presented in this research work.
My especial sincere thanks and gratitude to co-guide Dr. Binod Kumar, Director, JSPM’s JICA, Pune-33, without whose instructions, scholarly guidance and ceaseless encouragement I would never succeed in completing this thesis.
I also take an opportunity to thank Dr. Veni, Associate Professor & HOD, Department of Computer Science, Karpagam University and Dr. T. Purushottam, Professor, Government College of Engineering, Coimbatore for their kind cooperation during the Doctorate Committee Meetings and their valuable guidance and helpful suggestions during the research work.
I am equally indebted to our Hon. Founder Secretary, Prof. T. J. Sawant, JSPM Group of Institute, Pune for permitting me to carry out the research work and granting me leave for the same, as and when required.
I also thank to Dr. Ajay Kumar, Director, JSPM JTC, Pune and Dr. N. S. Nehe, Director, JSPM’s NTC for their constant support and motivation in the adjustment of duties and giving leave during the visit to the university time to time and providing the college infrastructure for carrying out the research work such as Digital e-Library and Internet facilities etc.
My special thanks to my wife Ms. Sheetal for her untiring support and good humor that helped to boost my spirit and made the research a joyful activity and I thank my son Ku. Shantanu and my daughter Ku. Shweta for giving me time to study and complete my academic journey delightfully and smoothly.
I would like to specially thank to Mr. Ramanand Potdar, Mr. Rajesh Dhudhal and Mr. Amar Kokare for their valuable guidance and help during this work.
Lastly I, take an opportunity to thank all the Computer Science Department staff and Director, Research, Karpagam University and their staff those who have helped me a lot during the research work directly or indirectly .
BANDGAR BAPURAO MAYAPPA
LIST OF TABLES
Abbildung in dieser Leseprobe nicht enthalten
LIST OF FIGURES
Abbildung in dieser Leseprobe nicht enthalten
LIST OF ABBREVIATION
Abbildung in dieser Leseprobe nicht enthalten
In day-to-day life, we see that there is tremendous growth in use of the social network site. Every one shares their views about observed event which increases the network of the peoples on the social network websites. Thus, with increasing availability of ample social network data there is an increasing interest in analyzing how those networks evolve over the time. There is also interest in the understanding the continuity in the social network and future status of the same interaction of the network over certain period.
Therefore, the social network textual data is visualized in the form of graph using the Gephi and identified the contextual clusters or distinct word communities that are present within the text and the main quantitative properties of the graph as a whole. We explore the relations between the contextual clusters and the role of junctions in linking these clusters together and interpreted the data. We observed that the frequent interaction of the people through the network with low average degree distribution value. It is also observed that there are nodes in the network that are not much densely connected between each other than with the rest of the network. Thus by making use of the Gephi tool, we analyzed the social network text data in graphical form and understood the social network by studying the statistical parameter of the network data.
Social media content also has stirred much excitement and created abundant opportunities for understanding the opinions of the general public and consumers toward social events, political movements, company strategies, marketing campaigns, and product preferences. Many new and exciting social, geo political, and business-related research questions can be answered by analyzing the thousands, even millions, of comments and responses expressed in various blogs, forums, social media and social network sites, virtual worlds, and tweets. This is one of the good medium to explore the opinion of people about the particular event and so that this may help in the making any business decisions or the feedback about political activities to be carried out in future.
Therefore, we extracted the real time tweets on the social tweet keyword from the twitter web site, news website etc. using the twitter 4j Libraries and their API’s and JSOUP Libraries for obtaining the real time tweets from the respective websites for English keyword only. These tweets are preprocessed and obtained the keyword related sentences only. These preprocessed tweets further used for the removal of slang, hash, tags and URL and the removal of stop words. We also used the abbreviations and emoticon conversion to get corresponding complete meaning full message from tweets.
The processed tweets are further classified using three unstructured models EEC, IPC and SWNC. The results of these models are compared by obtaining the confusion matrix and their parameter such as precision, recall and accuracy. The SWNC model showed good result of classification over the EEC and IPC. Further the Hybrid model is used to reduce the number of the neutral tweets and obtained the corresponding results and shown by pie graph. By comparing the results of the SWNC model and the Hybrid model, it is observed that the numbers of neutral tweets are reduced in Hybrid model. The % range of reduction is around 20 - 25% in comparison with the SWNC model. Thus, we classified the real time social tweets using unstructured and their hybrid models. For obtaining these results, we developed the windows based indigenous, integrated and user friendly application in java and using NetBean’s framework.
In day-to-day life, we see that there is tremendous growth in use for the social network sites. Every one share their views about instances and there is increase in the network of the people on the social network. Thus, with increasing availability of ample social network data, there is an increasing interest in analyzing how those networks evolve over the time. There is need of understanding the continuity in the social network and future status of the same interaction can be understood by studying the evolution of the network over certain period.
Social media content has stirred much excitement and created abundant opportunities for understanding the opinions of the general public and consumers toward social events, political movements, company strategies, marketing campaigns, and product preferences. Most of time we ponder over exciting social, geo political, and business- related research questions can be answered by analyzing the thousands, even millions of comments and responses expressed in various blogs, forums, social media and social network sites, virtual worlds, and tweets. This is one of the good medium to explore the opinion about the particular event. This could helpful in making business decisions or the feedback about political activities to be carried out in future.
1.2 What is Social Network Analysis?
Social network analysis is based on assumption of the significance of relationships among interacting units. The social network perspective encompasses theories, models, and applications that are expressed in terms of relational concepts or processes. Along with growing interest and increased use of network analysis has become a consensus about the central principles underlying the network perspective. In addition to the use of relational concepts, we note the following as being important: Actors and their actions are viewed as interdependent rather than independent, autonomous units Relational ties (linkages) between actors are channels for transfer or "flow" of resources (either material or nonmaterial). Network models focusing on individuals view the network structural environment as providing opportunities for constraints on individual action.
Network models conceptualize structure (social, economic, political, and so forth) as lasting patterns of relations among actors. The unit of analysis in network analysis is not the individual, but an entity consisting of a collection of individuals and the linkages among them. Network methods focus on dyads (two actors and their ties), triads (three actors and their ties), or larger systems (subgroups of individuals, or entire networks.
Social network analysis has emerged as a set of methods for the analysis of social structures, methods which are specifically geared towards an investigation of the relational aspects of these structures. The use of these methods, therefore, depends on the availability of relational rather than attribute data.
Network analysis is the study of social relations among a set of actors. It is a field of study - a set of phenomena or data in which we seek to understand. In the process of working in this field, network researchers have developed a set of distinctive theoretical perspectives as well. Some of the hallmarks of these perspectives are:
- focus on relationships between actors rather than attributes of actors
- sense of interdependence: a molecular rather atomistic view
- structure affects substantive outcomes
- emergent effects
Thus, Network theory is sympathetic with systems theory and complexity theory. A social network is also characterized by a distinctive methodology encompassing techniques for collecting ample data, statistical analysis, visual representation, etc. Social network analysis [SNA] is the mapping and measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities. The nodes in the network are the people and groups while the links show relationships or flows between the nodes. SNA provides both a visual and a mathematical analysis of complex human systems. Network analysis (or social network analysis) is a set of mathematical methods used in social psychology, sociology, ethology, and anthropology. Network analysis assumes that the way the members of a group can communicate to each other affect some important features of that group (efficiency when performing a task, moral satisfaction, leadership). Network analysis makes use of mathematical tools and concepts that belong to graph theory. A network model is a communication group. It consists of a number of nodes (each node corresponding to a member of the group) and a number of edges (or ties), each one being associated to a communication connection between two actors. Network data is stored in an adjacency matrix. Commonly, the [i,j] element of the adjacency matrix corresponds to the communication behavior of actor Îi' to actor Îj'.
Social network analysis is focused on uncovering the patterning of people's interaction. Network analysis is based on the intuitive notion that these patterns are important features of the lives of the individuals who display them. Network analysts believe that how an individual lives depends in large part on how that individual is tied into the larger web of social connections. Many believe, moreover, that the success or failure of societies and organizations often depends on the patterning of their internal structure. From the outset, the network approach to the study of behavior has involved two commitments:  it is guided by formal theory organized in mathematical terms, and  it is grounded in the systematic analysis of empirical data. It was not until the 1970s, therefore when modern discrete combinatorics (particularly graph theory) experienced rapid development and relatively powerful computers became readily available that the study of social networks really began to take off as an interdisciplinary specialty. Since then its growth has been rapid. It has found important applications in organizational behavior, inter-organizational relations, the spread of contagious diseases, mental health, social support, the diffusion of information and animal social organization.
A graph transaction is represented by adjacency matrices and the frequent patterns appearing in matrices are mined through the extended algorithm. These are modeled as attribute graph in which each vertex represents a user ID and the weight on each edge represent the frequency of the interaction between them. Each vertex carries attribute that indicates the user type. The main challenge in the development of the algorithm is how to split up the discovery process into several phases efficiently. The algorithm should behave like a specialized free tree miner when faced with free tree databases, but should also be able to deal with graphs databases efficiently. Existing algorithm for frequent pattern mining become very costly in time and space as the pattern sizes and network number increase. Currently no efficient algorithm is available for mining recurrent patterns across large collection of genome wide network. There are various domains like chemoinformatics bioinformatics etc. where no efficient algorithms are available, for example, for mining recurrent patterns across large collection of genome wide networks. Due to increasing size and complexity of patterns in computer sciences the need for efficient graph mining algorithm is increasing. Still there is a scope of improvement in graph mining algorithm; the improvement can be in speed or sensitivity.
1.3 Motivation and Problem Statement
As the social networking is an upcoming and applicable field, understanding the concepts of evolution of the social network is must. Therefore, we decided to understand the different aspects of the social network. It is also interesting to understand the evolution of the social network over certain period of time. The different aspect of the evolution of social networks have been studied but for the prediction of the future growth is not much studied. Therefore, we have planned to use the most effective graph evolution theory for this purpose.
The emergence of social media has given web users a venue for expressing and sharing their thoughts and opinions on different topics and events. Twitter, with nearly 600 million users and over 250 million messages per day, has quickly become a gold mine for organizations to monitor their reputation and brands by extracting and analyzing the sentiment of the Tweets posted by the public about them, their markets, and competitors. Sentiment analysis over Twitter data and other similar micro-blogs faces several new challenges due to the typical short length and irregular structure of such content.
After understanding the concept of the social network, it is also an interesting to understand the opinion of the people on the social network about particular topic. So that this may help in the making any business decisions or the feedback about political events or the editor of the news paper may get the feedback of the particular news in the news paper. There are various research publication related to the same using machine learning algorithms but they have some limitations. Therefore here we have planned to understand the different aspects of the opinion mining using unstructured algorithms. There are some research publications regarding the same but the methods given are not cleared. The system they used for the same is complex using Linux based systems. Therefore, we planned develop the system based on windows, which will be user friendly. For this purpose we applied the sentimental analysis theory.
The objective of the our work are as follows
1. Study the Social Network and understand the evolution of the same over certain period
2. Extraction of real time tweets and process them for the opinion mining
3. Analysis of the processed tweets for opinion mining using unstructured EEC, IPC, SWNC Models
4. Application of hybrid model for the reduction of the neutral tweets from the unstructured model results
1.5 Outline of the thesis
Thesis is mainly divided into the two parts. In first part we studied the social network analysis using the graph theory. And in second part, we studied the extraction of the real time tweets, processing them for the obtaining the meaningful messages. These meaningful messages will be used for further classification using the unstructured algorithms and their hybrid model. The brief outline of the thesis is as follows
1. Chapter one contains the introduction of the social network, the objectives of the research work and the outline of the thesis.
2. Chapter two contains wide literature survey and review of the literature on of the social network, the online social network analysis and the sentimental analysis of social network.
3. Chapter three contains the detailed information about the tools used for the extraction of the real time tweets and processing of the tweets for obtaining the meaningful message.
4. Chapter four contains the detailed information about the studies of the social network data structure using the Gephi tool and their different aspect of the social network as betweenness centrality, density of the network, modularity, connected components, clustering coefficient, and average degree distribution etc. of the social network data.
5. Chapter five contains detailed information about the extraction, processing of the real time tweets on the particular topic and methods used for the same. The obtained the meaningful messages further used for the sentimental analysis.
6. Chapter six gives the information about the classification of the tweets using the unstructured as EEC, IPC and SWNC model. The comparison of their results using confusion matrix, precision, recall, accuracy parameter and visualized using pie graph..
7. Chapter seven contains the information about the Hybrid model used for the more accuracy of the classification and reducing the no. neutral tweets from the classified social network data
8. Chapter eight contains the overall summary of the research work carried out during this research period. The comparison of their results using pie graph.
2. LITERATURE REVIEW
Especially, we ponder over in mathematics, computer science and other subjects algorithm play vital role is an effective method for solving a problem expressed as a finite sequence of instructions. Algorithms are used for calculation data processing and many other fields. These depends up on the of different base approach
2.2 Apriori-Based Approach
Apriori based frequent substructure mining algorithm share similar characteristics with Apriori-based frequent item set mining algorithms. The search for frequent groups starts with graphs a small ―size" and proceeds in a bottom-up manner by generating candidate having an extra vertex, edge or path. The definition of graph size depends on algorithm used.
2.3 Pattern-Growth Approach
The Apriori-based approach has to use the breadth-first search (BFS) strategy because of its level-wise candidate generation.
2.3.1 Survey and Techniques of Frequent Pattern Mining
Various algorithms on graph mining were developed by many researchers. Some of them are reviewed in this section. With the increasing availability of large social network data, there is also an increasing interest in analyzing how those networks evolve over time . Traditionally, the analysis of social networks has focused only on a single snapshot of a network. Researchers have already varied that social networks follow power-law degree distribution, have a small diameter, and exhibit small-world structure and community structure. Attempts to explain the proper ties of social networks have led to dynamic models inspired by the preferential attachment model, which assumes that new network nodes have a higher probability of forming links with high-degree nodes, creating a ―rich-get- richer" effect. Recently several researchers have pondered over to the evolution of social networks at a global scale. For example, Jure Leskovec and his colleagues empirically observed that networks become denser over time, in the sense that the number of edges grows super linearly with the number of nodes. Moreover, this densification follows a power-law pattern. They reported that the network diameter often shrinks over time, in contrast to the conventional wisdom that such distance measures should increase slowly as a function of the number of nodes. Although some effort has been devoted to analyzing global properties of social network evolution, not much has been done to study graph evolution at a microscopic level. A first step in this direction investigated a variety of network formation strategies, showing that edge locality plays a critical role in network evolution.
M. Berlingerio et. al have proposed a different approach of paradigm of association rules and frequent-pattern mining, their work searches for typical patterns of structural changes in dynamic networks. Mining for such local patterns is a computationally challenging task that can provide further insight into the increasing amount of evolving network data. Beyond the notion of graph evolution rules (GERs), a concept that they introduced in an earlier work and developed the Graph.
Ullmann in 1976 developed an algorithm for subgraph isomorphism. Subgraph isomorphism determined by means of a brute-force tree search procedure. This algorithm attains efficiency by inferentially eliminating successor‘s nodes in the tree search. Agarwal and Srikant in 1994 considered the problem of discovering association rules between items in a large database of sales transaction. They presented two new algorithms for solving this problem that are fundamentally different from the known algorithm. Cook and Holder in 1994 discovered a new version of their SUBDUE substructure discovery system is based on minimum description length principle(MDL). And described the SUBDUE system which the minimum description length principle is discovered substructures that compress the database and represent structural concepts in the data. In this paper they described the application of SUBDUE and also discussed the minimum description length principle and background knowledge used by SUBDUE can guide substructure discovery in a variety of domain. Blockeel and Raedt in 1998 introduced a first- order framework for top-down induction of logical decision tree. Top-down induction of decision trees is the best known and most successful machine learning technique. It has been used to solve numerous practical problems. It employs a divide and conquers strategy, and in this it differs from its rule-based competitors which are based on covering strategies. Chakrabarti, Dom and Indyk in 1998 developed a new method for automatically classifying hypertext into a given topic hierarchy using an iterative relaxation algorithm.
After bootstrapping off a text-based classifier, they used both local texts in a document as well as the distribution of the estimated classes of other documents in its neighborhood, to refine the class distribution of document being classified. They discussed three area of research: text and hypertext information retrieval, machine learning in context other text or hypertext, and computer vision and pattern recognition.
Inokuchi,Washio and Motoda in 1998 proposed a novel approach name AGM to efficiently mine the association rule among the frequently appearing substructure in a given graph dataset. A graph is represented by adjacency matrices and the frequent patterns appearing in the matrices are mined through the extended algorithm of the basket analysis. Calders and Wisen in 2001 Presented on monotone data mining layer a simple data-mining logic (DML) that can express common data mining tasks like ―find Boolean association rules" or "Find inclusion dependencies". Kramer, Raedt, and Helma in 2001 presented the application of feature mining techniques to the developmental therapeutics program‘s AIDS antiviral screen database. Kuramochi and Karypis in 2001 presented a computationally efficient algorithm for finding all frequent subgraphs in large graph databases. They evaluated the performance of the algorithm by experiments with synthetic datasets as well as a chemical compound dataset. Pei, Han, Mortazavi-Asl and Pinto in 2001 proposed a novel sequential pattern mining method called PrefixSpan that is prefix- projected Sequential pattern mining.
Asai, Abe and Kawasoe in 2002 discovered the efficient substructure from large semi structure data and patterns by labeled ordered trees and studied the problem of discovering all frequent tree-like patterns that have at least a minimum support in a given collection of semi-structured data. They presented an efficient pattern mining algorithm FREQT for discovering all frequent tree patterns from a large collection of labeled ordered tree. Borgelt and Berthold in 2002 presented an algorithm to find fragments in a set of molecules that help to discriminate between different classes for instance, activity in a drug discovery context. Yan and Han in 2002 investigated new approaches for frequent graph based pattern mining in graph datasets and proposed a novel algorithm called gSpan. It is a graph-based substructure pattern mining. This discovered frequent substructures without candidate generation. Zaki in 2002 presented TREEMINER algorithm to discover all frequent subtrees in a forest, using a new data structure called scope-list.
Deshpande, Kuramochi and Karypis in 2002 proposed the technique for classifying chemical compounds. These techniques can be broadly categorized into two groups. The first group consists of techniques that rely mainly on various global properties of the chemical compounds, such as molecular weight, ionization potential, inter-atomic distance etc. for capturing the structural properties of the compounds. Since this information is not relational, existing classification techniques can be easily used on these datasets. However the absence of actual structural information limits the accuracy of such classified. The second group of techniques directly analyzes the structure of the chemical compounds to identify patterns that can be used for classification [47,48,67,68]. One of the earliest studies in discovering substructures was carried out by Dehaspe et. al in 1998 which Inductive Logic Programming (ILP) techniques were used though this approach is quite powerful it is not designed to scale to large graph databases hence may not be able to handle large chemical compound databases. A number of recent approaches focused on analyzing the graph representation of the chemical compounds, to identify frequently occurring patterns, and use these patterns to aid in the classification.
Cooper and Frieze in 2003 described a general model of a random graph process whose proportional degree sequence obeys a power law. These law recently observed in graph associated with www. Dzeroski in 2003 introduced the Multi-Relational Data Mining. Getoor in 2003 studied on link mining. Link among the objects may demonstrated certain patterns which can be helpful for many data mining tasks and are usually hard to capture with traditional statistical models. Link mining is promising new area where relational learning meets statistical modeling. Huan, wang and Prince in 2003 proposed a novel subgraph mining algorithm: FFSM, which employs a vertical search scheme within an algebraic graph framework. They have developed to reduce the number of redundant candidates proposed. Their studies on synthetic and real datasets demonstrate that FFSM achieves a substantial performance gain over the current start-of-the art subgraph mining algorithm gSpan. Washio and Motoda in 2003 introduced the theoretical basis of graph based data mining and survey the state of the art of graph-based data mining.
Yan, Han and Afshar in 2003 proposed an alternative but equally powerful solution: instead of mining the complete set of frequent subsequences, they mined frequent closed subsequences only that are those contained no super-sequence with the same support. They also introduced an efficient algorithm, called CloSpan. (CloSpan is stand for closed sequential pattern mining.) This outperforms the previous work by one order of magnitude. Moreover a deep understanding of efficient sequential pattern mining methods may also have strong implications on the development of efficient methods for mining frequent subtrees, lattices, subgraphs, and other structured patterns in large databases. The sequential pattern mining algorithms developed so far have good performance in databases consisting of short frequent sequences. Yan and Han in 2003 proposed to mine close frequent graph patterns. A graph ‗g‘ is closed in a database if there exists no proper subgraph of ‗g‘ that has the same support as ‗g‘. A closed graph pattern mining algorithm, CloseGraph, is developed by exploring several interesting looping methods. Their performance studied shown that CloseGraph not only dramatically reduces unnecessary subgrapgs to be generated but also substantially increases the efficiency of mining, especially in the presence of large graph patterns. Yan and Han  in 2003 developed a new classification approach is called CPAR (CPAR is Classification based on Predictive Association Rules). Based on their study performance study, CPAR achieved high accuracy and efficiency, which have many useful features. CPAR represents a new approach towards efficient and high quality classification. It is interesting to further enhance the efficiency and scalability of this approach and compare it with other well established classification schemes. Moreover, the strength of the derived predictive rules also motivates us to perform an in-depth study on alternative approaches towards effective association rule mining.
Huan, Wang, Prins and Yang in 2004 developed a new algorithm that mines only maximal frequent subgraphs, that is subgraph that are not a part of any other frequent subgraphs. Their algorithm can achieve a five-fold speed up over the current state-of- the-art subgraph mining algorithms. Their mining method is based on a novel graph mining framework in which they first mine all frequent tree patterns from a graph database and then construct maximal frequent subgraphs from trees. Huan, Wang Bandyopadhyay Snoeyink, and Prins in 2004 applied a novel subgraph mining algorithm mining algorithm to three related graph representation of the sequence and proximity characteristics of a proteins amino acid residues. The subgraph mining algorithm is used to discover spatial motifs that can be used to discriminate among protein in different families found in the SCOP database. Koyuturk, Grama, Szpankowski in 2004 presented an innovative new algorithm for detecting frequently occurring patterns and modules in biological network. They show experimentally that their algorithm can extracted from the KEGG database within seconds. The proposed model and algorithm are applicable to a variety of biological networks either directly or with minor modification. Meinl, Borgelt and Berthold in 2004 shown that is possible to mine meaningful, discriminative molecular fragments from large databases. Using an existing algorithm that employs a depth- first strategy and a sophisticated ordering scheme allows avoiding costly re-embeddings throughout the candidate growth process, which in turn enables us to find also larger fragments. Yin, Han, Yang and Yu in 2004 developed a CrossMine, an efficient and scalable approach for multi-relational classification. Several novel methods are developed in CrossMine, including
1. tuple ID propagation, which performs semantics preserving virtual join to achieve high efficiency on databases with complex schemas.
2. a selective sampling method which makes it highly scalable with respect to the number of tuples in the databases.
Both theoretical backgrounds and implementation techniques of CrossMine are introduced. Yan, Yu and Han in 2004 discovered the issues of indexing graphs and proposed a novel solution by applying a graph mining technique. Different from the existing path based methods, their approach, called gIndex, makes use of frequent substructure as the basic indexing feature. Frequent substructures are ideal candidates since they explore the intrinsic characteristics of the data and are relatively stable to database updates. gIndex has 10 times smaller index size, but achieves 3-10 times better performance in comparison with a typical path based method.
Hu, Yan, Huang, Han and Zhou in 2005 developed a novel algorithm, CODENSE, to efficiently mine frequent coherent dense subgraphs across large number of massive graph on biological networks for function discovery. Li and Tan in 2005 proposed a novel graph mining algorithm to detect the dense neighborhoods in an interaction graph which may correspond to protein complexes. Their algorithm first located local cliques for each graph vertex and then merge the detected local cliques according to their affinity to form maximal dense regions. Yan, Yu and Han in 2005 investigated the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed missing features, their structural filtering algorithm called Grafil, can filter many graphs without performing pairwise similarity computations.
Chakrabarti and Faloutsos in 2006 discussed the problem of detecting abnormalities in a given graph and of generating synthetic but realistic graphs have received considerable attention recently. Both are tightly coupled to the problem of finding the distinguished characteristics of real-world graphs i.e. the patterns that show up frequently in such graphs and can be considered as marks of realism.
Karunaratne and Bostrom in 2006 developed a method for learning from structured data are limited with respect to handling large isolated substructures and also imposed constraints on search depth and induced structure length. An approach to learning from structured data using a graph based canonical representation method of structured called finger printing. Krasky, Rohwer, Schroeder and Selzen in 2006 discussed on a combined bioinformatics and chemo-informatics approach for the development of new antparasitic drugs. Meinl in 2006 solved the problem Parallel molecular mining Worlein, Urzova, Fischer, and Philippsen which are used in chemo-informatics to find common molecular fragments a database of molecules represented as two-dimensional graphs. In ParMol package they have implemented four of the most popular frequent subgraph miners using a common infrastructure: MoFa, gspan, FFSM and Gaston. They also added additional functionality to some of the algorithms like parallel search, mining directed graphs and mining in one big graph instead of a graph database. Meinl, Worlein, Fischer, and Philippsen in 2006 presented the thread based parallel versions of MoFa and gSpan that achieve speedup up to 11 on a shared memory SMP system using 12 processors. They discussed the design space of the parallelization, the results, and the obstacles, that are caused by the irregular search space and by the current state of Java technology. Tsuda and Kudo in 2006 proposed a graph clustering method that selects informative patterns at the same time. Their task is analogous to feature selection for vectors however the difference is that the features are not explicitly listed. This method is fully probabilistic adopting a binomial mixer model defined on a very high dimensional vector indicating the presence or absence of all possible patterns. Wegner, Frohlich, Mielenz and Zell in 2006 presented a classification method which is based on a coordinate free chemical space. Thus it does not depend on descriptor values commonly used coordinated based chemical space methods.
Merkwirth and Ogorzalek in 2007 described a method for construction of specific types of neural networks composed structures directly linked to the structure of the molecule under consideration. Each molecule can be represented by a unique neural connectivity problem (graph) which can be programmed on to a cellular neural network. A composite network can further successfully perform classification and regression on real-world chemical data-set.
Dong, Gilbert, Guha, Heiland, Kim, Pierce, Fox, and Wild in 2007 developed an infrastructure of chemo-informatics web service that simplifies that access to this information and the computational techniques that can be applied to it. They described this infrastructure and give some examples of its uses and then discuss their plans to use it as a platform for chemo-informatics application development in the future.
Rhodes, Boyer, Kreulex, Chen, and Ordonez[76 ] in 2007 developed a system that allow user to explore the US patent corps using Molecular information. Their system contains three main technologies. In this system a user may go to a web page, draw a molecule search for related Intellectual property (IP) and analyzed the results. Bogdanov  in 2008 studied on Graph searching, indexing, mining and modeling for Bioinformatics, chemo-informatics and Social network. Fahim et. al in 2008 proposed a method which is based on shifting the center of the large cluster toward the small cluster and recompiling the membership of small cluster points, the experimental results reveal that the proposed algorithm produce satisfactory results. Godeck and Lewis in 2008 stated that QSAR models can play a vital role in both the opening phase and the endgame of lead optimization. In the opening phase there is often a large quantity of data from high throughput screening (HTS) and potential leads need to be selected from several distinct chemo-types.
Guha and Schurer in 2008 investigated various aspects of developing computational models to predict cell toxicity based on cell proliferation screening data generated in the MLSCN. By capturing feature-based information in that data set, such predictive models would be useful in evaluating cell based screening results in general and could be used as an aid to identify and eliminate potentially undesired compounds.
Hubler, Kriegel, Borgwardt, Ghahramani and Metropolis in 2008 presented metropolis algorithm for sampling a representative small subgraph from the original large graph with representative describing the requirement that the sample shall preserve crucial graph properties of the original graph. Karunaratne and Bostrom in 2008 presented a case study in chemo-informatics in which various types of background knowledge are encoded in graphs that are given as input to a graph learner. In this paper shown that the type of background knowledge encoded indeed has an effect on the predictive performance. Lam and Chan in 2008 studied on graph data mining algorithm are increasingly applied to biological graph data set. In this paper they proposed graph mining algorithm MIGDAC (Mining graph data for classification) that applies on graph theory and an interestingness measure to discover interesting sub graphs which can be both characterized and easily distinguished from other classes.
Maji, and Mehta in 2008 proposed a novel a measure of similarity between labeled graphs which has applications to structured data analysis for example chemical informatics, web document clustering etc. their metric on graphs exploits vertex context similarity and computes an overall matching score in polynomial time in the size of the graphs using a network flow formulation of the problem. Tsuda and Kurihara in 2008 proposed a nonparametric Bayesian method for clustering graph and selecting salient patterns at the same time. Variation inference is adopted here because sampling is not applicable due to extremely high dimensionality. Schietget, Costa, Ramon, and Raedt in 2009 proposed a direct efficient and simple approach for generation of interesting graph pattern. They computed maximum common subgraph from randomly selected pairs of examples and directly use them as features.
Jiang, Coenen and Zito in 2010 examined a number of edge weighting schemes; and suggested three strategies for controlling candidate set generation. Spjuth, Willighagen, Guha, Eklund and Wikberg in 2010 Studied on toward interoperable and reproducible QSAR analysis: Exchange of data sets. QSAR is widely used method to relate chemical structures to responses or properties based experimental observations. Much effort has been made to evaluate and validate the statistical modeling QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structure as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standard and exchange formats in the field, making it virtually impossible to reduce and validate analyses and drastically constrain collaboration and reuse of data.
Yang, Parthasarthy and Sadayappan in 2010 presented a novel approach to data representation for computing this kernel, particularly targeting sparce matrices representing Power law graphs. They shown their representation scheme, coupled with a novel tiling algorithm, can yield significant benefits over the current state-of the-art GPU and CPU efforts on a number of core data mining algorithms such as PageRank, HITS and Random Walk with Restart.
2.4 Survey and Techniques of Online Social Networks
We carried out the wide literature survey on the online social network and we observed the various researchers have experienced surge of interest by researchers, due to different factors, such as the popularity of online social networks (OSNs), their representation and analysis as graphs, the availability of large volumes of OSN log data, and commercial/marketing interests. OSNs also present interesting research challenges such as the search/matching of similar sub-graphs, community analysis/modeling, user classification and information propagation. Hence, OSN data analysis has a great potential for researchers in a diversity of disciplines. However, we propose that OSN analysis should be placed in the context of its sociological origins and its basis in graph theory. Thus, we have devised a survey which firstly presents the key historical and base research ideas and the different associated themes, and secondly presents a selection of the latest research and tendencies taken from international conferences. Graph Mining of online social networks is a relatively new area of research which however has a solid base in classic graph theory, computational cost considerations, and sociological concepts such how individuals interrelate, group together and follow one another.
Different algorithms exist which perform higher level operations on graphs, such as finding degree, finding the connectivity between its neighbors (clustering coefficient), finding a path between two nodes, (using depth-first search or breadth- first search), or finding the shortest path from one node to another. Refer to  for a general introduction to different types of graph algorithms which are relevant to OSNs.
A social network is a social structure comprised of a set of participants (individuals, organizations . . .) and the mutual ties between these participants. The vision of a social structure as a network facilitates the understanding and analysis of this structure, such as the identification of local and global characteristics, influential participants and the dynamics of networks. Social network analysis is an interdisciplinary endeavor including such fields as social psychology, sociology, statistics, and graph theory. Georg Simmel[91, 92] was one of the earliest to define structural theories in sociology, such as the dynamics of triads (subgroups involving three participants) and the development of individualism.
Commencing in the 1930‘s, Jacob L. Moreno developed a representation called ‗sociograms‘ which facilitated the study of interpersonal relationships, choices and preferences within groups. A ‗sociogram‘ is a diagram of the structure and patterns of group interactions, which can be based on criteria such as social relations, channels of influence, lines of communication, and so on. During the 1950s a mathematical formalization was developed for social networks which became the basis for modern social and behavioral sciences, as well as complex network analysis in general. McPherson, in combines a sociological focus to social network analysis with mathematical and graph theory approaches. McPherson considers the hypothesis that social network structures are influenced by the idea that similar persons are attracted to each other (the colloquial term ―birds of a feather" is mentioned).
A research work which combines a sociological view point with graph theory concepts of structures and graph generator models and their parameterization, is that of Robins et. al. The central theme of this wide ranging paper is that the overall structure of a large graph is determined by its structure at a local level, specifically in terms of the proportion of a given number of predefined structures. The authors evaluate different graph model generators, in terms of the frequency of occurrence of these structures in the whole generated graph.
Online social network can be generically understood to be some kind of computer application which facilitates the creation or definition of social relations among people based on acquaintance, general interests, activities, professional interests, family and associative relations, and so on.
Some of the most popular worldwide OSNs are Facebook, Twitter, LinkedIn, Google+ and MySpace, with a number of users ranging from 800 million (in 2011) for Facebook to 61 million for MySpace. Different dataset represents a social network of a community of 62 bottlenose dolphins studied by countries also have their specific applications which are most popular domestically. In the case of China, RenRen (the equivalent of Facebook), has approx. 160 million registered users, of which 31 million are considered active. Weibo (a social micro blogging application similar to Twitter) is claimed to have 300 million registered users. Spain has Tuenti, Hi5 is popular in Central and South America, Orkut (India and Brazil), StudiVZ (Germany), and Skyrock (France), among others. Some applications are specific to photo sharing, such as Flickr or Picasa, or videos and music, such as YouTube or Spotify. Dolphins Lusseau et. al compiled the dolphin data from a 7 year field study in which ties between dolphin pairs were established by observation of statistically frequent associations. In  Girvan and Newman conducted empirical tests on different datasets, including ―College football", in which the teams are the vertices and the edges are the games between respective teams, a ―collaboration network" consisting of 271 vertices which correspond to scientists resident in the Santa Fe Institute over a two year period, and a ―food web" consisting of 33 vertices corresponding to an ecosystems principal taxa (species) and in which the edges represent tropic relationships (which taxa eats which).
The dataset cit-HepTh, represents relations between citations and cited authors of scientific papers in the high energy physics field, comprising of 27,770 nodes and 352,807 edges; a collaboration network of high energy physics papers, ca-HepTh  has 9877 nodes and 51,971 edges. In the second type of graph dataset we have, for example, the Enron dataset, made up of a log of emails sent and received between employees of the Enron Corporation during a given time period. This is available online from the SNAP dataset collection, and consists of 36,692 nodes and 367,662 edges. Also, in Seshadri et. al analyse a large dataset of mobile phone calls (one million users and ten million calls). Seshadri examines the distributions of three key factors: phone calls per customer, duration time per customer, and number of distinct calling partners per customer.
The Flickr images sharing common metadata with 105,938 nodes and 2316,948 edges; and finally the Twitter dataset made up of 476 million tweets collected between June–Dec 2009, representing 17,069,982 users and 476,553,560 tweets. All of these datasets are available at the SNAP website. We also note that some applications such as Twitter and LinkedIn have made APIs (Application Programming Interfaces) available for programmers who wish to perform their own ‗scraping‘ of these OSNs, given certain data privacy restrictions.
However, graphs have specific properties, especially with respect to the way the data is represented and interrelated, which differentiate it from ‗tabular‘ data, and require specialized techniques. In , Cook and Holder define graph based data mining as the task of finding novel, useful and understandable graph-theoretic patterns in a graph representation of data.
Any text can be represented as a network. At a basic level, the words, or the concepts are the nodes, and their relations are the edges of the network. Once a text is represented as a network, a wide range of tools from network and graph analysis can be used to perform quantitative analysis and categorization of textual data, detect communities of closely related concepts, identify the most influential concepts that produce meaning, and perform comparative analysis of several texts.
In chapter three we will introduce a method for text network analysis that allows one to visualize a readable graph from any text, identify its structural properties along with some quantitative metrics, detect central concepts present within the text, and, finally, identify the most influential pathways for the production of meaning within the text, which we call the pathways for meaning circulation. This method can have a variety of practical applications: from increasing the speed and quality of text cognition, to getting a better insight into the hidden agendas present within a text and better understanding of its narrative structure. The fields where it can be applied range from media monitoring to comparative literary analysis to creative writing.
The use of diagrammatic approach to better understand text and analyze its narrative structures is not new. There has been a lot of research on this subject. For instance, Bruce et al. focused on uncovering social interaction structures from semantic structures of texts using visual and graphic analysis. Lehnert focused on identifying so-called ―plot units" within a text. These plot units were graphical representations of ―affect states", which were not focusing on complex emotional reactions or states, but, rather, on gross distinctions between ―negative" and ―positive" events and ―mental events" with zero emotionality. Dyer established affect and goal-seeking behavior as a moving force behind narrative structures.
Later work, most notably Carley focused on using map analysis to extract the main concepts from the texts and relations between them. In their research on spatial analysis of text documents Wise et al. proposes spatial visualization techniques for various texts based on their similarity, reducing the workload when performing text analysis.
Popping, the author advocated the use of schemes, and specifically knowledge graphs, to represent texts in order to gain a better understanding of textual data and to tackle the dynamic nature of knowledge. Her other work  has a good overview of network text analysis techniques that use graphs to represent text visually and graph analysis tools in order to obtain quantitative data for later analysis.
Further research, for instance, Ryan proposes to think of diagram as an heuristic device, which can represent a narrative in a spatial, temporal, and mental plane. There is also work by the scientists involved in computer game creation, Loewe et. al where formal representations of narrative structures are used to detect similarities between different stories.
Thus, there‘s a long history of diagrammatically representing textual data as graphs and applying network text analysis in order to gain a better understanding of the text‘s meaning and structure.
Several approaches presented above focused on semantic relations between the words when representing texts as networks. While this is definitely helpful in gaining a better understanding of text, such approach is also adding an extra layer of ontologies and complexity on top of the textual data. The decisions regarding which concepts are related together are based on their affective affinity (Lehnert  and Dyer their causal relations, Bruce chronological sequence, Loewe and semantic analysis, Van Atteveldt. All these approaches introduce a strong subjective (and even cultural) bias into the structure of the resulting text graphs. When the basis for connectivity is an external system of rules and logic, the resulting structure will be a result of negotiation between the text itself, the representational system used, and these external systems that define the basis for connectivity.
An interesting relation between the ―meaning" and text network analysis can be found in a paper on meaning as sociological concept, Leydesdorff  where he writes as words in sentences (Hesse, 1980; Law & Lodge, 1984). The information is then positioned in a network with an emerging (and continuously reconstructed) structure. This positioning can be done by an individual who – as a system of reference – can provide personal meaning to the events; but meaning can also be provided at the supra-individual level, for example, in a discourse. In the latter case, meaning is discursive, and its dynamics can therefore be expected to be different from those of psychological meaning."
The difference in their approach is that they propose to avoid as much subjective and cultural influence as possible during the process of translating the textual data into a text network and visualizing it into a graph. They only used the proximity of concepts and the density of their connections to encode the relations between the words, not their meanings or affective relations. They then applied various graph visualization techniques, community detection mechanisms, and quantitative metrics in order to get a better insight into resulting structures without imposing any other external semantic structures on top. The resulting visual representation of text thus be a translation of textual data into a graph where we attempt to avoid any filtering, generalization and distortion that may result from interfering into that process with an external ontology. After the textual network is visually represented as a graph of interconnected concepts it is finally open to interpretation by the observer.
Their approach can inform the existing methods of finding structure within text that employ graphical models and topic modeling: latent semantic analysis or LSA (Landauer et al 1998), pLSA (Hofmann 1999), Pachinco allocation (Li & McCallum 2006), Latent Dirichlet allocation or LDA (Blei et al 2003), and relational topic models (Chang & Blei 2010). These methods are based on retrieving the topics from text by identifying the clusters of co-occurrent words within them. This data can then be used to classify similar documents, improve text indexing and retrieval methods, and to identify evolution of certain topics over a period of time within a specific text corpus (Blei 2004). Combined with hyper linking data between texts it can also be used to find similar texts, find relations between different topics, or to predict citations based on the presence of similar topics in a text corpus (Chang & Blei 2010).
The difference of their method is that it doesn‘t only take into account the probabilistic co-occurrence of words in a text to identify the topics, but also the structural properties of the text itself, using graph analysis methods along with qualitative and quantitative metrics. While it is yet to be seen how this approach specifically compares to the methods outlined above, they believe that it will definitely be useful for the researchers in this field. Italo Calvino once said that reading is ―a way of exercising the potentialities contained in the system of signs" (Calvino, 1987). Therefore what they want to achieve with their approach is to simply propose a different way of reading a text through representing it as an image. And while the specific algorithm behind the image formation is described, it is the actual interpretation of the external observer that interests us the most and the possibility to fold the temporal aspect of a text onto a two-dimensional plane – thus giving a global overview of local chronological phenomena.
Following Victor Shklovsky‘s  words when talking about a plot and a story, ―Energy of delusion" means a search for truth in its multiplicity. This is the truth that must not be the only one, it must not be simple. Truth that changes; it recreates itself through a repetitive pattern." The pathways for meaning circulation that we attempt to discover through our methodology are these repetitive patterns derived from the text‘s structure, using their connectivity and the intensity of interactions between them as the only criteria for their belonging together. It is then through interpretation and comparative analysis that their semantic relations can be established, but we want to leave this out of the picture to allow the text to speak for itself.
"Time prevents everything from being given at once" (Bergson, 2002). Visualizing text as a network they remove the variable of time and let the history of the text appear through the diagram.
Further, we explore to study the opinion mining of the social media content, which has stirred much excitement and created abundant opportunities for understanding the opinions of the general public and consumers toward social events, political movements, company strategies, marketing campaigns, and product preferences. Many new and exciting social, geo political, and business-related research questions can be answered by analyzing the thousands, even millions, of comments and responses expressed in various blogs (such as the blogosphere), forums (such as Yahoo Forums), social media and social network sites (including YouTube, Facebook, and Flikr), virtual worlds (such as Second Life), and tweets (Twitter). This is one of the good medium to understand the opinion of people about the particular event and so that this may explore in the making any business decisions or the feedback about political activities.
2.5 Survey on Opinion Mining
Numerous research papers have been written on opinion mining topic. There are two main approaches to document level sentiment analysis: supervised learning and unsupervised learning. The supervised approach assumes that there is a finite set of classes into which the document should be classified and training data is available for each class. The simplest case is when there are two classes: positive and negative. Simple extensions can also add a neutral class or have some discrete numeric scale into which the document should be placed (like the five-star system used by Amazon). Given the training data, the system learns a classification model by using one of the common classification algorithms such as SVM, Naïve Bayes, Logistic Regression, or KNN. This classification is then used to tag new documents into their various sentiment classes. When a numeric value (in some finite range) is to be assigned to the document then regression can be used to predict the value to be assigned to the document (for example, in the Amazon five-star ranking system). Research has shown that good accuracy is achieved even when each document is represented as a simple bag of words. More advanced representations utilize TFIDF, POS (Part of Speech) information, sentiment lexicons, sand parse structures.
Unsupervised approaches to document level sentiment analysis are based on determining the semantic orientation (SO) of specific phrases within the document. If the average SO of these phrases is above some predefined threshold the document is classified as positive and otherwise it is deemed negative. There are two main approaches to the selection of the phrases: a set of predefined POS patterns can be used to select these phrases or a lexicon of sentiment words and phrases can be used. A classic method to determine the SO of a given word or phrase is to calculate the difference between the PMI (Point wise Mutual Information) of the phrase with two sentiment words. PMI(P,W) measures the statistical dependence between the phrase P and the word W based on their co-occurrence in a given corpus or over the Web (by utilizing Web search queries). The two words used in Turney are ‗excellent‘ and ‗poor.‘ The SO measures whether P is closer in meaning to the positive word (‗excellent‘) or the negative word (‗poor‘). A few researchers[135, 136] have used machine translation to perform document level sentiment analysis in languages such as Chinese and Spanish that lack the vast linguistic resources available in English. Their method works by translating the documents to English and then performing sentiment analysis on these documents using a sentiment analyzer in English.
A single document may contain multiple opinions even about the same entities. When we want to have a more fine-grained view of the different opinions expressed in the document about the entities we must move to the sentence level. We assume here that we know the identity of the entity discussed in the sentence. We further assume there is a single opinion in each sentence. This assumption can be relaxed by splitting the sentence into phrases where each phrase contains just one opinion. Before analyzing the polarity of the sentences we must determine if the sentences are subjective or objective. Only subjective sentences will then be further analyzed. Some approaches also analyze objective sentences, which are more difficult. Most methods use supervised approaches to classify the sentences into the two classes. A bootstrapping approach was suggested in Hai in order to reduce the amount of manual labor needed when preparing a large training corpus. A unique approach based on the minimum cuts was proposed in Pang and Lee. The main premise of their approach is that neighboring sentences should have the same subjectivity classification. After we have zoned in on the subjective sentences we can classify these sentences into positive or negative classes. As mentioned earlier, most approaches to sentence-level sentiment analysis are either based on supervised learning or on unsupervised learning. The latter approach is similar in nature to that of Turney, 36 except that it uses a modified log-likelihood ratio instead of PMI and the number of seed words that are used to find the SO of the words in the sentence is much larger. Recent research has shown that it is advisable to handle different types of sentences by different strategies. Sentences that need unique strategies include conditional sentences, question sentences and sarcastic sentences. Sarcasm is extremely difficult to detect and it exists mainly in political contexts. One solution for identifying sarcastic sentences is described in Tsur et. al.
The two previous approaches work well when either the whole document or each individual sentence refers to a single entity. However, in many cases people talk about entities that have many aspects (attributes) and they have a different opinion about each of the aspects. This often happens in reviews about products or in discussion forums dedicated to specific product categories (such as cars, cameras, smart phones, and even pharmaceutical drugs). As an example here is a review of Kindle Fire taken from the Amazon website: ―As a long-time Kindle fan I was eager to get my hands on a Fire. There are some great aspects; the device is quick and for the most part dead-simple to use. The screen is fantastic with good brightness and excellent color, and a very wide viewing angle. But there are some downsides too; the small bezel size makes holding it without inadvertent page-turns difficult, the lack of buttons makes controls harder, the accessible storage memory is limited to just 5GB." Classifying this review as either positive or negative toward the Kindle would totally miss the valuable information encapsulated in it. The author provides feedback about many aspects of the Kindle (like speed, ease of use, screen quality, bezel size, buttons, and storage memory size). Some of these aspects are reviewed positively while some of the others get a negative sentiment. Aspect-based sentiment analysis (also called feature-based sentiment analysis) is the research problem that focuses on the recognition of all sentiment expressions within a given document and the aspects to which they refer. The classic approach, which is used by many commercial companies, to the identification of all aspects in a corpus of product reviews is to extract all noun phrases (NPs) and then keep just the NPs whose frequency is above some experimentally determined threshold . One approach is to reduce the noise in the found NPs. The main idea is to measure for each candidate NP the PMI with phrases that are tightly related to the product category (like phones, printers, or cameras). Only those NPs that have a PMI above a learned threshold are retained. For instance, for the printer category such phrases, for example, would be ―printer comes with" or ―printer has." Another approach to aspect identification is to use a phrase dependency parser that utilizes known sentiment expressions to find additional aspects (even infrequent ones). They have also viewed the problem of aspect identification as an information extraction problem and then use a tagged corpus to train a sequence classifier such as a Conditional Random Field (CRF) to find the aspects. They have just discussed identification of explicit aspects, that is, aspects that are mentioned explicitly in the sentences. However, there are many aspects that are not mentioned explicitly in the sentences and can be inferred from the sentiment expressions that mention them implicitly. These aspects are called implicit aspects. Examples of such aspects are weight, which can be inferred from the fragment ―this phone is too heavy," or size, which can be inferred from ―the camera is quite compact." One way to extract such implicit aspects is suggested in Liu where a two-phase co-occurrence association rule mining approach is used to match implicit aspects (sentiment expressions) with explicit aspects. With these two sets we can use a simple algorithm that determines the polarity of each sentiment expression based on a sentiment lexicon, sentiment shifters (such as negation words), and special handling of adversative conjunctions, such as ‗but.‘ The final polarity of each aspect is determined by a weighted average of the polarities of all sentiment expressions inversely weighted by the distance between the aspect and the sentiment expression.
In most of cases users do not provide a direct opinion about one product but instead provide comparable opinions such as in these sentences taken from the user forums of Edmonds.com: ―300 C Touring looks so much better than the Magnum," ―I drove the Honda Civic, it does not handle better than the TSX, not even close." The goal of the sentiment analysis system in this One of the pioneering papers on comparative sentiment analysis is Jindal and Liu. This paper found that using a relatively small number of words we can cover 98% of all comparative opinions. Since these words lead to a very high recall, but low precision, a Naïve Bayes classifier was used to filter out sentences that do not contain comparative opinions. The classifier used sequential patterns as features. The sequential patterns were discovered by the class sequential rule (CSR) mining algorithm. A simple algorithm to identify the preferred entities based on the type of comparative used and the presence of negation is described in Ding et al.
As we have seen in the previous discussion, the sentiment lexicon is the most crucial resource for most sentiment analysis algorithms. Here, they have briefly mention a few approaches for the acquisition of the lexicon. There are three options for acquiring the sentiment lexicon: manual approaches in which people code the lexicon by hand, dictionary-based approaches in which a set of seed words is expanded by utilizing resources like WordNet and corpus-based approaches in which a set of seed words is expanded by using a large corpus of documents from a single domain. Clearly, the manual approach is in general not feasible as each domain requires its own lexicon and such a laborious effort is prohibitive. Here they have focused on the other two approaches. The dictionary-based approach starts with a small set of seed sentiment words suitable for the domain at hand. This set of words is then expanded by using WordNet‘s synonyms and antonyms. One of the elegant algorithms is proposed in Kamp et al. The method defines distance d(t1, t2) between terms t1 and t2 as the length of the shortest path between t1 and t2 in WordNet. The orientation of t is defined as SO(t) = (d(t, bad) − d(t, good))/d(good, bad). | SO(t)| is the strength of the sentiment of ‗t‘, SO (t) > 0 entails ‗t‘ is positive, and ‗t‘ is negative otherwise. The main disadvantage of any dictionary based algorithm is that the acquired lexicon is domain independent and hence does not capture the specific peculiarities of any specific domain. More advanced dictionary-based approaches are reported in Dragut et.al and Peng and Park. They have created the domain-specific sentiment lexicon and used one of the many corpus-based algorithms. A classic work by Hatzivassiloglou in this area introduced the concept of sentiment consistency that enables one to identify additional adjectives that have a consistent polarity as a set of seed adjectives. A set of linguistic connectors (AND, OR, NEITHER-NOR, EITHER-OR) was used to find adjectives that are connected to adjectives with known polarity. Consider the sentence ―the phone is both powerful and light." If we know that ‗powerful‘ is a positive word, we can assume that by utilizing the connector AND the word ‗light‘ is positive as well. In order to eliminate noise the algorithm created a graph of adjectives by using connections induced by the corpus and after a clustering step, positive and negative clusters are formed. An approach called double propagation for simultaneous acquisition of a domain-specific sentiment lexicon and a set of aspects was introduced in Qiu et. al. This approach used the Lin parser to parse the sentences in the corpus and find associated aspects and sentiment expressions.
The algorithm starts with a seed set of sentiment expressions and uses a set of predefined dependency rules and the Lin parser to find aspects that are connected to the sentiment expressions. It then uses the same aspects to find more sentiment expressions that in turn find more aspects. This mutual bootstrapping process stops when no more aspects or sentiment expressions can be added. For example, in ―Kindle Fire has an amazing display," the adjective ‗amazing‘ modifies the noun ‗display,‘ so given that ‗amazing‘ is a sentiment expression and we have the rule ―a noun which is modified by a sentiment expression is an aspect," we can extract ‗display‘ as an aspect. Conversely, if we know ‗display‘ is an aspect, then using a similar rule we can infer that ‗amazing‘ is a sentiment expression. The algorithm uses several additional constraints to reduce the effect of noise. Migrating a sentiment lexicon from one domain to another domain was studied in Du et al. An algorithm for acquiring a slightly different type of lexicon called a connotation lexicon is reported in Feng et al. A connotation lexicon contains words that express sentiment either explicitly or implicitly. For instance, award and promotion have positive connotations and cancer and war have negative connotations.
- Quote paper
- Dr Bapurao Bandgar (Author), 2019, Social Networks and their Opinion Mining, Munich, GRIN Verlag, https://www.grin.com/document/512868