Every natural language contains a large number of words. These words can have different senses in different context; such words with multiple senses are known as sense tagged words. Word sense reflects the basic concept of the word and the words with several meanings cause ambiguity in the sentence, and the process that decides which of the denotation is accurate in the sentence among several meanings of the word is known as Word Sense Disambiguation.

Human beings are good at understanding the meaning of the word by reading the sentence but the same task is difficult for a machine: to understand and accurately sense the correct meaning of the word. Machines can easily understand the set of rules and it is a difficult task to create such rules that can easily disambiguate the word in the context. This task is complicated because every natural language has their own set of rules such as grammatical rules, part-of-speech, antonomy, and synonym.

Therefore, a machine is trained by special algorithm so that it can tag the word with its correct sense. If the correct sense of the word is determined, that correct sense is helpful in retrieving the basic concepts of the word. As such this is very difficult task for a machine to retrieve the basic definition of word.

In this proposed work, K-Nearest Neighbor (KNN) approach is used to disambiguate the sense tagged words. The KNN is based on supervised learning method. The proposed technique evaluates the performance on Hindi sense tagged words and these are obtained from Hindi Wordnet. The results show the effectiveness of the proposed technique in sense tagged words.

Excerpt

Acknowledgement

List of Abbreviations

List of Figures

List of Tablev

Abstract

Chapter 1 INTRODUCTION
1.1 History
1.2 Word Sense Disambiguation
1.2.1 Sense Tagged Words
1.3 Learning approaches for sense tagged words
1.3.1 Knowledge Based Approach
1.3.2 Machine Learning Approach
1.4 Organization of book

Chapter 2 SENSE TAGGED WORDS
2.1 Various classification techniques used for sense tagged words
2.1.1 Naive Bayes
2.1.2 Decision tree
2.1.3 Decision List
2.1.4 Neural Network
2.1.5 Lesk Approach
2.1.6 Rocchio Classification
2.1.7 K-Nearest Neighbor
2.1.8 Ensemble Method
2.2 Knowledge Source of sense tagged words
2.2.1 Wordnet
2.2.2 SemCor
2.3 Representation of context

Chapter 3 LITERATURE REVIEW

Chapter 4 PRESENT WORK
4.1 Problem statement
4.2 Objectives
4.3 Design and Implementation
4.3.1 Methodology
4.3.2 Creating a dataset for Hindi sense tagged words
4.3.3 K-NN approach for Text categorization
4.3.4 K-NN approach for Sense Tagged Words
4.3.5 Working of algorithm

Chapter 5 RESULTSAND DISCUSSIONS
5.1 Text Categorization using K-NN
5.2 K-NN approach for sense tagged words
5.3 Comparison between Text Categorization and Sense tagged words

Chapter 6 CONCLUSION AND FUTURE SCOPE
6.1 Conclusion
6.2 Future Scope

REFERENCES

ACKNOWLEDGEMENT

The greatest praise can only rest on God, who, like a single lamp, never failed to light the dark rooms of our minds and sparked in us the sheer pleasure of formulating ground-breaking ideas. His blessings have never failed to enthuse our hearts and minds to perform unseemly miracles in our research work. No encomium can elucidate the nature, leadership qualities, grit and determination. He has always been an inspirational torchbearer and has always encouraged us to chase our dreams with concrete plans. I am also thankful to Professor Arvinder Kaur (Co-author) as a source of constant inspiration. We are also grateful and highly indebted to our research scholar Akanksha Sambyal who has implemented this research on “A Machine Learning approach to Sense Tagged Words using K-Nearest Neighbor”. Her support and lively discussions have enabled us to correct many mistakes during this research study. Indeed, I would be at a great loss without her sage advice. Last but not the least, I acknowledge my indebtedness to my loving parents and friends for their constant cooperation and help in completing this work.

LIST OF ABBREVIATIONS

Abbildung in dieser Leseprobe nicht enthalten

LIST OF FIGURES

1.1 Different Approaches for unsupervised learning

2.1 An example of naive bayes
2.2 An example of decision tree
2.3 An example of neural network
2.4 Instances in K-NN
2.5 Wordnet Domain
2.6 Preprocessing of text
2.7 Parsing tree

4.1 The flow diagram of algorithm

5.1 An output of text categorization
5.2 Correct Sense of Hindi words

LIST OF TABLES

1.1 The narration of NLP from 1950’s to 2000

2.1 Feature set for decision list
2.2 Synset of Hindi sense tagged words

4.1 Hindi sense tagged words
4.2 Sentences of Hindi word ‘वचन’

5.1 Distance Values of categories
5.2 Precision value for text categorization
5.3 Precision Value

ABSTRACT

Every natural language contains large number of words. These words can have different senses in different context; such words with multiple senses are known as sense tagged words. Word sense reflects the basic concept of the word and the words with several meanings cause ambiguity in the sentence, and the process that decides which of the denotation is accurate in the sentence among several meanings of the word is known as Word Sense Disambiguation.

Human beings are good in understanding the meaning of the word by reading the sentence but, the same task is difficult for a machine to understand and accurately sense the correct meaning of the word. Machines can easily understand the set of rules and it is a difficult task to create such rules that can easily disambiguate the word in the context. This task is complicated because every natural language has their own set of rules such as grammatical rules, part-of-speech, antonomy, and synonym.

Therefore, machine is trained by special algorithm so that it can tag the word with its correct sense. If correct sense of the word is determined, that correct sense is helpful in retrieving the basic concepts of the word. As such this is very difficult task for a machine to retrieve the basic definition of word.

In this proposed work, K-Nearest Neighbor (KNN) approach is used to disambiguate the sense tagged words. The KNN is based on supervised learning method. The proposed technique evaluates the performance on Hindi sense tagged words and these are obtained from Hindi Wordnet. The results show the effectiveness of the proposed technique in sense tagged words.

CHAPTER 1 INTRODUCTION

Natural language processing (NLP) is the task of analyzing and generating by computers, languages that humans speak, read and write. The goal of natural language analysis is to produce knowledge representation structures like predicate calculus expressions, semantic graphs or frames. Natural Language Processing is concerned with questions involving three dimensions: language, algorithm and problem ⁴. Natural language is most common way to communicate with each other but sometime it is difficult to understand other languages to understand different languages. Machine translation is the best application which helps to understand any other language in less cost and less time.

1. Introduction of Natural Language Processing

Natural Language Processing is a field of Artificial Intelligence (AI) which determines the semantic meaning of natural language to computer. It facilitates the human and computer interaction and creates models of how human understand and produce natural language.AI consist of most difficult problems which are known as AI-Complete problems and to solve these problems, machine is trained by special learning algorithm which make computer intelligent as human. Machine Learning is a technique to recognize the unknown sample through known sample and also provide training to the machine of how human understands and produce natural language and create relation between human mind and language.

In all natural languages, one word is related with more than one unique meaning which is known as sense of a word and this phenomenon is called polyseme. Word Sense Disambiguation (WSD) is a technique which is used to determine the correct sense of the word in computational manner and consider as an AI-complete problem. It recognizes the accurate sense of the word according to the context in which that word occurs. This technique eliminates the communication barriers between human beings by providing the right meaning of the words of languages.

1.1 History

NLP has evolved as they pass from generation to generation and has rapid growth in its theories and methods for machine translation. These methods are beneficial for scientific, social and are deployed in various new language technologies. In the revelation of machine translation in NLP, Shannon explored the first probabilistic model of NLP in 1950 and is called as father of information theory.

In 1950’s, Chomsky discovered the formal models for finite state and context free grammar and first Optical Character Recognition (OCR) was also invented by Bledsoe and Browning using Bayesian method.

By 1960’s WSD was first formulated by Warren Weaver which introduced the method for computational context. In the same period, Bar Hillel also modified the same method and a new feature was added for identifying all the parts of speech. Table 1.1, show the revelation of NLP from 1950’s to 2000.

In 1967, first electronic corpus was introduced which was the Brown corpus of 1 million words. Computational parser and Document Identifier were also introduced in this duration. Up to 1980’s, some primarily WSD techniques were developed such as Oxford Advanced Learner Dictionary (OALD) which replaced the hand-coding technique where data is generated by human efforts, with automatically extraction of knowledge by knowledge based or dictionary based technique.

In 1990’s, there was expeditious revolution in WSD through computational linguistic and statistical parsing. In 1997, first system for robust information extraction was introduced which reduces the human workload in large organization.

In 2000’s, supervised and unsupervised approaches were introduced for machine learning. These techniques reach a plateau where accurate and fast information is provided to the user. K-Nearest Neighbor (KNN), Linguistic Regression, Naive Bayes, Graph based technique and so on are used for WSD. Supervised and unsupervised systems are latest researches that perform better than earlier discoveries.

Table 1.1: The narration of NLP from 1950’s to 2000

Abbildung in dieser Leseprobe nicht enthalten

1.2 Word Sense Disambiguation

WSD is used to disambiguate the sense tagged words, which mean automatically identifying the sense of polyseme words and tagging the word with its correct sense according to the context. In Hindi Language, there are several words which contain multiple meaning such as फल, वग,ͪवͬध and so on. Therefore to identify their accurate

meaning in the sentence these words are needed to be tagged with their correct sense in the context. Consider an example:

- Context 1: घर मरखी गयी पजा◌ क ललए राम तरह तरह क फल लक आया।
- Context 2.: èकल महए परȣ¢ा का फल Ǔनकला।
- Context 3: भारत कसǓनकɉको तीर चलाि◌◌का अßयास कǐरक ललए ि◌कȧलȣ◌ फल ि◌◌ालतीर मͬगाएगए।

In the given three contexts the ambiguous word is ‘फल’ which is considered as target word. As such in first context the ‘फल’ denotes as ‘fruit’, in second context as ‘result’ and third context as ‘sharp thing’. In such cases word should be tagged with their correct sense.

1.2.1 Sense Tagged Words

Sense Tagged Word is one of the extremely difficult problems among various open problems of NLP. These problems can be accurately defined, but has not yet been solved. NLP applications range from query archives to access the collections of texts and report generation to machine translation which solves the real world problem and in these applications, WSD plays a significant role as explained below:

- Text Categorization- It is a process of acquiring relevant information related to any query from collection of data. WSD provides an automatic text categorization system which reduces the workload of large organization where large amount of data is generated on regular basis. It improves the indexing process of different queries and customizes the representation and storing of data to achieve accurate results.
- Machine Translation- Machine translators are best in transforming one language to another language. In this application, WSD plays a significant role to convert source language to target language by analyzing the morphological, syntactical and semantic meaning of different words of the source language. It generates the conceptual meaning of the words to transform the language from one to another.
- Speech Processing - It is an approach which differentiates the sense of the words that are pronounced and spelled similarly. WSD provides an automatic speech recognition system which assists in processing the homophones words to recognize the correct words.
- Question Answering- Question answering is a domain of information retrieval and NLP, which automatically generate the answer for the queries which are given by user.
- Computer Advertising- Computer advertising plays a vital role in different applications of NLP such as information retrieval, machine learning and so on. It helps in finding the accurate information and best match related to the queries.

1.3 Learning Approaches for Sense Tagged Words

The process of text determination is handled by the various applications of WSD. These various applications are instructed to depict the exact implication of the text. For example, applications such as information retrieval, text classification, and question answer areused to find out the accurate meaning and category of the text. In order to attain the right meaning and categorize the text, WSD system provides the significant knowledge related to the world and domain of discourse. Different learning schemes for sense tagged words are as follows:

(i) Knowledge Based Approach
(ii) Machine Learning Approach

The knowledge-based and machine learning approach acquires knowledge to solve well known problems in WSD such as identifying the sense and removing ambiguity.

1.3.1 Knowledge Based Approach

The Knowledge-Based approach is used to formulate the problems that are domain specific where all knowledge related to the domain is applied to solve the specific problems of NLP. This approach determines the semantic and grammatical knowledge of the natural language and is practically implemented in various applications of NLP. The Knowledge Based approach is also known as domain specific method because it needs rich knowledge related to particular problem. This approach detects the problem by using overriding method which creates bridge between different domains. This method is useful in forming rules for learning knowledge to solve the problems like resolving ambiguity. Following are the types of ambiguities that are resolved by Knowledge Based Approach:

(1) ि◌ह सबकललए फल लकआया Ïयादा।

(2) èकल महए परȣ¢ा का फल Ǔनकला।

In sentence (1), there is syntactic ambiguity and this ambiguity is resolved by using grammatical knowledge. It identifies and removes the error by lexical resource and also instructs the parser with lexical knowledge. In sentence (2), there is semantic ambiguity here ambiguous word is ‘फल’ which denotes ‘as a result’ or ‘fruit’. Here ambiguity is resolved by using knowledge based approach which identifies the accurate meaning of the word ‘फल’ in the sentence.

In Knowledge Based Approach, labeled and unlabeled corpus is provided to the system. Labeled corpus consists of informatics and meaningful data such as sentence, paragraph and so on. In case of unlabeled corpus, corpus is not in a systematic and not in meaningful way. Labeled corpus is directly provided to the system and unlabeled corpus is first converted into meaningful information and then provided to the system such as recorded information and image. This type of information is not directly applied to the system it is first converted into labeled information.

1.3.2 Machine Learning Approach

Machine learning approach introduced an automated change in NLP which enable the system to do the same task automatically for which it is trained. This approach derives the system in such a way so that it can perform the task better than human. For many AI systems, this method provides the enough knowledge to the system that makes it more efficient to begin the task and then leave the rest for itself to finish it more effectively. There are different machine learning strategies that are discussed below.

(i) Rote Learning
(ii) Supervised Learning
(iii) Unsupervised Learning
(iv) Semi-Supervised Learning

The machine learning approach generates the potential improvements in solving the various problems related to natural language. It uses top-down model to improve the potential for understanding the problem domain and bottom-up model for guiding the data in training set. In increasing order, following are the different strategies for machine learning approach to be used to classify the sense tagged words.

(i) Rote Learning

In this learning, knowledge is implanted directly to the machine which also keeps track of past events which makes system to response more accurately according to the queries.

Rote learning uses repetition technique to quickly recall the data when it is require again during processing. This learning method is also determined as meaningful learning, associative learning, and active learning. Rote learning methods are mainly used for memorizing all previous records for example: in telephone agency this learning method is used for keeping track of all telephone numbers.

Rote learning methods are useful in medical area to remember all elements used in making medicines and keeping record of all patients. As such this learning method is recently used in schools, for preparing exams and keeping record of all students and is also applicable in mostly biotechnical field where data is frequently used such as the periodic table in chemistry, basic formulas used on organic chemistry.

Rote learning is generally used to gain information in short time for example, when learning another language like French or learning some useful words related to any language. This method is used in WSD for learning verbs, morphological words and so on. It memorizes all basic concepts related to the word.

(ii) Supervised Learning

The basic meaning of supervised learning is to provide a training dataset already to the systems which helps the classes to differentiate between new data. The training set contains the set of input and their corresponding output. It generates the relation between input and output values e.g. neural network and so on. It has two different phases: training phase and testing phase.

In testing phase, the appropriate sense of the word is investigated which is based on the surrounding words present in the sentence. This supervised learning method provides better result as compared to the other methods. For various problems in WSD there are different methods of supervised learning such as probabilistic methods and similarity based. These all methods have definite rules and linearly classify the data.

In probabilistic methods, it determines the set of parameters such as joint and conditional distribution and context. These probabilistic methods are used to define the category of test sample which maximize the samples according to the value of conditional probability.

The similarity based method compares the features of test sample with the training samples and assigns the most similar sample to the new instance. This method is based on discrimination rules which defines the rules that are associated with each sense of the word and also uses the classification technique to classify the new sample. These rules are also use to create the feature set of new sample and prediction method assigns the acceptable implication to the word.

(iii) Unsupervised Learning

In unsupervised learning method, the system is provided with a training set that is it consists of input data only and is trained to produce output. Self-Organizing Map is an example of such a system. This learning method acquires knowledge for classification and assumes that similar words form a cluster. This approach assigns the target word to the neighbor of similar word where it belongs. For word disambiguation, it performs the similarity test among different sense in the sentence which calculates the similarity measurement among sense.

Unsupervised learning method applies the cluster base technique to identify the correct sense of the target word by forming clusters of contexts and target word. This learning method solves the problem of knowledge acquisition which provides the complete knowledge related to the target word. Unsupervised learning approach has lower performance as compare to the other learning techniques such as supervised learning, rote learning and semi-supervised. This learning approach also uses word clustering method, which is a unique property of unsupervised learning method. According to word clustering method, clustering is done by using single feature, like noun-subject, adjective-adverb, etc. It calculates the centered vector and selects the words that occur in same cluster. Unsupervised method consists of different approaches that are explained below:

(a) Discriminative Approach: This approach uses contextual feature to identify the correct answer. It also stores the other words present in the context which are known as contextual feature of the context and mark words are stored in the other dataset.
(b) Traditional Equivalence Approach: Traditional Equivalence approach makes use of parallel corpora for disambiguation. This approach compares the target word with the words present in the corpora and then selects the correct meaning of the word.
(c) Type Based Approach: In this approach, the total number of instances of the target words is calculated. Type based approach clusters the instances and relates them with their neighbor instances to remove the ambiguity present in the context.
(d) Token Based Approach: Token based approach disambiguates the context using clustering technique. In this approach, context of a target word is clusters and then the accurate sense of the target word is determined.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.1: Different Approaches for Unsupervised Learning.

The different approaches of the unsupervised learning are shown in Figure 1.1. The Graph based method uses unsupervised approach where graph is built according to the grammatical relationships and the calculated weight value is assigned to different edges of the graph. The value at the highest degree node is selected as a correct sense of the target word.

(iv) Semi-supervised Learning

Semi-supervised methods do not use training set during the classification of data. It classifies both types of data: labeled and unlabeled data. In this approach mainly unlabeled data is used in large amount as compare to the labeled data. Bootstrapping algorithm is mostly used semi-supervised learning method for WSD. It uses iterative technique to convert the unlabeled data into labeled data and then uses model learn from labeled dataset to move this dataset to previous iteration. The graph based semi-supervised learning algorithms is a recently introduced in the research field, which exploit the cluster structure in data to combine the unlabelled data with labeled data in learning process.

Label Propagation algorithm is based semi-supervised learning method which is famous algorithm for sense tagged words. Semi-supervised method has become a popular learning method in all machine learning techniques. It uses unlabeled data and requires small amount of labeled data for classification because labeling data is very expensive task in Natural Language Processor. Unlabeled data is widely used in the field of Word Sense Disambiguation and therefore this learning approach is mainly used for removing the ambiguity present in the sentence.

1.4 Organization of book

The organization of the book is as follows:

Chapter 1 describes the introduction of natural language processing and its various applications.

Chapter 2 describes the sense tagged words and the various classification techniques used for disambiguating the sense tagged word.

Chapter 3 illustrates the learning approaches and classification methods for sense tagged words available in literature review.

Chapter 4 discusses the proposed work and experimental results of sense tagged words using Hindi Wordnet. It also compares the results of different applications to check the feasibility of the proposed approach.

Chapter 5 discusses the results and discussions, after evaluating the performance of the proposed work.

Chapter 6 illustrates the conclusion and future work of the proposed work that can be extended further. It is followed by the list of references and publications.

CHAPTER 2 SENSE TAGGED WORDS

In the field of computational linguistics, the problem is generally called Word Sense Disambiguation, and is defined as the problem of computationally determining which "sense" of a word is activated by the use of the word in a particular context. Word Sense Disambiguation is essentially a task of classification where word senses are the classes, the context provides the evidence, and each occurrence of a word is assigned to one or more of its possible classes based on the evidence.

2. Sense Tagged Words

The denotation of the words signifies the basic concept of the word. These words can have several meaning in different sentences and such words with multiple meaning are known as sense tagged words. It contains lexical knowledge related to the word and describes the word in all part- of-speech such as noun, verb and adverb. Every language is a collection of sense tagged words and the semantic meaning of these words are defined according to their grammatical rules.

2.1 Various Classification Techniques used for Sense Tagged Words

There are various techniques that are used for the classification of sense tagged words to understand their accurate meaning according to the sentence. Some of the techniques are discussed below:

2.1.1Naive Bayes

Naive Bayes is a simple classifier based on the application of Bayes Theory ¹⁰. It calculates the conditional probability according to the different features belonging to the respective classes. Naive Bayes has independence assumption which assumes that each feature of particular class should independently contribute to the probability method. During classification, feature selection method is applied to detect the different features of class and the particular feature of class does not depend upon the existence of other features of the class. Consider an example:

- The fisherman jumped off the bank and into the river. x The bank down the street was robbed.
- Back in the day, we had an entire bank of computers devoted to this problem.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.1 An example of Naive Bayes¹⁰.

In figure 2.1 a simple example of Naive Bayesian Network is reported. In this example conditional probability of the word ‘ bank ’ is calculated to obtain its correct meaning in the sentence. The fisherman jumped off the bank and into the river given the features: { n -1 = the, n +1 = fisherman, top=bank, verb=jump, adverb= ‘ ’ } where these features determine the grammatical meaning of target word after comparing it with training set. During classification naive bayes calculates the conditional probability of each sense of the target word; here ‘ X i ’ calculates the probability of each sense of word according to the feature values.

It selects the sense which contains highest probability score. Depending on the precise nature of probability model, naive bayes classifiers can be trained very efficiently by supervised learning method. As such bayes classifiers requires small amount of dataset for classification to calculate the probability. When the features are highly correlated with each other than the classification of dataset become complex and the accuracy become poor during classification.

2.1.2 Decision Trees

Decision Trees possess a significant way to represent the classification technique in structural manner. It represents the information in the form of tree where root node contains data that is to be tested and leaf nodes represent the semantic information. During classification it determines the feature values of data and represents the different feature values at intermediate nodes. For example:

- Context1: Ram saw a play in the theater.
- Context 2: Children play basketball in the ground.

An example of a decision tree for sense tagged words is discussed in Figure 2.2. Here in both sentences the target word is ‘ play ’ where the first sentence denotes the word bank as ‘ play basketball ’ and second sentence the word denotes as ‘ large number ’.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.2 an example of a decision tree.

Decision tree classifies the sentence and tagged the target word ‘ play ’ with accurate sentence according to the context and also detect the unseen instances. It follows the noyes-no path to determine the correct sense and the leaf with empty value specify that no classification is done based on some feature value. As such this classification technique is rarely used for sense tagged words. C4.5 and ID3 algorithms are the popular algorithms of decision tree to identify the correct sense.

The sense-tagged corpus is used as a knowledge source in case of decision tree. The classification rules are defined in the form of ‘ yes-no ’ method. Using these rules the training data set is recursively partitioned. The characteristics of decision tree are:

- Each internal node represents a feature.
- Each edge represents feature value and each leaf node represents sense.

The feature vector here is same as the feature vector of decision list. In testing phase, the word to be disambiguated along with feature vector is traversed through the tree to reach leaf node. The sense contained in the leaf node is considers as sense for the word.

2.1.3 Decision List

A Decision list is an ordered set of rules for recognizing the accurate sense of the word according to the sentence. A list of weighted rules is generated and training set is created to attain feature values. This technique calculates the feature value, score and sense kind of rules for determining the correct sense of the word. The ordering of these rules are based on the value of scores.

Table 2.1 Feature set for Decision List

Abbildung in dieser Leseprobe nicht enthalten

It calculates the occurrence of the word and represents the features in the vector form as discussed in Table 2.1. Then these features are compared with the decision list and feature with highest score that matched with input vector is selects as correct sense. Decision list has been most successful technique for first senseval evaluation method.

2.1.4 Neural Network

A neural network is an information processing model which is composed of a large number of highly interconnected processing elements (artificial neurons) which works together to solve the specific problems. It consists of three layers; hidden layer, input layer and output layer. Input layer receives the input and buffer the input signals and output layer generates the output of the network. Hidden layer is the layer which is interconnected to both input and output layer. This layer performs all internal functions of neural network and has no direct connection with external environment.

Activation function is applied to obtain the exact output according to the input patterns and to make the work more efficiently. It determines the feature values and desire response to the output layer according to the input patterns. This function calculates the weighted values which are given by input nodes to each hidden nodes.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.3 an example of neural network¹⁰.

Neural network is well established technique used for processing the information and can also be used in perceiving the correct meaning of the word. The figure 2.3

illustrates an example of neural network for sense tagged words. At the time of classification, this technique represents the ambiguous word in input node and finds the semantic meaning of the word when activation function is applied to the input layer. Neural network is provided with training algorithm and is trained until the exact output corresponding to the each input node is not achieved.

2.1.5 Lesk Approach

Lesk approach is knowledge based approach and is used in disambiguating the context. This approach uses direct overlapping method to determine the correct sense of the word. It overlaps the context and the glosses present in the dictionary and the words that have same sense consider as neighbors because of sharing same meaning. In this approach, many recent researches have been introduced such as removing stop words, modifying context window size, using Wordnet to disambiguate a word and so on.

Lesk approach identifies the sense of a word by extracting the basic concept of the word such as noun, verb and so on from the glosses. This approach obtains the semantic meaning of the word from the defined Wordnet. It removes the ambiguity present in the polysemous sense tagged words.

It is a best approach to identify the correct sense of the word by using knowledge based learning method. Lesk approach calculates the total number of instances of the target word that is total number of time a target word is used in a sentence. After calculating the instance value, it generates the parametric values such as precision and recall. Stemming method has been recently introduced in Lesk approach to captures the similarity between the words present in context with the meaning define in the glosses.

In this approach, instance is defined as the total number of ambiguous words present in the context and the size of context window is depend upon total number of instances. This context vector is compare with the sense definition of the target word in the Wordnet and then overlapping of the word is calculated. The sense that has maximum overlap is selected as correct sense of the target word in the context. Lesk approach has one drawback that it is not able to calculate the morphological variations.

2.1.6 Rocchio Classification Approach

Rocchio classification approach is used in various application of WSD such as text categorization, information retrieval and so on. This approach is used for determining the category of the text. Rocchio classification approach is based on supervised learning method and is very fast learner algorithm. This classification technique constructs a prototype vector of training set where it defines the different categories of a text in vector form. In vector space model, different classes belongs to particular word are constructed in vector form and all defined classes contain information corresponds to the training set. It defines the similarity boundaries between all classes in the vector space.

The information that belongs to the training set is represented by positive number and the information that does not belong to the training set is represented by negative number. This approach uses similarity formula to determine the similarity between different prototype vector and unspecified document. The category of the unspecified document is determined by the similarity formula, if the document does not belong to any class then that document is placed at the boundaries of the prototype vector. The vector class with maximum similarities is chosen as corresponding category of unspecified document. Rocchio algorithm is mainly used for text classification and information retrieval.

2.1.7K-Nearest Neighbor

K-Nearest Neighbor is an exemplar-based or instance based approach which is effectively applied for sense tagged words. This classification approach based on instances where instances are used as points in the vector and test instance compares the new instance with all previously stored instances in the memory. KNN is based on non- parametric method which assumes that data do not have any characteristics. It is based on supervised learning algorithm which is provided with the training set and during classification it compares the test instance with training set and takes less time to instruct the system.

The KNN is the one of the highest performing algorithm for text categorization and sense tagged words. In KNN the classification of new examples are represented in the vector form of ‘n’ features .It compares the test vector with training set and if any value is similar to the test vector than test vector is placed near to that similar value. The exemplar- based methods do not ignore any exceptions so that the context should be disambiguated properly.

The K-Nearest Neighbor classifier selects the correct answer by comparing target word with sense inventory dictionary; it can also determine the accurate sense of a word by comparing the target word individually with each sense present in the training set. This classifier finds the ‘k’ nearest sample to the target word and the closest sense is selected as the correct sense of the word.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.4 Instances in K-Nearest Neighbor

KNN also determined as local classifier as such it uses lazy learning technique. These classifiers are also called as lazy learners because as they do not train a classifier until presented with a test sample. During classification, KNN uses cross validation method to choose the closest neighbors of the target word and because of this feature sometime they are no consider as completely lazy learners.

Lazy learners also need preprocessing of the context before classifying the word but, recently new methods have introduced in KNN algorithm where there is no need of preprocessing of data and can also calculates the size of multiple neighbors of any word.

As such this technique does not lose any information and memorizes all the information so that it can be retrieved easily and quickly.

The above figure 2.4 shows the instance values of the KNN approach. The KNN classification technique is used in various applications of NLP such as text categorization, question answering and information retrieval and so on. This classification method also used in medical field for determining protein or DNA comparison and checks the relevancy of information. In security areas, K-Nearest Neighbor approach is used to recognize the facial expression.

2.1.8 Ensemble Method

Ensemble method is the combination of different classifiers to improve the disambiguation accuracy. The different features of classifiers are put together to improve the learning method. This method is used to overcome the problems of single classifier which is based on supervised leaning method. Ensemble method improves lexical disambiguation, semantic analysis and classifying training set.

Text categorization is performed by collection of different classifiers such as K- Nearest Neighbor and Naive Bayes which uses different features of both classifiers. Naive Bayes is used for feature selection of training set and K-Nearest Neighbor recognizes the categories of the text. Majority voting, probability mixture and Rank based combination techniques are used to ensemble the classifiers. These Ensemble techniques are explained below:

- Majority Voting: The correct sense of a target word is determined by maximum number of votes. Each classifier provides one vote to the sense and the sense with maximum votes is selected as a correct sense of the word. If there is equal number of votes to every sense of the target word, then the sense is randomly selected. x Probability Mixture: The probability distribution method is used to disambiguate the sense tagged words. Every classifier provides the confidence score for each sense of target word, then the confidence score is normalized to calculate the probability of the sense and the sense with highest probability score is selected as accurate sense of the target word.

- Rank-Based Combination: This method calculates the total sum of rank given by every classifier to each sense of the target word. Rank- based combination method maximize the ranks of all senses to resolve the proper logic and the sense with highest rank is chosen as right meaning of target word.

- AdaBoost: AdaBoost is Adaptive Boosting method to ensemble different classifiers. This method uses weighted training set to learn all classifiers. It determines the incorrect answers by performing iterations where each classifier makes use of single iteration. The number of incorrectly classified answers increases as the number of iteration increases. AdaBoost method also deals with multiclass classification techniques. Probably Approximately Correct learning model is latest AdaBoost algorithm. This algorithm is used for noisy data and it has less over fitting problem as compare to other learning algorithms.

2.2 Knowledge Sources for Sense Tagged Words

The fundamental concept in WSD is the knowledge source. The Knowledge source provides data that can be used to associate senses with words. The brief description of various knowledge sources used in the field of WSD are explained below:

(i) Structured knowledge source.
(ii) Unstructured knowledge source.
(i) Structured knowledge source

These are the knowledge sources which provide the relation between words and various types of corpora of texts. Following are the different type of structured knowledge source:

- Thesauri: It provides information like synonymy, antonymy and further more relationship between words. Synonymy represents the basic mean of the word and antonymy provides the opposite meaning of a word. The latest edition of thesauri contains 250,000 words with six classes and 1000 categories of word.
- Machine comprehensible dictionaries: Machine readable dictionary has become a popular knowledge source in the field of WSD. It provides the semantic meaning of the word. The latter is one of the most famous machine readable dictionaries.
- Ontologies: Ontologies are specialized in specific domain of WSD like taxonomy and set of semantic relations. They usually illustrate the conceptualized Wordnet and form semantic network to categorize the word.

(ii) Unstructured Knowledge Source

Unstructured knowledge sources are used for learning natural language model and these sources are used in both supervised and unsupervised learning method. Corpora, is an unstructured knowledge resource. It is a collection of text which contains all relative information about language. Corpora can be sense annotated or raw which are explained below:

- Raw Corpora: The Brown Corpus is a collection of million words which is published in the United States in 1961. Raw corpora identify the grammatical relations between words. The Wall Street Journal (WSJ) corpus which is a collection of approximately 30 million words and also includes22 million words of written and spoken American English.
- Sense Annotated: Semcor and Wordnet are the largest sense annotated corpora that are widely used in WSD. Semcor consist of 352 word tagged and 234,000 sense annotations. There is English - Italian sense annotated multisemcor. Wordnet contain large number of sentences with their instances in noun, verb, adjective and adverbs. These corpora are collection of different versions of Semcor and Wordnet sense inventory such as Senseval-1.
- Collocation resources: It includes the affinity for words to occur with other applications of WSD such as Word Sketch Engine and Just the word. It increases the frequency of word to occur in corpus derived from web. This includes other latest resources for WSD.

2.2.1 Wordnet

Wordnet is an electronic dictionary which provide lexical database for natural language. It stores words and their meaning and organized the words into synonym sets, known as synsets. Synsets provide the lexical concept of a word. Wordnet define the synset and lexical relations. The relation between two synset is known as semantic relation and the relation between two words within two synsets of Wordnet. It is a collection of different types of words that are explained below:

- Monosemous words: These are the words with one sense and have one synset. For example- the word wrist watch has only one meaning and has only one synset.

- Homonymous words: Homonomy words have same spelling and have more than one meaning but the meanings are entirely different from each other. For example the word ‘bark’ can mean ‘tough protective covering of trees’ and ‘sound made by a ‘dog’ and ‘bark’ is a homonymous word because its meaning are not related to each other.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.5: Wordnet domain

- Polysemous words: These words also have multiple senses where senses have same basic meaning. For example: the word ‘accident’ is a polysemous word that denote as ‘ some miss happening’ or ‘some incident that happen by chance’ which have same basic meaning.

Wordnet do not investigate the difference between polysemous and homonymous words. It differs from traditional dictionary. In traditional dictionary words are arranged in alphabetical order where as in Wordnet words are store in semantic order. Wordnet stores information about word into four parts of speech: noun, verb, adjective and adverbs.

Each synset consists of short definition known as gloss which explains the exact concept represented by synset. It has been observed that the nouns have long glosses while verb, adjective and adverbs are quite short. Figure 2.5 discusses an example of Wordnet taxonomy.

2.2.2 SemCor

SemCor is a subset of large corpus used in Word Sense Disambiguation. This knowledge source is annotated with all part-of-speech and word senses used in the sense inventories. The Brown Corpus consists of large number of word senses also has SemCor. It is composed of 352 texts, 186 nouns, verbs adjectives and adverbs. SemCors also contain 166 texts that are only verbs and their semantic meaning.

Table 2.2 Synset of Hindi sense tagged word

Abbildung in dieser Leseprobe nicht enthalten

SemCor also used as training set for the classifiers that are based on supervised learning approach. An example of synset is discussed in Table 2.2. This is also a subset of a dictionary that contains around 234,000 semantic meaning of word and this is a largest sense tagged corpus used in the field of WSD. The corpus that is being created by Bentivolgi and Pianta, is a multisemcor of English and Italian language and provides all part-of-speech for these languages. It aligns the Italian words and original sense tagged words are then tagged with their semantic meaning.

2.3 Representation of Context

Text is always represented in unstructured format and to make it suitable with automatic function it is converted in structural form. In every approach of WSD, first step includes preprocessing of text which transforms the text from unstructured to structural form. This process include following steps that are explained below figure 2.6:

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.6: Preprocessing of text¹⁴.

- Tokenization: It is normalization process where text is converted into set of words. x Part - of - speech tagging: This step assigns the grammatical categories to the set of words such as noun, verb, adjective and adverbs.
- Lemmatizations: This deals with morphological property of the text and reduces the morphological words form the text to their base form.
- Chunking: This step divides the text into syntactically correlated parts.
- Parsing: Parsing identifies the syntactic form of context and generates the parse tree during this process to analyze the formation of sentence according to the already defined rule and standards in any natural language.

Preprocessing is the first step for disambiguating the word. This is basically used for feature selection process because it removes the punctuations used in the sentence and stores the useful data present in the sentence.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.7: Parsing Tree

The above figure 2.7 shows the flow diagram of preprocessing. Therefore, preprocessing phase helps in generating the text into feature vector form and helps in choosing the knowledge source according to the context and words.

CHAPTER 3 LITERATURE SURVEY

Lexical ambiguity has become a recent and complex problem in research field. It is a problem to judge the semantic and syntactic meaning of the word. As such there is a lot work that has been done in the related field which is discussed below:

Singh et al. ¹ propose and evaluates a method of Hindi WSD that computes similarity based on the semantics. It adapts an existing measure for semantic relation between two lexically expressed concepts Hindi WordNet. The propose method has evaluated 20 polysemous Hindi nouns and obtained overall average accuracy of 60.65% using semantic measure. It has compared average values of two methods that are direct overlap method and semantic similarity measure and has shown that the semantic similarity measure has improved the results up to 32.53%. This is mainly due to the use of WordNet hierarchy in identifying similarity between the test instance and the sense definition.

Singh et al.² describe the effects of stemming, stop word removal and size of context window on Hindi WSD. This method has two categories: knowledge-based and corpus-based approaches. Knowledge-based approaches make use of information provided by machine readable dictionaries, thesauri or lexical resources whereas corpus- based approaches make use of information gathered from a semantically tagged or raw corpus. This method calculates the maximum overlaps between all senses of word to find out the winner sense. It has observed that the stop word elimination increases the number of content words in context vector and stemming improves the chances of overlap by reducing the morphological variants of these words to the same stem. This approach has improved the precision and recall value up to 9.24% and 12.63% respectively and also analysis the ‘karak relations’ between all words in a sentence.

Aquidate et al. ³ present an Ant Colony Optimization (ACO) technique, which solve combinatorial problems during prototype selection. It introduces the Ant Prototype Reducing algorithm (Ant-PR) which reduces the training set of the KNN classifier and compares the algorithm with two classical well known KNN condensing algorithms. The proposed Ant-PR provides best classification accuracy then KNN condensing algorithm. This technique eliminates the prototypes that do not provide additional information for the KNN classification, involving ant colonies optimization.KNN algorithm for Ant Colony Optimization requires large running time as compare to the Ant-PR algorithm. Prototype selection method is requires to identify the similar patterns from the data. Prototype selection method is considers as NP-hard problem, because there is no polynomial algorithm which is able to determine the final solution. The Condensed Nearest Neighbor Rule (CNN), Reduced Nearest Neighbor rule (RNN), Fast Condensed nearest Neighbor rule (FCNN) and The Class boundary Preserving Algorithm (CBP) are the prominent approaches for Prototype selection these are explained below:

- CNN is the first method for reducing the size of the stored data for the nearest neighbor to make decision rule. It has to read all data from training set, if new information is added which is time consuming process.
- RNN is an extension version of CNN approach which reduces the complete reading process and even deletes that data which cause misclassification.
- FCNN is a scalable algorithm used for large multidimensional data set. It creates subsets of large dataset and applies decision rules on those subsets. It selects the points that are very close to the test vector and has low quadratic complexity.
- CBP is a multi-step method for extracting information from training set. It provides the necessary information during classification and deletes the useless data.

The proposed ACO algorithm is a simple algorithm that cooperates with each other to achieve a correct behavior of the system. During classification, it prepare minimum prototype by eliminating the unnecessary data. In ACO, each ant incrementally constructs a solution for the target problem. The performance of Ant-PR algorithm has been tested on various domains including public-domain and medical domain. This method also gives better results in all datasets. This algorithm eliminates the prototypes that do not provide additional information for KNN classification method and ant colonies optimization.

Bowes et al. ⁴ describe an automatic word sense disambiguation method that determines which word sense is fits in a given context that makes the process feasible in practice. It explains the evaluations require the disambiguation of all nouns, adjectives, verbs and adverbs for all words task. It performs comparison on several WSD systems on Senseval 2 and Semval 2007 and evaluates the best performing system for sense tagged words.

Rezapour et al. ⁵ propose the supervised learning method for WSD based on KNN algorithm. It illustrates the weighting strategies for classifying the context. This scheme includes two major parts; first part performs the feature selection and the second part performs classification by calculating the weight of different senses. It has evaluated the proposed scheme on six sense tagged words from bench Mark corpus. This method uses 5 cross validation criteria to estimate the performance of the algorithm where for each ambiguous words, the set of all related words are divided into five equal fold. Four folds are used to extract the feature and train the classifier and remaining folds are used to test the data. It has shown that the weighing method is encouraging and has led to promising improvements to many cases.

Bamman et al. ⁶ describe the methods to identify the word sense variation in a dated collection of historical books in a large digital library. It evaluates the seven different classifiers both in a tenfold test on 83,892 words from aligned parallel corpus and on a small manually annotated sample of 525 words and measure overall accuracy of each system. It has proved that large parallel data lead to higher quality of WSD classification.

Jabarri et al.⁷ present a novel semi-supervised learning to reduce the human labeling effort using two different statistical classifiers, KNN and so on. The technique uses Persian corpus, the University Information Kiosk corpus. The proposed method uses unlabeled data and then converts that data into labeled form.This method performs classification on both labeled and unlabeled data. It also shows the effectiveness and feasibility of the proposed approach.

Haroon⁸ develops a WSD system in Malayalam Language. This proposed technique explains the various approaches for WSD, Knowledge based approach, supervised approach and unsupervised approach and are discussed below:

- The knowledge based approach is a dictionary based approach that mainly uses external sources such as corpus, Wordnet and so on for gaining knowledge related to the word.
- In Supervised learning approach, a learning system which is provided with the training set and contain various feature related to the dataset. In this approach disambiguated corpus is available to the system.
- In unsupervised learning approach, training set is not provided to the system. It uses clustering method to classify the context. Hyperlex and Lin’s approach are the two main algorithms used for disambiguating the word. This technique combines the advantages of supervised and knowledge based approach.

This WSD system adopts a knowledge base approach. It uses the electronic Wordnet dictionary that has been developed at Princeton University. The main purpose of Wordnet is to develop a system that consists of knowledge related to any language. It distinguishes between noun, verbs, adjectives and adverbs. It collects all the grammatical knowledge related to any language. Wordnet provides the polysemy count of word: the number of synsets that contain the word and also calculates the frequent score: in which several sample texts have all words semantically tagged with the corresponding synset, and then count provided correct sense of the word. This work uses two methods for disambiguation: Lesk and Walkers method and Conceptual Density method.

In the first method, a hand devised knowledge source is used, and the second method uses the Malayalam Wordnet. It is shown that conceptual density depends on the implementation of semantic relations in Wordnet and with increased number of classification in Wordnet, the system accuracy can improve. This technique also illustrates that knowledge based system results in poor accuracy because of their complete dependence on dictionary defined sense.

Garcia et al.⁹ propose a simple alternative method to cross validation of the neighborhood size that does not require preprocessing in local classifiers. It analyzes the approach for six standards and state-of-the-art local classifiers and confirms the experiments on seven benchmark datasets. This technique has shown that classification can be attained without any training.

Navigli¹⁰ provides a survey on the field of WSD and describes an overview on supervised, unsupervised and knowledge based approaches .It explains all the important perspective for this technique. This survey illustrates all the classification techniques for instance, decision tree, decision list, neural network and so on.

Hwang et al.¹¹ present a new fast KNN classification for texture and pattern recognition. It identifies the first k closest vectors in the design set of a KNN classifier for each input vector by performing the partial distance search in the wavelet domain. It describes the KNN algorithm: is an instance based classification method. This technique can achieve high classification accuracy in problems which have unknown and non-normal distribution. The main drawback of this approach is that it requires large amount of design vectors and resulting in high complexity for classification. This technique has proved that without increasing the classification error rate, the algorithm requires only 12.94% of the computational time of the original KNN technique.

Wang et al. ¹² propose the improved KNN classification technique for text categorization. It introduces the Rocchio classification technique which analyzes the traditional KNN classification algorithm. It has shown two lacks in traditional KNN technique that are large calculation during similarity between unknown text and all the training samples and less feature crossover phenomena. The traditional K-Nearest Neighbor classification need large memory space for storing the information as it has complex calculations and accuracy of K-Nearest Neighbor approach fails when it has large data to categorize. The Rocchio classification method has been used to categorize large dataset and has improved the classification accuracy. It has illustrated various features of Rocchio approach as discussed below:

- It uses prototype vector for training set.
- This method defines the various classes in vector form.
- It represents the correct answers in positive numbers and incorrect answers in negative numbers.
- The Rocchio method compares the new vector with already define classes in the prototype vector and any class is similar to unknown vector then, it will place that new vector near to that class in prototype vector.

This improved KNN method solves the two problems and has good results in the classification after improving the value for precision and recall parameters. It evaluates the method on two different type of corpus; TanCorpV1.0 and Sohu webpage corpus. These are explained as below:

- TanCorpV1.0 is a Chinese corpus which consists of 12637 documents and 10 different categories that belong to different classes such as medical, geography, science and so on. It has 8872 training documents and test has been performed on 3765 documents. There are some classes which contain small documents, sports category has 1426 training documents and regional category has 86 training documents.
- Sohu webpage corpus is provided by internet search engine which provide information regarding lovers and researchers. It consists of 12236 training documents and 3269 test webpage. This Chinese corpus also contains 10 different categories; news, finance, sports, health, IT, tourism, education, employment, culture and military. It contains 1513 training documents of training categories and tourism category has 91 training documents.

The Rocchio classification method calculates the precision and recall parameter to determine the performance of this method. It also calculates the F-measurement which is a combination of both precision and recall. The Rocchio method is fast and good method for text categorization. This method also increases the accuracy for determining the categories of the given text.

Wajeed et al. ¹³ propose the method for the classification of data. It introduces the method for storing large amount of data in categories. It uses the supervised classification paradigm using distributed features. The K-Nearest Neighbor classification approach is used to categorize the large documents. It creates mapping between independent attributes and decision attributes. There are supervised and unsupervised methods by which information can be retrieved and that methods are explained below:

- Supervised methods are those where training set is provided to the classifier. It consists of more than one independent and decision attributes, where these attributes are also known as class label of different categories. The test attribute is consider as decision attribute and the attributes define in the training set are consider as independent attributes and there is mapping between dependent and independent attributes and then determine the class label of the decision attribute. x Unsupervised methods are not provided with any training set. This method uses clustering method to categorize the text, where inter-clusters have maximum similarities and intra clusters have minimum similarities almost zero similarity. The decision attributes of the classes belongs to the similar clusters and have almost similar values.

It explores the different types of clusters such as soft, hard and mixed clusters. These clusters are generated after the distribution of different features of all categories in the training set. Feature selection method is used for text categorization. This feature selection method is used during preprocessing of the text. It determines the subset of the original classes. Data is transformed from high dimension to low dimension so that data can be retrieved easily and this can be achieved by using filter technique or wrapper technique.

It has used text categorization corpus which has 14822 words with 8 different classes. By using these words patterns clusters are defined for each class. These all clusters have threshold values by which these are differentiating from each other. The data is generated in textual form. There are two different types of feature selection method that has been used, type-1 and type-2 clusters. The last and first part of document is used in type-1 clusters and complete document is used in type-2 cluster. It has shown that soft clusters are better than the hard and mixed clusters.

Tomar et al. ¹⁴ present an unsupervised technique for word disambiguation. It explains the disambiguation method based on semantic analysis and has improved the similarity measurement method and glosses used for collecting the information regarding words. There are many unsupervised method that are based on similarity method and graph based method. Graph method is used to define the lexical meaning of every word present in the sentence and then according to the lexical value, graph is constructed. The edge values in the graph are used if the same word is used in different paragraph then it will directly select the correct answer from the edge. This proposed method determines the probability of the word and select the word with maximum probability. This method uses raw corpus which consist of various ambiguous words. The probabilistic method removes the punctuations and articles used in the paragraph and form clusters according to the different senses of the word. The morphological meaning of the words is also determined by using this method. It has shown that the probabilistic method is language independent and can be applied to any language. It has evaluated the performance of the algorithm in the Hindi and English Language and has proved that the performance of the proposed method is increased as there Hindi there is increase in training set.

Wajeed et al.¹⁵ explore the feature selection method and similarity measure techniques for text classification based on KNN. In this method data is classified when its content is explored by using feature selection technique and similarity measure technique selects the category of the text. Text categorization is performed after converted the textual information into vector form and this technique uses different methods to generate the text into numeric or vector form are explained as follow:

- Binary vector form is generated in the presence of all the related classes where total number of instance is equal to the number of document in the respective class. This binary vector method depends on the size of lexicon. The lexicon used in this method consists of ‘5485’ number of rows and 14,822 numbers of columns and binary vector also have same number of rows and columns.
- Frequency vector calculates the number of time a word occur in respective class and here the presence and the absence of the word is important to calculate the numeric value.
- Normalization is performed to increase the accuracy of the process used for learning the training set. It calculates the ranges of the clusters by which they can be differentiate and can be constructed. The output of both binary vector and frequency vector generation are normalized to obtain the accurate result. The output of binary vector is normalized in two ways; Length normalization with unique word where, length of the document is directly proportional to the probability of the appearance of the word in the document and Length normalization with all words length is determined after calculation of the total number of words present in the document. Similarly, output of frequency vector is calculated with unique words and with all words.

This proposed method uses different measures to calculate the distance between all classes present in the training set and unknown vector. It has used Squared Euclidean Distance method, Manhattan Distance method, Chessboard Distance method and Bray Curtis distance method. After analyzing the results of this different distance method, it selects the best distance formula to evaluate the results of the applied method. The ion and distance method is K-Nearest Neighbor classifier is used for text categorization where normalized text is used for classification and distance method is used to find the nearest neighbor class. It investigates the process of building multi-classifier model for organization of data. It has shown that the value of ‘k’ in KNN Classifier is inversely related to the accuracy of classifier. It also uses vector method for classification of text and has proves that for larger value of ‘k’ the accuracy of classifier is never below 55% for different vector type.

Qamar et al. ¹⁶ propose a voted perceptron algorithm. It also describes the different similarity measure techniques based on diagonal, symmetric or asymmetric similarity matrices. It illustrates the similarity methods for K-Nearest neighbor approach which is used to determine the similar class for unknown instance. It has improved the rules for K-Nearest Neighbor approach and has defined the symmetric version for similarity method for K-Nearest Neighbor approach. The K-Nearest Neighbor has good performance and used in many fields such as pattern recognition, machine learning and so on. It has used SiLA algorithm for determining the similarity based methods. There are various features of SiLA algorithm that are explained below:

- It uses metrics method for determining the similarity of all defined classes.
- It can learn the training data in symmetric, diagonal and asymmetric form.
- The proposed algorithm can be used on-line to retrieve the information.
- It placed the new example to the neighbor of the already defined classes in the training set.

The proposed method is evaluated on the UCI database. It has shown that learning similarity measures lead to better improvement after defining the standard rules for KNN Classifier

Garcia et al. ¹⁷ explain the local classifiers such as KNN and Hyper plane Distance Nearest Neighbor (HKNN). These classifiers are also known as lazy learner. It provides a cross validation technique to reduce the pre-processing which is used for finding the neighbor size in local classifiers. This approach decreases the error rates by using Bayesian neighborhood method for different classifiers to find out the neighborhood size.

Wang et al. ¹⁸ propose a new technique for facial expression recognition. Facial expression is an important communication media and the main non- language method of human communication. It explains the method for combining the two classifiers for facial recognition. This method combines the Hidden Markov Model (HMM) classifier and KNN classifier in a sequential way. Firstly, HMM classifier is used which calculate the probabilities of six expression and provide two possible results. Then KNN makes the final decision by checking the threshold value which is obtained after applying the HMM classifier and training sample. It selects the highest probabilistic value as a correct result

Bell et al. ¹⁹ explore the technique for combining more than two classifiers to have good performance results for text classification. It combines the KNN and Rocchio method for analyzing the results. It has been shown that the combination of classifiers is performed by evidential operation; the accuracy for categorizing the data is increased.

Li et al.²⁰ present the density based method for the reduction of noise in text categorization .It uses the KNN classifier for categorizing the text .It proves the noise is inversely proportional to the accuracy of the categorization. This approach also improves the value precision after removing the noise.

Leroy et al. ²¹ describe a symbolic knowledge technique for machine learning algorithm. This machine learning technique use Naive Bayes classifier and the relationship between the words in the sentence is done by Unified Medical Languages (UML). The knowledge base created by UMLS and also finds the logic meaning of words in the context. It is based on probability rules. It checks all the features of the words in the context independently.

Zhang et al.²² describe the usage of Naive Bayes classifier for text classification. These classifiers are based on conditional probability of a feature that belongs to a particular class. The selection method is then used to select these features. Furthermore an auxiliary method is described for feature selection which improves the feature selection technique and performance of naïve Bayes classifier.

Li et al. ²³ describe the novel method to retrieve protein-protein interaction (PPI) and information from biomedical literatures based on KNN .In this work both KNN classifiers are used. The KNN classifier is introduced to improve the accuracy of other classifier.

Resnik et al.²⁴ describe the automatic WSD technique different approaches used for WSD It explains the different task performed by automatic Word Sense Disambiguation and illustrates the problem faced WSD. This proposed method shows the different application if WSD.

Dorr et al. ²⁵ determine the way extract the grammatical and linguistic information regarding any natural language. It has illustrates the interlingua notations for natural language to determine the semantic meaning for sense tagged words. Interlingua can be defined as a machine translation method which is used to convert the source language into target language. Therefore, this is the way by which grammatical meaning and standard rules of all the natural language should be properly define. It has discussed the various issues regarding the component of interlingua and the standard rules that are used for defining the natural language. This method deeply analyzed the natural language. It determines the semantic and syntactic meaning of the sentence. For transforming the source language into target language it follows the different steps that are explained as below:

- It represents the word in the structural and symbolic notations.
- It analyzed the morphological meaning of the source language and then compares the category of the word with the target language.
- It determines the lexicon definition of the word, which is a collection of words where words are defined according to the human language.
- The instances are defined in the way of textual form such as sentence and documents.
- Each instance is connected in the form of nested loop and the meaning of the word extracted in the atomic form.

There are several issues that are analyzed in the interlingua method such as representing meaning and divergence present in the different languages. This method is applicable in many areas of artificial intelligence such as information retrieval, question answering and speech recognition. It has improved the annotation process of language in structural form and increased the accuracy for transforming the language.

Carenini et al.²⁶ describe the E- democracy method using different approaches of natural language processing. E-democracy is a technique which maintained a communication between public administration and their citizens. This method provides information regarding for human in simple language so that they can retrieved the correct information. This method uses different tools for communicating the information in natural language. The features of these tools are explained as below:

- It delivers the correct information, contained in the message box.
- This method easily provides the answer of the question asked by different users. x The information is made more stylish and presentable so that it can be readable. x It also compares each word with natural language to obtain the correct meaning of the words.
- It translates every technical word in the simple human language.

It has shown the new methods that are used in public affairs and increase the interaction between human and the latest technologies. These technologies are applicable in various forms such as retrieving information from search engines and create a friendly environment for users.

Stone et al. ²⁷ illustrate the security methods using language processor. It protects the information from unauthorized attacks. It provides the different authentications and validations to the information so that there should not be unauthorized access to the information. The intrusion detection system is used to provide security to the information, which uses network system so that correct data to be transfer from sender to receiver. It protects the user accounts from attackers by compromising the attack and stop false mails such as financially related or some bank related mails. This method reminds the user to updating the password, which protect illegal verification of accounts. The intrinsic security method also protects from fake advertisement which requires information from end user regarding bank account and credit cards. It also introduces the new technologies regarding information security such as EBID and Spam Assianin which provide security from unwanted spam and phishing. It has shown the different values obtained by using both method; EBID and Spam Assianin. The obtained results have proved that Spam Assianin method is better method for securing the information, especially when the data is transforming from sender to the receiver. It has total 0.14365 positive results for data security as compared to the EBID approach. Spam Assianin detects the phishing and protects from unwanted spams and its detecting rate is double then the EBID.

Mills et al. ²⁸ present the survey on graph based method for natural language processor. It analysis the performance and functional component of graph based method. In graph based method every word is present in the node and every node is linked with each other. This method uses clustering method for determining the correct sense of the word. It determines the centered of the node. The graph based method remove the punctuations present in the sentence and select the noun and verb present in the particular sentence. The noun is present in the nodes of the graph. The frequency of a target word is calculated to finds the instance values of the target word. The graph based method also removes the noise and unwanted attribute present in the sentence. There are two type of graph based method that are used in disambiguating the sentence; direct graph and undirected graph based method. These methods use clustering method to determine the correct sense of the ambiguous word. The similarity based method is used to find the similarity between all clusters. The values obtained after similarity measure are used for used to merge the similar clusters with each other. The graph based method is also used for determining the morphological meaning of the word. It determines the meaning of the word according to the phase it has been used in the sentence, if it is used as noun, adverb or as verb. It has improved the performance of graph based method and explains the different phrases of the words occur in the sentence.

Pham et al. ²⁹ investigate the method for using unlabeled data for word sense disambiguation. It uses semi-supervised method for learning the training set.Semi- Supervised method is also used for learning the labeled data. In case of labeled data, it requires small amount of dataset to evaluate the results. Semi-supervised method converts the unlabeled data into labeled data. After converting unlabeled data into labeled data, there is large labeled training data. The feature selection method is used for preprocessing the data and removing the noise from training set. It also constructs a set which contains all the surrounding words present in the sentence. Bootstrapping is first classification method which is based on semi-supervised learning method. Spectral Graph Transduction (SGT) cotraining, cotraining, local cotraining and smoothened cotrainingare also semi-supervised learning approaches. These approaches also used feature vector method, which represent the test vector and training set into vector form. It normalize the data to obtain the correct results after classification of the sentence. SGT and SGT-cotraining has similar feature vector attributes where as local cotraining and smoothened cotraining has different feature vector attributes. It evaluates the performance of the proposed work into interest corpus which consists of 2,369 sense tagged word but, the proposed algorithm has used 29 nouns to obtain the results. It has used four different semi-supervised algorithms and has show that the unlabeled data improves the accuracy to disambiguate the sense tagged words.

Benerjee et al. ³⁰ present the lesk algorithm for word sense disambiguation. It uses wordnet to obtain the lexical meaning of the ambiguous words. This method has used English lexical sample data for classification. It has introduced the adaptive lesk algorithm for disambiguating the target word and uses the comparison method, whereas the original lesk algorithm calculates the overlapping between different word and the meaning of the word present in the dictionary. It also calculates the score of the word which is define as the value of total number of instance. This adaptive lesk algorithm finds the relation between all the words after obtaining the complete information related to the word. It collects the information that has defined in the synset of the word. It uses small context window size. The context window size collects the surrounding words of the target words, specially the words present in the left and right side of the target word. It determines the relation between the target word and the surrounding word to obtain the correct sense of the target word. The scores values are calculated according to the instance values of the target word. It has improved the accuracy of the lesk algorithm upto 32% and has shown that the comparison method is better than the overlapping method used in the traditional lesk algorithm.

Sidhu et al. ³¹ describe the importance of machine translation and word sense disambiguation in natural language processor. It has discussed the role of the machine translation in various fields of natural language. Machine translation is used to translate the source language into target language. These are mostly used in speech recognition where the source language is translated in the target language. There are different steps involved in machine translation that is explained below:

- An intermediate source is used which collects the target word from the source language.
- The statistical translator is required to collects the accurate meaning of the target words.
- Example-based machine translator is used to define the linguistic rules for translation.
- It also resolves the semantic and syntactic error from the sentence.
- This method determines the subject, verb and noun present in the sentence so that there should be proper translation of the source language into target language.

In case of word sense disambiguation, it is used to automatically determine the correct sense of the word according to the sentence. There are various application of the word sense disambiguation such as information retrieval, text categorization, question answering and advertisements. It also explains the different learning method used for word sense disambiguation; knowledge based learning method, supervised learning method and unsupervised learning method. These learning methods are used for learning the training set and are used for preprocessing and feature selection methods. There are various classification method that are used for classifying the context for obtaining the accurate meaning of the target word such as naive bayes, decision list, decision tree and lesk approach. These all techniques use different leaning methods for training set. It has shown the accuracy values of all the classifiers and also discussed the best methods for machine translation and word sense disambiguation.

Canas et al. ³² propose the mapping method for wordnet defined for different languages. It constructs a map for determine the correct word of the ambiguous word. It links the target word with all surrounding words by using mapping method. It navigates the target word with its correct meaning. The Cmap tool is used for dividing the map into different modules and then link every word present in the sentence with each other. It prepares a list of concepts and then uses this list to check the relevancy of the data. It also makes use of preposition and punctuations used in the sentence. Linking different phrases with each other is a most difficult task in this proposed method.

The Cmap tool is used as browser which provides the domain knowledge. This method link the different concepts with each other, it link the audio, video, pictures, and different webpages with each other. This method can be used for retrieving all the information regarding any field. The cramp tool has its own server where all information stored and can be retrieved easily. This method is applicable in text categorization where well-organized information is necessary.

Fulmari et al. ³³ present the survey on supervised learning method for word sense disambiguation. It has discussed four different learning approaches used for sense disambiguation; supervised approach, unsupervised learning approach, semi-supervised and knowledge based learning approach and these techniques are discussed below:

- In supervised learning approach, training set is provided to the system. Probabilistic method is used to find the accurate answer. It creates different classes according to the dataset and applies the similarity based method to determine the similar class. There are different phases that are used by supervised learning method: testing phase and training phase. Training phase collects the required information related to the word and testing phase collects classifies the word.

- Unsupervised learning approach use clustering technique to classification. It do not provided with any training set. This technique form clusters of words according to their meaning. As such the total number of clusters may vary according to the number of senses of the word. It also determines the instance value of the word. x Semi-supervised learning approach is used for labeled data and unlabeled data. During classification, it uses large amount of labeled data as compare to the unlabelled data. This method covert the unlabelled data in labeled data which is time consuming. As such unsupervised method is time consuming and less used in complex problems.

- Knowledge based approach use different knowledge sources such as wordnet and dictionary to classify the word. This learning approach uses external source to collects the information regarding word. It is based on overlapping method which counts the overlapping of different senses of the words.

Naive bayes, decision list, neural network and so on are different classification methods. These methods are based on supervised learning approach. It also illustrates the various approaches used for WSD and also focused on feature selection method used for removing the unwanted punctuations in the sentence.

Jurek et al.³⁴ describe the cluster- based technique for classifier. It combines multiple classifiers to improve the overall classification accuracy. It introduces a novel combination function based on exponential support (ExSupp). A cluster based method is an alternative method to K-Nearest Neighbor approach. The main difference between cluster based and K-Nearest Neighbor is the method of determining the neighbor or similar class of the unknown instance or a new vector. The cluster based method constructs the different clusters of the values according to the total number of clusters and total number of features in the document. When the new text arrived, it assigns the new text to every cluster and then similarity method is applied to determine the similar cluster of the new text, where new text is placed at the neighbor of the selected cluster.

It ensembles the different classifiers to obtain the correct results such as KNN, Minkovdm and so on. It generates base classifier and its output is then ensemble with all different classifiers. The generation of base classifier is explained as follow:

- At the first step, training set is divided into different cluster form where, clusters are constructed according to the total number of instances in the training set and the feature values of the entire attribute present in the data present in the training set.
- The total number of instances is clustered ‘k’ number of time to obtain ‘k’ number of clusters. The Euclidian distance method is used to determine the total number of distance between clusters and the new instance.
- It assumes that all clusters contain more than one attribute so that the new instance can be easily compared with all clusters. The number of attributes in the clusters is considers as degree of a cluster and the degree of the cluster is depend on the number of instances.
- The distance values are then converted into matrix form to normalize the value. After normalization, the result is obtained in one digit form. If the text belongs to the respective cluster then the output is ‘1’ and similarly if the text does not belong to the cluster then the value is ‘0’. This matrix method is directly applied to the training set so that clusters can be easily compared with all the classes.

The proposed method is applied on different dataset which are prepared by UCI Machine learning repository. This machine learning repository consists of 20 dataset such as BC, COLIC, HEPAT and so on. The BC has ‘286’ number of instances, ‘9’ number of attributes and the number of classes is ‘2’. Similarly other dataset also consist of instances, attributes and classes. The output of based classifier is ensemble with all classifier such as KNN classifier is ensemble with MV and MV is ensemble with cluster based method. It has shown that this novel method increase the efficiency in terms of classification time. It also introduces the new instance- based classifier ensemble. It combines the output of multiple classifiers to improve the performance of classification.

Chklovski et al. ³⁵ describe the multilingual lexical method based on senseval-3 semcor. It provides the Hindi translation of the English ambiguous word. It has used Open Mind Work Expert (OMWE) corpus to evaluate the performance of the proposed method. The OMWE corpus provides the Hindi translation of the English words. The multilingual method use a training set which consist of ‘18’ noun, ‘15’ verb and ‘8’ adjectives. These consist of English words which have various meaning and with their Hindi translations. These translate the English word in Hindi according to the sentence which reflects the accurate meaning of the ambiguous word. The translation is calculated according to the instance value of the word. It has focused on various steps that has been involved in translating the meaning of word in other language and also check the feasibility of the multilingual method by determining the accuracy of the language.

Dhopavkar et al. ³⁶ propose the word sense disambiguation method for Marathi language. It has used Rule based method for disambiguating the Marathi words. It creates dataset for Marathi language and form entropy model for classifying the ambiguous word. This method determines the sense of the Marathi words according to the phase it has been used in the sentence such as noun, adverb and so on. The entropy model is used to calculate the relation between different words and their senses. The right meaning of the word is selected according to the entropy value and the sense with maximum entropy value is selected as correct answer. It has proved that the rule based method give 80% of accuracy when used for Marathi language.

Lai et al.³⁷ present the increasing volume of unsolicited bulk e-mail (also known as spam) and has generated a need for reliable antispam filters. Using a classifier based on machine learning techniques to automatically filter out spam e-mail has drawn many researchers' attention. In this proposed approach it review some of relevant ideas and do a set of systematic experiments on e-mail categorization, which has been conducted with four machine learning algorithms applied to different parts of e-mail. Experimental results reveal that the header of e-mail provides very useful information for all the machine learning algorithms considered to detect spam e-mail.

Han et al. ³⁸ describe the name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, Web search, database integration, and may cause improper attribution to authors. We investigate two supervised learning approaches to disambiguate authors in the citations. One approach uses the naive Bayes probability model, a generative model; the other uses support vector machines (SVMs) and the vector space representation of citations, a discriminative model. Both approaches utilize three types of citation attributes: coauthor names, the title of the paper, and the title of the journal or proceeding. It illustrate these two approaches on two types of data, one collected from the Web, mainly publication lists from homepages, the other collected from the DBLP citation databases.

Navigli et al. ³⁹ describe the Word sense disambiguation (WSD) is traditionally and AI-hard problem. A break-through in this field would have a significant impact on many relevant Web-based applications, such as Web information retrieval, improved access to Web services, information extraction, etc. Early approaches to WSD, based on knowledge representation techniques, have been replaced in the past few years by more robust machine learning and statistical techniques. The results of recent comparative evaluations of WSD systems, however, show that these methods have inherent limitations.

On the other hand, the increasing availability of large-scale, rich lexical knowledge resources seems to provide new challenges to knowledge-based approaches. In this paper, we present a method, called structural semantic interconnections (SSI), which creates structural specifications of the possible senses for each word in a context and selects the best hypothesis according to a grammar G, describing relations between sense specifications. Sense specifications are created from several available lexical resources that we integrated in part manually, in part with the help of automatic procedures. The SSI algorithm has been applied to different semantic disambiguation problems, like automatic ontology population, disambiguation of sentences in generic texts, disambiguation of words in glossary definitions. Evaluation experiments have been performed on specific knowledge domains (e.g., tourism, computer networks, enterprise interoperability), as well as on standard disambiguation test sets.

Berger et al. ⁴⁰ illustrates the concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the wide scale application of this concept to real world problems in statistical estimation and pattern recognition. It describes a method for statistical modeling based on maximum entropy. We present a maximum-likelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing.

Banko et al. ⁴¹ proposed the method for calculating the amount of readily available on-line text that has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. It evaluates the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. This particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comesat a cost.

Yarowsky et al. ⁴² describe a program that disambiguates English word senses in unrestricted text using statistical models of the major Roget's Thesaurus categories. Roget's categories serve as approximations of conceptual classes. The categories listed for a word in Roget's index tend to correspond to sense distinctions; thus selecting the most likely category provides a useful level of sense disambiguation. The selection of categories is accomplished by identifying and weighting words that are indicative of each category when seen in context, using a Bayesian theoretical framework. Other statistical approaches have required special corpora or hand-labeled training examples for much of the lexicon. The use of class models overcomes this knowledge acquisition bottleneck, enabling training on unrestricted monolingual text without human intervention. Applied to the 10 million word Grolier's Encyclopedia, the system correctly disambiguated 92% of the instances of 12 polysemous words that have been previously studied in the literature.

Lee et al. ⁴³ evaluate a variety of knowledge sources and supervised learning algorithms for word sense disambiguation on SENSEVAL-2 and SENSEVAL-1 data. The knowledge sources include the part-of-speech of neighboring words, single words in the surrounding context, local collocations, and syntactic relations. The learning algorithms evaluated Support Vector Machines (SVM), Naive Bayes, AdaBoost, and decision tree algorithms. It presents empirical results showing the relative contribution of the component knowledge sources and the different learning algorithms. In particular, using all of these knowledge sources and SVM (i.e., a single learning algorithm) achieves accuracy higher than the best official scores on both SENSEVAL-2 and SENSEVAL-1 test data.

Tratz et al.⁴⁴ present a supervised classification approach for disambiguation of preposition senses. It uses the SemEval Preposition Sense Disambiguation datasets to evaluate our system and compare its results to those of the systems participating in the workshop. It has also derived linguistically motivated features from both sides of the preposition. Instead of restricting these to a fixed window size that utilizes the phrase structure. Testing with five different classifiers and these classifiers increasesthe accuracy that outperforms the best system in the SemEval task.

Yarowsky et al. ⁴⁵ present an unsupervised learning algorithm for sense disambiguation that, when trained on unannotated English text, rivals the performance of supervised techniques that require time-consuming hand annotations. The algorithm is based on two powerful constraints”---“ that words tend to have one sense per discourse and one sense per collocation”---“exploited in an iterative bootstrapping procedure. Tested accuracy exceeds 96%.

Ide et al. ⁴⁶ present an unsupervised approach for disambiguating between various senses of a word to select the most appropriate sense, based on the context in the text. It has defined the Probabilistic Latent Semantic Analysis (PLSA) based Word Sense Disambiguation (WSD) system in which sense tagged annotated data is not required for training and the system is language independent giving 83% and 74% accuracy for English and Hindi languages respectively. Also, through word sense disambiguation experiments and have shown that by applying Word net in this algorithm, performance of our system can be further enhanced.

Ng et al. ⁴⁷ proposed a new approach for word sense disambiguation (WSD) using an exemplar-based learning algorithm. This approach integrates a diverse set of knowledge sources to disambiguate word sense, including part of speech of neighboring words, morphological form, and the unordered set of surrounding words, local collocations, and verb-object syntactic relation. It has tested the WSD program, named LEXAS, on both a common data set used in previous work, as well as on a large sense- tagged corpus that we separately constructed. LEXAS achieves a higher accuracy on the common data set, and performs better than the most frequent heuristic on the highly ambiguous words in the large corpus tagged with the refined senses of WORDNET.

Shalabi et al. ⁴⁸ describe the text classification and the various task use for assigning a document to one or more of pre-defined categories based on its contents. It has shown the results of classifying Arabic language documents by applying the KNN classifier, one time by using N-Gram namely unigrams and bigrams in documents indexing, and another time by using traditional single terms indexing method (bag of words) which supposes that the terms in the text are mutually independent which is not the case. The obtained results show that using N-Grams produces better accuracy than using Single Terms for indexing; the average accuracy of using N-grams is .7357, while with Single terms indexing the average accuracy is .6688.

Zhu et al. ⁴⁹ illustrate the two issues of active learning. Firstly, to solve a problem of uncertainty sampling that it often fails by selecting outliers, this presents a new selective sampling technique, sampling by uncertainty and density (SUD), in which a k-Nearest- Neighbor-based density measure is adopted to determine whether an unlabeled example is an outlier. Secondly, a technique of sampling by clustering (SBC) is applied to build a representative initial training data set for active learning. Finally, we implement a new algorithm of active learning with SUD and SBC techniques. The experimental results from three real-world data sets show that our method outperforms competing methods, particularly at the early stages of active learning.

Wang et al.⁵⁰ describe the method to combine the different classifiers for better results It explains that the classifier combination is a promising way to improve performance of word sense disambiguation. It has proposed a new combinational method in this paper. We first construct a series of Naïve Bayesian classifiers along a sequence of orderly varying sized windows of context, and perform sense selection for both training samples and test samples using these classifiers. Thus, get a sense selection trajectory along the sequence of context windows for each sample. Then make use of these trajectories to make final K-Nearest Neighbors-based sense selection for test samples. This method aims to lower the uncertainty brought by classifiers using different context windows and make more robust utilization of context while perform well. Experiments show that our approach outperforms some other algorithms on both robustness and performance.

Brown et al. ⁵¹ present a method for the resolution of lexical ambiguity of nouns and its automatic evaluation over the Brown Corpus. The method relies on the use of the wide-coverage noun taxonomy of WordNet and the notion of conceptual distance among concepts, captured by a Conceptual Density formula developed for this purpose. This fully automatic method requires no hand coding of lexical entries, hand tagging of neither text nor any kind of training process. The results of the experiments have been automatically evaluted against SemCor, the sense-tagged version of the Brown Corpus.

Ponzetto et al. ⁵² describe the main obstacles to high-performance Word Sense Disambiguation is the knowledge acquisition bottleneck. It presents a methodology to automatically extend WordNet with large amounts of semantic relations from an encyclopedic resource, namely Wikipedia. It has show that, when provided with a vast amount of high-quality semantic relations, simple knowledge-lean disambiguation algorithms compete with state-of-the-art supervised WSD systems in a coarse-grained all- words setting and outperform them on gold-standard domain-specific datasets.

Michalcea et al. ⁵³ proposed the method for selecting the most appropriate sense for an ambiguous word in a sentence is a central problem in Natural Language Processing. In present a method that attempts to disambiguate all the nouns, verbs, adverbs and adjectives in a text, using the senses provided in WordNet. The senses are ranked using two sources of information: the Internet for gathering statistics for word-word co- occurrences and WordNet for measuring the semantic density for a pair of words. It has shown that an average accuracy of 80% for the first ranked sense, and 91% for the first two ranked senses. Extensions of this method for larger windows of more than two words are considered.

Yarowsky et al. ⁵⁴ describes a supervised algorithm for word sense disambiguation based on hierarchies of decision lists. This algorithm supports a useful degree of conditional branching while minimizing the training data fragmentation typical of decision trees. Classifications are based on a rich set of collocational, morphological and syntactic contextual features, extracted automatically from training data and weighted sensitive to the nature of the feature and feature class. The algorithm is evaluated comprehensively in the SENSEVAL framework, achieving the top performance of all participating supervised systems on the 36 test words where training data is available.

Rigau et al. ⁵⁵ present a method to combine a set of unsupervised algorithms that can accurately disambiguate word senses in a large, completely untagged corpus. Although most of the techniques for word sense resolution have been presented as stand-alone, it is our belief that full-fledged lexical ambiguity resolution should combine several information sources and techniques. The set of techniques have been applied in a combined way to disambiguate the genus terms of two machine-readable dictionaries (MRD), enabling us to construct complete taxonomies for Spanish and French. Texted accuracy is above 80% overall and 95% for two-way ambiguous genus terms, showing that taxonomy building is not limited to structured dictionaries such as LDOCE.

Agirre et al. ⁵⁶ describe a statistical technique for assigning senses to words. An instance of a word is assigned a sense by asking a question about the context in which the word appears. The question is constructed to have high mutual information with the translation of that instance in another language. When incorporated this method of assigning senses into statistical machine translation system, the error rate of the system decreased by thirteen percent.

CONCLUSION

After reviewing the literature survey for sense tagged words, it has been concluded that there are various issues that arises while disambiguating the sense tagged words such as choosing wordnet, learning methods, classification techniques and so on. There are two learning approaches that are used for disambiguation; machine learning approach and knowledge-based approach. These learning approaches are used by various classification techniques such as naive bayes, decision tree, K-Nearest Neighbor, neural network and so on for training the dataset. It has been observed that the computational identification of correct sense of word has not yet been done by using K-Nearest Neighbor classification technique. It has used in other application of Natural Language Processing such as text categorization and information retrieval. The K-Nearest Neighbor method is based on supervised learning approach. Every natural language has sense tagged words and lot of work has been done on Malayalam, English and Marathi Language with various classification techniques. But, in Hindi language, less survey has been published. Therefore, better approach has been investigated to solve the problem of disambiguating the Hindi sense tagged words and evaluates the correct results.

CHAPTER 4 PRESENT WORK

After reviewing the literature, it is observed that several learning approaches and classification approaches are used for sense tagged words and the ambiguity present in sense tagged words is a complex problem in the field of WSD. The KNN algorithm is applicable in many applications such as text categorization and text recognition and this approach can be used for sense tagged words. This technique has not yet been used for disambiguating the words.

4.1 Problem Statement

Sense tagged words consist of more than one meaning where sense is a definition of a word which reflects the basic concepts and lexical knowledge related to it. These words have become a central question in computational linguistic which arises the problem of lexical ambiguity. Lexical ambiguity can be define as, when single word is used in different contexts and that word has numerous denotation where reader is confused to pick the exact gist of that word. Human can easily understand the correct sense of the word by reading the sentence. But, for machine it is difficult to judge the correct sense.

Machine is trained by using special machine learning approaches to solve this problem. In machine learning approach word sense is consider as word translation which mean translating the correct meaning of the word. An automatic system for sense tagged words, play a vital role in machine translation. Several Sense inventories and Wordnet have introduced for many natural languages to obtain the semantic meaning of the words. There are various machine learning approaches that are used for removing the ambiguity. For example: Naive Bayes, Decision List, Neural network and so on. These approaches are mostly based on knowledge based learning method and supervised learning method, when these are applied to determine the semantic meaning of the word. Although, lot of work has done in many natural languages such as English, Malayalam, Persian and Marathi by using these machine learning approach.

The KNN approach has not yet been used for solving this problem especially in Hindi Language. For Hindi Sense tagged words, less work has been done. So, the proposed work uses the KNN approach to classify the Hindi sense tagged words and judge the correct sense of the target word in different Hindi sentences.

4.2 Objectives

The focus of the proposed work is to achieve the following objectives:

- To create a training set for Hindi language.
- To identify the correct sense of ambiguous word used in different sentences and in different contents for the created dataset.
- To improve the word sense and thus disambiguate the word sense using parameters like accuracy percentage, execution time.

4.3 Design and Implementation

The proposed work identifies the accurate sense of the word. The KNN approach is used to disambiguate the sense tagged words. The performance of the proposed approach is evaluated on 10 Hindi sense tagged words. These Hindi sense tagged words are collected from Hindi Wordnet dictionary and their meaning is also obtained from Hindi Wordnet. This Hindi Wordnet is prepared by IIT, Bombay.

To achieve the proposed objectives, various sentences on Hindi sense tagged words are created, where different paragraph are prepared for all senses of the particular word and similarly this method is applied for every Hindi sense tagged word. Then, the KNN approach is applied to classify the context for achieving the precise meaning of the word according to the context. The proposed algorithm is implemented in java. The accuracy of the K-Nearest Neighbor is determined after calculating the value of parameters.

4.3.1 Methodology

The ambiguity is removed from sense tagged word by using the following criteria:

(i) Dataset of Hindi language is created.
(ii) The algorithm based on KNN is applied for the identification of the sense of word. (iii)The feasibility is tested by using some parameters like accuracy percentage, execution time through comparison with existing techniques.

4.3.2 Creating a dataset for Hindi sense tagged words

The Hindi sense tagged words are used to evaluate the results. In this work 10 polysemous Hindi words are used and semantic meanings of these words are obtained from Hindi wordnet. Table 4.1 shows the Hindi polysemous words with their total number of senses.

Table 4.1Hindi sense tagged words

Abbildung in dieser Leseprobe nicht enthalten

The different sentences have been created for every sense of a particular word. For example: ‘ि◌ͬच’ is a Hindi sense tagged word and its different sentences based on its every sense are explained in Table 4.2.

Table 4.2 Sentences of Hindi Word ‘ि◌ͬच’

हमअͪपमाता वपता क ि◌ͬच का पाͧल कǐरा चाहहए िइक ि◌ͬच को ͧसकǽ हȣ जीि◌ि◌ मआग बड़ा जाता ह, हमसबक साथ सहȣ ि◌ͬच ि◌◌ोͧलचाहहए।

राम ि◌◌महाक साथ ककया हआ ि◌ͬच Ǔनभाया ,एक सच लमğ ि◌हȣ होता हजो अͪपहर ि◌ͬच को Ǔनभाया मोǑह जब भी ककसी समèया महोता तब तब राम ि◌◌उसका साथ दक अͪपा ि◌ͬच Ǔनभाया।

4.3.3 KNN Approach for Text Categorization

Text categorization is a technique which assigns the fixed category to an unknown text. It becomes necessary to categorize the data in systematic manner so that it can be efficiently retrieved and reused easily. In this proposed work, KNN approach is used to define the accurate category to the text to which it belongs and the Cosine distance measure is used to check the similarity between different categories by calculating the distance between unspecified text and training set. The shortest distance value is selected as correct category of an unspecified text.

Algorithm

Input: unspecified text, training set

Output: categorized text

Abbildung in dieser Leseprobe nicht enthalten

4.3.4 KNN Approach for Sense Tagged Words

Once the database is created for Hindi sense tagged words, the next step is to apply the proposed KNN approach. In this step, classification of the context is performed. The proposed approach, determines the right sense of the Hindi word after reading the sentence and recognizing the target word. Then, the target word is disambiguated by identifying its correct meaning.

The proposed method creates the feature set. This feature set consist of two set: target set which is a collection of target words and neighbor set which is a collection of neighbor words in the sentence. Then, the approach compares the target word with neighbor words and training set to judge the proper denotation of the word.

Algorithm

Input: Hindi dataset (H), training set (t).

Output: correct sense of target word.

Abbildung in dieser Leseprobe nicht enthalten

4.3.5 Working of Algorithm

The figure 4.1 explains the working of the proposed algorithm with the help of flow chart. The problem is to find out the proper sense of sense tagged words. The KNN approach is used to verify precise meaning of the Hindi words. The training set of the Hindi sense tagged words are initialized at the beginning and then the classification algorithm is applied to resolve the ambiguity present in the words.

Abbildung in dieser Leseprobe nicht enthalten

Figure: 4.1 the flow diagram of algorithm

During classification, the proposed approach compares the target word with its surrounding words in the sentence. It has created two dataset; target word set contains the different frequently occurred words and feature set which is a collection of surrounding words to the target word. The feature set is used to remove the punctuations and unwanted words from the sentence which is known as preprocessing of the sentence. Then, the distance measurement formula is applied to generate the accurate sense of the sense tagged words.

The KNN classification algorithm determines the target word and then compares that target word with its several senses present in the training set. The distance is calculated between all the senses of the target word to decide the proper meaning of the word. After calculation of distance, the sense with shortest value is selected as a correct sense of the word.

CHAPTER 5 RESULTS AND DISCUSSIONS

The proposed work classifies the sentences and identifies the correct sense of the word. The KNN Approach classifies the sentence and judges the accurate meaning of the word. The proposed work is used for two applications of NLP, text categorization and sense tagged words.

5.1 Text Categorization using KNN

Text categorization is a technique which assigns predefined categories to the text. This technique is used to retrieve the correct information related to the text from the large database. Text categorization manages the large data in the database so that information can be easily and accurately retrieved from database. The KNN approach has been used for categorizing the text. This approach is provided with the training set and is used for assigning the unspecified text.

When an unspecified text arrives, it compares that text with training set and calculates the distance between predefined class and the unspecified text. The category with the short distance is selected as correct category of the unknown text.

Table 5.1Distance Values of categories

Abbildung in dieser Leseprobe nicht enthalten

The above Table 5.1 shows the distance values after categorizing the unknown text. The category with short distance is selected as a right category of the unknown text and is consider as nearest category of the word. The distance values are calculated after applying the distance formula. Here cosine distance method is used to measure the distance between all categories.

Abbildung in dieser Leseprobe nicht enthalten

Figure 5.1: An output of Text categorization

The above figure 5.1 shows an output of text categorization after using KNN approach. When any unspecified text is entered in the search box then, it determines the categories of the unknown text and also calculates the distance between all categories. These distance values of the categories are used in determining the accuracy of the proposed approach. The category with the shortest value is selected as the correct category of the unspecified text and assigns that unspecified text to the correct category. If, any unknown text does not have any category then, it is categorized as new class in the database.

Table 5.2 Precision Value for Text Categorization

Abbildung in dieser Leseprobe nicht enthalten

The above Table 5.2 shows the Precision values for Text categorization. When these values are compared with Precision value of the sense tagged words then it has been determined that the K-Nearest Neighbor approach perform better and gives accurate results when it is applied for sense tagged words.

5.2 KNN Approach for Sense Tagged Words

The performance of KNN algorithm is evaluated on the Hindi sense tagged words. Here the KNN approach firstly reads the sentences and stores the target word of every sentence in dataset and then stores the surrounding words of the target words in other feature dataset.

Abbildung in dieser Leseprobe nicht enthalten

Figure 5.2 Correct sense of the Hindi word

The above figure 5.2 shows the output of the proposed approach that is K-Nearest Neighbor approach. In context 1 and context 2 the ambiguous word is ‘ͪवͬध’, the KNN approach stores the target word ‘ͪवͬध’ in frequent word dataset and its surrounding words in the feature dataset. The correct sense of the Hindi word ‘ͪवͬध’ is determined according to the sentences. The Hindi sense tagged word ‘ͪवͬध’ has two senses respectively ‘way of doing something’ and ‘law’. Similarly, the proposed approach has determined the correct sense of the other Hindi sense tagged words. The proposed approach has classified the sentences provided in the dataset and it determines the correct sense of the 5 Hindi sense

tagged words. In context 3 and context 4, the ambiguous word is ‘वग’.In context 1 the worddenote as ‘part of thing’ and in context 2 as ‘category’. The proposed method classify the other sentences and determine their accurate meaning according to the sentence where that sense tagged words are occured.

Table 5.3Precision Value

Abbildung in dieser Leseprobe nicht enthalten

Precision can be defined as correctly classified sense tagged words over total number of words that are classified accurately according to the context. The above Table

5.3 shows the precision values of the proposed approach. The precision values of the K- Nearest Neighbor are calculated according to the set of target words and the set of the neighbor words to the target word. It calculates the number of time the word execute in the sentence and the distance values of the target word from its all sense. Precision value is the ratio of total distance value and the distance value of the particular sense of the target word.

5.3 Comparison between Text Categorization and Sense Tagged Words

Table 5.4 Precision value for Text Categorization

Abbildung in dieser Leseprobe nicht enthalten

The above Table 5.4 and Table 5.5 show the precision values of KNN algorithm for two different applications: text categorization and sense tagged words respectively. In text categorization, precision is calculated for ‘5’ unspecified text where the precision value for unspecified text depends on the distance from its categories.

The Table 5.5 shows the precision value of sense tagged words. Here, KNN approach is used to detect the sense of Hindi words and precision value is calculated over 5 sense tagged words. It has been observed that KNN approach has high accuracy for sense tagged words than text categorization.

CHAPTER 6 CONCLUSION AND FUTURE WORK

The proposed work presents the task that disambiguates the Hindi sense tagged words. The K-Nearest Neighbor approach is used to determine the correct sense of the word according to the sentence. The proposed work provides the accurate meaning of the Hindi words that are present in the different context.

6.1 Conclusion

In this proposed work, semantic meaning of the sense tagged words is identified. The KNN method is used to determine the correct sense of the word. The KNN approach is based on supervised learning approach. The task of disambiguating the word has become a fundamental task in computer linguistic. The proposed approach is applied on Hindi Language. The proposed work is provided with different Hindi sentences that have different sense tagged words with their various meanings. The semantic meaning of sense tagged words is defined in training dataset. The target word is given that is to be disambiguated and the surrounding words present in the sentence. The proposed approach forms two dataset: feature set in which surrounding words are stored and target word set where ambiguous word is stored. It compares the target word with the given training set and collects the various meaning of the target words. The proposed approach compares the meaning of target word and surrounding words in the paragraph. The KNN approach calculates the distance between the target word and all senses of the target word present in the training set. It uses distance measure to find the distance between target word and its various meaning. The sense with the smallest distance is selected as a correct sense of the word.

The performance of the proposed approach is evaluated on 10 Hindi sense tagged words. The semantic meaning of the Hindi sense tagged words are obtained from Hindi Wordnet. The objective is to achieve the correct sense of the word using K-Nearest Neighbor approach; as such the classification of the word is improved. The proposed method leads to some promising improvement in identifying the correct meaning of the sense tagged words.

6.2 Future Work

The problem occurred in this proposed approach, it classifies the fixed amount of dataset and it is difficult to calculate the optimal value of ‘k’ for large dataset. As a further expansion to the work carried out; calculation of optimal value of ‘k’ by formulating a new method for large data. The proposed approach can be used for large database to check the accuracy for other words.

The proposed method has used Hindi Wordnet, this can be further used for another Wordnet. The K-Nearest Neighbor approach can also apply for morphological words to determine their meaning. As K-Nearest Neighbor is applied for Hindi language, this can be applied for other language to check the accuracy of the proposed algorithm.

REFRENCES

Translation: Interlingual Methods", Proceeding International Conference of the World Congress on Engineering, 2004.

[...]

¹ Satyendr Singh, Vivek Kumar Singh, and Tanvee J. Siddiqui, "Hindi Word Sense Disambiguation Using Semantic Relatedness Measure”, Springer Berlin Heidelberg, Multidisciplinary Trends in Artificial Intelligence, vol.8271, pp. 247-256, Dec 2012.

² Satyendr Singh, and Tanveer J. Siddiqui, "Evaluating Effect of Context Window Size, Stemming and Stop Word Removal on Hindi Word Sense Disambiguation”, IEEE International Conference on Information Retrieval& Knowledge Management (CAMP),pp. 13-15, March 2012.

³ Miloud Aouidate, Amal and Ahmed Riadh Baba-Ali, "Ant Colony Prototype Reduction Algorithm for KNN Classification”, IEEE 15th International Conference IEEE onComputational Science and Engineering (CSE), vol. 289, no. 294,pp. 5-7, Dec 2012.

⁴ Chen, Ping, Chris Bowes, Wei Ding, and Max Choly, "Word Sense Disambiguation with Automatically Acquired Knowledge", IEEE Intelligent Systems, vol. 27, no. 4, pp. 46- 55, Aug 2012.

⁵ Rezapour, A. R., S. M. Fakhrahmad, and M. H. Sadreddini, "Applying Weighted KNN to Word Sense Disambiguation", Proceeding International Conference of the World Congress on Engineering, vol. 3, July 2011.

⁶ Bamman, David, and Gregory Crane, "Measuring Historical Word Sense Variation", Proceeding IEEE 11th Annual International Conference on Digital libraries ACM, June 2011.

⁷ Jabbari, Fattaneh, HosseinSameti, and Mohammad HadiBokaei, "Unilateral SemiSupervised Learning of Extended Hidden Vector State for Persian Language Understanding", IEEE7th International Conference on Natural Language Processing and Knowledge Engineering, vol. 165, no. 168, pp. 27-29, Nov 2011.

⁸ HaroonRosna P, "Malayalam Word Sense Disambiguation", IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), vol. 1 , no.4, pp. 28-29,Dec 2010.

⁹ Garcia, Eric, Sergey Feldman, Maya R. Gupta, and Santosh Srivastava, "Completely Lazy Learning", IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 9, pp. 1274-1285,Sept 2010.

¹⁰ Navigli, Roberto, "Word Sense Disambiguation: A survey", ACM Computing Surveys (CSUR), vol. 41, no. 2, Feb 2009.

¹¹ Hwang, Wen-Jyi, and Kuo-Wei Wen, "Fast KNN Classification Algorithm Based on Partial Distance Search", IEEE conference on Electron Lett, vol. 34, no. 21, pp. 2062- 2063, Oct 1998.

¹² Lijun Wang, Xiqing Zhao, "Improved KNN Classification Algorithms Research in Text Categorization, “IEEE2nd International Conference on Computer Electronics, Communication and Networks (CECNet)”, vol. 1848, no. 1852, pp. 21-23, April 2012.

¹³ Wajeed M.A, Adilakshm T, "Building Clusters with Distributed Features for Text Classification using KNN”, IEEE International Conference on Computer Communication and Informatics (ICCCI), pp. 1-6, Jan. 2012.

¹⁴ Tomar and Gaurav, "Probabilistic Latent Semantic Analysis for Unsupervised Word Sense Disambiguation", International Journal of Computer Science Issues (IJCSI), vol. 10, no. 2, Sept 2013.

¹⁵ Wajeed M.A, Adilakshmi T., "Different Similarity Measures for Text Classification using KNN", IEEE 2nd International Conference Computer and Communication Technology (ICCCT), pp.41-45, Sept 2011.

¹⁶ Qamar A.M, Gaussier E,Chevallet J.P, Joo Hwee Lim, "Similarity Learning for Nearest Neighbor Classification”, IEEE International Conference on Data Mining(ICDM), pp. 983-988, Dec 2008.

¹⁷ Garcia E.K, Feldman S., Gupta M.R, Srivastava S, "Completely Lazy Learning”, IEEE Transactions on Knowledge and Data Engineering, vol. 22, pp. 127-1285, Sept 2010.

¹⁸ Qingmiao Wang, ShiguangJu, "A Mixed Classifier Based on Combination of HMM and KNN", IEEE Fourth International Conference on Natural Computation, vol. 4, pp. 38- 42, Oct 2008.

¹⁹ Bell D.A, Guan J.W, Yaxin Bi, "On combining Classifier Mass Functions for Text Categorization", IEEE Transactions on Knowledge and Data Engineering, vol.10, pp.1307-1319, Oct 2005.

²⁰ Rong-Lu Li, Yun-Fa Hu, "Noise reduction to Text Categorization based on Density for KNN”, IEEE International Conference on Machine Learning and Cybernetics, vol.5, pp.3119-3124, Nov 2003.

²¹ Gondy Leroy and Thomas C. Rindflesch, “Effects of information and machine learning algorithm on word sense disambiguation with small datasets,” Proc. Elsevier, pp. 573-585, Aug 2005.

²² Wei Zhang and Fang Gao a, “An Improvement of Naïve Bayes for Text Classification Effects”, Proc. Elsevier, pp. 573-585, 2005.

²³ Philip Resnik and David Yarowsky, “A Perspective on Word Sense Disambiguation and Their Evaluation”, P roceeding IEEE, 2008.

²⁴ Lishuang LI, Linmei Jing and DegenHaung, “Protein-Protein Interaction Extraction from bio medical literatures based on modified SVM-KNN”, IEEE International conference on Natural Language Processing and knowledge Engineering, pp. 1-7, Sept 2009.

²⁵ Dorr, Bonnie J., Eduard H. and Lori S. Levin, "Natural Language Processing and Machine Translation Encyclopedia of Language and Linguistics”, (ELL2) Machine

²⁶ Carenini and Michele "Improving communication in E-democracy using Natural Language Processing", IEEE International Conference on Intelligent Systems, 2007.

²⁷ Stone and Allen "Natural-language processing for Intrusion Detection", IEEE International Conference on Intelligent Systems, Sept 2007.

²⁸ Mills, M.T, Bourbakis, N.G., "Graph-Based Methods for Natural Language Processing and Understanding-A Survey and Analysis", IEEE Transactions on Intelligent System, vol. 22, no. 1, pp. 20-27, Feb 2014.

²⁹ Pham, Thanh Phong, Hwee Tou Ng, and Wee Sun Lee "Word sense disambiguation with semi-supervised learning", Proceedings of the International conference on artificial intelligence, vol. 20, no. 3,July 2005.

³⁰ Banerjee, Satanjeev, and Ted Pedersen, "An adapted Lesk algorithm for word sense disambiguation using WordNet", Springer Berlin Heidelberg on Computational linguistics and intelligent text processing, pp. 136-145, 2002.

³¹ Sidhu, Gurleen Kaur, and Navjot Kaur, "Role of Machine Translation and Word Sense Disambiguation in Natural Language Processing", IOSR General of Computer Engineering, vol. 11, June 2013.

³² Canas, Alberto J., "Using WordNet for word sense disambiguation to support concept map construction", Springer Berlin Heidelberg on String Processing and Information Retrieval, 2003.

³³ Fulmari, Abhishek, and Manoj B. Chandak, "A Survey on Supervised Learning for Word Sense Disambiguation", International Journal of Advanced Research in Computer & Communication Engineering, pp. 167-170, Dec 2013.

³⁴ Jurek A., Yaxin B, Shengli Wu, Nugent, C., "A Cluster-Based Classifier Ensemble as an Alternative to the Nearest Neighbor Ensemble", IEEE 24 th International Conference on Tools with Artificial Intelligence (ICTAI), vol. no.1, pp.1100-1105, Nov 2012.

³⁵ Chklovski, Timothy and Anatolievich, "The SENSEVAL-3 multilingual EnglishHindi lexical sample task", International Journal of Advanced Research in Computer & Communication Engineering, July 2004.

³⁶ Gauri Dhopakar, Manali Kashirsagar and Latesh Malik, “Handling Word Sense Disambiguation Marathi Using Rule Based Approach” , International Conference on Industrial Automation and Computing, April 2014.

³⁷ Chih-Chin Lai and Ming-Chi Tsai, "An empirical performance comparison of machine learning methods for spam e-mail categorization,", IEEE Fourth International Conference onHybrid Intelligent Systems, vol.44, no.48, pp.5-8, Dec 2004.

³⁸ Hui Han, Giles, L., Hongyuan Zha,Li, C. and Tsioutsiouliklis K., "Two supervised learning approaches for name disambiguation in author citations,"Proceedings of the Joint ACM/IEEE Conference on Digital Libraries, vol. 296, no. 305, pp. 7-11, June 2004.

³⁹ Navigli, R. and Velardi, Paola, "Structural semantic interconnections: a knowledgebased approach to word sense disambiguation,"IEEE Transactions on Pattern Analysis and Machine Intelligence,, vol.27, no.7, pp.1075-1086, July 2005.

⁴⁰ Berger, Adam L., Vincent J. Della Pietra, and Stephen A. Della Pietra., "A maximum entropy approach to natural language processing", International Conference on Computational linguistics, no. 1, pp-39-71, 1996.

⁴¹ Banko, Michele and Eric Brill, "Scaling to very large corpora for natural language disambiguation", Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, July 2001.

⁴² Yarowsky and David, "Word-sense disambiguation using statistical models of Roget's categories trained on large corpora", Proceedings of the 14th conference on Computational linguistics, vol . 2, pp. 454-460, August 1992.

⁴³ Lee, Y. K., and Ng, H. T., “An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation”, Proceedings of the ACL-02 conference on Empirical methods in natural language processing, Vol. 10, pp. 41-48, July 2002.

⁴⁴ Tratz, S., & Hovy, D, “Disambiguation of preposition sense using linguistically motivated features”, Proceedings of Human Language Technologies Annual Conference of the North American, pp. 96-100, June 2009.

⁴⁵ Al-Shalabi, Riyad and Rasha Obeidat, "Improving KNN Arabic Text Classification with N-Grams Based Document Indexing", International Journal of Advanced Research in Computer & Communication Engineering, March 2008.

⁴⁶ Zhu J., Wang H., Yao T., & Tsou B. K, “Active learning with sampling by uncertainty and density for word sense disambiguation and text classification” Proceedings of the 22nd International Conference on Computational Linguistics, pp. 1137-1144, August 2004.

⁴⁷ Wang, X., & Matsumoto, Y., “Trajectory based word sense disambiguation”, Proceedings of the 20th international conference on Computational Linguistics, August 2004.

⁴⁸ Yarowsky and D., “Unsupervised word sense disambiguation rivaling supervised methods”, Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pp. 189-196, June 1995.

⁴⁹ Tomar, G. S., Singh, M., Rai, S., Kumar, A., Sanyal, R., and Sanyal, S., “Probabilistic Latent Semantic Analysis for Unsupervised Word Sense Disambiguation”, International Journal of Computer Science Issues (IJCSI), 2013.

⁵⁰ Ng, H. T., and Lee, H. B., “Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach”, Proceedings of the 34th annual meeting on Association for Computational Linguistics, pp.40-47, June 1996.

⁵¹ Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., and Mercer, R. L., “Word-sense disambiguation using statistical methods”, Proceedings of the 29th annual meeting on Association for Computational Linguistics, pp. 264-270, June 1991.

⁵² Ponzetto, S. P. and Navigli, R., “Knowledge-rich word sense disambiguation rivaling supervised systems”, Proceedings of the 48th annual meeting of the association for computational linguistics, pp. 1522-1531, July 2010.

⁵³ Mihalcea, R. and Moldovan, D. I., “A method for word sense disambiguation of unrestricted text”, Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 152-158, June 1999.

⁵⁴ Yarowsky and D, “Hierarchical decision lists for word sense disambiguation”, Springer Berlin Heidelberg , Computers and the Humanities, vol. 34, pp. 179-186, 2000.

⁵⁵ Rigau, G., Atserias, J., and Agirre, E., “Combining unsupervised lexical knowledge methods for word sense disambiguation”, Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp. 48-55, July 1997.

Frequently Asked Questions

What is the main topic of this document?

This document is a language preview focusing on word sense disambiguation (WSD) using machine learning techniques, specifically K-Nearest Neighbor (KNN), for Hindi language processing.

What is Word Sense Disambiguation (WSD)?

WSD is the task of identifying the correct meaning ("sense") of a word in a given context, especially when a word has multiple possible meanings (polysemy).

What is the K-Nearest Neighbor (KNN) approach in the context of WSD?

KNN is a machine learning algorithm used to classify data points based on their similarity to other data points. In WSD, it's used to determine the correct sense of a word by finding the senses of similar words in the context.

What are the different learning approaches discussed in this document for WSD?

The document mentions knowledge-based approaches (using dictionaries and lexical resources) and machine learning approaches (including rote learning, supervised learning, unsupervised learning, and semi-supervised learning).

What are some of the knowledge sources used for WSD mentioned in this document?

The document lists WordNet, SemCor, thesauri, machine-readable dictionaries, and ontologies as knowledge sources for WSD.

What classification techniques used for sense tagged words are mentioned in the document?

Techniques like Naive Bayes, Decision Trees, Decision Lists, Neural Networks, Lesk Algorithm, Rocchio Classification, K-Nearest Neighbor, and Ensemble Methods are mentioned.

What are the steps involved in preprocessing text for WSD?

The preprocessing steps include tokenization (splitting text into words), part-of-speech tagging, lemmatization (reducing words to their base form), chunking (grouping words into phrases), and parsing (analyzing the syntactic structure of sentences).

What is the focus of the proposed work in this document?

The proposed work focuses on applying the KNN approach to disambiguate Hindi sense-tagged words, a problem where this technique hasn't been widely explored. It involves creating a training dataset for Hindi, identifying the correct word sense, and improving accuracy.

What is the methodology used in the proposed work?

The methodology involves creating a Hindi language dataset, applying a KNN-based algorithm, and testing the feasibility of the approach using parameters like accuracy percentage and execution time.

What are the components of the feature set used in the KNN approach for sense tagged words?

The feature set consists of a target set (containing the target word) and a neighbor set (containing surrounding words in the sentence).

How is the performance of the KNN algorithm evaluated in this document?

The performance is evaluated by measuring the precision of the algorithm, which is defined as the ratio of correctly classified sense-tagged words to the total number of classified words.

What are some potential areas for future work mentioned in the document?

Future work includes improving the dataset, using the method on another WordNet, determining the optimal value of ‘k’ for large datasets, applying it to other languages, and considering morphological word disambiguation.

How did the document generate its bibliography and reference section?

The bibliography and reference section comes from OCR text in the body of the text. There may be missing or improperly transcribed items.

Excerpt out of 82 pages - scroll top

Buy now

Title: Natural Language Processing. A Machine Learning Approach to Sense Tagged Words using K-Nearest Neighbor

Scientific Study , 2018 , 82 Pages , Grade: 1

Autor:in: Jagpreet Sidhu (Author), Arvinder Kaur (Author)

Computer Sciences - Artificial Intelligence

Look inside the ebook

Details

Title: Natural Language Processing. A Machine Learning Approach to Sense Tagged Words using K-Nearest Neighbor
College: Post Graduate Government College
Grade: 1
Authors: Jagpreet Sidhu (Author), Arvinder Kaur (Author)
Publication Year: 2018
Pages: 82
Catalog Number: V411986
ISBN (eBook): 9783668642324
ISBN (Book): 9783668642331
Language: English
Tags: natural language processing machine learning approach sense tagged words k-nearest neighbor
Product Safety: GRIN Publishing GmbH

Quote paper: Jagpreet Sidhu (Author), Arvinder Kaur (Author), 2018, Natural Language Processing. A Machine Learning Approach to Sense Tagged Words using K-Nearest Neighbor, Munich, GRIN Verlag, https://www.grin.com/document/411986

Natural Language Processing. A Machine Learning Approach to Sense Tagged Words using K-Nearest Neighbor

Excerpt

TABLE OF CONTENTS

ACKNOWLEDGEMENT

LIST OF ABBREVIATIONS

LIST OF FIGURES

LIST OF TABLES

ABSTRACT

CHAPTER 1 INTRODUCTION

1. Introduction of Natural Language Processing

1.1 History

1.2 Word Sense Disambiguation

1.2.1 Sense Tagged Words

1.3 Learning Approaches for Sense Tagged Words

1.3.1 Knowledge Based Approach

1.3.2 Machine Learning Approach

1.4 Organization of book

CHAPTER 2 SENSE TAGGED WORDS

2. Sense Tagged Words

2.1 Various Classification Techniques used for Sense Tagged Words

2.1.1Naive Bayes

2.1.2 Decision Trees

2.1.3 Decision List

2.1.4 Neural Network

2.1.5 Lesk Approach

2.1.6 Rocchio Classification Approach

2.1.7K-Nearest Neighbor

2.1.8 Ensemble Method

2.2 Knowledge Sources for Sense Tagged Words

2.2.1 Wordnet

2.2.2 SemCor

2.3 Representation of Context

CHAPTER 3 LITERATURE SURVEY

CONCLUSION

CHAPTER 4 PRESENT WORK

4.1 Problem Statement

4.2 Objectives

4.3 Design and Implementation

4.3.1 Methodology

4.3.2 Creating a dataset for Hindi sense tagged words

4.3.3 KNN Approach for Text Categorization

4.3.4 KNN Approach for Sense Tagged Words

4.3.5 Working of Algorithm

CHAPTER 5 RESULTS AND DISCUSSIONS

5.1 Text Categorization using KNN

5.2 KNN Approach for Sense Tagged Words

5.3 Comparison between Text Categorization and Sense Tagged Words

CHAPTER 6 CONCLUSION AND FUTURE WORK

6.1 Conclusion

6.2 Future Work

REFRENCES

Frequently Asked Questions

What is the main topic of this document?

What is Word Sense Disambiguation (WSD)?

What is the K-Nearest Neighbor (KNN) approach in the context of WSD?

What are the different learning approaches discussed in this document for WSD?

What are some of the knowledge sources used for WSD mentioned in this document?

What classification techniques used for sense tagged words are mentioned in the document?

What are the steps involved in preprocessing text for WSD?

What is the focus of the proposed work in this document?

What is the methodology used in the proposed work?

What are the components of the feature set used in the KNN approach for sense tagged words?

How is the performance of the KNN algorithm evaluated in this document?

What are some potential areas for future work mentioned in the document?

How did the document generate its bibliography and reference section?

Buy now

Details