Reducing the Computational Requirements of the Nearest Neighbor Classifier


Doctoral Thesis / Dissertation, 2018
138 Pages

Excerpt

TABLE OF CONTENTS

Chapter No. Name of the Chapter

1 Introduction

2 Literature Survey

3 OCNN (Prototype Selection)

4 Prototype Selection and Feature Extraction

5 Prototype Selection and Feature Extraction

6 Prototype Selection and Feature Selection

Summary, Conclusions and Future Work

References

Index

List of Publications

DECLARATION

I hereby declare that the work described in the thesis entitled “Reducing the Computational Requirements of Nearest Neighbor Classifier” submitted to Jawaharlal Nehru Technological University Anantapur, Ananthapuramu, A.P., for the award of the degree of Doctor of Philosophy, is the bonafide research work done by me during 2011-2017 under the supervision of Dr. P. Viswanath, Visiting Professor, Dept. of CSE, IIIT Chittoor, SriCity, Chittor, A.P. India, and the Co-supervision of Dr. C. Shoba Bindu, Professor, Dept. of CSE, JNTUA College of Engineering, Ananthapuramu, A.P., India.

The contents of this thesis have not been submitted to any other University or Institution for the award of any degree or diploma.

CERTIFICATE

This is to certify that, the thesis entitled “Reducing the Computational Requirements of Nearest Neighbor Classifier” being submitted to JAWAHAR LAL NEHRU TECHNOLOGICAL UNIVERSITY ANANTAPUR, ANANTHAPURAMU by Sri R. Raja Kumar bearing Roll. No 11PH0511 for the award of the degree of Doctor of Philosophy (Ph.D) in Faculty of Computer Science and Engineering under our guidance and supervision. This thesis work is original and has not been submitted to any other University or Institution either in part or in full for the award of any degree.

ACKNOWLEDGEMENT

Many of the people to whom I should be thankful. Few of them I can list here. First and foremost, I solemnly express my earnest and humble thanks to my Supervisor Dr P. Viswanath, Visiting Professor, CSE Dept, IIIT Chittoor, Chittoor, A.P., INDIA. He learned me how to learn. He is so kind and more than a supervisor to me.

I am so grateful to Co-Supervisor Dr C. Shoba Bindu, Professor, CSE Department, JNTUA, Ananthapuramu, A.P., INDIA. I got valuable directions, suggestions, guidance and encouragement ever since the commencement of this research from my supervisors. These two persons are so important persons in my life.

I extend my thanks to Dr A. Ananda Rao, Professor for his valuable advises. I am thankful to Dr. Vasundhara, HOD, CSE Dept for her support in the Research Review Meetings (RRM).

I am grateful to Prof. K. RajaGopal, the Vice Chancellor, JNTUA, Ananthapuramu. I am thankful to Dr M. Santhi Ramudu, Chairman, M. Sivaram, Managing Director, Principal Dr T. Jaya Chandra Prasad( becasue of the principal’s naysayer attitude It helps me to become stubborn), Faculty members and staff, CSE Dept, RGM College of Engineering and Technology for their continuous support during my research work.

I extend my thanks to my senior and my well wisher Dr. G. Kishore Kumar, Professor & HOD, IT Dept, RGMCET for his support during my research work. We spend together many evenings to understand most of the concepts.

I take this opportunity to thank my parents D. Lakshmi Kantamma and R. Veeraiah, my wife M. Sandhya, my only daughter R. Raj Veena, my brothers, R. Raja Sekhar and R. Sreenu Vasulu and other family members who have been a source of strength and encouragement which help me a lot at all stages of my research work.

Above all, I wish to express my humble thanks to the almighty God Lord Sri Venkateswara Swamy for giving me enough strength (especially when I am weak) and blessings in carrying out my research work.

ABSTRACT

Nearest Neighbor classifier (NNC) is a popular non-parametric classifier used in many fields since 1950’s. It is conceptually a simple classifier and shows good performance. The design phase is not needed for NNC and it stores the given training set. A query (i.e. test) pattern is classified to the class of its nearest neighbor in the training set. A simple enhancement of NNC is to consider k nearest neighbors and assign a class label to the query pattern based on the majority voting among them. This method is called k nearest neighbors (k-NN) classifier.

NNC and its variants like k nearest neighbor classifier (k-NNC) are widely used and well studied. Theoretically, it is shown that the error rate of NNC is bounded by twice the Bayes error rate when the training set size is infinity. Similarly, k-NNC, asymptotically is equivalent to the Bayes classifier.

There are three important parameters which influence a classifier’s error rate, (i) the training set size, (ii) the dimensionality of the problem and (iii) the classifier’s complexity (in other words, generalization ability of the classifier). Curse of dimensionality effect is a problem because of which, when the dimensionality is large, the classifier becomes a biased one, resulting in increased error rate. This problem leads to peaking phenomenon in the classifier design. The peaking phenomenon with NN classifier is more severe than other classifiers. In order to overcome this effect paper, point out that the number of training samples per class should be at least 5-10 times the dimensionality of the data. So, the demand for a large number of samples grows exponentially with the dimensionality of the feature space.

So, it is widely believed that the size of training sample set needed to achieve better classification accuracy would be sufficiently large when the dimensionality of the data is high. Two possible remedies are (i) to reduce dimensionality, or (ii) to increase training set size by adding certain artificially generated training patterns to the training set. But if we increase the training set size, the time and storage requirements of the data are high especially classifiers like NNC where we miss the training phase. Hence NNC suffers from the storage and time requirements compare to other classifiers.

Generally researches will increase the training set size to overcome the curse of dimensionality problem. It has the limitations as follows. Obtaining a larger training set is difficult in certain real world problems, like emotion recognition from speech signals, hand-drawn shape recognition, etc., where human beings have to generate examples. But, to achieve better accuracy training set size should be sufficiently large. Even if the training set size increased, NN classifier is required to store training set and search it for computing nearest neighbors.

The two problems, viz., curse of dimensionality (Which requires a larger training set), time and space requirements( Which requires a smaller training set) are distinct and difficult problems. The solution is to reduce the training set size and dimensionality. But reducing the training set size, without degrading performance, is a challenging task that attracts researches over many years.

My proposal is to reduce the data set size in both ways i.e, in terms of number of samples and also in terms of number of features. So, the holistic goal of my proposal is to reduce the time and space requirements and at the same time not to degrade the performance of NN classifier.

In order to deal with the above goals, the thesis proposes a new method for reducing the data set size and the proposed method was cascaded with the feature reduction techniques to achieve the overall goal i.e, reducing the training set both in cardinality and dimensionality.

PREFACE

The tremendous growth of data due to Internet and electronic commerce has created serious challenges to the researches in pattern recognition. There is a need of processing and analysing data. Advances in data mining and knowledge discovery provide the requirement of new approaches to reduce the data. In this thesis, methods are proposed to overcome the computational requirements of Nearest Neighbor Classifiers. The thesis shows the some of the possible remedies to overcome the problems with Nearest Neighbor based classifiers. The main disadvantages of Nearest Neighbor Classifiers can be avoided using the proposed methods in this thesis. The data is reduced not only in the cardinality but also in terms of dimensionality.

LIST OF TABLES

Table No. Name of the Table

1.1 Dataset Template

1.2 A Small Training Dataset

1.3 Blood Pressure Dataset

1.4 Blood Pressure Dataset

3.1 Example dataset

3.2 Euclidean distances between patterns of table 3.1

3.3 Prototype Set chosen by OCNN on table 3.1

3.4 Description of Data sets

3.5 Accuracy obtained over data sets

3.6 Accuracy after applying Bootstrapping

3.7 Comparison of Time

3.8 Reduction Rates of OCNN, MCNN and CNN

4.1 Description of Data sets

4.2 Accuracy obtained over Thyroid

4.3 Accuracy obtained over Iris

4.4 Accuracy obtained over Wine

4.5 Accuracy obtained over Balance

4.6 Accuracy obtained over Opt.digit.rec

4.7 Accuracy obtained over Thyroid

4.8 Accuracy obtained over Iris

4.9 Accuracy obtained over Wine

4.10 Accuracy obtained over Balance

4.11 Accuracy obtained over Opt.digit.rec

4.12 Reduction Rate of Features

LIST OF TABLES CONTINUED...

Table No. Name of the Table

5.1 Description of Data sets

5.2 Accuracy obtained over Thyroid

5.3 Accuracy obtained over Iris

5.4 Accuracy obtained over Pima

5.5 Reduction Rate of Features

LIST OF FIGURES

Figure No. Name of the Figure

1.1 The process of evolution of data

1.2 Cycle of data, statistics and computing power

1.3 The Machine Learning Process

1.4 Pattern Recognition System

1.5 Machine Perception of Pattern Recognition System

1.6 Relation between the Domains of Machine Learning

1.7 Example of Patterns

1.8 Flow of supervised learning

1.9 Decision Tree for Data in Table 1.2

1.10 Clustering Example

1.11 Semi-supervised Learning Example

1.12 Example of NNC

1.13 Example of 3-NNC

2.1 Compact representation

3.1 Piecewise linear decision boundary

3.2 Patterns selected by OCNN

3.3 Accuracy on OCR

3.4 Accuracy on IRIS

3.5 Accuracy on Liver

3.6 Accuracy on Pima

3.7 Accuracy on opt.digit.recognition

3.8 Time required to build Prototype Sets

4.1 LDA direction

LIST OF FIGURES CONTINUED

Figure No. Name of the Figure

4.2 Results over Thyroid

4.3 Results over Iris

4.4 Results over Wine

4.5 Results over Optdigitrec

4.6 Results over Thyroid

4.7 Results over Iris

4.8 Results over Wine

4.9 Results over Optdigit

5.1 Sequential Forward Selection

5.2 Results over Thyroid

5.3 Results over Iris

5.4 Results over Pima

LIST OF SYMBOLS

Abbildung in dieser Leseprobe nicht enthalten

LIST OF ABBREVIATIONS

Abbildung in dieser Leseprobe nicht enthalten

Chapter-1

INTRODUCTION

CHAPTER 1

Chapter 1. INTRODUCTION

S. No. Name of the Sub-Title

1.1 Overview

1.2 Data Mining

1.3 Machine Learning

1.4 Pattern Recognition

1.5 Learning:

1.6 Nearest Neighbor Based Classifiers

1.7 Nearest Neighbor Method

1.8 Working of NNC

1.9 Motivation, Objective and Scope of the present work

1.10 Organization of the thesis

Chapter 1

INTRODUCTION

This chapter discuss an overview of Nearest Neighbor Classification (NNC) . The chapter first discuss the evaluation process of the related fields where NNC uses, for example Machine Learning, Pattern Recognition and Data Mining etc. Further the chapter describes motivation, objectives and scope of the proposed work and organization of the thesis is presented at the end of the chapter.

1.1 Overview

We live in the world where very huge quantity of data are collected daily. Such type of data may contain some important information and often it is mixed with vague and unimportant data. Analysing such type of data is an important task.

We are living in the age of data. ”We are living in the era of information age” is a popular said term [9]. Tera bytes and Peta bytes of data is accumulating into the data bases and in computer networks through various resources. But actually whatever existing is not the information, we are living in the data age not information. The growth of data is with the avail- able computer resources. Business people around the world generate huge data sets(giga bytes very often in size) for their transactions, stock market, sales promotions, company profiles and performance, and customer feedback [9]. For example large stores like Wall mart produces hundreds of millions of records of transactions through their branches around the world wide. Scientists and engineers produces and require high orders of data(in petabytes) for remote sens- ing, process measuring, scientific experiments,system performance, engineering observations, and environment surveillance. The telecommunication industries carry tens of petabytes of data traffic every day. The medical and health industries generates huge amounts of data from medical records, patient monitoring, and medical imaging. The web pages and web searches in internet supported by search engines peta bytes of data daily. Communities and social media have become increasingly important data sources, producing digital pictures and videos, blogs, Web communities, and various kinds of social networks [9].The list of sources that generate huge amounts of data is endless [9].

Hence, there is a need of intelligent and automated tools which can mimic the knowledge human beings in analyzing data. We understood with an example. Consider the following phrases.

Data: Data is raw and unorganized facts that needs processing. Data without process- ing is useless. Its like crude oil [10].

Information: When the data is well structured, organized and presented in meaningful form or used for some purpose or statement or to draw conclusions then it is called Information. Its like drawing petrol from crude oil [10].

Examples:

1) The data base which contains the history of temperatures all over the world is data. If this is properly organized and analyzed to draw conclusions for finding the global temperature and to warn about global warming then it is information [10].
2) Maintaining the number of visitors for a website from each country is data. But finding which country visitors are more and which country visitors and at what time the traffic is more and at what time the traffic is less is meaning ful information [10]. The data evolution process is shown in Figure 1.1

In 1960’s the importance was given to data collection and its storage in storage. Later in 1970’s and 80’s the data is organized using Data Base Management Systems(DBMS). Later Advances DBMS and Data Analysis systems were evolved. The present and future is Informa- tion Systems.

This tremendous growing, widely available, and gigantic bytes of data makes the data age. Very powerful and tremendous tools are required to extract the valuable information from the huge amounts of data and to transform such data into knowledge in order to draw useful conclusions. This necessity gave birth to machine learning and its related fields like pattern recognition, data mining etc. These fields are dynamic and young and continuing their journey in this era.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.1: The process of evolution of data

1.2 Data Mining

Data Mining refers to extraction of useful knowledge from the raw data. It can also be called as ”knowledge mining from data”. It has several names such as data dredging, knowledge extraction, data analysis, pattern analysis, Data Archaeology.

The data mining is a knowledge discovery process. It has the following important stages.

1. Data preprocessing stage
2. Data mining stage.
3. Data Presentation stage.

The Preprocessing stage is classified into the following steps.

1. Data cleaning (refers removing unwanted data)
2. Data integration (where multiple data sources combined)
3. Data selection (The data relevant to analysis is extracted)
4. Data transformation ( Data consolidation will be done)

The mining stage has the following sequence of steps.

1. Data Mining ( Intelligent methods are applied to extract the patterns)
2. Pattern Evaluation ( some interesting measures are used to evaluate the patterns)

Data Presentation stage refers to present the extracted knowledge using visual tools.

Data Mining supports number of functionalities. The important functionalities are

1. characterization and discrimination
2. Mining Frequent Patterns and Association Analysis
3. Classification and regression
4. Clustering Analysis
5. Outlier analysis.

These functionalities specify the hidden patterns present in data mining tasks. Data mining task can be divided into two categories.

1. Descriptive data mining
2. Predictive data mining

Descriptive data mining describe and characterizes the properties of the data present in the target data set where as Predictive mining tasks draw inferences or inductions on the data to make future predictions.

Characterization and Discrimination

Generally data can be associated with concepts or classes. For example, in an elec- tronic show room, the concept of items for sale include computers and Printers. The concept of customers may be big customers and budget customers. It might useful to describe these concepts in a summarized, concise and precise form. These type of descriptions of a class or a concept is termed as class/concept description. The class/concept descriptions can be derived using the following steps.

1. Data characterization : It summarizes the data of the class. The class for which we are characterizing is called target class.
2. Data discrimination: This refers the comparison of the target class against one or more classes. The classes which are used for comparison are called comparative classes.
3. Data characterization and summarization

Data characterization describes the general characteristics or general features in the target data. This data typically collected by using a query. For example, to examine the char- acteristics of an item whose sales increased by 10% in the last quarter are collected using SQL query over the sales database. There are certain methods used for efficient characterization and summarization.

Simple summaries are based on the statistical measures and plots. OLAP operations like drill down, drill up, slice, dice are used to perform user controlled data summarization. In this user has control over the cube by specifying dimension. An attribute oriented induction technique does not involve much user interaction and can be used to perform generalization and characterization. The output of data summarization or characterization is often presented in various forms like bar charts, pie charts, multidimensional data cubes, curves, cross tabs and in some general rules or relations. These are called characteristic rules.

Data discrimination compares the general attributes of the data in the target class against the general attributes of the data with the another class. The class with which we are comparing is called contrasting class. We can compare with one or more contrasting classes. A user spec- ifies target and contrasting classes , and the relevant data is retrieved using database queries. For example, comparing the attributes of items which sales are increased by 20% during the last quarter against with the products which sales were decreased by 20%. Data discrimination uses similar methods used by data characterization. Discrimination descriptions use compara- tive measures to distinguish the target class versus contrasting class. The output representation of data discrimination is in the form of bar charts, pie charts, curves, cross tabs, multi dimen- sional data cubes etc. Often discrimination descriptions presents in the form of rules, called discriminant rules.

Example: A manager in an electronic shop want to compare two categories of his customers. The first category of customers are those who visit the shop regularly may be 3 times approximately within three months to buy computer products. The second category of customers are those who visit the shop rarely ma by less than four times in a year to buy computer products. The description and discrimination rules finds general profile comparison of this data. The manager finds that the 85% of frequent customers are between the age of 20-35 and have university degree. The infrequent customers who buy the products rarely have no university degree. Operations like drilling down or up a dimension like job or adding the income of customers may help to find even some more good discriminative features between target and contrast classes.

1.3 Machine Learning

ORIGIN: From the birth onwards we are dealing with the data. The body sensors the eyes, ears, nose, tongue, and nerves are continually dealing with raw data. They collect the data and sends to brain.The brain translates into sights, sounds, smells, tastes, and textures [11]. In the early days, people record the information from the observations. Astronomers recorded patterns of planets and stars, biologists noted results from experiments crossbreeding plants and animals and cities recorded tax payments, disease outbreaks, and populations [11]. All these task need human interventions at the stages. Nowadays, all these observations are mostly automated and systematically recorded by the computers [11].

We, human beings are surrounded by data. Larger and interested data sets are a click away. We now live in period of vast quantities of data. This information has a potential of decision making. The branch of study that develops computer algorithms that turns data into intelligent actions is known as machine learning [11]. This field originated from the environ- ment of statistical tools, availability of data and computing power. But growth in data necessary increases more statistical methods there by sophisticated algorithms from time to time to deal with the data. This is a cycle as shown in Figure 1.2.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.2: Cycle of data, statistics and computing power

It is widely believed that Data Mining is a sibling of Machine learning. But there is a disagreement also; To some extent the fields overlap each other, the key distinction is machine learning focus on performing known task where as data mining focus on searching for hidden piece of information. It’s like mining the gold from gold mines where plenty of mud,dust has to be removed to get a single gram of gold. For example we may taught a robot how to drive a car whereas we utilize data mining to learn which type of cars are the safest for driving or less prone to accidents or what are the interests of Indian people when buying new cars.

The algorithms in Machine learning may be prerequisite for data mining but the vice versa is not true. That means we can use machine learning to the tasks that do not involve any data mining functionality but when we are using data mining methods, we are almost use machine learning. Some of the applications of the machine learning are :

1) Estimating the results of elections
2) Filtering Spam mails
3) Auto driving vehicles
4) Predicting criminal activity
5) Targeting consumers

The basic learning process in machines or human beings is similar. It has the following components.

1) Data input: It refers the observations and memory storage.
2) Abstraction: It translates data into generalized representations.
3) Generalization: The abstracted data tuns into an action in this step.

An evil associated with the abstraction and generalization steps is Bias. Bias is inherent in any machine learning process. Every learning method has its own weaknesses and is biased in particular angle. There is no single model to fit for all the problems. The last step in generalization process is to know the model’s performance by not considering bias. Once a model is trained on some dataset, then it is tested on a new dataset, and judged how well its characterization over the new data. One should note that, it is very rare for a model to generalize perfectly all the unforeseen.

This may be due to the problem of noise in the data, or unwanted variations in the data. Noise in the data is caused by random events such as,

1) Due to the imprecise sensors that add or subtract a bit of values when reading.
2) Due to the random answers for the survey questions to finish the survey quickly. This might be respondents report random answers to the survey questions.
3) Due to the missing values in data, null values, corrupted values while recording the data. At that time the data may not be available.

The Pictorial representation of Machine Learning Process is shown in Figure 1.3

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.3: The Machine Learning Process

The raw input data in a structured format is very essential for a learning process. The data is a collection of observations. Frankly speaking, the data is merely 1’s and 0’s on the disk. The abstraction process assigns some meaning to the data. Then abstraction tries for knowledge representation from the data. During the knowledge representation, the abstraction constructs a model which depicts the structured patterns from the data. There are many models exist. Some of the models are listed below.

1) Equations like association rules
2) Diagrams like trees and graphs
3) Logical rules like if/else
4) Grouping of data like clusters

The following are the steps for applying machine learning to the data.

- Data Collection: The data may be on papers or in text files or in some databases, first the data has to be collected and put in an electronic form suitable for system. This data serves as learning data.
- Exploration of data: The quality of machine learning process depends on the quality of the data used. In this step, human intervention is involved. Nearly 70-80% of effort is put on preparing and exploring the data.
- Training a model: In this using the data we train model, and the specific learning task.
- Evaluating Model Performance: In this step, the performance of the built model is evaluated. Basically when a machine learns model, it results are biased to the learning problem. Evaluating a model is important as we know the experience of the algorithm. Generally the accuracy is evaluated depending on the model and using the test set. we can also use some performance measures in support of evaluation process for specific purpose or for the specific application.
- Model Performance improvement: Still if we want to improve the performance then some advanced techniques are required to augment the model. Sometimes, we need to switch to different type of model. We may require to augment the data with additional data.

1.4 Pattern Recognition

Pattern Recognition is defined as the classification of data based on the obtained knowledge or based on the statistical information which is extracted from the examples and their repre- sentations. Pattern Recognition is close to the basic attribute of human behaviour. Humans categorize every sensory input data. This process starts from the birth to death and from the start of day when we awake and continuous till our sleep.

Pattern Recognition has several applications. Some of them are listed below.

1) Automatic Medical Diagnosis
2) Multimedia Document Recognition
3) Reading facial expressions
4) Recognition of speech
5) Identification of criminal from his finger prints
6) Reading a document

A simple Pattern Recognition system is shown in the Figure 1.4 The machine perceptive

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.4: Pattern Recognition System of Pattern Recognition is shown in Figure 1.5

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.5: Machine Perception of Pattern Recognition System

A pattern is an example from the data. Feature Extractor makes some measurement on the pattern. X is called feature vector. X n. Classifier maps each feature vector to a class label. Features used are specific to the problem.

Some of the examples of Pattern Recognition task are:

1) Character Recognition :

- Pattern - Image
- Class - Identity of character
- Features - Binary Image, Projections, Moments.

2) Speech Recognition :

- Pattern- Signal
- Class - Identity of speech units
- Features - chunks of speech, spectral information etc.

3) Identifying criminal based on Finger Prints :

- Pattern- Image
- Class -criminal or not
- Features - Minutiae position, Orientation filed

4) Video Surveillance :

- Pattern- Sequence of video’s
- Class -Alertness level
- Features - Trajectory of motions, speed etc

5) Credit Screening :

- Pattern-Applicant details
- Class -Yes, No
- Features - salary, credit history, job history etc.

6) Document Classification :

- Pattern- documents
- Class -Relevant or not
- Features - word counts, word context etc.

In the pattern recognition application, we require to process the raw data and transform the data into a form that is suitable for machine use. For example, it is possible to convert all forms of multimedia data into feature values. The frequency count of key words in the text will form representation. similarly audio data can be represented by linear predictive coding coefficients and video data can be transformed into frequency or time domain. The popular transformations are Wavelet transforms and Fourier transforms. Signal processing converts raw signals into numbers.

In doing the process, we don’t deal with specific details that present in the data. Abstrac- tion of data is necessary step. For this summarizing the data is needed. Abstraction of data is familiar to both human and machines. For the machines it helps computation burdens like time reductions and space reductions. Obtaining abstractions from the given examples, is a popular paradigm in machine learning. In artificial intelligence also the machine learning activity is used by taking the help of domain knowledge. The abstractions can be in the form of rules called rule-based abstractions. In addition data mining tools can be used for this purpose. All these fields like pattern recognition , machine learning, artificial intelligence, Algorithms, In- formation Retrieval, High Performance Computing, Statistics and data mining are overlapped. The Figure 1.6 gives pictorial idea of relation between the domains of Machine Learning.

1.4.1 Patterns: An Example

The fundamental task of pattern recognition is assigning class labels to the patterns. The Figure 1.7 shown an example of patterns. In this Figure there are 12 patterns present. The patterns are represented by the symbols X and +. There are total 12 patterns are present in the data.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.6: Relation between the Domains of Machine Learning

Patterns numbered 1,2,3,4,5,6 are denoted by cross symbol ’X’ and patterns 7, 8, 9, 10, 11, 12 are denoted by plus ’+’. In reality they represent some abstraction like X may represent short people and + represent tall people etc. Now, let P be a new unknown pattern for which whether it is an example of X patterns or + pattern. This is the task of pattern recognition. P can be assigned to either of the class depending on the algorithm.

The human beings based on the abstraction may feel that pattern P has to belong to +. But its not always true. Even though, contextually it looks P belongs to + , but it depends on the type of algorithm Pattern Recognition system follows.

Pattern Recognition field is application potential. It can be used in different areas such as education, security, transportation, medicine, entertainment and finance. The specific applica- tions of Pattern Recognition are in the fields multimedia, data analytics, document recognition, fault diagnostics, informatics, biometrics and expert systems. The task of classification is basic fundamental idea for human beings so that pattern recognition finds a place in any field.

There are mainly two types of paradigms are present in pattern recognition.

1. Statistical Pattern Recognition
2. Syntatic Pattern Recognition

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.7: Example of Patterns

Statistical pattern recognition is popular than syntatic pattern recognition. The funda- mental reason for this is, most of the problems present in this are requires to handle with noise in the data and uncertainty. statistic techniques and probability tools are suitable tools to handle such problems. The theory of formal languages provides basis for syntatic pattern recognition. Syntatic pattern recognition is also called as structural pattern recognition. In this each pattern is represented by symbolic features or nominal features. This allows pattern structure represen- tation and more complex relationships among the attributes. If there is a clear structure in the patterns then syntatic pattern recognition can be used instead of statistical pattern recognition.

1.5 Learning:

The important functionality of all the major fields like Machine Learning, Pattern Recogni- tion and Data mining is Learning. Learning refers to how the computers learn or improve their performance based on the data presented to them. Learning for computer programs concentrate mainly on automatically learn to identify complex patterns and there by taking intelligent de- cisions based on the data presented. For example a learning problem might be recognizing the hand written postal code on the letters and classify them according to the pin code. Learning is fast growing research area discipline. The following are the typical learning categories [9].

1. Supervised learning.
2. Unsupervised learning.
3. Semi-supervised learning.
4. Active learning.

Supervised learning: Supervised learning is also referred as classification. In this learning comes in the form of supervising the examples given in the training data set. For example in the postal code recognizing problem, a set of handwritten postal images are given as training examples along with the corresponding class labels in this case postal code and its corresponding place is given. These examples supervise the classification model. Classifi- cation analyses the class labelled data objects. Supervised machine learning searches for the algorithms from externally supplied instances. From these instances, it produce the general hypotheses and uses it to predict the future instances. The goal of supervised learning is to build a concise model over the distribution of classes in terms of the predicted features. The resulting classifier is used to assign the class labels over the testing instances. For the testing instances, the values of the features are known, but the class label is unknown.

The dataset used by any algorithm consists of instances. Every instance in the dataset contains set of features. The features may be binary valued features (contains 0 or 1), categorical fea- tures ( non-numerical values) or continuous( numeric values). If the labels(the correct outputs for the features) are present for the instances then it is called supervised learning(refer table 1.1) which is opposite to unsupervised learning.

Issues of supervised learning: Supervised learning is an inductive learning [12] which is the process of learning the rules from the instances ( objects present in the training set) or generally speaking it constructs a classifier that can be used to generalize new instances. The process of applying supervised learning to a real world problem is presented in Figure 1.8.

Collecting the dataset is the first step. If the expert is available, then he or she can suggest which type of features or attributes are informative. If this is not possible, then we should follow ”brute force” method which measures the everything that is available so as to find right features should be isolated. But a dataset collected by this method is not generally suitable of learning by induction. In most cases it may contain noise and some of the feature values may be missing. Hence it has to undergo pre-processing [14]. The second step is data preprocessing step. Depending on the circumstances the researcher decides whether this step is needed or not. There are number of methods to deal with the missing data and noisy data etc [15] [16]. The technique’s have their own advantages and disadvantages.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.8: Flow of supervised learning

Table 1.1: Dataset Template

Abbildung in dieser Leseprobe nicht enthalten

The next crucial step is selecting the learning algorithm. Once the preliminary testing is satisfied, the classifier is available for routine use. The classifier maps the unlabelled instances to classes. Evaluation of the classifier depends on the predicted accuracy. The predicted accu- racy is the percentage of correct predictions divided by the total number of predictions. There are techniques to compute the accuracy. The most general technique most of the researches follow is divide the dataset into two mutually disjoint sets. Two-thirds is used for training and one-third is used for estimating the performance of classifier. In another technique referred as cross-validation, the training set is divided into equally sized mutually exclusive subsets. The average of the error rate of each subset is the error rate of the classifier. Another technique known as leave-one-out validation is a special case of cross-validation. This is computation- ally expensive method but it is useful when we need most accurate estimate of the classifier error rate. If it is found that the error rate evaluated is unsatisfactory, then we should return to the previous stage in the supervised learning process. The factors for this has to be examined. They might be like relevant features may not be used for the problem, a larger training set may need, the dimensionality of the problem is too large and the selected algorithm might be inappropriate or some times the algorithm looks inappropriate because of the not well tuned parameters. Another important consideration is the dataset is an imbalanced dataset [17].

Supervised learning is the most frequent task carried out in Intelligent Systems. Variety of techniques have been developed based on artificial intelligence, Perceptron based techniques and Statistic based techniques.

Examples of Classification Methods:

Classification is the function that assigns objects or items in a collection to target classes or categories. It predicts the class for each item in the collection. For example, a classification model may be used to classify the bank customers low, medium or defaulters or a classification model may predict the customers of supermarket into the two classes like regular buyers or seldom buyers. Classification relates to each activity in the human life from birth to death.

Decision trees:

A decision tree can have zero or more internal nodes and also one or more leaf nodes. Each internal node can have two or more child nodes. A decision tree is also called as directed tree therefore a node with no incoming edge is known as root node, nodes which have one incoming edge and one or more outgoing edges are known as internal nodes and the nodes which have no outgoing edge are known as leaf nodes. Each internal node consists of an expression to split the instance space into two or more subspaces. An expression is composed of a single attribute or more attributes. If an expression consists of a single attribute, then the decision trees are known to be univariate decision trees (also called axis parallel decision trees) and the decision trees are known to be multivariate decision trees ( also called oblique decision trees) if an expression is a combination of attributes.

A decision tree is constructed from a set of labelled objects ( also called training set), where each object is described with a set of attributes or features and also associated with a class label. The values of each attribute are either ordered or unordered values. Basically the approaches for induction of decision trees are top-down and bottom-up. In top-down approach to insert a node in the decision tree induction process, the goodness of the attribute to be tested at the node is normally measured obtained on drop in certain impurity measures like entropy, gain-ratio, gini-index etc. [9]. A sample decision tree is shown in the Figure 1.9for categorical data set in which four categorical attributes and fourteen instances as shown in Table. 1.2.

Table 1.2: A Small Training Dataset

Abbildung in dieser Leseprobe nicht enthalten

The two phases in top-down induction of decision trees are growing phase and pruning phase. In growing phase the decision trees are fully grown until every leaf node become pure node, where all instances belong to same class. In pruning phase, the pruning method can be employed to resolve the over-fitted decision trees. The popular top-down decision tree classifiers are ID3[18], C4.5[19] and CART [20] [21]. The bottom-up approach induction of decision trees are first proposed by Landeweerd et al. [22]. In bottom-up approach the decision trees are grown from leaf nodes to root node. In this approach a priori information is available about which objects belong to a given node in the tree. Therefore bottom-up approach decision trees do not lead to over-fitted and hence pruning is not needed.

The typical challenge to decision tree induction is to build optimal decision tree which minimize the generalization error, number of nodes, etc. Therefore it is a hard task to build an optimal decision tree. Hancock et.al., [23] is shown that to find optimal decision tree consistent to the training set is NP-Hard. And also finding an optimal equivalent decision tree for a given decision tree is NP-Hard [24]. In [25] it is shown that constructing a minimal binary tree, respect to estimate number of tests required to classify an unseen pattern is NP-complete.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.9: Decision Tree for Data in Table 1.2

Merits and Demerits of Decision Tree Classifier : The advantages of decision tree classifier are listed as follows:

- Decision trees have the capability of self-explanatory and easy to follow compact deci- sion trees, so that non-professional users can easily understand. Further the decision trees are easily converted to a set of rules. Therefore this kind of representation of decision
trees is considered as comprehensible.
- Decision trees are capable to handle both numeric and nominal input features/attributes.
- Decision trees can handle the data sets that might have errors.
- Decision trees can handle the data sets that might have missing values.
- Decision trees do not have any assumptions about the structure of input space distribution so that it can be considered to be a non parametric classifier.

On the other side, decision tree classifiers also have some disadvantages which are listed as follows:

- Many of the decision tree algorithms like ID3, C4.5 etc. need that the target attribute will have discrete values only.
- Generally the decision tree classifiers can be followed divide and conquer approach, in which the given instance space is partitioned into mutually exclusive subspaces so that duplications of sub trees may exist in representation of decision tree.
- Due to the greedy characteristic another problem with decision trees is its over sensitivity to the irrelevant attributes, training set and to noise [18].

Bayes classification methods: Bayesian classifiers are the best examples of statistical classifiers. They estimate the class membership probabilities which means the probability that given sample belongs to a specified class. Bayesian classifiers is based on Baye’s theorem. A simple Bayesian classifier is known as naive bayesian classifier. Baye’s classifiers can be com- parable in performance with decision tree and nearest neighbor classifiers. Bayesian classifiers can be applied to large datasets, they show high accuracy and speed. Naive Bayesian classifiers assume the class conditional independence which tells that the attributes are independent of each other. This assumption made the computations simple.

Baye’s theorem: Baye’s theorem is named after Thomas Bayes who has done the work in basic probability and decision theory in the 18th century. Let X represents a data sample. In Baye’s terms , X is called as evidence. It is described by measurements over a set of n attributes. Let H be the hypothesis such that the sample X belongs to a specific class C. For the problem of classification, we need to determine P(H|X), which indicates that the probability so that the hypothesis H holds given that evidence or observed data sample X. In other terms, we are computing the probability that the sample X belongs to the class C. P(H|X) is called posterior probability or a posterior probability of hypothesis H conditioned on X. For example, let the samples represents customers who are described by the properties income and age respectively.

Let X is a 40 year old customer with income of Rs 50,000. Let H is the hypothesis so that this customer will buy a computer or not. Then P(H|X) represents the probability that customer buys the computer given that his age and income.

P(H) is called prior probability or a prior probability of H. This indicates the probability of hypothesis without depending on the attributes i.e not depending on X. For example in the above case P(H) represents the probability of buying a computer ignoring income and age attributes. The posterior probability depends on more information than prior probability. P(X|H) is the posterior probability of X given the hypothesis H. This means, it is the probability that we know the customer buys the computer what is the probability that his age is 40 years and income 50,000. P(X) is the probability of X. It is the probability that a customer is 40 year old and earns 50,000. All these probabilities are connected with the Baye’s theorem. It provides a way of computing P(X), P(H), P(H|X) and P(X|H). So the Bayes theorem is given by

Abbildung in dieser Leseprobe nicht enthalten

Instance based learning: Instance based learning is another category of statistical learning method. The instance based learning algorithms are called lazy-learners because there is no design phase in this. The induction or generalization process is delayed until the classifi- cation step is performed. Lazy-learners need less computation time in the training phase when compare to eager-learners such as decision trees, Baye’s networks etc. Eager learners require more computation time in the design phase of classification process. The study of instance based learning classifiers is presented in the papers [26], [27].

The straight forward instance based learning algorithm is NNC. The theme, motivation, objective of this thesis is Nearest Neighbor based classifiers.

Unsupervised learning: This is also referred as clustering. In this also input examples are present but without class labels. Hence there is no supervision present in the data. So, first we have to discover the groups( which are generally referred as clusters ). For example, an unsupervised learning method can take input examples of images of handwritten digits. Then it can find there are 10 clusters present in the data. These clusters corresponds to the digits from 0 to 9. There might or may not be possible to give the semantic meaning of the clusters discovered. Clustering analyzes the data objects without the help of labels. The objects are grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. That means, the objects within the same group are highly similar with respect to each other when compare to objects of other group [13].

Example: A supermarket may like to apply clustering analysis to identify the homogeneous groups of its population. These clusters represent individual groups targeted for its marketing.

An example of clustering can be shown in Figure 1.10.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.10: Clustering Example

Semi-supervised learning: Semi-supervised learning techniques use both labeled and unlabeled objects while learning a model. It is a class of supervised learning technique gen- erally small amount of data is labeled and large amount of data is unlabeled. It falls between supervised and unsupervised learning. It has began with aim of improving the accuracy when unlabeled data is mixed with labeled data. The acquisition of labels to the data needs human expert and it is a costly process. The cost associated in this process makes most of the data sets infeasible where as on the other hand acquiring labels to the unlabeled data is relatively costly process than applying technique. In this type of situations semi supervised learning helps significantly. For example, let the learning problem is an exam, where some of the prob- lems present in the exam are already solved in the class room by the teacher and the rest are unsolved. These unsolved are take-home exam where the students has to do well on these. The semi-supervised learning is shown in the Figure 1.11.

The top panel in the picture presents the decision boundary after seeing only one posi- tive and one negative example. The bottom panel represents the decision boundary when there are unlabeled data is present in addition with these two objects. we can observe how the unla- beled data influence the boundary between the classes. This might increase the accuracy of the problem.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.11: Semi-supervised Learning Example

Active Learning: Active learning is a special case of semi supervised learning in which the learning algorithm interactively query the user to get the information about the unlabeled data. The aim of active learning is to optimize the quality of the model by actively interacting with the users.

1.6 Nearest Neighbor Based Classifiers

In classification methods Nearest Neighbor classification is one of the simplest and easy to un- derstand decision procedure. The nearest neighbor based classifiers work based on the nearest neighbor rule (NN) NN). For the given query pattern or sample, NNC classifies the pattern based on the category of its nearest neighbor. The NN rule is a suboptimal procedure. It usually has an error rate greater than the optimum error rate, the Bayes error rate. But when infinite number of prototypes are present, its error rate is never worse than twice the Baye’s rate [6]. That means, when the number of samples are increasing the error of NN approaches to minimum possible error.

1.7 Nearest Neighbor Method

The nearest neighbor method classifies the given test or query pattern to the class of its closest pattern. For example consider the sample dataset present in Figure 1.12. In the dataset two classes are presented. The patterns that belong to circle are represented by solid circle and the patterns that belong to diamond are represented with the symbol solid diamond. Now the query pattern ’q’ indicated with the symbol ’*’ infers to the diamond class as its closest pattern belong to diamond class. Formally the method is stated as follows.

Let there are m patterns present in the dataset, (X1 ,l1 ), (X2 ,l2 ), ... , (Xm,lm ) where Xi is the ith pattern of dimension d and li is the corresponding class label. Let ’Q’ is the query pattern then, d(Q,Xj )=min{d(Q,Xi)} where i=1,2, ... , m. Then Q is assigned to the class lj which is the class label of Xj sample.

k-Nearest Neighbor Method: A simple extension of nearest neighbor method is k- nearest neighbor method, which gives the provision of considering more than one nearest neighbor to infer the class label. The number of nearest neighbors to be considered will be given by ’k’. k is a positive integer. For example, if k=3 means then we should find the first three nearest neighbors. The class label is decided by majority voting method. For example consider the sample dataset shown in the Figure 1.13. To classify the query pattern ’q’, now 3 nearest neighbors are considered, hence we can call this 3-NNC. Among 3 closest patterns, 2 belong circle class and 1 belongs diamond class. Intuitively two samples are inferring such that the class label of ’q’ is circle and one sample infers the diamond class. Using majority voting concept, the query pattern infers the class circle. There is always possibility of ties. Ties are broken using any random method. For example k can be chosen odd to avoid ties etc.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.12: Example of NNC

Proximity Measures: To classify the patterns, they should be compared with the other patterns using a standard. When a new query pattern is present and it is required to classify it, then the proximity of this patterns against the other patterns present in the training set should be computed. Hence proximity measures are used. Even in the case of unsupervised learning (Clustering) also, patterns are to be grouped so that similar patterns are in the same group, hence proximity measures are required. To know the proximity of given query pattern, the methods uses a number of dissimilarity or similarity measures.

Distance Measures: Distance Measures are to find the dissimilarity between the sam- ples. Patterns which are closer should be similar patterns. Distance function can be a metric or not. A distance measure is said to be metric if it holds the following properties:

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.13: Example of 3-NNC

- Positive reflexivity : Distance from the object to itself should be zero. d(y,y) =0.
- Symmetric property: if x, z are the two patterns then distance from x to z and distance from z to x must be same. d(x,z)=d(z,x).
- Triangular inequality: Let p, q and r be the three patterns, then the following property must be true. d(p,q) ≤ d(q,r) + d(p,r).

The most popular distance metric is called Minkowski distance given below.

Abbildung in dieser Leseprobe nicht enthalten

When k is 1 then it is called Manhattan distance or city block distance or L1 distance.

Abbildung in dieser Leseprobe nicht enthalten

The most popular distance metric is Euclidean distance or L2 distance which is obtained when k is assigned 2. Then we get,

Abbildung in dieser Leseprobe nicht enthalten

EXAMPLES: Let X=(3,5) and Y=(6,9) are the two patterns. Then the Manhattan distance is L1 = |3 − 6| + |5 − 9| = 3 + 4 = 7.

The Euclidean distance is

Abbildung in dieser Leseprobe nicht enthalten

1.8 Working of NNC

Consider the following two dimensional points. The first two numbers represent the features and the third one represents the class label. Let there are two classes present. The patterns belong to either class 1 or class 2.

Abbildung in dieser Leseprobe nicht enthalten

Let the query pattern q= (3.0, 3.0) for which we need to compute the class label. Following the 1NN procedure, We find the distances between ’q’ and all the patterns i.e., X1, X2, ... , X8.

Abbildung in dieser Leseprobe nicht enthalten

So, the nearest neighbor for q is X4 as it is very close to ’q’. This infers the class label 1 for ’q’ saying that ’q’ belongs to class 1. If we consider 3-NNC then the 3 nearest neighbors for ’q’ are X4, X3 and X2 then ’q’ infers the label 1. If we consider 5-NNC then the 5 nearest neighbors for ’q’ are X4 , X3 , X2, X1 and X5 . Then ’q’ infers the class label 1 using majority voting.

1.8.1 Problem with NNC

Besides its simplicity and popularity, NNC suffers from severe issues or problems from the beginning of its invention. These drawbacks attract the eyes of the researchers from several decades. We illustrate the problem with NNC using a sample dataset. Consider the Blood Pressure dataset present in Table 1.3. It presents the information about whether a patient has Hyper Tension (HT) or not. This is class label information. A person is classified whether he has HT or not based on the systolic pressure (SP) and Diastolic pressure (DP). Systolic pressure is the force created on the blood vessels wen heart pumps blood through arteries. The diastolic pressure refers the pressure in the arteries when the heart during the successive beats. Based on the data given in the table, a person is classified whether he has HT if his blood pressure is (SP, DP) >(139, 99) where (SP,DP) is order pair. This data can be served as training data for classifying newly coming patients whether they have HT or not. For the let there are two new patients with ID’s P10, P15 given to classify with the readings (122, 82),(155,115). Then P10 is classified to NO (means he has no HT) and P15 to YES class label as they are close to P1 and P4 respectively using NNC rule.

Table 1.3: Blood Pressure Dataset

Abbildung in dieser Leseprobe nicht enthalten

Time and Space Requirements: NNC is a lazy learner. It has no design phase as eager learners. Hence it needs to store the entire data set to classify the query patterns. It requires memory space to store all the training patterns. Hence the space requirement of NNC is O(m) where ’m’ is the training set size. To classify the single test pattern, it has to compute the distances between the test pattern and every training pattern in order to find nearest neighbor of the given test pattern. So the time complexity is O(m) where ’m’ is the training set size.

Excerpt out of 138 pages

Details

Title
Reducing the Computational Requirements of the Nearest Neighbor Classifier
College
Jawaharlal Nehru University
Author
Year
2018
Pages
138
Catalog Number
V495106
ISBN (eBook)
9783346003935
ISBN (Book)
9783346003942
Language
English
Tags
Reducing the Computational Requirements of Nearest Neighbor Classifier
Quote paper
Raja Kumar (Author), 2018, Reducing the Computational Requirements of the Nearest Neighbor Classifier, Munich, GRIN Verlag, https://www.grin.com/document/495106

Comments

  • No comments yet.
Read the ebook
Title: Reducing the Computational Requirements of the Nearest Neighbor Classifier


Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free