A Genetic Programming Approach to Classification Problem
Genetic Programming is a biological evolution inspired technique for computer programs to solve problems automatically by evolving iteratively using a fitness function. The advantage of this type programming is that it only defines the basics. As a result of this, it is a flexible solution for broad range of domains. Classification has been one of the most compelling problems in machine learning. In this paper, there is a comparison between genetic programming classifier and conventional classification algorithms like Naive Bayes, C4.5 decision tree, Random Forest, Support Vector Machines and k-Nearest Neighbour. The experiment is done on several data sets with different sizes, feature sets and attribute properties. There is also an experiment on the time complexity of each classifier method.
The world is going towards digitisation. Anything in the human life becomes data. Parallel to the improvement of the storage and database systems, storing the data and reaching to it have become easier and cheaper. However, having data does not mean knowledge. Information must be extracted from certain amount of the raw data. When it is done, the picture becomes clearer. This is where data mining starts.
Data mining1 is the exploration and analysis of large quantities of data. Therefore extraction of interesting knowledge like patterns, rules or constraints from large data sets is essential.
Classification is the problem of identifying the categories of data. Text classification is one of the most idiosyncratic one among all. It is based on labelling the input text based on some training data. Social media and internet usage have been increasing by the acceptance of the real time communication and text based information sharing. Increasing amount of the text data boosts the importance of the knowledge extraction from this type. This leads computer science world to lean on text classification algorithms more. The most well known algorithms of this kind are decision tree, Naive-bayes, Random Forest, Support Vector Machines and K Nearest Neighbours classification.
Genetic programming, from then on GP, is a methodology which is inspired by biological evolution to solve computer related problems10. GP is a machine learning technique based on a fitness function which measures the performance of the given task. In the paper, GP is used to enhance the decision tree classification method to be able to classify text efficiently. After that the overall performance of this GP classifier is compared with the other popular classification algorithms.
Classification is categorisation process in which ideas and objects are understood and recognised. In statistics and data science, it is a problem of labelling data according to some training data. This operation is done by classifiers. One can say that classifiers are the learners from the training data and categorises the unlabelled input. Main purpose of the classification is making the predictor algorithm to learn the model of the data and start predicting based on this model.
Common method is that a data set is taken and it is divided into two parts. First part is training data which is used to train the classifier. Second part is called test data which is used to benchmark the performance of the classifier. The rationale is basic: teach learner something and give it some input to test what it learnt. However, there comes a couple of problems like overfitting and under-fitting. If the training data is not sufficient for the domain of the problem, under-fitting happens which is high bias to the data. On the other hand, if the training data is so complicated or big according to the domain of the problem and structure of the data set, overfitting happens which means high variance that causes the classifier can not limit to the possible scenarios due to the large space of hypothesis. These lead one to choose the training set and test set ratio neatly.
Decision tree2 is a learning method which is based on target functions with discrete values. Essentially, the model is created as a tree structure with nodes and leaves. Each node corresponds to a variable test and each leaf corresponds to a class label which is definitive of the decision made by classifier. This is a really popular algorithm to be used from finance to medicine.
Main purpose while creating a decision tree from a training data is to create the most compact decision tree possible. Because one can create multiple decision trees with the same training set. The key factor to set up a tree is the attribute selection for the internal nodes in order to create a compact tree. To standardise this process, 2 main methods can be used. These are information gain which calculates the knowledge stored in each training item’s each attribute and gini index which is the measurement of the information distribution of attributes. There are a couple of well known algorithms to form up trees which are ID3, C4.5 and C5.0. In this experiment, J48 will be used to set up basic decision tree classifier. It’s an open source Java implementation of C4.5 decision tree algorithm.
Naive-Bayes is a classifier based on Bayes’ theorem3. It’s also known as independent feature model which comes from its naive acceptances. Naive-Bayesian classifier is widely used in pattern recognition field. Bayes’ theorem:
illustration not visible in this excerpt
Random forest is an algorithm which is simply based on creating a bulk of decision trees during runtime based on the random selection of features4. It’s like an extension of decision tree classifier.
1.5.Support Vector Machines
Support Vector Machines, from then on SVM, is a learning and prediction method based on non probabilistic binary linear classification6. Basically SVM takes the patterns which are linearly separable and moves them into a hyperplane space. The patterns which are not linearly separable are transformed into new space and mapped with original data by using some kernel function. If there is no kernel function used, it means it is linear SVM, which is the most basic method.
The advantage of SVM is that it is a theoretical model of learning and its performance is guaranteed in theory. It also has a modular design which gives the one to add/ remove components.
SVM is one of the essential algorithms of pattern recognition. In this experiment, multi class SVM learner is also used for some of the data sets which have a number of class labels more than 2. To achieve this, each categorical input variable is converted into an indicator variable and for N multi-class problem, multiple binary classifiers are built for the operation.
1.6.K Nearest Neighbours
K Nearest Neighbours, from then on kNN, is an instance-based classification algorithm where the function is created locally and all of the computation is postponed until classification5. In simple words, whole training data set is loaded into the classifier and whenever a new test data comes, the distance of this test item to the each training item is calculated and test item is labeled as the nearest training item’s class. More advanced method is that instead of simple distance calculation to each training data, certain number of nearest neighbours are checked and the label with the most number of items is used for classifying that test item. kNN gives very effective results in the field of image processing. IBk is an implementation of a classifier using kNN as search algorithm.
Genetic Algorithm, shortly GA, is a probabilistic search method which is based on converting a population of binary strings which have fixed lengths iteratively into a new population of offspring objects using an evolutionary principle. This supports crossover and mutation. It uses a fitness function to measure the success of population in order to get better candidates for the next iteration. This is inspired by natural selection. Steps of genetic algorithms are listed:
illustration not visible in this excerpt
Crossover and mutation are totally random phases of this method.
Genetic programming, shortly GP, is an outgrowth of genetic algorithms. GP uses biological evolution as inspiration to perform artificial intelligence tasks9. GP’s most important part is its fitness function. Basically it measures how well the task is performed by the algorithm. It helps the algorithm to get better and interestingly it can simply be similar to the natural selection in biology.
GP is widely used for optimisation problems by defining fitness functions for that specific problem.GP adopts almost everything from GA. However, during the mutation and crossover, instead of binary strings, programs are used.
GP is used to optimise the creation of the decision tree in this experiment. As mentioned above, compactness is the biggest challenge during the setup of a decision tree. An
illustration not visible in this excerpt
evolutionary method proves that GP is a good use case for classification problems10.
Figure 1: Example tree representation of a GP
This paper is based on experimenting different text classification algorithms and comparing their performance on a common and fair ground. To achieve this, 3 different data sets are chosen. They have different numbers of data rows.