Genetic Programming is a biological evolution inspired technique for computer programs to solve problems automatically by evolving iteratively using a fitness function. The advantage of this type programming is that it only defines the basics.
As a result of this, it is a flexible solution for broad range of domains. Classification has been one of the most compelling problems in machine learning. In this paper, there is a comparison between genetic programming classifier and conventional classification algorithms like Naive Bayes, C4.5 decision tree, Random Forest, Support Vector Machines and k-Nearest Neighbour.
The experiment is done on several data sets with different sizes, feature sets and attribute properties. There is also an experiment on the time complexity of each classifier method.
Table of Contents
1. Introduction
1.1.Classification
1.2.Decision Trees
1.3.Naive-Bayes
1.4.Random Forest
1.5.Support Vector Machines
1.6.K Nearest Neighbours
1.7.Genetic Algorithm
1.8.Genetic Programming
2. Experiment
2.1.Classification
2.2.Data Sets
2.3.Tools and Frameworks
2.4.Compared Algorithms
2.5.Genetic Programming Classifier
2.6.Evaluation
3. Results
3.1.Accuracy
3.1.1.Adult Data Set
3.1.2.Breast Cancer Wisconsin Data Set
3.1.3.Car Data Set
3.1.4.Iris Data Set
3.1.5.Contact Lenses Data Set
3.1.6.Soybean Data Set
3.1.7.Weather Nominal Data Set
3.2.Time Complexity Comparison (seconds)
4. Conclusion
5. Future Work
Research Objectives & Core Themes
The primary objective of this research is to evaluate the effectiveness of Genetic Programming (GP) as a classifier compared to traditional machine learning algorithms, specifically focusing on its ability to optimize decision trees for text classification tasks.
- Comparison of GP-based classification with conventional algorithms like Naive Bayes, Random Forest, and SVM.
- Analysis of performance metrics including AUC, precision, recall, and true positive rate across various data sets.
- Investigation of time complexity for training and classification processes.
- Evaluation of GP performance in relation to data set size, distribution, and attribute complexity.
Excerpt from the Book
1. Introduction
The world is going towards digitisation. Anything in the human life becomes data. Parallel to the improvement of the storage and database systems, storing the data and reaching to it have become easier and cheaper. However, having data does not mean knowledge. Information must be extracted from certain amount of the raw data. When it is done, the picture becomes clearer. This is where data mining starts.
Data mining[1] is the exploration and analysis of large quantities of data. Therefore extraction of interesting knowledge like patterns, rules or constraints from large data sets is essential.
Classification is the problem of identifying the categories of data. Text classification is one of the most idiosyncratic one among all. It is based on labelling the input text based on some training data. Social media and internet usage have been increasing by the acceptance of the real time communication and text based information sharing. Increasing amount of the text data boosts the importance of the knowledge extraction from this type. This leads computer science world to lean on text classification algorithms more. The most well known algorithms of this kind are decision tree, Naive-bayes, Random Forest, Support Vector Machines and K Nearest Neighbours classification.
Summary of Chapters
1. Introduction: This chapter defines the context of data mining and classification, introducing Genetic Programming as a bio-inspired technique for solving classification problems.
2. Experiment: This section details the methodology used for the experiment, including descriptions of the data sets, tools (Weka, Rapidminer, Orange), and the specific parameters of the GP classifier.
3. Results: This chapter presents a comparative analysis of classification performance across seven distinct data sets and evaluates the time complexity of the tested algorithms.
4. Conclusion: The author summarizes the findings, highlighting the potential of GP while acknowledging its current limitations regarding parameter tuning and computational requirements.
5. Future Work: This chapter outlines potential research directions, such as hybridizing GP with neural networks and improving search space efficiency via parallelization.
Keywords
Genetic Programming, Machine Learning, Text Classification, Data Mining, Decision Trees, Accuracy, AUC, True Positive Rate, Time Complexity, Evolutionary Algorithm, Naive Bayes, Random Forest, Support Vector Machines, K-Nearest Neighbours, Fitness Function.
Frequently Asked Questions
What is the core focus of this research?
The paper explores the application of Genetic Programming (GP) to improve decision tree classification models and benchmarks its performance against standard machine learning algorithms.
What are the primary topics covered?
Key topics include bio-inspired search methods, data classification accuracy, performance benchmarking on various UCI repository data sets, and algorithmic time complexity.
What is the main objective of the paper?
The objective is to determine if Genetic Programming can act as a reliable and efficient classification method by optimizing the creation of decision trees.
What research methods were employed?
The study uses an experimental approach, applying 10-fold cross-validation on various data sets and comparing the results of different classification algorithms using tools like Weka and Python-based APIs.
What does the main body discuss?
It provides a technical overview of each classifier, details the experimental setup, and presents an extensive analysis of the results through tables and performance charts.
Which keywords best describe this study?
Genetic Programming, Classification, Data Mining, and Machine Learning metrics such as AUC and Recall are the essential identifiers for this work.
How does GP handle the "compactness" challenge in decision trees?
GP uses a fitness function to iteratively evolve the population of trees, aiming to select more informative attributes close to the root to reduce size and increase classification speed.
Why did the GP classifier perform poorly on the Soybean data set?
The large number of class labels (19) and attributes (36) created a massive search space, making it difficult for the evolutionary process to identify optimal nodes during crossover and mutation.
What is the impact of elite population size on GP performance?
The study found that increasing the elite population size improves the True Positive rate but increases the computational time complexity by approximately 20%.
- Arbeit zitieren
- Hakan Uysal (Autor:in), 2013, A Genetic Programming Approach to Classification Problems, München, GRIN Verlag, https://www.grin.com/document/333781