Excerpt

## Table of Contents

CHAPTER 1: INTRODUCTION

1.1 Objective of the work

1.2 Introduction to data mining

1.3 Missing values

1.4 Missing value imputation

1.5 Model flow diagram

1.6 Organizationof the report

1.7 Summary

CHAPTER 2: LITERATURE REVIEW

2.1 Introduction

2.2 Literature review

2.2.1 Missing values

2.2.2 Missing value imputation

2.2.3 Kernel functions

2.3 Summary

CHAPTER 3: DATASET DESCRIPTION

3.1 Introduction

3.2 Data set description

3.3 Summary

CHAPTER 4: IMPUTATION TECHNIQUES

4.1 Introduction

4.2 K –Nearest neighbor imputation method

4.3 Experimental results for imputation done using K-NN

4.4 Frequency Estimation Method

4.5 Experimental results for frequency estimator

4.6 Kernel Functions

4.7 Imputation using RBF kernel

4.8 Experimental results for rbf kernel

4.9 Imputation using poly kernel

4.10 Experimental results for poly kernel

4.11 Summary

CHAPTER 5: IMPUTATION USING MIXTURE OF KERNELS

5.1 Introduction

5.2 Interpolation and Extrapolation

5.3 Mixture of kernels

5.4 Experimental results for mixture of kernels

5.5 Imputation using spherical kernel with rbf kernel

5.6 Experimental results for imputation using spherical kernel and rbf kernel

5.7 Imputation using spherical kernel and poly kernel

5.8 Experimental results for spherical kernel and poly kernel

5.9 Summary

CHAPTER 6: RESULTS AND DISCUSSION

6.1 Introduction

6.2 Performance evaluation

6.3 Experimental results and discussion

6.4 Discussion of results

6.5 Summary

CHAPTER 7: CONCLUSION AND FUTURE WORK

7.1 Conclusion

7.2 Future work

REFERENCES

APPENDIX-A

APPENDIX-B

## CHAPTER 1: INTRODUCTION

### 1.1 Objective of the work

The main objective of this work is to use an estimator for imputing missing values in mixed attribute datasets by utilising the information present in incomplete instances also apart from the complete instances. This approach prevents loss of information which occurs when continuous values are converted into discrete values and vice versa for imputation.

This method is evaluated with extensive experiments and is compared with some typical algorithms and the performance is evaluated in terms of root mean square error and correlation coefficients.

This chapter begins with the brief introduction to data mining concepts, missing values and missing value imputation and concludes with the organization of the report.

### 1.2 Introduction to data mining

Data mining is the process of extraction of hidden predictive information from large databases. It is a powerful tool in by modern business to transform data into business intelligence giving an informational advantage. It is currently used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, and scientific discovery. It can also be defined as the process of discovering interesting knowledge from large amount of data stored either in databases, data warehouses or other information repositories.

Data Mining is a step in the knowledge discovery process consisting of particular data mining algorithms. It is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining is primarily used today by companies with a stronger consumer focus retail, financial, communication and marketing organizations. It enables these companies to determine relationships among internal factors such as economic indicators, competition, and consumer demography and it enables them to determine the impact on sales, customer satisfaction , and corporate profits.

Data mining consists of five major elements:

- Extract, transform and load transaction data onto the data ware house system.

- Store and manage the data in a multidimensional database system.

- Provide data access to business analysts and information technology professionals.

- Analyze the data by application software.

- Present the data in a useful format , such as graph or a table.

**Data mining techniques**

Data mining is an in disciplinary field ,the confluence of a set of disciplines including database systems, statistics, machine learning, visualization and information science. Some commonly used data mining techniques are,

- **Artificial neural networks:** Non-Linear predictive models that learn through training and resemble biological neural networks in structure.

- **Decision trees:** Tree-shaped structures that represents sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include classification and Regression (CART) and Chi square Automated Integration Detection (CHAID).

- **Genetic Algorithms:** Optimization techniques that use process such as genetic combination, and natural selection in a design based on the concepts of evolution.

- **Nearest neighbor method:** A technique that classifies each record in a dataset based on a combination of the classes of the k record most similar to it in a historical dataset. Sometimes called the k-nearest neighbor technique.

- **Association rule induction:** The extraction of useful if-then rules from data based on statistical significance.

Many of these technologies have been in use for more than a decade in specialized analysis tool that work with relatively small volume s of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms.

**Data mining tasks**

Data mining commonly involves four classes of tasks:

- **Clustering:** is the task of discovering groups and structures in the data that are in some way or another similar without using known structures in the data.

- **Classification:** is the task of generalizing structure to apply to new data. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines.

- **Regression:** Attempts to find a function which models the data with the least error.

- **Association rule learning:** Searches for relationship between variables.

### 1.3 Missing values

Many types of experimental data, especially expression data obtained from microarray experiments and air pollutant data obtained from air sample collecting machine are frequently peppered with Missing values (MVs) that may occur for a variety of reasons and they need to be preprocessed.

Tasks in data preprocessing are:

- **Data cleaning: fill in missing values**

- Data integration

- Data transformation: normalization and aggregation.

- Data reduction: reducing the volume but producing the same or similar analytical results.

- Data discretization

Because many data analyses such as classification methods, clustering methods and dimension reduction procedures require complete data, researchers must either remove the data with MVs, or, preferably, estimate the MVs before such procedures can be employed. Consequently, many algorithms have been developed to accurately impute MVs. Because missing values can result in bias that impacts on the quality of learned patterns or/and the performance of classifications, missing data imputation has been a key issue in learning from incomplete data. Many existing, industrial and research data sets contain Missing Values. They are introduced due to various reasons, such as manual data entry procedures, equipment errors and incorrect measurements. The detection of incomplete data is not easy in most cases. Missing Values (MVs) can appear with the form of outliers or even wrong data (i.e. out of boundaries).

Three basic approaches to deal with missing values

- Case deletion.

- Learning without handling of missing values.

- **Missing value imputation.**

### 1.4 Missing value imputation

Missing value imputation is a procedure that replaces the missing values with some feasible values. Various techniques have been developed with great successes on dealing with missing values in data sets with homogeneous attributes (their independent attributes are all either continuous or discrete). However, these imputation algorithms cannot be applied to many real data sets, such as equipment maintenance databases, industrial data sets, and gene databases, because these data sets are often with both continuous and discrete independent attributes. These heterogeneous data sets are referred to as mixed-attribute data sets and their independent attributes are called as mixed independent attributes. Imputing mixed-attribute data sets can be taken as a new problem in missing data imputation because there is no estimator designed for imputing missing data in mixed attribute data sets. The challenging issues include, such as how to measure the relationship between instances (transactions) in a mixed-attribute data set, and how to construct hybrid estimators using the observed data in the data set.

### 1.5 Model flow diagram

Abbildung in dieser Leseprobe nicht enthalten

Fig 1.1 Data flow diagram (Author’s own work)

Fig 1.1gives an overview of the work. The original data sets are subjected to pre processing techniques which here refers to creating missing values randomly and the missing values are imputed using K-NN, Frequency estimator method, RBF kernel and Polynomial kernel, a mixed kernel (RBF kernel and Poly kernel), and a spherical kernel mixed with poly kernel and a spherical kernel mixed with RBF kernel.

Finally, the performance of these imputation methods is evaluated using Root Mean Square Error(RMSE) and Correlation Coefficient.

### 1.6 Organizationof the report

The report is organized as follows,

Chapter 2 provides an overview of the literature review

Chapter 3 describes about the dataset description

Chapter 4 presents the implementation results

Chapter 5 presents Imputation using kernel functions

Chapter 6 presents results and discussion

Chapter 7 provides conclusion and future work

### 1.7 Summary

This chapter discussed about data mining, scope of data mining, various data mining methods, Missing values, Missing value imputation and Data flow diagram , thereby giving an overview about the basic concepts.

## CHAPTER 2: LITERATURE REVIEW

### 2.1 Introduction

This chapter presents previous studies done in the field of missing values, missing value imputation techniques and kernel functions. It deals with the background studies in the field of missing values and gives an idea about the imputation methods carried out in this field.

### 2.2 Literature review

#### 2.2.1 Missing values

Allison, P. D. (2001) in the paper, Missing data has evaluated two algorithms for producing multiple imputations or missing data using simulated data based on the software of SOLAS. Software using a propensity score classifier with the approximate Bayesian bootstrap was found to produce badly biased estimates of regression coefficients when data on predictor variables are MAR or MACR.

Brown, M. L., & Kros, J. F.,(2003) in the paper, Data Mining and Impact of missing data explained the importance of estimating missing data in datasets especially in the case of real data sets. Data mining is based upon searching the concatenation of multiple databases that usually contain some amount of missing data along with a variable percentage of inaccurate data, pollution, outliers, and noise. The actual data-mining process deals significantly with prediction, estimation, classification, pattern recognition, and the development of association rules. Therefore, the significance of the analysis depends heavily on the accuracy of the database and on the chosen sample data to be used for model training and testing. The issue of missing data must be addressed since ignoring this problem can introduce bias into the models being evaluated and lead to inaccurate data mining conclusions.

S.C. Zhang et al. (2004) in the paper, Information enhancement in data mining indicated the presence of missing values and pointed out the importance of information enhancement and data pre-processing in the raw data. Information enhancement techniques are desired in many areas such as data mining, machine learning, business intelligence, and web data analysis. Information enhancement mainly includes the following topics: data cleaning, data preparation and transformation, missing values imputation, feature and instance selection, feature construction, treatment of noisy and inconsistent data, data integration, data collection and housing, information enhancement, web data availability, web data capture and representation, and the others

Ghahramani, Z., & Jordan, M. I. (1997) in the paper, Mixture models for learning from incomplete data has reviewed the main missing data techniques, including conventional methods, global imputation, local imputation, parameter estimation and direct management of missing data. He tried to highlight the advantages and disadvantages for all kinds of missing data mechanisms. For example, he revealed that statistical methods have been mainly developed to manage survey data and proved to be very effective in many situations. However, the main problem of these techniques is its strong model assumptions.

Zhang, S et al.,(2005) in the paper Missing is useful , missing values in cost sensitive decision tree studied the issue of missing attribute values in training and test data sets. Indeed, many real-world data sets contain missing values and a difficult problem to cope with. Sometimes, values are missing due to unknown reasons or errors and omissions when data are recorded and transferred. However, deleting casescan result in the loss of a large amount of valuable data. In this paper, they study missing data in cost-sensitive learning in which both misclassification costs and test costs are considered. That is, there is a known cost associated with each attribute (variable or test) when obtaining its values.

Feng, D. C (2008)in the paper, Research on missing value estimation in data mining explained the advantage of MAR (missing at random) pattern which is introduced on the basis of the analysis of all the missing pattern and EM(expectation maximum) algorithm were applied in MAR pattern correspondingly. Finally, the missing estimation algorithm was used in the pre-treatment stage of fault data and combined with wavelet neural network to realize the fault classification. Mean value algorithm will easily cause the estimation error and reduce the association trend among the variables. However, the missing value estimation process may change the original information system more or less; even add the noise during the filling process of null value, which will probably cause the wrong results in data mining. Therefore, how to realize the data mining with null value directly rather than changing the original information system still need further research.

#### 2.2.2 Missing value imputation

Qin, Y et al.(2007)in the paper, Semi parametric optimization for missing data imputation gave the idea that Missing data imputation is an important issue in machine learning and data mining. In this paper, a new and efficient imputation method for a kind of missing data: semi-parametric data is proposed. This imputation method aims at making an optimal evaluation about Root Mean Square Error (RMSE), distribution function and quintile after missing-data are imputed.

Zhang, C et al. (2007) in the paper, An imputation method for missing values gave an idea that it is necessary to iteratively impute missing values while suffering from large missing ratio. Hence, many iterative imputation methods have been developed, such as the Expectation-Maximization (EM) algorithm which is a classical parametric method.

Dick, U et al.(2008),in the paper ,Learning with incomplete data with infinite imputation addresses the problem of learning decision functions from training data in which some attribute values are unobserved. This problem can arise, for instance, when training data is aggregated sources, and some sources record only a subset of attributes. A generic joint optimization problem in which the distribution governing the missing values is a free parameter is derived. It is shown that the optimal solution concentrates the density mass on finitely many imputations, and provides a corresponding algorithm for learning from incomplete data.

Ling, W et al.(2009) in the paper , Estimation of missing values using a weighted K-Nearest Neighbors algorithm presented a novel algorithm that iscapable of simultaneously estimating several missing components using a weighted K-Nearest-Neighbors algorithm.. This paper studied a new imputation method towards the task of establishing a model from observation data when missing values occur among the multivariate input data. The main idea is to exploit correlations between different dimensions in Weighted-KNN distance metric when imputing the missing dimension, where each dimension should be weighted by the respective correlation coefficient obtained by the SVR method. The imputation method is stimulated by the steel corrosion dataset in seawater environmental, which was demonstrated to have superior results to the KNN imputation method.

#### 2.2.3 Kernel functions

Silverman, B. W. (2018) in the paper , Density estimation for statistics and data analysis pointed out that the selection of optimal bandwidth is much more important than kernel function selection. This is because smaller values of bandwidth make the estimate look “wiggly” and show spurious characteristics, whereas too large values of bandwidth will result in an estimation that is too smooth, in the sense that it is too biased to reveal structural features. However, there is not a generally accepted method for choosing the optimal bandwidth.

Racine, J., & Li, Q. (2004) in the paper, Non- parametric estimation of regression functions with both categorical and continuous data proposed a method for nonparametric regression which admits continuous and categorical data in a natural manner using the method of kernels. A data-driven method of bandwidth selection is proposed, and the asymptotic normality of the estimator is established. The rate of convergence of the cross-validated smoothing parameters to their benchmark optimal smoothing parameters was also established.

Smits, G. F., & Jordaan, E. M. (2002) in the paper, Improved SVM regression using mixture of kernelsexplained that a mixed kernel, a linear combination between poly kernel and Gaussian kernel, gives the extrapolation and interpolation much better than either a local kernel or a global kernel. In this paper, a mixture of kernels is employed to replace the single kernel in continuous kernel estimator.

Raykar, V. C., & Duraiswami, R. (2006) in the paper, Fast optimal bandwidth selection for kernel density estimation proposed a computationally approximation algorithm for univariate Gaussian kernel based density derivative estimation that reduces the computational complexity from *O* (*MN*) to linear *O* (*N* + *M*). The procedure is applied to estimate the optimal bandwidth for kernel density estimation. The speedup achieved on this problem is demonstrated using the "solve-the-equation plug-in" method, and on exploratory projection pursuit techniques.

Xiaofeng Zhu et al.(2011)in the paper ,Missing value estimation in mixed attribute data sets explained a new setting of missing data imputation, i.e., imputing missing data in data sets with heterogeneous attributes (their independent attributes are of different types), referred to as imputing mixed-attribute data sets. This paper first proposes two consistent estimators for discrete and continuous missing target values, respectively. And then, a mixture-kernel based iterative estimator is advocated to impute mixed-attribute data sets .

### 2.3 Summary

This chapter gives a detailed view about the work done by authors in the field of missing value, missing value imputation and the use of kernel functions for imputation which are considered in this work.

**[...]**

- Quote paper
- Aasha Ajith (Author), 2012, How Can a Loss of Information in Mixed Attribute Datasets be Prevented?, Munich, GRIN Verlag, https://www.grin.com/document/457847

Comments