Often, model development starts with missing data treatment. For regulatory internal rating based (IRB) models, missing data raise data quality concerns around system and process whereas the randomness nature of missing is sometimes overlooked, resulting inappropriate choice of imputation methods, more importantly, the chosen imputation method could lead to issues that violate modelling assumptions in the later process. With ML and AI methods introduced to regulatory modelling, impact of missing data will be more thoroughly investigated and challenged.
This paper starts with issues arose from imputation processes in practice, then briefly review common approaches for missing data treatment. A candidate Bayesian approach is then proposed as an alternative. In conclusion, imputed results using the proposed approach improve the explanatory power of historical observations while housing multiple convergence conditions such as the train-test accuracy, likelihood of value distribution, cross-validation and challenger model performance. At the dawn of ML and AI algorithms coming to the regulatory IRB models, these properties are highly desired in the area.
Contents
1. Motivation
2. Short review of imputation approaches
3. Logistic Regression and randomness of missing in IRB
4. Bayesian imputation
5. Numerical example
6. Conclusion
Appendix:
A. Summary of data
B. Post imputation statistical inference
On Missing Data Imputation for IRB Models
Yang Liu1
Abstract Often, model development starts with missing data treatment. For regulatory internal rating based (IRB) models, missing data raise data quality concerns around system and process whereas the randomness nature of missing is sometimes overlooked, resulting inappropriate choice of imputation methods, more importantly, the chosen imputation method could lead to issues that violate modelling assumptions in the later process. With ML and AI methods introduced to regulatory modelling, impact of missing data will be more thoroughly investigated and challenged.
This paper starts with issues arose from imputation processes in practice, then briefly review common approaches for missing data treatment. A candidate Bayesian approach is then proposed as an alternative. In conclusion, imputed results using the proposed approach improve the explanatory power of historical observations while housing multiple convergence conditions such as the train-test accuracy, likelihood of value distribution, cross-validation and challenger model performance. At the dawn of ML and AI algorithms coming to the regulatory IRB models, these properties are highly desired in the area.
Keywords: Missing data treatment, Imputation methodology, Bayesian imputation, Cross-validation
1. Motivation
Given that modelling assumptions and methodologies that prevail for good quality data may prove impaired in the context of uncertainty with presence of missing data, it appears of paramount importance that IRB institutions pay particular attention to the choice of data treatment and modelling methods in this case.
Issues such as missing and erroneous data not only lead to questions around quality of system and process, but also impact the quality of modelling and estimation exercises based on such data. Results from Schafer [1] and Little [2] shown impacts and uncertainty embedded in the unknown data as well as potential false conclusions where misleading assumptions is made. Various approaches were introduced to address the aforementioned issue in raw data whilst concerns around justification and violations of modelling assumptions found and discussed by researchers and practitioners.
Meantime, with the regulatory requirement of conservatism measures in mind, punitive values that result weaker risk parameter values, e.g. higher estimated PD or LGD values, are often proposed to impute missing data during IRB model development. However, punitiveness differ from conservativeness, even more so when it comes to imputing data for IRB modelling purposes. Instead of mitigating uncertainty, punitive imputation not only lead to additional estimation error during the development process, justification for the imputed value is often requested and challenged during reviews thereafter.
Two main challenges arose from practical application is:
1) Logistic Regression assumption: imputation methods could add correlation between features. As the market standard model for credit risk rating, it is well known that the Logistic Regression model assume negligible correlation between the independent variables. However, model based imputation methods such as MICE build regressive relationship among the independent variables, while evidence support the effectiveness of the methodology, using such imputed data in a Logistic Regression model raise concern around the theoretical independence assumption.
2) Randomness: if not missing at completely random(MACR) or random(MAR), then the statistical measures such as Mean or Median derived from observed data cannot be used to impute the missing data, as the missing might follow a different distribution.
This paper start with a brief review of treatment approaches, followed by an extended discussion of above mentioned challenges relative to regression modelling and randomness of missing. A candidate approach is then proposed with the focus of exploring the distribution of all possible values that helps to identify the observed outcome according to intermediate models fitted during the process.
The principle of the proposed approach is, the underlying data should improve the predictability towards the observed outcome, in other words, the imputation candidate on an individual record level should help to explain the observed outcome of the record. Meanwhile, the imputed data should not change the feature selection decisions based on the raw observed data.
2. Short review of imputation approaches
Below are some treatment approaches commonly used to tackle missing value issue in data. As erroneous values are nullified upon identification, imputation approaches for missing values are equally applicable in this case.
Record removal: Simple removal of record with missing value is straightforward to understand and implement. While the impact in the follow up analysis is directly linked to the quality of initial data, the number of total removed records could be significant, e.g. removing 5% missing value in each of 4 features could result 20% records been removed. More importantly, direct removal potentially lead to missing out material characteristics observed in these records in follow up analysis or modelling exercise.
Constant value imputation: This approach replace missing values by a constant values chosen by the analyst. For example, replacing all the missing values by 0 if numerical, or mode value if categorical. Constants obtained from descriptive statistics, e.g. mean, median or boundary values are often candidates for this category.
Simple as it seems, use of this approach in practice could be tricky and difficult to justify. The challenge is significant if complex rules applied when subsegments are considered, especially when selection of the constant differ per segment per feature.
Data driven imputation: One application of the data driven approaches is to apply last observed value before missing, however, this direct application is most justifiable for time series data. Another variant randomly choose one of the actual observations for each of the missing value. The multi-imputation variant of data driven substitution consider drawing from the observed value multiple times to balance for potential weakness in the assumed value distribution.
Model based imputation: Methods in this category develop models using existing data to estimate the missing values, the choice of models range from regression models to neural network and machine learning methods.
One of the widely adopted methods is Multiple Imputation by Chained Equations2 (MICE). Recent studies support use of Machine Learning methods such as K-Nearest Neighbours3 (KNN) or Multiple Imputation with Denoising Autoencoders4 (MIDAS). Other simulation and Bayesian based models for missing data imputation can be found in Cameletti [5] , Gomez- Rubio [4] , and Buuren [7] .
Remarks:
- Model based imputation methods tend to explore relationship between features, in the mathematical form:
Abbildung in dieser Leseprobe nicht enthalten
Subject to severity and frequency of missing, significant correlation could be found in the imputed features which then leads to multicollinearity issues for regression models.
- It should be noted that the choice of starting value and order in which the variables are imputed are important factors that affect the imputed results for model based imputation approaches.
- Testing of randomness assumptions are straightforward with a theoretically constructed data set as the underlying distribution is embedded in the premises of the test. Unfortunately, in practice none of the assumptions that are made are verifiable from the data.
- The statistical analysis dedicated to tests and model selection involving the imputed data are available for practical applications. Conclusions and recommendations documented in Little [2] apply to a wide variety of quantities, particularly normality transformed features in model.
3. Logistic Regression and randomness of missing in IRB
Logistic Regression
The Logistic Regression is a traditional but well accepted modelling approach across different fields of research and development, it is often considered the industry standard or benchmark model in classification problems such as credit rating, AI item identification and recommendation.
To avoid confusion and misunderstanding, the following terms used throughout this paper is defined and explained before detailed methodology review and discussion:
Feature: the characteristic in the data that used to help prediction and output estimate. This is sometimes referred to as ”Factors” or ”Independent” variables and denoted by X for Logistic Regression.
Target: the observed classification and the targeted output of the prediction model. Also known as ”Dependent” variable which is often denoted by Y in a typical Logistic Regression set up.
Multinomial Logistic Regression is often represented in the following mathematical form:
score(j|Xk) = ßo + ßkXk, for k e (1,K),j e{0,J - 1} (1)
k
where the total number of features is denoted by K, Xk is the value of individual features, ßk is the estimated coefficient for corresponding feature Xk. The total number of possible target outcomes is denoted by J, while J > 2 and j g{0,...,J — 1} for multinomial classification, the term score(Xk,j) is the estimated logistic score for target outcome j. The probability estimates can be obtained using the softmax function.
[...]
1 Yang Liu is a quantitative specialist at an international bank. Yang holds a doctorate in quantitative finance from Cass Business School, City University of London. He has published a number of papers on quantitative methods in risk and finance and served as reviewer for journals in the field. The opinions expressed in this paper are those of the author only. E-mail: yang.liu.q-fin@outlook.com
2 See Melissa J. Azur 3 and Buuren 8 .
3 See Beretta 9 .
4 See Lall 10 and Gomez-Rubio 6 .
-
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X.