A Clustering Method for Analysis of Data Subject to Pre-defined Classifications

Script, 2019
12 Pages, Grade: A



1 Motivation

2 Geometric Representation of Data

3 Relative Location of Geometric Representations

4 General and Relative Distances

5 Conclusions

A Clustering Method for Analysis of Data Subject to Pre-defined Classifications

Yang Liu1

Abstract In this paper, we present a methodology to perform clustering and grouping analysis for dataset with classification constraints or definitions. The discussion is demonstrated with a full example based on read data. We start with the observed difference in the CIA and UN subregional definition of European countries, and consider what the impact is from a subregional house price ratio perspective. As documented in this report, we find that the presented approach useful for clustering analysis of the pre-identified subgroups to address subgroup based clustering problems.

Keywords: Data Classification, Grouping Analysis, Clustering

1 Motivation

Clustering is an important topic in data analysis. While there are a lot of discussions around the methodologies of classification and pattern recognition, most discussions are based on single record level and assume that the difference between records are completely embedded in the features of the underlying data.

Therefore it is challenging when classification problem involves constraints or predefined data properties in the underlying data.

In this paper we propose an approach to perform classification analysis for data with constraints or pre-defined properties in place. To help demonstrate the approach, we provide a full analytical example using real data along side the methodology discussion.

We detail the data, target and challenge in the following sub-sections.


Here we download and use the 2018 Q2 IMF house price-to-income and price-to-rent data for research and academic purpose only. Any conclusions and findings presented in this paper using the dataset does not reflect or contradict any conclusion presented or found in other publications.

Source of housing price ratio data we use in this paper is: IMF Global Housing Watch 1.

Note that the IMF analysis quote the source as Organisation for Economic Co-operation and Development, while we fully acknowledge the analysis and reports published by IMF, here we only make direct reference to the IMF web location where the data is downloaded.

We then filter the data and only keep the European countries to meet our focus of analysis.

As this data is at country level, we adopt and mapped these countries according to the European country classification rules of both the CIA and United Nations.

The CIA classification can be found in the World fact book: CIA World Factbook 2. The link to the United Nations geographical classification definition is: United Nations Geoscheme 3.

The full final dataset used to demonstrate the methodology is as shown in Table 1.

Table 1: 2018 Q2 Indexed (2015=100) IMF House Price-to-Rent and Price-to-Income Ratio with CIA and UN Geographical Region Classification for European Countries

Abbildung in dieser Leseprobe nicht enthalten

Note that the price ratios are indexed by basing the 2015 value as 100, we keep only 2 decimal places for display purposes in the table, original value is kept and used in calculations in the rest of this paper.

Figure 1 graphically display the raw data.

Figure 1: IMR House Price-to-Income and Price-to-Rent Index 2018 Q2 (2015=100)

Abbildung in dieser Leseprobe nicht enthalten


We acknowledge the rational respectively behind the CIA and UN classification of the countries. Our analytical target is to demonstrate the the analytical approach by establish a grouping and measuring system using these data and compare the results from a house price ratio per­spective. Results and conclusion in this paper does not contradict the published CIA and UN classifications.

As shown in Table 1, we see that there are significant differences between the CIA and UN defined subregions of European countries. The UN consider four subregions: Eastern, Western, Northern and Southern, the CIA has a fifth subregion: Central Europe.

Aside from the total number of subregions, we see that there are differences in the classification of certain countries as well. For example, the UK is considered as a Western European country by CIA while it is a Northern European country to the UN, same for Ireland.

In the meantime, CIA’s Eastern European countries Latvia and Estonia are considered as North­ern European countries according to the UN. Additionally, Western European country Germany for the UN is classified as a Central European country by the CIA. More differences like these can be found in Table 1.

Now if we consider the house price ratios on a subregional basis in Europe, how does the above differences affect the result? Note that in this example the key difference is two sets of definitions on exactly the same underlying data.

The goal of this example is:

- First to demonstrate our approach to group and measure the cluster of data under pre­defined conditions, in this case the two different classification definitions.
- Second, we look into further clustering of pre-defined data groups, for example, if we separate the data into two groups only by further grouping the CIA or UN defined regions.
- Thirdly and specific to this example, we want to compare the two set of definitions from a housing price perspective using the IMF reported index data.


Classic clustering classification method such as the K-means clustering work less well for the purpose of this example as the output clusters are strongly subject to constraints. As a result, we find the following:

- The suggested clusters does not necessarily complain with the pre-defined regions.
- For data record values that are numerically close, clustering results are subject to random seeds and processes.
- In case of further clustering or splitting of grouped data like grouping 5 geographic subre­gions into 2, is either not possible or violating the initial definition such as region classifi­cations.

Figure 2 demonstrate the K-Means clustering of 5-cluster and 4-cluster classification respectively.

Figure 2: K-Means Clusters

Abbildung in dieser Leseprobe nicht enthalten

2 Geometric Representation of Data

We adopt condition according to the pre-defined property or constraint, and geometrically rep­resent the subset data as convex polygons. On a 2D plain, the XY-axis are the 2 numerical ratios that is considered in the analysis. Our approach works well in higher dimensions and adopt orthogonal transformation or dimension reduction methodologies, but we focus on the 2-dimensional example for simplicity and illustration purpose in this paper.

Tables 3a and 3b define the 2D polygons according to the CIA and UN definitions.

Table 2: Vertices for 2D Graphics Plot

Abbildung in dieser Leseprobe nicht enthalten

(a) CIA Region Vertices

Abbildung in dieser Leseprobe nicht enthalten

(b) UN Region Vertices


1 Yang Liu is a quantitative specialist at an international bank. Yang holds a doctorate in quantitative finance from Cass Business School, City University of London. He has published a number of papers on quantitative methods in risk and finance and served as reviewer for journals in the field. The opinions expressed in this paper are those of the author only.

Excerpt out of 12 pages


A Clustering Method for Analysis of Data Subject to Pre-defined Classifications
Catalog Number
ISBN (eBook)
clustering, method, analysis, data, subject, pre-defined, classifications
Quote paper
Yang Liu (Author), 2019, A Clustering Method for Analysis of Data Subject to Pre-defined Classifications, Munich, GRIN Verlag, https://www.grin.com/document/491428


  • No comments yet.
Read the ebook
Title: A Clustering Method for Analysis of Data Subject to Pre-defined Classifications

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free