Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests

Master's Thesis, 2014

98 Pages, Grade: 1,0


Table of contents

List of illustrations

List of tables

List of abbreviations


1. Introduction and problem description
1.1 Intention of this thesis
1.2 Proceeding

2. Introduction to key figure analysis
2.1 The principle of key figures
2.2 The classical key figure analysis approach
2.3 Modern key figure analysis approaches
2.4 Limitations of annual report analysis

3. The available dataset
3.1 Description of the dataset
3.2 Data clean-up

4. Key figure selection
4.1 Significant key figure requirements
4.2 The selected key figures of this analysis
4.2.1 Selected class variable
4.2.2 Selected qualitative key figures
4.2.3 Selected absolute key figures
4.2.4 Selected relative key figures
4.3 Class analysis

5. Classification trees and forests
5.1 Preconsiderations
5.2 Classification trees
5.2.1 A simple example
5.2.2 Generation of classification trees
5.2.3 Pruning an existing tree
5.2.4 Relevant properties of CART trees
5.3 Random forest
5.3.1 Classification process of a random forest
5.3.2 Generation of random forest
5.3.3 Relevant properties of random forests

6. Classification results
6.1 Classification tree results
6.1.1 Examination of the most precise tree
6.1.2 Key indicator importance ranking
6.1.3 Transfer to data from 2011
6.2 Classification forest results
6.2.1 Transfer to data from 2011
6.2.2 Key indicator importance ranking

7. Conclusion
7.1 Critical assessment
7.2 Outlook



1. This thesis’ procedure model
1.1 Definitions of “reference model” and “process model”
1.2 The CRISP-DM reference model
1.2.1 The six phases of CRISP-DM
1.2.2 Assessment of CRISP-DM

2. Data extraction

3. Class comparison diagrams

List of illustrations

Illustration 1: Structure of the available csv-file

Illustration 2: Boxplots of the absolute key figures

Illustration 3: Proportion of corporate failure in 1996-1997

Illustration 4: Boxplots of the relative key figures

Illustration 5: Occurrence of NAs among the key figures

Illustration 6: Transition of the class variable between 2010 and 2011

Illustration 7: Lucrativeness in 2010 and 2011 considering NAs

Illustration 8: Classification tree of the example dataset

Illustration 9: Two possible splits of the example datasets

Illustration 10: Overfitting example

Illustration 11: Examples's variable importance values of the tree

Illustration 12: Random forest for the example consisting of 100 trees

Illustration 13: Example’s variable importance values of the random forest

Illustration 14: False positive fallacy

Illustration 15: Undersampled classification tree

Illustration 16: Classification tree’s importance ranking

Illustration 17: Precision values for different minimum vote counts

Illustration 18: Factor of share improvement for different minimum vote counts

Illustration 19: Instance numbers for different vote limits

Illustration 20: Importance rankings of the classification trees and forests

Illustration 21: The six CRISP-DM phases

Illustration 22: Class comparison boxplots 1-8

Illustration 23: Class comparison boxplots 8-16

Illustration 24: Class comparison bar diagram of National_legal_form

Illustration 25: Class comparison bar diagram of Legal_form

Illustration 26: Class comparison bar diagram of Company_Independence

List of tables

Table 1: Important parameters of the class variable

Table 2: Selected qualitative key figures

Table 3: Selected absolute key figures

Table 4: Selected relative key figures

Table 5: Four entries of the example dataset

Table 6: Cross validated classification tree results for 2010

Table 7: RPART’s prediction of lucrativeness for 2011

Table 8: Results for the reduced dataset

Table 9: Cross validated classification forest results for 2010

Table 10: Random forest’s prediction of lucrativeness for 2011

Table 11: Results of the measures to improve the precision of the forests

List of abbreviations

illustration not visible in this excerpt


I would also like to thank my supervisor Prof. Dr. Andreas Behr for assisting me with this thesis. He provided me with valuable suggestions and gave me the opportunity to write about my favourite topic. Thank you very much!

1. Introduction and problem description

In literature, a lot of scientists describe how to use annual report data to predict whether a certain company is going to become bankrupt (Dimitras, Zanakis und Zopounidis 1996, 487–513). The reasons why this topic attracts such a high degree of scientific attention is rather obvious: The stability of the financial system depends on the ability of banks and other financial service providers to assess whether a certain firm will be able repay a loan or not. Furthermore, banks need this information to be able to calculate an adequate probability of default to identify a minimum interest rate for a concrete loan (Moro und Schäfer 2004).

Nevertheless, it is not only relevant to anticipate this worst case of bankruptcy, but also whether a regarded small firm will grow extraordinary in the next year and maybe even become a big company in the medium term. This is crucial information for private investors and fund managers who need to decide whether they should invest in a certain firm. Companies like Apple and Amazon have shown in the past that people who recognized the potential of such companies and bought their shares have earned a lot of money.

The prediction models, which are described in this paper, can also be used by politicians to identify companies which are eligible for funding. Because growing companies oftentimes hire many employees, it might be meaningful to facilitate their development process by selective subsidies to reduce unemployment. Furthermore, it is possible to question the prediction results of a financial analyst if he came to a different conclusion than a model.

Since annual reports are often publically available for free, it is reasonable to take advantage of them for such a prediction (Gräfer 1988, 52). Additionally, various information providers maintain huge databases with annual reports. A big data approach promises to further improve accuracy of predictions (Rauscher und Rockel 2001, 5). This paper introduces methods, which enable to generate knowledge out of these huge data sources to identify extraordinary lucrative firms.

To generate these prediction models, a data mining approach is used which is based on the approved CRISP-DM proceeding model for data mining processes. CRISP-DM ensures comparability and the consideration of best practices (Chapman, et al. 2000, 1-2). The prediction models are based on classification trees and forests because they have some very substantial advantages over other methods like neural networks, which are frequently used in literature. For instance, the underlying algorithms of the used model do not require a certain distributional assumption, accept both quantitative and qualitative inputs, and is not sensitive with respect to outliers. But the two most important advantages are that a tree can be easily interpreted by users which is important for the previously described stakeholders because it is not easy to trust the results of a model which one does not understand (Löbbe 2001, 199). This is why a lack of understanding might impede the practical implementation of such a model. Besides that, the used algorithms can handle missing data which occur very often in the available dataset. In other analysis, these data entries would have been removed even if only one value is missing. This reduces the often already relatively small amount of available data and can reduce the model’s accuracy (Neeb 2011, 67, Franken 2007, 5). This is not the case for the applied methods.

1.1 Intention of this thesis

The intention of this paper is to determine whether a stakeholder can use a classification tree or classification forest at the beginning of one year to identify German firms which will grow exceptionally in this year using annual reports’ key figures from previous years. As a first step, key figures from the years 2007, 2008 and 2009 are used to generate different trees and forests which can predict whether a company grows outstandingly in 2010 or not. These models require the lucrativeness information from 2010 to be generated. To evaluate how well these unchanged models would work for the mentioned stakeholder at the beginning of the year 2011, they are also applied to data from 2008, 2009 and 2010 as a second step. This means that this time, the models are applied to more recent data to anticipate whether the regarded firms will grow intensively in 2011. Data from 2011 is only used to check the predictions’ correctness and not to generate models. The best identified models are also compared and analysed.

These four particular years have been chosen because the available dataset “Amadeus” only contains a relatively small amount of more recent data. It is probably not necessary to regard more than three years for the generation of these models because it is shown in literature that this data is not able to noticeably improve prediction (Pytlik 1994, 94).

One important characteristic of this paper is the usage of the CRISP-DM model, which is frequently used by data analysts and helps them to easier understand the analysis. Furthermore, this model encompasses important best practices which could otherwise be overseen.

Another distinctive feature of this analysis is that the used dataset has a total size of about 18 gigabytes. The reason why it is so huge is that it contains more than 24 million entries each of which contains up to 158 attributes of over 3 million different European companies. Such dimensions are untypical for such analysis[1] and overburden frequently used software like R. Furthermore, this analysis overstrains most current desktop computers because they do not have enough main memory for it. Besides that, this dataset does not have a typical database structure. This is why this analysis meets two important criteria (size and complicated structure) of Big Data-analysis (IBM Corporation Software Group).

1.2 Proceeding

To reach all these goals, the following proceeding is chosen. Since this Data Mining analysis is based on key figures from annual reports, chapter 2 describes some main principles of such an analysis, why it is more powerful than other analysis using older techniques and which drawbacks the usage of annual statement has in general. This section provides some explanations why the generated models sometimes fail to do a correct prediction.

Chapter 1 presents the used dataset and which requirements its entries have to meet in order to be analysed in this thesis. It is, moreover, explained in the appendix how to solve some of its structural problems.

After the available data is described, chapter 1 presents and elucidates all the qualitative and quantitative key figures which are used to predict growth. These key figures have to meet certain demands, to be meaningful which are presented as well. Furthermore, it depicts how it is determined whether a company is lucrative and has, therefore, grown outstandingly or not because there are several different possibilities to do that. Based on these key figures, a first analysis is carried out to find differences between lucrative and not lucrative companies.

The next chapter illustrates why classification trees and forests are used in this thesis and which software is used to generate them. The methods are explained based on a simple example which is presented, too.

Chapter 1 contains the actual analysis and a comparison of the different results. The obtained findings are summarised in chapter 1 and a conclusion is drawn.

2. Introduction to key figure analysis

Because the main part of this thesis is an analysis of annual statements’ key figures, it is necessary to explain what key figures are in general and what advantages and shortcomings such an analysis has. This explanation can be found in this chapter. Moreover, the assumptions and risks of the data analysis are presented which are crucial for the entire project.

This chapter is part of the Business Understanding phase because it explains the context of the upcoming data analysis. Some further aspects of this phase like the target group of this analysis have already been mentioned in the introduction to meet the structure of a scientific paper.

2.1 The principle of key figures

A key figure is a condensed indicator for a certain quantifiable issue. It provides information in a way that the beholder gets a quick overview of the most important aspects of this issue and it points out abnormalities (Pook und Tebbe 2002, 104). In this thesis, such a figure has to inform about the economical state of the regarded enterprise.

Every key indicator should be designed in a way that it has a clear meaning because, in theory, it is of course possible to put arbitrary figures in the numerator and denominator of a key figure (März 1983, 80). But even the value of a well-designed key figure often does not have an own semantic but only gets a meaning when this value is compared to the value of another company or to a certain reference value (Johnson 1970, 1167). Furthermore, it is advisable to look at the development of certain key indicators (Pytlik 1994, 98).

It is important to mention that there are both absolute key figures, relative key figures, and proportional key figures (Mittag 2011, 73).

Several key figures can be combined to a so called “key figure system” which aims for representing managerial interdependencies and certain external influences (Löbbe 2001, 24). Such a system can be used to compare several enterprises even if they are different in some respects (Schult 2003, 15).

2.2 The classical key figure analysis approach

Such a key figure system can be used to analyse the economical state of a company. To do this, the enterprise is assessed based on subcategories like asset structure, rentability, and liquidity. All necessary key figures for such an assessment are calculated based on the figures, which are published by the company. The results of these subcategories are then combined to an overall result (Löbbe 2001, 23, Franken 2007, 3-11). Such an analysis should also provide information about how rich or poor a firm is, why and how much its assets have changed, and how successful it will be in the future (Löbbe 2001, 35, Franken 2007, 3).

This depiction creates the impression that such an approach can be used in this paper to predict whether a company is going to realise profit or to incur a loss. But this is not the case since these techniques have many shortcomings, which are summarised in the following paragraph.

These classical approaches try to conclude the current state of an enterprise from certain key figures (Löbbe 2001, 34). This is very problematic because there are no proven theories which could enable such a deductive reasoning but only evidence about certain interdependencies. This is why an analyst has to make a lot of assumptions about which key figures to look at, how strong the impact of each of them is, and how to combine the results of the subcategories to an overall assessment. Furthermore, it is in most cases ambiguous whether a certain value of the regarded figure is actually a “good” or a “bad” sign. This makes such an approval highly subjective. Besides that, using not enough figures reduce the semantic of the key figure system as some important aspects are not considered. Using a too big system makes it very hard to get a quick overview. Because of that, identifying an appropriate number of regarded figures is also not trivial (Löbbe 2001, 33-46, Hauschildt und Baetge 2000, 115, Küting und Weber 1994, 342).

But there are many other problems, too. One of them is that the results of these classical approaches are often not precise enough (Moro und Schäfer 2004). Moreover, such judgements require a lot of time and generate relatively high costs because they almost do not benefit from modern information processing (Nanni und Lumini 2009, 3028).

Because of this lack of a theoretical foundation, the insufficient precision and the high cost, these approaches are not used in this thesis.

2.3 Modern key figure analysis approaches

All these disadvantages of the previously mentioned approaches motivated scientists to develop new kinds of methods to make predictions based on key figures from annual statements. The ongoing evolution of digital information processing, which enables the practical application of these methods, is also a very important reason why the significance of these methods continues to increase (Löbbe 2001, 35).

Because there is no proven theory about the dependencies between key figures and the state of the corresponding enterprise, this approach encompasses various data mining methods.

Data Mining (DM) is both the science and art of intelligent data analysis, which aims for gaining insights into the data and for learning about interesting patterns and trends (Williams 2011, VII, Hastie, Tibshirani und Friedman 2009, 8, Han, Kamber und Pei 2011, 8). A pattern is usually regarded as relevant if it is universally valid, not already known by the user, and is useful and understandable for him. Such relevant patterns are regarded as knowledge (Runkler 2010, 2).

The identified knowledge is often represented as models, which are a structured representation of the underlying data. Models are sometimes also called “learners”. They can, further on, be used for predictions or to learn more about the data (Williams 2011, 3-4, Hastie, Tibshirani und Friedman 2009, 20-21).

DM was introduced by the database community in the 1980s and is now also advanced by statisticians and artificial intelligence scientists (Williams 2011, VII). Statistics added various computational methods and visualisation techniques to DM. Artificial intelligence contributed its focus on heuristics, and the database experts provided the knowledge how to efficiently store and access large amounts of data which have to be analysed (Gorunescu 2011, 2-3).

Nowadays, different kinds of data like data from social media, patient data and data from the retail industry and science was collected (Han, Kamber und Pei 2011, 2). DM methods can be used to analyse this data and to predict heart attacks, identify cancer, anticipate share prices, and recognize spam emails (Hastie, Tibshirani und Friedman 2009, 20-21).

There are several different approaches how to categorize DM methods. One of them is presented here. The first category is “characterization and discrimination”, where the properties of certain user defined classes should be analysed. In the grouping “mining frequent patterns”, item sets and patterns which occur frequently within the data are identified. In case of “classification and regression”, the classes or a certain target value of not yet classified objects have to be determined. These methods require a certain amount of already classified objects to determine the classification model. Methods of “cluster analysis” try to identify objects which belong together based on similarity considerations when no class information exists in advance. Moreover, there are also methods for the detection of outliers. These are objects, which are very different from most of the other objects (Han, Kamber und Pei 2011, 15-21).

The analysis of this thesis is a classification task. Firms are classified as firms which will grow intensively (=class 1) or will not grow or will even shrink (=class 2). The right class is not known in advance and is determined based on concrete annual report data of the regarded enterprises from previous years (Anders und Szczesny 1999, 3). The corresponding DM methods cannot give reasons for the underlying observations but can be used for predictions if the assumption is true that the identified trends or patterns stay valid up to the prediction point (Löbbe 2001, 34, Franken 2007, 1). Besides that, the used methods enable to get an impression about the quality of the generated results (Hauschildt und Baetge 2000, 115). The classical approach does not offer this possibility.

Edward I. Altman was the first scientist, who used such a modern approach. He applied the multiple discriminant function analysis to annual report data (Löbbe 2001, 46). Because of this method’s very restrictive assumptions on linear separability, multivariate normality and independence of the predictive variables, other authors have applied other methods to this kind of data (Chandra, Ravi und Bose 2009, 4831). Examples of other used data mining methods are neural networks, decision trees, and support vector machines (Kumar und Ravi 2007, 4-13).

An important advantage of such a data mining approach is that they meet the principles of the analysis of annual statements: The results meet the objectification principle because they are generated based on empirical data. They also meet the neutralisation principle because the importance of each key figure is determined by the used method. Last but not least, they meet the holism principle since both the assets, finances, and yields are taken into account (Baetge und Henning 2008, 279).

Furthermore, it is important to point out that it is possible to combine the results of the classical and the data mining approaches to benefit from all of their advantages simultaneously.

Since the classification and prediction model is now built based on given data the quality of this data directly influences the quality of this model and has to be taken into account (Löbbe 2001, 137).

It is important to mention that DM is not just a collective term for various data analysis methods but describes an entire process which is carried out as a project. In such a project, DM experts, data experts, and domain experts have to collaborate to bring together the knowledge how to analyse data, how to access the data, and how to understand the data’s semantic. Moreover, the actual target of the DM project and the intended proceeding is often not clear at the beginning and is often specified based on first results. Even after the proceeding and the targets are specified, it is often necessary to return to previous stages because of certain new insights. Furthermore, several models are created, tested and improved in the course of the project until a satisfactory performance is achieved (Williams 2011, 5-8, Runkler 2010, 3).

The CRISP-DM reference model is the most common one and encompasses plenty of best practices. To benefit from these best practices, this model is considered in this thesis. The model’s description can be found in the appendix.

2.4 Limitations of annual report analysis

At the end of this chapter it is important to point out important general aspects of analysing annual statement data because these facts directly influence the quality of the created model.

First of all, annual reports are not originally designed to be used as a foundation for predicting growth but rather concern the past by telling how wealthy the company is and why its assets has changes. This means that the annual report is diverted from its intended use (Franken 2007, 3).

Another problem, especially in context of small and middle-size companies, is that their success strongly depends on the manager of this company. Unfortunately, most used datasets do not contain any information like age, gender and education of this person (Anders und Szczesny 1999, 1-2).

Furthermore, there is often no information about the enterprise’s strategic goals, its capability to be innovative, the professionalism of the manager and his staff, and the customer focus. All these aspects influence whether a company is going to be successful but cannot be used because they are either not available at all or very hard to operationalize and, therefore, require controversial generalisations (Moro und Schäfer 2004, Fritz 1993, 1, Feldo 2011, 8).

But even the available information cannot be regarded as objective which influences the informational value of the key figures as well. The reason for this is that the companies have a certain level of autonomy of decision as far as the calculation of certain values is concerned so that two identical companies can legally create different annual statements. At least some companies take advantage of this to create their annual statement in such a way that they have to pay less taxes (Löbbe 2001, 43, Rauscher und Rockel 2001). Besides that, annual statements are not instantly available at the beginning of a year so that analysts have to wait until they can use this information for prediction. If they need the outcomes of their predictions earlier, they have to rely on older data. This degrades the accuracy of their prediction (Löbbe 2001, 43).

But there is another kind of problem, too, which is caused by rather mathematical reasons. One of them is that even if all required values are available, some key indicators cannot be calculated because its denominator has the value zero. In huge datasets, this most likely occurs a few times so that these firms have to be removed, too (Löbbe 2001, 138). Moreover, the same value of the same key indicator can be a result of completely different initial values which are divided by each other. For example, both 2/4 and 333/666 have the same result 0.5 which, on the one hand, makes it possible to compare completely different enterprises as mentioned before but, on the other hand, makes it complicated to conclude certain properties of the firm from such a division result (Franken 2007, 9).

Despite all these shortcomings of annual statements, Gräfer still points out that it is meaningful to use them for prediction purposes because they are often the only publically available source of information and still contain a lot of useful data (1988, 52).

3. The available dataset

In this chapter, the used dataset of this paper is described. In this context, both the content and the structure of the dataset are illustrated. This chapter is part of the Data Understanding phase of CRISP-DM.

3.1 Description of the dataset

The dataset originates from the company „Bureau van Dijk Electronic Publishing GmbH“ (BvD). BvD obtains digitalised data about companies from its information providers, combines this data, and provides this data to its customers for analysis and research purposes. BvD also collects some of its data by itself.

The dataset contains both companies which are listed at a stock exchange and companies which are not or no longer listed with an emphasis upon non-incorporated firms (Bureau van Dijk Electronic Publishing GmbH 2013). Furthermore, this dataset, which is called “Amadeus”, encompasses records from eastern and western Europe. In total, approximately three Million different companies are inside Amadeus. To enable comparisons of international firms, especially the annual report data was collected in a standardised way (Bureau van Dijk Electronic Publishing GmbH 2013).

Amadeus is stored in five Comma-separated-values files (csv-files). Its values are enclosed in quotation marks, and consecutive values are separated by tabulator characters. The structure of such a csv-file can be seen in Illustration 1.

illustration not visible in this excerpt

Illustration 1: Structure of the available csv-file

In this analysis, only two of the five csv-files are required: master file data (86 features) and finance data (72 features). Each of these files has a file size of approximately nine gigabytes. The master file data contains the names and addresses of the regarded companies. Additionally, it is mentioned in which industry they operate, which important trademarks they possess, and where most of their goods are produced. As it can be seen in Illustration 1, there are oftentimes more than one row for the same company. This seems to be the case if the corresponding feature is a descriptive feature and, therefore, has more than one value for this company at the same time (Bol 2004, 16). For instance, this is the case if a company has changed its name several times and consequently has more than one former name. In these cases, only the first row is complete and all the other rows just contain the same “BvD ID number”, company name, and the additional feature characteristics. Such a file structure enables to avoid redundancy and to reduce the file size.

The finance dataset contains the actual annual reports. Every row represents exactly one report the date of which is saved in the column “Account date”. Other characteristic features are the gross profit, the number of employees and the costs of materials.

Another very important column is the already mentioned “BvD ID number”, which is unique for every company and enables to merge data from several csv-files. If, for instance, the user requires the industry code for a given annual report, he just has to go through the master file data and look for the first row which has the same “BvD ID number” as the annual report.

3.2 Data clean-up

Like in most databases the data from Amadeus has to be manipulated and some datasets have to be excluded first before it can be analysed. This section presents such manipulations, which are carried out to enable data analysis. Further manipulations which are related to key figures are mentioned in chapter 2. Because the used data is distributed over two database tables, it has to be merged. The necessary steps are described in the appendix.

First of all, it has to be mentioned that only German companies are regarded because of the setting of the task which means that all other companies are excluded. Besides that, only annual reports from the years 2007, 2008, 2009, 2010 and 2011 are regarded. There are more recent reports in the dataset, too, but much less then for the mentioned five years. To ensure a certain representativeness of results, older data is accepted.

Furthermore, it is ensured that only those annual reports are considered which cover exactly twelve months. There are a few reports in the database, too, which summarise a different number of months. Such reports are not comparable to those which cover exactly one year. It also appears not to be sensible to multiply the used key figures with a factor which could compensate a different number of months. The reason is that the underlying assumption that costs and earnings stay the same every month is in most cases not true because, for instance, a tourist hotel in a ski-region usually earns more money and has also higher costs during winter.

Additionally, it is ensured that no consolidated companies are extracted because annual reports of concerns and firms have completely different purposes and it is not sensible to regard them simultaneously (Vorstius 2004, 26-27). In this thesis, only annual reports of firms are regarded.

Besides that, the account practice has to be “Local GAAP” and not “IFRS”. It can lead to wrong results if firms using different account practices are compared because they often calculate the same key figures using different rules (Lembke 2007, 6-7). Because over 99.8 percent of all reports are based on “Local GAAP” this accounting practice is selected. All the reports which are based on “IFRS” are excluded, too.

All the annual reports which do not contain any key figure values at all are also not part of the final dataset. For the actual analysis, all companies are not considered either which do not have a lucrative-value for the prediction year.


[1] Kumar and Ravi have shown in their review that the majority of bankruptcy prediction studies do not analyse more than 9000 firms simultaneously (2007, 6-7).

Excerpt out of 98 pages


Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests
University of Duisburg-Essen  (Wirtschaftswissenschaften)
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
1294 KB
Data Mining, Big Data, Classification Tree, Random Forest, balance sheet, Classification
Quote paper
B. Sc. Jurij Weinblat (Author), 2014, Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests, Munich, GRIN Verlag,


  • No comments yet.
Look inside the ebook
Title: Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free