In this thesis it is predicted if a regarded firm will grow extraordinary in the next year and maybe even become a big company in the medium term. This is crucial information for private investors and fund managers who need to decide whether they should invest in a certain firm. Companies like Apple and Amazon have shown in the past that people who recognized the potential of such companies and bought their shares have earned a lot of money.
The prediction models, which are described in this paper, can also be used by politicians to identify companies which are eligible for funding. Because growing companies oftentimes hire many employees, it might be meaningful to facilitate their development process by selective subsidies to reduce unemployment. Furthermore, it is possible to question the prediction results of a financial analyst if he came to a different conclusion than a model.
Since annual reports are often publically available for free, it is reasonable to take advantage of them for such a prediction. Additionally, various information providers maintain huge databases with annual reports. A big data approach promises to further improve accuracy of predictions. This paper introduces methods, which enable to generate knowledge out of these huge data sources to identify extraordinary lucrative firms.
To generate these prediction models, a data mining approach is used which is based on the approved CRISP-DM proceeding model for data mining processes. CRISP-DM ensures comparability and the consideration of best practices. The prediction models are based on classification trees and forests because they have some very substantial advantages over other methods like neural networks, which are frequently used in literature. For instance, the underlying algorithms of the used model do not require a certain distributional assumption, accept both quantitative and qualitative inputs, and is not sensitive with respect to outliers. But the two most important advantages are that a tree can be easily interpreted by users which is important for the previously described stakeholders because it is not easy to trust the results of a model which one does not understand. This is why a lack of understanding might impede the practical implementation of such a model. Besides that, the used algorithms can handle missing data which occur very often in the available dataset. In other analysis, these data entries would have been removed even if only one value is missing.

Excerpt

1. INTRODUCTION AND PROBLEM DESCRIPTION

1.1 INTENTION OF THIS THESIS

1.2 PROCEEDING

2. INTRODUCTION TO KEY FIGURE ANALYSIS

2.1 THE PRINCIPLE OF KEY FIGURES

2.2 THE CLASSICAL KEY FIGURE ANALYSIS APPROACH

2.3 MODERN KEY FIGURE ANALYSIS APPROACHES

2.4 LIMITATIONS OF ANNUAL REPORT ANALYSIS

3. THE AVAILABLE DATASET

3.1 DESCRIPTION OF THE DATASET

3.2 DATA CLEAN-UP

4. KEY FIGURE SELECTION

4.1 SIGNIFICANT KEY FIGURE REQUIREMENTS

4.2 THE SELECTED KEY FIGURES OF THIS ANALYSIS

4.2.1 Selected class variable

4.2.2 Selected qualitative key figures

4.2.3 Selected absolute key figures

4.2.4 Selected relative key figures

4.3 CLASS ANALYSIS

5. CLASSIFICATION TREES AND FORESTS

5.1 PRECONSIDERATIONS

5.2 CLASSIFICATION TREES

5.2.1 A simple example

5.2.2 Generation of classification trees

5.2.3 Pruning an existing tree

5.2.4 Relevant properties of CART trees

5.3 RANDOM FOREST

5.3.1 Classification process of a random forest

5.3.2 Generation of random forest

5.3.3 Relevant properties of random forests

6. CLASSIFICATION RESULTS

6.1 CLASSIFICATION TREE RESULTS

6.1.1 Examination of the most precise tree

6.1.2 Key indicator importance ranking

6.1.3 Transfer to data from 2011

6.2 CLASSIFICATION FOREST RESULTS

6.2.1 Transfer to data from 2011

6.2.2 Key indicator importance ranking

7. CONCLUSION

7.1 CRITICAL ASSESSMENT

7.2 OUTLOOK

Research Objectives & Core Topics

The primary objective of this thesis is to evaluate whether stakeholders can utilize classification trees and random forests to predict exceptionally growing German firms at the beginning of a calendar year, based on annual report key figures from previous years. The research addresses the challenge of analyzing large, complex datasets by implementing a data mining approach based on the CRISP-DM reference model.

Predicting corporate lucrativeness using classification trees and forests.
Comparative analysis of different data mining models on a large-scale financial dataset.
Evaluation of classification performance based on key figures from annual reports.
Investigation of methodologies to handle imbalanced datasets and missing data.
Assessment of the importance of specific financial indicators for predicting growth.

Excerpt from the Book

2.4 Limitations of annual report analysis

At the end of this chapter it is important to point out important general aspects of analysing annual statement data because these facts directly influence the quality of the created model.

First of all, annual reports are not originally designed to be used as a foundation for predicting growth but rather concern the past by telling how wealthy the company is and why its assets has changes. This means that the annual report is diverted from its intended use (Franken 2007, 3).

Another problem, especially in context of small and middle-size companies, is that their success strongly depends on the manager of this company. Unfortunately, most used datasets do not contain any information like age, gender and education of this person (Anders und Szczesny 1999, 1-2).

Furthermore, there is often no information about the enterprise’s strategic goals, its capability to be innovative, the professionalism of the manager and his staff, and the customer focus. All these aspects influence whether a company is going to be successful but cannot be used because they are either not available at all or very hard to operationalize and, therefore, require controversial generalisations (Moro und Schäfer 2004, Fritz 1993, 1, Feldo 2011, 8).

Summary of Chapters

1. INTRODUCTION AND PROBLEM DESCRIPTION: Introduces the research context, the importance of predictive models for finance, and the application of the CRISP-DM methodology for data mining.

2. INTRODUCTION TO KEY FIGURE ANALYSIS: Examines the principles, advantages, and shortcomings of traditional key figure analysis, contrasting them with modern data mining approaches.

3. THE AVAILABLE DATASET: Details the structure and content of the "Amadeus" database used for the analysis, including necessary steps for data cleaning and preparation.

4. KEY FIGURE SELECTION: Discusses the criteria for selecting meaningful financial indicators and defines the class variables used for identifying lucrative firms.

5. CLASSIFICATION TREES AND FORESTS: Provides a detailed explanation of the CART algorithm, the generation of classification trees, random forests, and methods for pruning and variable importance estimation.

6. CLASSIFICATION RESULTS: Presents the findings of the classification tasks, comparing the performance of different models on 2010 data and their predictive capability when transferred to 2011.

7. CONCLUSION: Summarizes the key insights, evaluates the effectiveness of the chosen DM approach, and provides an outlook on future potential improvements.

Keywords

Data Mining, Classification Trees, Random Forest, Financial Statements, Annual Reports, Predictive Modeling, Lucrativeness, Key Figures, CRISP-DM, Corporate Growth, German Firms, Big Data, R, CART, Model Performance

Frequently Asked Questions

What is the core focus of this research?

The research focuses on utilizing data mining techniques—specifically classification trees and random forests—to predict which companies will achieve exceptional growth based on historical financial statement data.

What are the central thematic areas?

The central themes include financial key figure analysis, classification algorithms, data processing of large financial datasets, and the practical application of the CRISP-DM methodology.

What is the primary research question?

The main question is whether a stakeholder can effectively use classification trees or random forests to predict the lucrativeness of German firms at the start of a year, using data from previous years.

Which scientific methodology is employed?

The study employs a supervised learning data mining approach, specifically using the CART algorithm and Random Forests, structured within the CRISP-DM lifecycle.

What topics are discussed in the main section?

The main section covers the selection and justification of financial key figures, the technical generation and pruning of classification trees, the creation of random forests, and a rigorous performance evaluation of these models on real-world datasets.

What characterize the key terms of this study?

Key terms center around predictive analytics in a corporate finance context, focusing on quantitative metrics, model accuracy, and the ability to interpret model results for decision-makers.

Why are Random Forests used in addition to simple trees?

Random Forests are used because they often provide better classification performance and variance reduction compared to individual "weak learner" trees, though they offer less transparency.

How does the research address the issue of missing data?

The research leverages the built-in capabilities of classification trees for handling missing data and discusses specific imputation methods for random forests to ensure the large "Amadeus" dataset remains usable.

Excerpt out of 98 pages - scroll top

Details

Title: Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests
College: University of Duisburg-Essen (Wirtschaftswissenschaften)
Course: Masterarbeit
Grade: 1,0
Author: B. Sc. Jurij Weinblat (Author)
Publication Year: 2014
Pages: 98
Catalog Number: V273792
ISBN (eBook): 9783656656258
ISBN (Book): 9783656658870
Language: English
Tags: Data Mining Big Data Classification Tree Random Forest balance sheet Classification
Product Safety: GRIN Publishing GmbH

Quote paper: B. Sc. Jurij Weinblat (Author), 2014, Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests, Munich, GRIN Verlag, https://www.grin.com/document/273792

Mining big annual statement datasets to predict highly lucrative companies using classification trees and forests