Excerpt

## Table of Contents

1 Introduction

2 Problems of Fashion Retailing

2.1 Fashion Business

2.1.1 Characteristics of Fashion and Sports Equipment Products

2.1.2 Problems of Fashion Purchasing

2.1.3 Recent Developments in Fashion Purchasing

2.2 Research Questions and Goals of the Thesis

2.3 Outline of the Thesis

3 Forecasting

3.1 Fundamentals

3.2 Sales Forecasting

3.3 Sales Forecasting for Fashion Products

4 Regression Analysis

4.1 Fundamentals

4.2 Parametric Regression Analysis

4.2.1 Quality of the Estimated Regression Model

4.2.2 Nonlinear Regression Analysis

4.3 Nonparametric Regression Analysis

4.3.1 Binning and Local Averaging

4.3.2 Kernel Estimation

4.3.3 Local Polynomial Regression

4.3.4 Quality of the Estimated Nonparametric Regression Model

4.3.5 Nonparametric Multiple Regression Analysis

5 Data Mining

5.1 Fundamentals

5.1.1 Knowledge Discovery from Data

5.1.2 Measurement Scales

5.2 Preprocessing of Data

5.2.1 Data Cleaning

5.2.2 Data Transformation

5.2.2.1 Normalisation of Interval and Ratio Scaled Data

5.2.2.2 Normalisation of Ordinal Data

5.2.2.3 Normalisation of Alpha Variables

5.2.3 Data Reduction

5.3 Data Warehousing

5.3.1 Data Warehousing for Fashion Retailers

6 Economical Quality of Forecasts

6.1 Product Costing and Pricing

6.2 Costs of Overstocking and Understocking

6.3 Evaluating the Economical Quality of Forecasts

7 Application of Forecast

7.1 Calculating Regression Analysis by MATLAB

7.2 Preliminary Examination of Data

7.2.1 Data Description

7.2.2 Data Preprocessing after Export from Data Warehouse

7.2.3 Examinations of Sizes over Time

7.2.3.1 Examination of Sizes during the Season

7.2.3.2 Examinations of Sizes for the same Season over several Years

7.2.4 Examinations of Sizes over Stores

7.3 Forecast and Evaluation Process

7.4 Modelling

7.4.1 Univariate Approach

7.4.2 Multivariate Approaches

7.4.2.1 Surface Fitting Using a Parametric Polynomial Regression Model

7.4.2.2 Surface Fitting Using a Custom Equation Regression Model

7.4.2.3 Surface Fitting Using a Nonparametric Lowess Regression Model

7.5 Forecasting

7.5.1 Evaluation of the Forecast

7.5.1.1 Hypothesis 1: Actual Sales Data Represents the Demand of Products .

7.5.1.2 Hypothesis 2: Actual Sales Data is Biased by Original Order

7.5.1.3 Conclusions on Hypotheses

7.6 Practical Challenges

7.7 Suggesting an Implementation

8 Conclusions

8.1 Results of the Thesis

8.1.1 General Results

8.1.2 Data Understanding

8.1.3 Data Modelling

8.1.4 Forecasting

8.1.5 Implementation

8.1.6 Summary of Results

8.2 Future Research

References

List of Abbreviations

Appendix A: MATLAB Surface Fitting Results

Appendix B: Forecast Results for all Products

Appendix C: Example MATLAB Code of Forecast

## List of Figures

Figure 4.1 - Kernel functions: rectangular (blue), normal (green, rescaled) and tricube (red)

Figure 5.1 - Logistic function

Figure 5.2 - Elements of a data cube: dimension, hierarchy levels, facts, measures, grain, granularity

Figure 5.3 - Sales data cube for fashion retailers described by a starnet model

Figure 5.4 - Star schema of a potential sales data cube for fashion retailers

Figure 7.1 - Sales data of winter season 2009/10: mean (red lines) and standard deviation (blue lines) of sizes over time for product sub-categories Anoraks (left), Ski Boots (centre) and Skis (right)

Figure 7.2 - Sales data over various winter seasons: distribution of sizes per year for product sub-categories Anoraks (left), Ski Boots (centre) and Skis (right) in all stores

Figure 7.3 - Sales data of winter season 2009/10: mean (red lines) and standard deviation (blue lines) of sizes over stores sorted from west to east for sub-categories Anoraks (left), Ski Boots (centre) and Skis (right)

Figure 7.4 - Forecast test process

Figure 7.5 - Model for Anoraks created by univariate approach: (left) curve fit of sizes, (centre) curve fit of stores, (right) model created by multiplication of previous curve fits

Figure 7.6 - Polynomial fit for sales data of product sub-category Anoraks

Figure 7.7 - Custom equation fit for sales data of product sub-category Anoraks

Figure 7.8 - Lowess fit for sales data of product sub-category Anoraks

Figure 7.9 - Order quantities forecast for “Anorak Alpine”

Figure 7.10 - Error matrix of forecast against sales data for “Anorak Alpine”: overstocking (yellow to red), understocking (turquoise to blue)

Figure 7.11 - Error matrix of revised forecast against sales data for “Anorak Alpine”: overstocking (yellow to red), understocking (turquoise to blue)

Figure 7.12 - Architecture of an integrated purchasing DSS

Figure 7.13 - Possible GUI prototype for order program, example of “Anorak Alpine”

## List of Tables

Table 6.1 - Simplified price calculation for retailers per product unit

Table 7.1 - Sales data sub-categories of winter season 2008/09

Table 7.2 - Sales data of individual products of winter season 2009/10

Table 7.3 - Transformation of Anoraks and Boots sizes

Table 7.4 - Classification of Ski sizes

Table 7.5 - Goodness of fit for univariate approach of product sub-category Anoraks

Table 7.6 - Goodness of fit for polynomial fit of product sub-category Anoraks

Table 7.7 - Parameters and goodness of fit of the custom equation model

Table 7.8 - Parameters and goodness of fit of lowess model

Table 7.9 - Essentials of forecasted data

Table 7.10 - Comparison of forecast to actual sales data for “Anorak Alpine”

Table 7.11 - Comparison of revised forecast to actual sales data for “Anorak Alpine”

Table 7.12 - Stock out analysis for products

## 1 Introduction

One of the most important aspects for retailers is to know the demand of products to sell. Knowing the demand is important when considering the following two cases: First, if the retailer overestimates demand and buys too many goods, many products will be left on the shelf. Consequently, too many goods will lead to higher inventory costs and in the end of the selling season the retailer must markdown prices for the products which will lead to lower profit margins. Second, if the retailer underestimates demand, customers will go to another retailer in order to buy the desired products. As a result of this, the retailer will face so-called opportunity costs caused by a loss of sales.

However, to estimate precise demand is not simple, since a retailer must accurately anticipate the buying behaviour of consumers. Nevertheless, when consumer demand is estimated correctly, this implies a huge potential for improving costs and hence rising efficiency of a company. In particular efficiency is essential for companies to stay competitive, especially during uncertain times such as the recent worldwide economic crisis.

In order to estimate consumer demand accurately, forecasting methods can be utilised. Demand forecasting methods used by retailers, such as grocery stores, are generally short- term forecasts, because groceries do have a long product live cycle and a short delivery time. In contrast to grocery retailers, products for fashion and sports equipment retailers have different characteristics. In effect, fashion and sports equipment have a short product life-cycle which often lasts for one selling season only. Additionally, fashion and sports equipment products are characterised by long lead times. These long lead times are often caused by long delivery times because fashion products are mostly produced in the Far East. Another cause for long lead times is a complex development and production procedure such as for skis. In contrast to groceries, fashion and sports equipment products have also higher sales prices and slower stock turnovers, negatively influencing the cost structure of products. Therefore, the need for accurate forecasts is even higher to keep costs down.

Forecasting for fashion and sports equipment products is even more complex, since the nature of these products implies different sizes, colours, and cuts. For this reason, a demand forecast must also take these characteristics into account. For example, when a product for a specific size is out-of-stock, a product having a different size cannot suffice demand.

For defining the scope of this thesis, the following Chapter 2 presents ]problems of fashion retailing more detailed which leads to the research questions eventually. For this purpose, the outline of this thesis is provided in the end of Chapter 2 by Section 2.3.

## 2 Problems of Fashion Retailing

In the following sections the motivation for this thesis is presented. For this purpose, a short introduction to the fashion business is given. This introduction is followed by stating the research question as well as the outline of this thesis.

### 2.1 Fashion Business

Compared to other retailing goods, fashion products have many uncharacteristic features such as sizes, colours and cuts. Hence, for a better understanding of this thesis the fashion business is described in the following. The first part describes general characteristics of fashion and sports equipment products. After introducing the characteristics of fashion, problems of fashion purchasing are discussed. Lastly, contemporary developments for resolving the fashion purchasing problems are presented.

#### 2.1.1 Characteristics of Fashion and Sports Equipment Products

Fashion and sports equipment products have many characteristics in common and are thus described as one. Due to the similarities of fashion and sports equipment products, the term fashion is used in this thesis, predominantly. However, almost all considerations will also be appropriate for sports equipment products.

A first characteristic of fashion products is the existence of sizes. With respect to purchasing systems this means that one fashion article consists in reality of as many articles as sizes exist, for which order quantities have to be placed. A general problem of sizes is that there is no unique standard for sizes; on the contrary, sizes vary depending on the supplier. For clothing sizes many different size measurement tables exist, for example UK, US, German sizes, etc. Even if two suppliers use the same size measurement tables, products of the same size but from different suppliers often have different sizes in reality. For sports equipment products standardisation of sizes is even more difficult, because each product category has different size measurement tables, such as skis or bicycles.

Another characteristic of fashion products are different colours of one product. Colours mostly change for a new season due to, e.g., international fashion presentations. Additionally to size and colour, fashion can also have different cuts. The most evident examples are different cuts for ladies’ and men’s fashion. However, also other cutting differences are common, for example a normal neck versus v-neck cut of a t-shirt.

An important characteristic of fashion products is seasonality. For fashion products seasons can be classified simply into summer and winter products, or into the four seasons of a year, or even into more seasons. The chosen classification of seasons depends heavily on the product assortment of the retailer. Fashion and sports equipment retailers use mostly four to six seasons for classification of their products.

Lead times of fashion products are generally quite high. This is because nowadays fashion products are mostly produced in the Far East and then transported by container ships to Europe. Typical lead times for fashion products are several months. Sports equipment products have even higher lead times which can reach up to almost a year.

Fashion products and sports equipment have, compared to other consumer goods, a relatively high price. Hence, having fashion products on stock can lead to high inventory costs. This leads to higher financial requirements for the fashion retailer in order to prefinance goods on stock.

However, not all fashion products have all of the previously mentioned characteristics. There are also fashion products which do not have much seasonal sales behaviour. Basic fashion such as underwear or certain types of jeans have a very stable demand the whole year round. These types of products are called “never out of stock” or NOOS products. Since demand for NOOS products is quite stable, also forecasting the demand can be easily achieved and is therefore not covered in this thesis.

#### 2.1.2 Problems of Fashion Purchasing

Based on the characteristics of fashion, this section describes general problems concerning purchasing fashion. Beforehand, a short introduction to stock-keeping is given and stockkeeping problems are discussed.

In general, when purchasing products, the purchaser has to estimate the demand of the products which should be bought. If the estimations of the purchaser are poor, this can lead to either that too much or that too little is bought. Consequently, considering the stock, this leads to overstocking or understocking of the desired product. Both scenarios should be avoided. On the one hand, overstocking will lead to higher inventory costs, because of more tied capital. Additionally, in the end of the season overstocking will lead to a markdown of products which again results in lower profit margins. On the other hand, understocking results in a loss of possible sales and thus increases so-called opportunity costs. Moreover, understocking has another serious impact: the customer could buy the desired product somewhere else, from a different retailer. Also, understocking leads to inadequate product presentation possibilities. For all mentioned reasons, understocking is considered even worse than overstocking (cf. Angerer 2005, Ch.1.2). Concluding the considerations on estimating order quantities, to generate high quality forecasts of demand decreases costs and increases efficiency for the retailer.

Due to the fact that fashion products have long lead times, forecasts of the demand have to be made for a rather long period of time as well. However, a general rule concerning forecasts is: the longer a forecast predicts something into the future, the less confident the forecast will be. This is because of a rising number of unforeseeable factors, the longer a prediction is made into the future.

Another problem is the correct estimation of the size distribution of fashion products. Perhaps it is obvious that more products with sizes in the centre of a size table will be sold than products having sizes at the boundaries of a size table. On the other hand, a precise distribution of the sizes is not known and purchasers often work with their experience from the past. However, this experience depends on the purchaser and is therefore mostly a subjective estimation. Another problem concerning size distribution is whether for all stores the same distribution can be applied or whether the distribution changes for different stores. A further problem concerning the estimation of size distributions is the previously mentioned lack of unique size standardisation. Thus, purchasers tend to have different estimations for different standards of supplier’s size measurement tables.

The demand for fashion products is, moreover, depending on additional factors such as the weather or the economic situation. For example, in the beginning of the winter season, sales of winter products such as ski equipment, is depending on whether snow remains on the ground or not. Furthermore, since fashion and sports equipment are rather luxury goods than necessary goods such as groceries, also economic factors will influence demand.

However, the demand depending on economic factors is very difficult to predict since, e.g., an economic crisis could also have positive effects on the sports equipment market. For example, people can stay at home for their holidays, doing sports such as cycling or hiking rather than booking a flight. Generally speaking, it is very difficult either to predict additional influencing factors such as the weather or the dependency of the influencing factor to the demand is not known in detail.

#### 2.1.3 Recent Developments in Fashion Purchasing

To deal with the previously mentioned problems of fashion purchasing, two major developments took place in the last decades: (1) Quick Response (QR) and (2) Collaborative Planning, Forecasting, and Replenishment (CPFR) (cf. Suri 2010; Krafft and Mantrala 2010). QR is essentially a method to create very flexible supply chains. Having a flexible supply chain, the retailer is able to react flexibly to the demand experienced. For example, the Spanish fashion chain Zara has shortened the lead time of garment to mere 15 days instead of several months previously (cf. Ferdows et al. 2004). QR can be enhanced by testing the demand in specific test stores. For example, a new fashion collection could be tested in selected outlets before the real season starts (cf. Fisher and Rajaram 2000). Opposed to QR, CPFR focuses on collaboration between vendors and retailers in the areas of planning, forecasting and replenishment. These developments, QR and CPFR, are mainly achievements in the fields of logistics and require changes of many processes of a company. Although QR was designed originally for the fashion business, QR still requires products which suit to its supply chain. For this reason, even the design of products has to be controlled, in order to enable these “quick responses” to demand. An example of such products designed for QR would be white t-shirts which are coloured in a very late stage of the supply chain, according to consumer demand. However, such flexibility necessitates that a retailer can control the whole supply chain from production to sales. CPFR is a set of methods which should enable such an end-to-end control of the supply chain. On the other hand, Toporowski and Herrmann (2003) state that these new developments often lack (1) trust between vendors and retailers, (2) unique standards between vendors and retailers, (3) company internal information of the logistics processes, and (4) know-how to set up the new processes.

As already stated, QR and CPFR often cannot be applied for all fashion products of a retailer. For example, the previously discussed characteristics of certain types of fashion (e.g. sports fashion) and sports equipment do not allow such a flexible supply chain, because the production processes are still too complex and production must change for each season. An example of such products are skies, which have to be ordered almost a year in advance.

Despite these problems of fashion purchasing, there is not much scientific research on how to deal with the previously addressed assortments, where QR and CPFR is not applicable. A field related to fashion purchasing is automatic store replenishment (ASR), for which Angerer (2005) provides a comprehensive study on where and how ASR could be applied. However, ASR is probably not suitable for fashion products described in this section because of the changing nature of fashion products. The ASR idea applied to fashion products could lead to a decision support system (DSS) for fashion purchasing (cf. Toporowski and Herrmann 2003).

### 2.2 Research Questions and Goals of the Thesis

This section states the main research questions of this thesis. The main research question is then followed by sub-questions arising from the main question.

As discussed in the previous sections, many fashion products do have specific characteristics which create drawbacks for fashion retailers. These lacks have their origin in the fields of purchasing fashion, especially when estimating order quantities. However, fashion retailers have much information on sales of previous seasons. For executing forecasts on historical data often regression analysis is used (compare Chapter 3 and Chapter 4). Moreover, most fashion products do not change completely within one year. Therefore, this research aims to answer the following main research question:

Under which conditions is it possible to improve long-term forecasts of order quantities for seasonal fashion products, based on historical sales data by the use of regression analysis?

For answering the main question following sub-questions arise:

- Which regression models can be applied to fashion forecasting?

- How is the quality of a regression analysis determined?

- Which data is needed in order to execute the forecast and what are prerequisites of data design?

- How must the data be preprocessed in order to execute the forecast?

- How can the quality of the forecast be evaluated economically?

- How can a forecast be executed in practice?

- Does forecasting improve the cost structure in practice?

### 2.3 Outline of the Thesis

To give an answer to the previously stated research questions the thesis is structured as follows.

Chapter 3 gives an overview of forecasting techniques. After describing general forecasting issues, aspects of sales forecasting and sales forecasting for fashion products are presented.

Chapter 4 provides a few mathematical fundamentals as well as an overview of parametric and nonparametric regression analysis. For this purpose multivariate parametric linear regression analysis is explained which is followed by an elaboration of how to evaluate quality of regression models. Next, an overview of nonlinear regression models is given. Nonparametric regression analysis is presented by describing following methods: binning and local averaging, kernel estimation, local polynomial regression, and nonparametric multiple regression analysis.

Chapter 5 deals with data mining in general. After presenting data mining fundamentals, data preprocessing methods such as data cleaning, data transformation, and data reduction are explained. This is followed by providing principles of data warehousing which is concluded by discussing a general architecture of a fashion retailer’s data warehouse.

Chapter 6 develops a method for evaluating the economical quality of sales forecasts. This is carried out by presenting fundamentals of product costing and pricing. After presenting these fundamentals, the calculation of costs for overstocking and understocking are elaborated. Eventually, practical considerations of how to evaluate the economical quality are carried out.

Chapter 7 applies regression analysis to a test data set of a fashion retailer in order to calculate forecasts. For this reason first the test data was examined, a forecast evaluation process was designed, various regression models were tested, forecasts were calculated and the forecasts were evaluated eventually. In addition, Chapter 8 provides practical challenges and suggestions for the forecast implementation.

Chapter 8 finally concludes this thesis regarding its results and proposes further research areas.

## 3 Forecasting

Corporate planning strongly depends on forecasting, and many authors stress the importance of accurate forecasting (e.g. Moon et al. 1998). What is more, as in this application, estimations of order quantities must be obtained by a forecast. For this reason, this chapter gives a general overview of forecasting techniques. The overview is followed by a more particular coverage of sales forecasting. After having discussed sales forecasting, this chapter is completed by specific considerations on sales forecasting for fashion products.

### 3.1 Fundamentals

Forecasts are, in general, estimations of the future. For companies estimations and hence forecasts are used for planning resources needed to create added value. In the following paragraphs a short overview of economic forecasting techniques is given.

Holden et al. (1990) separate forecasting methods either into subjective or model based. Subjective forecasting methods, sometimes also referred to as qualitative methods (cf. Angerer 2005, Ch.4), could be the experience of specialists or simple guesses. However, subjective methods do not follow strict processes for obtaining the forecasts and are often not reproducible. Examples of subjective methods are surveys or the Delphi method. When carrying out surveys, people are simply questioned by mail, telephone or by means of a personal interview. The respondents are asked for their expectations of the future. The difficulty of a survey is to get a significant sample of interviewees, for example potential customers. In contrast to surveys, the Delphi method is a survey of experts. The experts should make their forecasts and discuss the reasons for their forecasts. Ideally, the subjective forecasts of the experts should lead to a common forecast eventually. Surveys and the Delphi method are often used for new products which are entering the market. This is because in the market entry phase of a product, no historical data is available on which to base the forecast upon. A drawback of subjective methods is that people making the forecast could be biased. For example, sales persons receiving a bonus on sold products might predict a cautious forecast, whereas product managers could overestimate a product’s sales capability (cf. Armstrong and Green 2010).

On the contrary, model based or quantitative forecasting (cf. Angerer 2005, Ch.4) methods apply a model to produce the forecast. These models can again be distinguished into causal and non-causal models. Causal models, which are often also referred to as econometric models (cf. Armstrong 1999), postulate a hypothesis of how the forecast is created, and what influences the forecast’s result. In contrast, non-causal models do not explain the dependencies of the forecast. These models just provide the forecast itself (Holden et al. 1990, Ch.1.2). Armstrong and Green (2010) discuss several univariate and multivariate methods of quantitative forecasting, amongst others extrapolation models, regression analysis and segmentation. Extrapolation methods are based on historical data of the variable to be forecast and are extrapolated to the future. Extrapolation methods are used when the connections of the variable to be forecast and the variables influencing the variable to be forecast are unidentified, as is for non-causal models. For this purpose, often time series analysis is used to calculate extrapolations. Regression analysis is applied when strong causal relationships are expected and the causal variables are known. Segmentation refers to breaking down forecasting problems into sub-problems which can be forecasted independently. Each of these sub-forecasts is then used for composing the overall forecast.

### 3.2 Sales Forecasting

A specific type of economic forecasting is sales forecasting. This section introduces the terminology and basic considerations of sales forecasting and presents common sales forecasting methods.

For forecasting order quantities (compare Chapter 2.2) a sales forecast can be used. This is because the ideal scenario would be that order quantities ordered for a specific time period are equal to the sales potential in this time period, such that the sales potential can be fully exploited. In this context the term sales potential is used, because actual sales could be below the sales potential. Reasons for sales lower than the sales potential are for example stock-outs of products, so that the demand of the product cannot be converted into sales. Another explanation for not achieving the sales potential is competition in real markets. For example price and quality of products, merchandising, or advertisements of competitors can lower the actual sales of a retailer. By contrast, when using the former activities for the own company, a higher sales potential could be attained. Because of the mentioned problems, Moon et al. (1998) recommend to forecast rather demand than sales, since sales refer only to the supply which the company was able to master. However, demand refers to the actual potential a company is able to sell. In literature, sometimes demand and sales potential of a product are used equally. Strictly speaking, using demand (or market demand) instead of sales potential could be used only when assuming a company owns 100 percent market share for a specific product (cf. Weis 1999, Ch.A.7). In competitive markets though, the sales potential for a product sold by a company is calculated by the demand for the product on the market, multiplied by the market share of the company for the specific product (Holden et al. 1990, Ch.4.3). Nevertheless, market share is not a fixed value but only a snapshot of a specific product. Also, market share could be shifted by already discussed competitive actions such as price variation or merchandising (cf. Weis 1999, Ch.A.7).

When using historical sales data to forecast sales, Armstrong (1999) suggests (1) direct extrapolation and (2) causal methods for forecasting. Direct extrapolation of sales data is often used when only insufficient historical data is available. Furthermore, extrapolation should only be used for short term forecasts such as inventory or production planning. In contrast to extrapolation, causal approaches use dependent factors which influence sales. These dependent factors, however, should be easy to forecast in order to gain a benefit from causal methods. On the other hand, compared to extrapolation, causal methods are mostly more expensive. To find variables on which sales might depend on, Armstrong (1999) discusses the following fields: environment, customers, marketing activities, competitors and, market share. Environmental variables are heavily depending on the product to forecast. For fashion, colour and cut trends could be used. For sports equipment, long-term weather tendencies or product trends could be suggested to be used as variables, but both examples are very difficult to predict. Sales are in the first instance dependent on customers. For this reason, variables as average income, purchasing power or regional market potentials could be used as the causal model’s inputs. As discussed previously, sales are dependent on marketing activities, competitors and market share as well. Thus, examples of variables sales depends on could be the own frequency of advertising actions, the main competitor’s frequency of advertising actions, or market share figures.

### 3.3 Sales Forecasting for Fashion Products

Combining the aspects of fashion products discussed in Chapter 2 and essential methods of sales forecasting in Chapter 3.2, sales forecasting for fashion products is discussed as follows.

When considering the specific characteristics of fashion, sales forecasting for fashion products turns out to be a multivariate problem. This is because, when placing orders for fashion products, it must be decided how many items in which sizes, colours and cuts are purchased. Additionally, when orders are made for a retailing chain, it has to be decided, how many items are distributed to each store, which leads to a further dimension to forecast.

Although there are difficulties in fashion forecasting, fashion retailers could learn from sales behaviour of similar historical products. Thus, one preliminary assumption is that similar products have similar sales characteristics. Moreover, in the fashion and sports equipment industry, this approach could be used because products do not change completely due within one year, when sales of a parallel season (e.g. compare winter season 2008/09 with winter season 2009/10) take place.

To solve a multivariate problem as stated in the previous paragraphs, it seems obvious to apply a multivariate method. However, are multivariate methods generally better than univariate methods for forecasts? Kunst (2004, Ch.4) states that multivariate forecasting methods are not necessarily better than univariate methods. The advantages of multivariate methods are that modelling dependencies is easier and a better fit to the training data set can be reached. However, since each parameter of a multivariate model has to be estimated, errors could rise the more parameters are used. Also, nonlinear dependencies are more difficult to model in a multivariate context than in an univariate context. Another disadvantage of multivariate models is that outliers in a multivariate environment are more difficult to discover. However, if an univariate or a multivariate model is better for carrying out the forecast, has to be discussed further. This decision depends eventually on whether specific dimensions can be neglected or not and cannot be answered generally at this point.

As already mentioned, for causal models a common multivariate approach is regression analysis. Han and Kamber (2006, Ch.6.11) even state that regression analysis is the most common approach for numeric prediction in the fields of data mining. And, for sales forecasting a numeric prediction model must be applied, because the outcome of the applied forecasting model should be virtually continuous values. Yet, there are more methods for numeric prediction such as k-nearest-neighbour or support vector machines. However, this would go beyond the scope of this thesis. Thus, the following Chapter 4 focuses on parametric regression analysis as a causal forecasting model and nonparametric regression analysis as a non-causal forecasting model.

## 4 Regression Analysis

In statistics, regression analysis is used for the description and explanation of quantitative functional connections. Therefore, regression analysis can also be employed, as in our application, for estimations and forecasts. For this purpose, regression models describe a dependent variable Y by one or more independent variables X_{j} 1 ≤ j ≤ h. Specifically, the conditional behaviour of the dependent variable Y , when one or more of the independent variables X_{j} are varied, should be modelled.

Basically, there are two different methods of applying regression analysis, namely parametric regression analysis and nonparametric regression analysis, which are described in the following sections. Before focusing on regression analysis, a few statistical fundamentals, used in this thesis, are introduced.

Parametric regression analysis is covered by statistical standard literature, specifically by literature for multivariate statistics. Thus, Section 4.2, Parametric Regression Analysis, provides an overview of Hartung and Elpelt (1999), Backhaus (2008) and Schira (2003). Section 4.3, Nonparametric Regression Analysis, is based on Fox (2000a and 2000b), Cleveland (1979), Cleveland and Devlin (1988), and Hansen (2009).

### 4.1 Fundamentals

For a common understanding of the following chapters, this section gives a basic introduction to selected statistical terms, calculations and approaches used in this thesis. For further details, Larsen and Marx (2005) provide a standard work on statistics and its applications.

In empirical statistics measures of central tendency describe a central value of quantitative data. The arithmetic mean is the most important measure of central tendency (cf. Hartung and Elpelt 1999, Ch.I) and is, for a sample x_{1},...,x_{n}, calculated by

illustration not visible in this excerpt

Alternatively to the arithmetic mean additional measures of central tendency are median ore mode of which definitions can be found in standard literature such as Hartung and Elpelt (1999, Ch.I). Statistical dispersion describes how far an observation can deviate from the central value ([illustration not visible in this excerpt]). The most important measure of statistical dispersion is the variance σ^{2} of a sample x_{1},...,x_{n} for n ≥ 2 . The variance of a sample x_{1},...,x_{n} is computed by

illustration not visible in this excerpt

The standard deviation is the square root of the variance and consequently calculated as

illustration not visible in this excerpt

The standard deviation is used when the same unit of quantity as in the sample is needed for the determination of the dispersion. Furthermore, the coefficient of variation is adjusted by the arithmetic mean [illustration not visible in this excerpt] and calculated by

illustration not visible in this excerpt

The advantage of using the coefficient of variation is that different variances of the samples can be compared when having distributions with different means [illustration not visible in this excerpt]. (cf. Hartung and Elpelt 1999, Ch.I)

Random experiments can be described by random variables and appropriate probability distributions. However, probability distributions are often not known and have to be estimated by a repeated execution of a random experiment. For example, the observations x_{1},...,x_{n} of n independently, identically distributed random variables X_{1},...,X_{n} are used to estimate the parameters for the distribution of X . Therefore, an estimated function [illustration not visible in this excerpt] should be identified, which fits the real parameter θ as good as possible. The quality of the estimation can be determined by the mean squared error (MSE)

illustration not visible in this excerpt

where [illustration not visible in this excerpt] is the expected value of the squared difference between the estimated parameter and the real parameter describing the random variables. Moreover, the MSE is the sum of variance and squared bias of the estimator [illustration not visible in this excerpt] , where the bias equals the difference between the expected value of [illustration not visible in this excerpt] and the true value of the parameter being estimated. The root mean square error or RMSE is the square root of the MSE and is, as the standard deviation, in the same unit of quantity as the random variable. (cf. Hartung and Elpelt 1999, Ch.I)

Often, the quality of statistical estimations is measured by statistical significance. A statistical result is statistical significant when it is unlikely that this result has occurred by chance. Statistical significance is tested by formulating hypotheses and calculating likelihoods of the hypotheses being true or false. For statistical hypothesis testing specific test distributions are used. For testing statistical significance of regression analysis mainly the F-distribution and the Student-t-distribution are used. The values of the F-distribution and the Student-t-distribution are built in functions in most mathematical software packages or can be retrieved from tables, for different degrees of freedom. Degrees of freedom in statistics are in general the number of parameters which are free to vary. (cf. Hartung and Elpelt 1999, Ch.I)

### 4.2 Parametric Regression Analysis

Parametric regression analysis tries to find a model in form of a function for describing the connections between independent and dependent variables. These functions include besides the independent variables also parameters. This section covers linear parametric regression analysis, the examination of the quality of an estimated model and nonlinear parametric regression models. Parametric regression analysis is the classical form of regression analysis and is therefore denoted simply as regression analysis within this section.

The simplest form of a parametric regression model is the regression line

illustration not visible in this excerpt

In the case of a simple regression line there is only one independent variable X_{1} which describes the dependent variable Y . The parameters β_{0},β_{1} of the regression model are also referred to as regression coefficients and are unknown. The regression coefficients must be estimated, which will be dealt with later. (cf. Hartung and Elpelt 1999, Ch.II)

If there is more than one independent variable X_{1} to describe the dependent variable Y , multiple regression analysis is used. The basic form of multiple linear regression analysis is described by the regression function

illustration not visible in this excerpt

An observation of independent and dependent variables x_{il},...,x_{ih}, y_{i} , 1 ≤ i ≤ n having n samples ( n > h + 1 ) is henceforth denoted by lowercase characters and is also referred to as model realisation or training data set. When considering a realisation of the regression model also an error variable e_{i} is added to the formula which leads to

illustration not visible in this excerpt

The error variables e_{i} are assumed to be random and independent of each other having an expected value of zero, hence the expected value of the random variable also denoted as y_{i} is

illustration not visible in this excerpt

The regression model can as well be denoted in matrix form where h is the number of independent variables and n the number of samples. The variables in (4.8) are set to

illustration not visible in this excerpt

The matrix x consists of h + 1 columns, one column for each independent variable X_{j} and additionally, the first column for the coefficient β_{0} consisting of ones only, which is hence called a dummy variable. The matrix notation of the multiple regression model eventually is

illustration not visible in this excerpt

where matrices and vectors will be represented by variables in bold characters. (cf. Hartung and Elpelt 1999, Ch.II)

For the estimation of the regression coefficients

β_{0}, β_{1},...β_{h} or the coefficient matrix β respectively, mostly the method of least squares is used. The method of least squares refers to a minimum deviation of the estimated model

illustration not visible in this excerpt

to the actual data points y_{i} of the training data set. The difference between the estimated model and the actual data is referred to as residual

illustration not visible in this excerpt

The method of least squares implies that for an optimised estimated model the sum of the squares of the residuals (also referred to as sum of squared errors or SSE) should be minimum,

illustration not visible in this excerpt

where êy is the transpose of ê . This leads, after differentiation to the estimation of β by

illustration not visible in this excerpt

With this estimation also the estimations ŷ of the dependent variable y and of the residuals ê = y− ŷ can be calculated. (cf. Hartung and Elpelt 1999, Ch.II)

#### 4.2.1 Quality of the Estimated Regression Model

Examining the quality of a regression model is basically separated into examining the goodness of fit on the one hand and to test statistical significance on the other hand. The goodness of fit corresponds to how well the model fits the training data set. Therefore, mainly residuals are examined and will be expressed by the coefficient of determination, the variance and the standard error of the estimation. Statistical significance of the regression model, however, articulates whether the model is valid beyond the training data set. For this reason, also the number of samples is considered in significance testing. Statistical significance for regression models can be tested with the F-test for the whole model and with the t-test for individual regression coefficients.

For examining the goodness of fit of a regression model the residuals of an estimated model can be analysed. The most common analysis of residuals is accomplished by so- called residual plots. For example, a basic residual plot can be applied by plotting y_{i} against ŷ_{i} on a two-dimensional coordinate plane. Each data point should then be distributed randomly around the bisecting line. Residual plots are carried out for a verification of specific assumptions made when using a linear regression model. The most important assumptions for the linear regression model are that the error variables e are random, normally distributed, independent of each other, and having an expected value of zero. A complete set of assumptions for linear regression models are defined in Backhaus (2008, Ch.1.2). More forms of residual plots can be found in Hartung and Elpelt (1999, Ch.II.1).

illustration not visible in this excerpt

**[...]**

- Quote paper
- Peter Hirschbichler (Author), 2010, Order Quantity Forecasting for the Fashion Industry, Munich, GRIN Verlag, https://www.grin.com/document/164751

Publish now - it's free

Comments