## Table of contents

Table of figures

List of tables

List of abbreviations

1. Introduction

2. State of research

3. Who are the punters and how do they behave

3.1. Data set

3.2. Methodology

3.2.1. Descriptive statistics

3.2.2. Significance

3.2.3. Data loading, preparation and processing

3.3. Results

3.3.1. Descriptive statistics

3.3.2. Results for the significance

3.3.2.1. Differences by gender

3.3.2.2. Differences by age

3.4. Conclusion for the analysis of the punters

4. Inefficiencies in the soccer betting market

4.1. Data set

4.2. Methodology

4.2.1. Objective and subjective probabilities

4.2.2. Calculation of the returns

4.2.3. Odd ranges

4.2.4. Significance

4.2.5. Arbitrage

4.2.6. Disaggregation

4.2.7. Data loading, preparation and processing

4.3. Results

4.3.1. Aggregated results

4.3.2. Disaggregated results

4.3.2.1. Draw

4.3.2.2. Disaggregation by years

4.3.2.3. Disaggregation by country

4.3.2.4. Arbitrage

4.4. Conclusion

5. The betting exchange betfair

Bibliography

## Table of figures

Figure 1 - % of Total Bets within each Age Range (number of bets)

Figure 2 - % of Men and Women within Age Range (number of bets)

Figure 3 - % of Bets within each Age Range (betsum)

Figure 4 - % of Men and Women within each Age Range (betsum)

Figure 5 - Comparison Age Distribution

Figure 6 - Comparison Gender Distribution

Figure 7 - Average Returns

Figure 8 - Average Returns Men and Women

Figure 9 - Average Stakes Men and Women

Figure 10 - Average Odds Men and Women

Figure 11 - Objective vs. Subjective Probabilities

Figure 12 - Average Excess Return with Corresponding Odd

Figure 13 - Average Excess Returns for odds between 1 and 2

Figure 14 - Cumulated Excess Returns for Odds between 1 and 10

Figure 15 - Cumulated Excess Returns for Odds between 1 and 1,5

Figure 16 - Cumulated Excess Returns for Odds between 5 and 7

Figure 17 - CER/Number of Games - Combinations

Figure 18 - CER/Number of Games - Combinations with Significance higher than 99%

Figure 19 - Average Excess Return for the Draw

Figure 20 - Cumulated Excess Return for the Draw with Odds until 2,5

Figure 21 - CER/number of Games for the Draw with Significance higher than 85% .

Figure 22 - Cumulated Excess Returns of the Seasons 2000/2001 and 2010/2011

Figure 23 - Cumulated Excess Returns for all Seasons

Figure 24 - Returns compared over the Years for 1,12 to 1,24

Figure 25 - Max. CER with Significance higher than 85% per Seasons

Figure 26 - Cumulated Excess Returns by Country

Figure 27 - Max. CER for FLB with Significance higher than 85% per country

Figure 28 - Max. CER with Significance over 85% per Country

## List of tables

Table 1 - Scenario

Table 2 - Probabilities without bookmaker’s margin

Table 3 - Probabilities including bookmaker’s margin

Table 4 - Data set New Zealand

Table 5 - Overview of titles in data set

Table 6 - Microsoft Excel analysis (1)

Table 7 - Results New Zealand

Table 8 - Mann-Whitney test number of bets

Table 9 - Mann-Whitney test TMW

Table 10 - Mann-Whitney test potential return

Table 11 - Mann-Whitney test average wagers

Table 12 - Kolmogorov-Smirnov test number of bets

Table 13 - Kolmogorov-Smirnov test TMW

Table 14 - Kolmogorov-Smirnov test average wagers

Table 15 - Kolmogorov-Smirnov test average real returns

Table 16 - Kolmogorov-Smirnov test potential average return

Table 18 - Microsoft Excel analysis (2)

Table 19 - Microsoft Excel analysis (3)

Table 20 - Scenario for arbitrage

Table 21 - Games with low odd for the draw

Table 22 - Portfolios for odd range 1,12 to 1,24 by season

Table 23 - Favourite longshot portfolios by season

Table 24 - Most successful portfolios by season

Table 25 - Favourite longshot bias portfolios by country

Table 26 - Most successful portfolios by country

Table 27 - Arbitrage possibilities

## List of abbreviations

illustration not visible in this excerpt

## 1. Introduction

The online sports betting markets industry has been gaining in popularity worldwide during the last decade. The Internet gives punters faster access to their bookmaker and lets them easily compare the odds of many different bookmakers. Furthermore, many online gambling sites offer a wide range of different sports events and specialized bets for their customers^{[1]}. Nowadays it is not only possible to bet on the final outcome of a soccer match, but even on the half-time result or the amount of goals scored during the match.

The betting exchange betfair^{[2]} makes it possible to bet against a certain outcome, meaning the punter can act as a bookmaker for a chosen event.

Betting on outcomes of soccer matches is very interesting with regard to probability theory because it is based on a numerical code: 0 for a draw, 1 if the home team wins and 2 if the home team loses. Therefore many textbooks on combinatorics consider soccer bets to be a prime example for probability theory.^{[3]} H. Matthes and H. Küchenhoff claim that soccer bets are a great example for the so-called subjective probability, which they define as the degree of confidence with which an observer believes in the occurrence of a certain event based on the information he currently has. P(A) would be the maximum amount in EUR that he is willing to risk if he receives exactly EUR 1 on the occurrence of A.^{[4]} For a bookmaker this would mean that his odd for a certain event will not exceed 2 if he believes that the probability of its occurrence is 50%. Hence, odds that are offered by a bookmaker on the occurrence of a certain result reflect his individual estimation regarding the probability of the occurrence of the given result. Of course, bookmakers can be wrong. If it were possible to identify a pattern for their mistakes, this would help in exploiting them. Furthermore, due to the growing competition in the sports betting market, the punter is able to choose between a large number of bookmakers if he wants to bet on a certain outcome of an event. If the variation between the odds of all bookmakers who offer bets on a certain event were large enough, the punter might have the probability to earn risk-free money by betting on all possible outcomes of this event and choosing the bookmaker that offers the highest odds for each individual outcome. This thesis will show that there are inefficiencies in the soccer betting markets that can be exploited by punters. It is important to understand who the people who get involved in sports gambling actually are. This information will be very useful to bookmakers because it will help them to better assess their target group and therefore improve the success of their marketing campaigns.

In section 2 the state of the research regarding the favourite longshot bias as well as arbitrage on soccer matches will be presented. The third chapter will provide the reader with detailed understanding about the characteristics of the punters. In the next section the soccer betting market will be analyzed for inefficiencies and finally, in the last section the betting exchange betfair will be introduced.

## 2. State of research

Since the objective of this thesis is to analyze inefficiencies in the soccer betting market and to understand who the punters actually are, this section will provide a general understanding of these topics and with an overview of the current state of research. In the soccer betting market there are two major sources of inefficiency: the existence of arbitrage opportunities and the so-called favorite longshot bias.

W. Sharpe and G. Alexander define arbitrage as “the simultaneous purchase and sale of the same, or essentially similar, security in two different markets for advantageously different prices.”^{[5]} The following example will explain what an arbitrage opportunity means for the soccer betting market. Let us assume the following situation:

illustration not visible in this excerpt

Table 1 - Scenario

This means that if the punter bets EUR 1, he would receive EUR 2,30 if Werder Bremen wins, EUR 3,20 for a draw and EUR 3 if Eintracht Frankfurt wins - if he anticipates the correct outcome. According to E.F. Farma, a market is efficient if prices fully reflect all available information. ^{[6]} For a soccer betting market this means that the odds of the bookmakers should reflect all the information that is available before the kick-off, such as the strength of the payers, the skills of the coaches, whether important players are injured, and so forth. If we believe in the efficient market hypothesis, we can calculate the possibilities for certain outcomes by using the odds provided by the bookmakers. Without considering the bookmaker’s margin, we can calculate the probability for a given outcome with the formula 1/[corresponding odd].

illustration not visible in this excerpt

Table 2 - Probabilities without bookmaker’s margin

If we add up the reciprocal values of the odds, we get about 1,08. Thus the bookmaker’s margin is about 8%^{[7]}. In order to adjust the probabilities by the bookmaker’s margin we have to divide them by the sum of their reciprocal values.

illustration not visible in this excerpt

Table 3 - Probabilities including bookmaker’s margin

If we assume efficient markets and control for the bookmaker’s margin we can see the probability that Eintracht Frankfurt will win this game is about 30,84%. Formula (1) allows us to calculate the theoretical payout from this bookmaker:

illustration not visible in this excerpt

(1)

The theoretical payout u is calculated by adding up the reciprocal values of the odds of all possible outcomes k and then taking the reciprocal value of this sum. If u is greater than 1, the punter has an arbitrage opportunity by betting on all three possible outcomes. This will, of course, not be possible if he only uses the odds of a single bookmaker. However, by combining the largest odds of different bookmakers for a certain game, he can increase *u* and eventually create arbitrage opportunities.^{[8]} There are a number of scientific papers, which try to identify intra-market and even inter-market arbitrage opportunities in the betting markets.

P. Pope and D. Peel examine a data sample, which contains the odds from four different bookmakers for the English soccer season in the year 1981/82.^{[9]} M. Dixon and P. Pope were not able to identify a single arbitrage opportunity in the UK soccer betting market using data from the years 1993 to 1996 for three different bookmakers.^{[10]} N. Vlastakis,

G. Dostis and R. Markellos were able to identify 63 cases of arbitrage opportunities by using a dataset of 12841 soccer matches from 26 different countries during the years 2002-2004.^{[11]} E. Franck, E.Verbeek and S. Nüesch searched for arbitrage opportunities by using odds from bookmakers and from the well-known betting exchange www.betfair.com. They analyzed a dataset containing 5478 soccer matches played during the 2004/2005 and 2006/2007 seasons. 1450 cases were identified in which a combined bet at the bookmaker and at the exchange would create an arbitrage opportunity.^{[12]}

G. Griffith was the author of the first scientific paper that examined the favorite- longshot bias. He observed that bets on horses with small odds for winning (favorites) yield higher average returns than bets on horses with large odds for winning (longshots).^{[13]} This means that bets on favorites tend to have a smaller or even negative spread^{[14]} between the market probabilities, which are obtained from the odds, and their empirical probabilities, which are computed from the actual outcomes. If this bias is large enough (spread is smaller than zero) to exceed the bookmaker’s margin, punters are able to create positive average returns by betting on the favorites.

Reasons for the favorite-longshot bias include misinformation about probabilities and the preference for higher risk. G. Griffith suggested that there is a psychological bias that leads punters to subjectively ascribe too large probabilities to rather rare events.^{[15]}

M. Weitzman explained that if individual punters prefer risk (skewness), they are willing to accept a lower expected payoff when betting on longshots.^{[16]} Since the focus of this thesis lies on the application of inefficiencies in soccer betting markets, the roots of the favorite-longshot bias will not be discussed in detail. In their paper “The Favorite-Longshot Bias: An Overview of the main Explanations“ M. Ottavani and P. Sorensen provide a detailed analysis of the reasons for the favorite-longshot bias.^{[17]} A. Direr analyzed a dataset containing odds quoted by 12 different bookmakers on 21 European championships over 11 years. He was able to identify a positive rate of return by using odds within the range of 1,13 to 1,27.^{[18]}

E. Feess, H. Müller and C. Schumacher were the first who analyzed a dataset consisting of more than five million bets placed in New Zealand between 2006 and 2009 that had information on age, gender and experience of each individual punter. They found that women bet more often on longshots than men. But because they use smaller wagers, they are still more risk-averse than men.^{[19]} The same set of data used by E. Feess, H. Müller and C. Schumacher will be used in this thesis, albeit with a different approach. The target is to find out who the punters are and how they differ by gender and by age.

## 3. Who are the punters and how do they behave

The purpose of this section is to provide a general understanding of the characteristics of the punters’ age and gender as well as their risk-taking behavior regarding returns, wagers and odds. For more detailed information about the behavior of the punters, especially concerning the favorite-longshot-bias, please refer to the paper by E. Feess, H. Müller and C. Schumacher mentioned in section 2.

### 3.1.Data set

The dataset consists of 4442229 bets which were placed with a bookmaker in New Zealand between August 2006 and February 2009. For each bet we have information about the punter’s gender and age at the time the bet was placed, the risked amount, the type of sport, the odds for the bet, and whether the bet was won or lost. Table 4 gives an overview of how bets are distributed between major types of sports.

illustration not visible in this excerpt

Table 4 - Data set New Zealand^{[20]}

More than 30% of the bets were placed on thoroughbred racing, followed by rugby. Soccer, which will be exclusively analyzed regarding inefficiencies in the betting market in section 4 seems be less important in New Zeeland. Therefore, the analysis in section 4 will use a different dataset.

### 3.2. Methodology

This section will describe the methodology that was used to find out who the punters are and how they behave. The first part will describe the methodology for creating descriptive statistics. The second part will focus on the methodology used for testing the results.

#### 3.2.1. Descriptive statistics

In the first step we calculate the age of each individual punter at the time he or she placed the bet. The punters who did not provide us with information about their age will not be included. After being able to determine the age of each individual punter at the time of the bet by taking the difference between their birthday and the time of the bet, we cluster them into age groups of 5 years.

In the next step we want to determine the proportion of male and female punters within every age group by controlling for the number of bets. Therefore we have to identify their gender by using the ”TITLE“ column in our dataset. Unfortunately it is not possible to identify the gender for each ”TITLE.“ The following table shows an overview:

illustration not visible in this excerpt

Table 5 - Overview of titles in data set

Table 5 shows that we were able to identify the gender for the majority of the data. The remaining part of the data where we were not able to identify the gender will not be considered in our analysis.

One could argue, however, whether the amount of bets placed is the right variable to determine the proportion of men and women within every age group. To have a comparison, we also determine the proportion of men and women within every age group depending on their bet amounts. In the final step we calculate the average return within every age group and then break it down for men and women. Average wagers and average odds will also be determined and broken down by age group and gender. Table 6 gives a detailed overview of the Microsoft Excel analysis.

illustration not visible in this excerpt

Table 6 - Microsoft Excel analysis (1)^{[21]}

#### 3.2.2. Significance

Divide the sum of all odds played by women within each age group by the number of bets placed by women within each age group

To test the hypotheses (for example the identification of differences between the two genders) or the deviation from a standard distribution, the following statistical tests were performed^{[22]}:

The measured values and the percentages were tested with the Mann-Whitney test ( u- test) for differences between the mean value between two sub-groups (men and women). According to L. Sachs^{[23]} percentages should not be treated like normal interval-scaled values when doing difference tests, therefore, the non-parametric Mann- Whitney test was applied because this test treats the values as rank numbers. In several tests the distribution of values for a variable compared to an equal, theoretical, uniform distribution of values for all 10 age groups was tested. To test for differences between the observed distribution and the theoretical uniform distribution (uniformity of the distribution of values in 10 age groups), the Kolmogorov-Smirnov test was used. Significance in the test means that the values in the age groups are distributed significantly unevenly.

3.2.3. Data loading, preparation and processing

For data loading, preparation and data (pre-)processing the (free) Microsoft SQL Server Express Edition (SQL Server Express) was used to circumvent size and performance issues of Microsoft Excel and similar tools. SQL Server Express is based on Microsoft SQL Server and supports most of the features of the database engine. The limitations (uses only at most 1 GB, uses only single physical CPU, but multiple cores allowable) were not relevant for the present work.

Microsoft SQL Server is a Relational database management system (RDBMS). It’s an implementation of the Relational Model introduced by E. F Codd^{[24]}. The Relational Model is based on the mathematical set theory and defines eight relational operators (select, project, product, join, union, intersection, difference, division). Microsoft SQL Server offers SQL or Structured Query Language as an accepted industrial standard for a relational data access language. Relations are stored in tables, the “select” clause is used to provide the relational operators, the “insert” clause adds one or more rows into a table, the “update” clause changes the values of one or more columns, the “delete” clause removes one or more rows from a table.

The three source files in comma-separated values (CSV) format were imported into the database using the MS DTS (Data transformation service) wizard included in the SQL Server Express package. The three created tables were then combined into one new table. Then the transaction date and the date of birth where converted into true date format columns to allow proper evaluations.

### 3.3. Results

The results of the analysis will be presented in this section. In the first step the results for the descriptive statistics will be presented. In the second step these results will be tested for their significance.

#### 3.3.1. Descriptive statistics

Figure 1 shows the age distribution of the punters. About three-fourths of the bets were placed by punters who were between 24 and 53. Only 7,73% of the bets were placed by people who were 60 years and older. It is also worth mentioning that within the ranges 24-29, 30-35, 36-41 and 42-47, the number of placed bets was pretty similar. With 17,63% of all bets, the age group from 36-41 is the largest.

illustration not visible in this excerpt

Figure 1 - % of Total Bets within each Age Range (number of bets)

The proportion of men and women within every age group comes next. Figure 2 provides an overview of the gender distribution.

illustration not visible in this excerpt

Figure 2 - % of Men and Women within Age Range (number of bets)

We can see that for every age group the majority of bets were placed by men. Within the age group of 18-23, the proportion of bets placed by men was 94,90%. This number stays pretty constant and only drops slightly within the further age groups. However, for those who are over 72 years old, only about 80% of the bets were placed by men, which is still a very high proportion but a large drop compared to the other age groups.

It seems that the amount of bets placed by women rises with age, peaking in the age group above 72.

In the next step we want to analyze the age and gender distribution by considering the betting amounts instead of the number of bets placed. This will give us the opportunity to compare both results.

Figure 3 shows us the age distribution by regarding the TMW for each range.

illustration not visible in this excerpt

Figure 3 - % of Bets within each Age Range (betsum)

Now the age distribution looks different. When we regard the TMW, more than 80% of the TMW was used by punters who were between the ages of 24 and 53. With over 28%, the people between the ages of 36 and 41 risked the largest amount of money. The age groups of 30-35 and 42-47 account for the second-largest and third-largest amount of money, which was placed in bets.

Figure 4 shows us the TMW distributed by gender and age.

illustration not visible in this excerpt

Figure 4 - % of Men and Women within each Age Range (betsum)

It is not possible to identify a clear trend in this distribution, other than a significant predominance of men in all age groups. However, it is worth mentioning that within the youngest group of punters, who are between the ages of 18-23, more than 97,5% of the money was risked by men, which is the highest percentage among all age groups. Among those who are 72 years and older, more than 19,5% of the money was risked by women, which is the highest percentage for women, followed by the age group of 60- 65, where about 9,34% of the TMW falls to women.

Figures 5 and 6 compare the results for the age and gender distribution of the punters:

illustration not visible in this excerpt

Figure 5 - Comparison Age Distribution

The age distribution shows that 18- to 29-year-old punters account for about 21,5% of the total bets, but only for about 16% of the TMW. In contrast, punters between the ages of 36 to 41 place only about 17,6% of the bets but are responsible for about 28,5% of the TMW. This means that the bookmaker collects the majority of his money from middle-aged people. However, the frequency of betting is distributed more evenly among the age groups, but just like TMW it drops drastically for the older people.

**[...]**

^{[1]} For example see www.bwin.com or www.bet365.com

^{[2]} see www.betfair.com

^{[3]} see Büchter, A.; Henn, H.-W. (2007)

^{[4]} see Mathes, H., Küchenhoff, H. et al. (2006)

1

^{[5]} see Sharpe, W.; Alexander ,G. (1990)

^{[6]} see Farma, E.F. (1970)

^{[7]} this is just an approximation. In order to calculate the exact margin we have to use formula (1)

^{[8]} in practice, however there are also transaction costs, which are not included in Formula (1).

^{[9]} see Pope, P.; Peel, D. (1988)

^{[10]} see Dixon, M.,; Pope, P. (2004)

^{[11]} see Vlastakis, N.; Dostis, G.;Markellos, R. (2009)

^{[12]} see Franck, E.; Verbeek, E.; Nüesch, S. (2009)

^{[13]} see Griffith, G. (1949)

^{[14]} the spread is calculated by subtracting the empirical probabilities from the market probabilities.

^{[15]} see Griffith, G. (1949)

^{[16]} see Weitzman, M. (1965)

^{[17]} see Ottavani, M.; Sorensen, P. (2007)

^{[18]} see Direr, A. (2011)

^{[19]} Feess, E.; Müller; H.; Schumacher ,C.

^{[20]} See CD/Data/Neuseeland.zip

^{[21]} See CD/Outputs/Excel/Neuseeland.xls

^{[22]} The tests were mainly performed with XLStat (excel add-in) www.xlstat.com

^{[23]} See Sachs, L. (2004): Angewandte Statistik

^{[24]} see Codd, E.F. (1990)

## Comments