Cornerstones of Research
The political science field has been exploring social media for several years now with an increasing tendency. Platforms like Twitter, Facebook, and Google+ provide many possibilities and a lot of data - also referred to as 'Big Data' - which can be used to measure public opinion. Many scientific studies already used social media to deal with forecasting trends, such as unemployment rates (Choi & Varian, 2009), car sales (Choi & Varian, 2012), presidential polls (O’Connor et al. 2010), or stock prices (Arias et al. 2013; Bollen et al. 2011; Wolfram 2010). A topic of extreme relevance and interest for political scientists however is election forecasting with social media, which I will focus on in this paper. The traditional approach of forecasting elections is based on mass surveys conducted in opinion polls. An approach, that is both time consuming and expensive, and necessarily a limited cross section of a dynamic concept. Against this background it is not surprising, that many scientists have high hopes for social media to be a cost-saving and easily accessible alternative for the traditional opinion polling.
One of the first and most prominent attempts of election forecasting with social media was by Tumasjan, Springer, Sandner, and Welpe (2010) and used Twitter, a micro blogging platform with 500 million tweets being posted daily (Sloan 2015). Tweets are messages with a maximum of 140 characters (Savage, 2011). The results were very promising: based on the German Federal election in 2009 the authors concluded, the volume of tweets on Twitter, referencing a party or candidate name, reflect the actual vote share in the election. Their method was fairly easy and proposed lower mean absolute errors than the traditional opinion polling, which could have, according to the authors, quite possibly made traditional polls unnecessary in the future (Tumasjan et al. 2010). Only one year later however a response paper by Jungherr et al. (2011) shattered these high hopes, unmasking the arbitrary use of time-frame and party choice.
2013 Daniel Gayo-Avello captured the state of the art regarding election prediction with social media publishing a meta-analysis, the very first one in this field of study. Concluding after an extensive literature review, that the „prevailing view [among scientists] is overly optimistic“, (2013, 649) Gayo-Avello declares three major problems, that have to be addressed by future research: 1) The need to produce a true forecast, that is published before the election. 2) The need to take into account the biases on Twitter, especially the unrepresentativeness of the sample. 3) The need to incorporate sentiment rather than just tweet volume (Burnap et al. 2015).
The research question of this paper is very similar to Gayo-Avello's meta-analysis, to give an overview of the current state of the art two years later, assess if the past problems and questions scientists raised have been discussed, and in the last step answer, whether or not Twitter can be used as an efficient alternative to traditional electoral forecasting.
Although two years might not be a lot of time considering the time it takes for a study to be conducted and published, the number of scientists contributing to this fairly new field of research is extremely high, therefore making much more understanding in a short time possible.
Necessarily I will only highlight a selection of studies, not being able to shed light on all studies. In a nutshell, I will take the 3 demands by Gayo-Avello as a guideline to order recent studies, then give a quick insight into the current discussion in the scientific field, and in the end come to the conclusion, that traditional polling and social media-based approaches do not have to be exclusive, but can and should be combined in future research.
Cornerstones of Research
As mentioned before, the cornerstone studies in the field of forecasting election outcomes using social media were published in the early 2010's. Although the first attempts were quite feeble, when investigated further, researchers recognized the possibilities and explored various methods. The attempt of Gayo-Avello (2013) to give an overview of current methods and a framework for researchers worldwide was as much needed as it was successful. His meta-analysis is being referenced in a lot of recent studies. Whether or not the 3 demands he proclaimed have been taken on board by newer research I will discuss in the following chapter. I will focus on studies published in the last two years, in order to guarantee the actuality of my literature review and to not repeat studies already examined by Gayo-Avello (2013).
Authors Kagan, Stevens, and Subrahmanian (2015) predicted the 2013 Pakistani and 2014 Indian elections with a dynamic model that presented daily forecasts, based on Twitter tweets. In both elections they correctly predicted the winner (that is the prime minister) well ahead of time. Furthermore, they used the sentiment analysis in contrast to the mere tweet volume method, which Tumasjan et al. (2010) employed in their study. With their approach they already address two of Gayo-Avello's concerns, only disregarding the bias concern. Kagan et al. do acknowledge his point though and state, that although the sample is indeed biased, “perhaps biases get worked out when such large numbers are considered” (2015, 4). Admittedly this is a rather universal proclamation, but in their study the bias really does seem small enough to not falsify the results. What has to be noted is their interesting choice of countries though. The choice of India and Pakistan now shows researchers, that even in relatively poor countries, where comprehensive Internet connection is not as self-evidently as in Europe or the USA, Twitter still provides a solid indicator of public sentiment. Furthermore, their dynamic model could identify the most influential individuals on Twitter, which is very valuable information for an election campaign.
Coming to similar results Franch (2013) used a dynamic ARIMA model (also known as Box-Jenkins model) to predict vote shares for the 2010 UK General Elections. By aggregating data from Twitter, Facebook, Google, and Youtube, he claims, that the results are reliable as well as exceedingly accurate. The trends predicted follow not only the traditional YouGov polls data, but the average predictions are furthermore extremely close to the real outcomes. Franch concludes, in this case the forecasts are surpassing the accuracy of polls and can be used as an inexpensive way to predict elections. In his own opinion he eludes biases by using several mediums, not just Twitter and “the ARIMA model seems to bypass such flaws” (Franch 2013, 64), meaning the unrepresentativeness of the sample. Therefore he is addressing and incorporating all 3 of Gayo-Avello's demands.
Caldarelli et al. (2014) and Ceron et al. (2014; 2015) support the optimistic view with their studies. Caldarelli and his colleagues studied the 2013 Italian parliamentary elections and introduced a relative strength parameter (RS) to the scientific community, in order to compare the strength between two parties based on the volume of tweets. They could attain good results both at a national level, as well as smaller levels (dividing Italy in three regions: North, Center, South) using the location information provided in user profiles. However, they overestimated the share of the two main parties and are therefore not being precise enough to their own appraisal. The RS parameter is an important, well- working method though, which basis is found in the study by Borondo et al. (2012), who measured the relative support between two candidates. The advancement of Caldarelli et al. (2014) to use the RS parameter for parties is an interesting step to the future, which could be investigated further.
Ceron et al. (2014) focused on the popularity of political leaders in Italy throughout 2011, the 2012 presidential election in France as well as the subsequent legislative election. Ceron et al. (2015) made the interesting choice of comparing results between the 2012 US presidential election and the 2012 Italian centre-left primaries. They explained their choice by saying “we have followed the most different system design setting” (2015, 5). On the one hand an election for the head of the USA, with a sample of 11% of American citizens being on Twitter, on the other hand the election for a leader of a political coalition running in the next national election with a sample of only 5% of Italian citizens being on Twitter. These dissimilarities provide a neat basis to find out how consistent the applied method is. In both their studies (2014; 2015) Ceron et al. apply the Hopkins and King (HK) method, a two-step supervised sentiment analysis. The first step being human coders coding a subsample, the second step being the automated statistical analysis provided by an algorithm, the method is able to comprehend humour and sarcasm and provide an accuracy not met by the traditional sentiment analysis with ontological dictionaries. In both studies the authors come to the conclusion, that the HK method is a promising step to the future in social media-based election prediction, with a proven robustness considering the application in 3 different countries (USA, France, Italy), and although acknowledging the unrepresentativeness of Twitter in general, the method outperforms the traditional opinion poll predictions. Their personal prediction is, that potential biases may in the future subside, because of the increase in social network usage (Ceron et al. 2015).
However, shedding some light on the other side of the trophy one has to look at the study by the authors Burnap, Gibson, Sloan, Southern, and Williams (2015), who made a forecast for the 2015 UK General Election. Although Franch could produce very accurate results for the UK election just 5 years earlier, Burnap et al. were faced with several new problems. First of, the Scottish National Party had greatly increased their influence since the 2010 UK General Elections now being a serious rival for the other established parties and very difficult for scientists to predict. Second, in 2015 surprisingly all traditional opinion polls were way off from the real election outcome, suggesting that this election really has been exceptionally complicated to forecast. Burnap et al. predicted a hung parliament, as well as several other platforms, for example YouGov. The reality however showed, that the conservatives with prime minister David Cameron as their leader could achieve an absolute majority. Furthermore, the authors greatly underestimated the number of seats gained by the Scottish National Party, winning 56 of 59 Scottish seats. They trace back this problem to their assumption of a random distribution of all individuals, which in the case of a regional party cannot be applied. To avoid this problem all users would need to be geocoded, which reduces the total N to about 1%, making the results ungeneralizable. While Sloan (2015) is working on this problem, a well-grounded answer has yet to be found.