Excerpt

## Contents

1 Abstract

2 Artificial Intelligence and Machine Learning

3 Machine Learning Techniques

3.1 Cross Validation

3.2 Decision Trees

3.3 k Nearest Neighbors

3.4 Support Vector Machines

3.5 Ensemble Methods

4 S&P 500 Buy or Sell Valuation

4.1 Problem Introduction

4.2 Dataframe Features

4.3 Applicable Machine Learning Classifiers

4.4 Cross Validation and Model Performance

4.5 Virtual Portfolio

4.6 Caveats and Outlook

5 Appendix

5.1 Figures

5.2 Tables

5.3 Python Code

## List of Tables

1 A simple Turing Machine Table. Adapted from Petzold (2008)

2 An Example of labeled Training Data

3 A (non-representative) Development Outline of Machine Learning Application Methods in Finance

4 Consolidated Classifier Cross Validation Accuracy Results

## List of Figures

1 An Example of a Maximum Margin separating Hyperplane. . .

2 The Cross Validation Score for the Support Vector Machine RBF Classifier on the Test Data which overfits at γ < 10_2.Created from the Valuation Data

3 The Decision Tree Golf Example. Adapted from J R Quinlan (1986)

4 A 2-Class Classification (k = 3, weights = ’distance’). Created from the Valuation Data

5 H2 represents a Maximum Margin Decision Boundary, while H1 equips narrower margins

6 A Soft Margin SVM with constraint-violating Instances. Adapted from (Ihler, 2015)

7 The Hinge Loss is increasing linearly for an Instance that violates the Margin Condition. Adapted from (Zhang, 2011) . .

8 SVC models (Pinto Moreira, 2011)

9 An Outline of Polynomial, Gaussian Radial Basis and Sigmoid Kernel functions and their shapes. Adapted from (Ihler, 2015)

10 The Bias-Variance Dilemma with proposed Ensemble Solution. Adapted from (Fortmann-Roe, 2012)

11 A Random Forest Example for 100 Trees. Adapted from (Scikit-learn.com, 2017)

12 A Gradient Boosted Regression Trees Illustration for 100 trees.

13 Choosing the optimal Machine Learning Classifier.(Scikit-learn.com, 2017)

14 A ROC Curve created with the Valuation Data

15 A Heatmap of Sigmoid Support Vector Machine at 53% Accuracy with y,C : [1000, 0.1]

16 A Contour Map displays the maximum Average Fold Accuracy including all applied Classifiers

17 The Virtual Portfolio invests according to the Machine Learning Valuation. Classifier applied: SVM RBF

18 A Heat Map for Support Vector Machine Sigmoid

19 A Heat Map for Support Vector Machine Polynomial 3

20 A Heat Map for Support Vector Machine Gaussian RBF. ...

21 A Heat Map for Adaptive Boosted Trees

22 A Heat Map for Random Forests

23 A Heat Map for k Nearest Neighbors

24 A Heat Map for Gradient Boosted Trees

25 Overall Maximum Accuracy along ζ

26 Accuracy along ζ

## 1 Abstract

This Master Thesis introduces basic concepts and methods of Machine Learning as well as applying them on a virtual buy- or sell algorithm enabled by an accurately predicting classifier. In the first section, the historical development of Machine Learning is presented. Consequently, an array of classifiers is described in detail. The third section results in a python-based Machine Learning application of seven classifiers, a cross validation and in initializing the final valuation algorithm. The optimal classifier predicts following day Open Prizes for the S&P 500 with 72% accuracy outperforming market return by nearly 150% within the span of one month.

## 2 Artificial Intelligence and Machine Learn ing

In order to conceptualize the ideas behind today’s developments in technology and for the application in the third section,it is deemed vital to introduce Machine Learning as a scientific field. To fully understand Machine Learning today, the historical context of the human endeavor to imagine complex machines requires consideration. Arthur Samuel coined the term Machine Learning as the scientific field that provides computers the ability to learn without being explicitly programmed (Samuel, 1959). Machine Learning applications directly stem from the advancements in the area of ’Artificial Intelligence’. One particular recurrent name in various scientific endeavors is Leonardo Da Vinci. This also holds true for the case for Artificial Intelligence. When some of Leonardo Da Vinci’s sketchbooks were discovered in the 1950s, Carlo Pedrettim, a professor at the University of California was stunned to find that they were the blueprint for a mechanical device with a complex core of two independent operating systems mimicking a human figure. As part of Da Vinci’s extensive study of the human body, he imagined this device that was analogly-programmable via a controller within its chest which provided power to the extremities (Moran, 2006). Pedrettim was quick to call it a robot and when Rosheim, a Nasa engineer built the machine from the blueprints in 2004, he proved that it was indeed able to walk and wave (Rosheim, 2006). The word robot derives from the Czech word robota which means forced labour. As Ivan Klima describes, Karel Capek, who was born 1890 in Northern Bohemia, suffered from Bechterew’s disease and was categorized unfit for military service. Having endured a world war in the midst of food shortages and cruel senseless carnage, he was certain that technical progress of unprecedented quality and velocity would drive civilation to a better and more comfortable life. He was shaped by the philosophical field of pragmatism to which he was introduced by Dr. Edvard Benes who would continue to become Czechoslovakia’s second president (Karel Capek, 2004). Already a household name in the Czechoslovakian intellectual scene, Capek gained international attention for a series of short stories called R.U.R. - Rossum’s Universal Robots published in 1920 as a play (Capek, 1920). The translated version of his play rapidly spread across the globe, portraying the Robots as incarnation of the technical revolution with perfect memory capabilities however without souls and feelings. This theme, coined by Capek along with other cultural influences such as Fritz Lang’s notorious 1927 movie Metropolis, was ground-breaking for the advent of computer technology and the aspiration of intelligent machines. In 1942, Isaac Asimov published a short story called Runaround (Asimov, 1942) in which he introduced three laws of behaviour-governing in robots: First, A robot may not injure a human being or, through inaction, allow a human being to come to harm. Secondly, A robot must obey the orders given to it by human beings, except where such orders would conflict with the First Law. Third, A robot must protect its own existence as long as such protection does not conflict with the First or Second Law. What Asimov was however really kickstarting was the a foundation for machine ethics, the question of whether and how to equip intelligent machines with certain fundamental laws to protect humans which has been concerning research ever since (Clark, 1994),(Murphy & Woods, 2009).

Even before these dystopian future scenarios became part of mainstream culture, now in the realm of science, Charles Babbage first described his idea of an Analytical Machine design in 1837. He desired to combine an arithmetic logic unit, a control flow and an integrated memory in such an arrangement that unknowingly for him it would become the foundation for digital computing (Bromley, 1998). Ada, Countess of Lovelace is credited to have been the first computer programmer in history. As she worked with Babbage, she recognized that a Computing Machine could solve much more complex problems than mere simple calculations. Translating such a problem to a machine in order to extract an output is what became known much later as computer algorithms and programming (Fuegi & Francis, 2003). Lovelace and Babbage and Turing paved way to the ENIAC (Electronic Numerical Integrator and Computer), introduced almost a century later in 1946 at the University of Pennsilvania. This machine was engineered to calculate artillery firing tables and thermonuclear feasibility tests for the U.S. military (Burks & Burks, 1981).

The ENIAC was, like the Analytical Machine, Turing-complete which means it performed according to a concept introduced by Alan M. Turing in his Article On Computable Numbers with an Application to the Entscheidungsproblem (Turing, 1938). Petzold (2008) explains Turing’s ground-breaking achievement as follows: The Entscheidungsproblem, the Decision Problem, was posed by mathematicians Hilbert and Ackermann (1928). They were reasoning the possibility to infallibly and in a finite time frame reveal provability for any given proposition P from a given set of arithmetical axioms (from the Greek axioma that which is thought worthy or fit or that which commends itself as evident) of first-order logic. This task can only be effectively computed by applying a set of specifications excluding inspirational leaps or intellectual insights. Discovering this specific procedure would mean that any arithmetical axiom could be mechanically proven or refuted by a machine. The specific challenge herein was to distinguish between a truly mechanical and systematic procedure from intuitive and spontaneous influences.

Turing took up this problem and proposed his solution: He designed a machine with a confined scope of operations that had the ability to carry out these axiom-testing operations if properly taught and he called it the Turing Machine. This machine consists firstly of a supposedly sufficiently long tape which is divided into squares. On theses squares, symbols can be read, written, erased and there can only exist one symbol on each square. The second component is a read and write head which can also move along the tape, on square at a time:

1. Read symbol from current square

Depending on the symbol Read, and the current State:

2 (a) Erase the symbol, or write a new one

(b) Move left or right on the tape

(c) Change to a new State

The head can therefore move along the tape, read, erase and write. In order for the head to remember what it supposed to do, the Turing Machine entails a memory which is called State. State is a finite range of predetermined values such as b, c or d. Turing refers to b as the computation starting state. Moving along the tape, the symbols are coded, meaning they are transformed from the initial string consisting of 1 and 0 into another string. After the coding is finished, the machine goes into a halting State. The answer to the problem, the computation output given is now written on the tape:

illustration not visible in this excerpt

Table 1: A simple Turing Machine Table. Adapted from Petzold (2008)

Instead of trying to provide the algorithm for the Decision Problem, Turing solved the underlying logical problem. His concept, partly based on the work of Lovelace and Babbage (whom he credited in his later work), was the basis for the first digital computers such as the ENIAC and could solve a vast range of problems.

However, as Turing found, not all problems can be solved. Even the most sophisticated computers have a fundamental flaw. He called this the Halting Problem. For this, he imagined a program with input and output with unknown commands for the head on the inside. For this machine, the output could be: ”I FOUND A SOLUTION FOR THE INPUT PROBLEM.” or ”I DO NOT have A solution” , hence does the given premise entail a certain conclusion or not? The Halting Problem describes that along the way of the machine running an unsolvable input (the head trying to fulfill its given tasks on the inside of the machine), it will trundle in a constant loop, thus it will never give an answer.

So in order to find out whether a machine will indeed halt and give an output after a finite time frame, Turing argues that it is impossible for any machine to solve the halting problem. He suggested that if a machine that could solve the halting problem existed, taking an input ’’Will You halt für this problem or not?” and providing an output ”Yes.” or ”No.”, this machine, he called it H, could consequently be transformed. For the output answer ”Yes.” , it could be made to loop indefinitely, for the answer ”No.”, it could be programmed to halt immediately. This new version, machine H+, as a set of exact rules and specifications could then be fed into itself.

By this operation, the machine which the inputs of H+ are being fed through, struggles to find a solution to the problem: ”Do I halt given the input H+?”. Herein lies the fundamental problem: If H+ halts, it will output ”Yes.” and then loop indefinitely, if H+ does not halt, it will output ”No.” and then halt. So if H+ does halt, it does not halt and conversely. This paradox shows that no possible machine, be it an elaborated algorithm or a supercomputer, can solve the halting problem (Turing, 1938).

In 1950, Turing published Computing Machinery and Intelligence (Turing, 1950) which may, according to the Encyclopedia of Machine Learning (Sam- mut & Webb, 2011), be regarded as one of the seminal works of Artificial Intelligence. He begins with refuting the concept of Thinking Machines as dangerous. Distinguishing between a machine performing tasks and intuitively and spontaneously solving problems could not be done by interpreting a statistical survey. Alternatively, he displayed the approach for The Imitation Game. Taking place in two closed-off rooms and a typewriter communication line between them, sit a man A, a woman B and an interrogator C. To C , either A or B is the woman and C’s objective is to determine which one it is. Further, A wants to coerce C to an incorrect determination and B’s goal is it to aid C in making the correct determination.

Hence, the derived behavior for A would be to try to convince C by stating that he is human, B however to answer truthfully and adding remarks like ”I AM THE WOMAN, DO NOT listen TO him.”. Turing then supersedes the model as he asks ”What happens when a machine takes the role of A?” and will C make the same amount of wrong determinations in this case? In order to trick C, a Thinking Machine would not need to look, sound or feel like a human. Additionally, a conventional computing machine, such as he described in his 1938 paper, would not suffice to solve the inherent complexity of mimicking, eventually reaching human behavior. A machine would need to become similar to a Human Computer, a device without the limits of tape supply for calculations. The Digital Computer invented by Charles Babbage, he states, would match this requirement as it carries a Store for the rules and exact specifications for the computer to obey, an Executive Unit executing the orders and a Control unit ensuring all operations are executed according to the Store.

Advancing the idea of the rudimentary State memory he proposed in 1938, the information in Store could be organized like ”Add the number stored in position 58493 to that in 34923, then return the result in the latter storage position.” which would translate to machine language 5849334923999 with 999 being the reference to this specific rule in the Store unit. He described defining such rules in Store as Programming. Turing then lists several potential objections to his idea that there could theoretically be thinking machines, falsifying each one. The Theological Objection states that god has provided humans with a soul, not however machines. Thus, no machine can think. Turing refutes that creating thinking machines would not misappropriate God’s power of creating souls more than having children does. The Head in the Sand Objection points out that thinking machines would dread humanity, we should therefore hope this will not happen. An argument which was intellectually grounded in the believe of human superiority and not convincing. The Mathematical Objection basically refers to the Halting Problem discusses before. Turing partly acknowledges this while explaining that humans, like future thinking machines, are not able to solve all logical problems. The Argument for Consciousness promotes the thought of a machine not being able to think, as it is not led by emotions and intuition. Pointing to the Imitation Game where everyone seem to think in a certain pattern causing their actions, the objection is not sufficient to refute the concept of thinking machines. The Lady Lovelace’s Objection highlights Lovelace’s memoirs in which she states that Babbage’s Analytical Engine was not supposed to carry out original behavior. This would not exclude the possibility creating a thinking machine in the future. The Argument from Continuity in the Nervous System argues that the human nervous system is in fact not a discrete-state machine but neuronal net. A machine could not mimic human behavior based on this. Turing agreed, despite responding that the trick may suffice in the Imitation Game experimental setup. The Argument from Informality of Behavior points to conflicting rules such as ’’When you see a green light, cross” and ”Do not cross when you see a red light.” which could come to execution at the same time. A human does not halt in such a situation, a machine however does. Agreeing to a certain amount, Turing shared that there are conflicting scenarios but there are also strict ones set by the laws of nature and physics. These could be replicated in a thinking machine. At last, the Argument from Extrasensory Perception suggests that humans have the ability of telepathy, clairvoyance and such. Turing agreed to that (a widely shared believe at the time), the imitation game setup would, if these concepts were proven, simply be complemented with a telepathy-proof room. Kirkpatrick and Klingner (2004) stresses that the Imitation Game still poses an apt test for computer intelligence, enough to defy today’s scientists and argues that especially Turing’s objections are still remarkably prescient to this day.

Finally, Turing outlined his own views on Learning Machines. He puts forward an analogy about the human consciousness consisting of layers like an onion, each of its own not representing the real mind. In order to eventually provide support for thinking machines, one would have to wait until the end of the century and conduct The Imitation Game. He suggested that, since the brain’s storage capacity at the highest 1015 binary digits, a machine would most likely need only 109 to successfully play The Imitation Game. In order to create a Complete System, including logical interference, he suggests emulating children’s brain development by building a Childs’ Brain and lead it through the necessary education and experience. Then, a system of exact specification for order of rules to be executed would be required. Only such a system could make the difference between a genius and a poor reasoner. When teaching the Learning Machine there would be a contrast between handling a regular computing machine as that the teacher of the Learning Machine would largely be ignorant to what happens within the mind of his pupil. For this, an entire new field of study might be inevitably important. Lastly, implementing a random element into the machines mind would increase the scope and speed of computer evolution as it effects the analogue evolution in the same manner (Turing, 1950). The Imitation Game became commonly known as the Turing Test and gave birth to the philosophy of artificial intelligence.

The birth of artificial intelligence took place in 1956 (S. Russell & Norvig, 2009). John McCarthy, who had recently received his PhD at Princeton together with Marvin Minsky organized a two-month workshop at Dartmouth College in the summer. The workshop proposal (McCarthy, Minsky, Rochester & Shannon, 1956) reads: ”We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected (group of scientists work on it together for a summer.” (p. 1). Allen Newell and Herbert Simon also participated in the workshop presented their work together with Cliff Shaw (who did not attend the event) on what they named Logic Theorist, the first real Artificial Intelligence program (Newell & Simon, 1956). Simon was a political scientist and would later receive the Nobel Prize for his theory of bounded rationality in 1978. He was working as a consultant for the Rand Corporation, a non-profit think tank aiming for interdisciplinary and quantitative research. There, Simon got the idea to apply punctuation and symbols to the already existing computing machines.

When they tested their prototype of the Logic Theory Machine on Bertrand Russel’s and Alfred North Whitehead’s Principia Mathematica (B. Russell, 1948), it successfully solved 38 of the first 52 theorems (McCorduck, 2004). During this Workshop, no real breakthroughs were achieved, however it unleashed Artificial Intelligence as a separate field of research branching from computer science. For the pioneers of this new field, generating intelligence with creativity, self-improvement and language use in par human capabilities was the aspiration which could not be realized within control theory or operations research (S. Russell & Norvig, 2009).

The first Machine Learning system to gain wide recognition was Arthur Samuel’s Checkers Player (Samuel, 1959). It was intoduced in 1955 and was the first application of Reinforcement Learning and was able to beat one of the best U.S. players of all time. (S. Russell & Norvig, 2009). Reinforcement Learning can be defined as a vast range of learning problems in which autonomous agents interact, also referred to as sequential decision-making problems with delayed reward. The goal of Reinforcement Algorithms is to learn a policy (mapping from states to actions) that will maximize the reward over time. In Reinforcement Learning, the examples used to train the machine are not labeled whether they entailed incorrect or correct behavior (Sammut & Webb, 2011).

In 1958 Paul Rosenblatt and Marvin Minsky published their idea of a Per- ceptron, an attempt to mimic biophysical principles of the human brain simulated on a IBM 704 computer. This simplest form of a Neural Network constists of a single neuron with adjustable synaptic weights and bias. It can be used to linearly separate patterns that lie on opposite sides of a hyperplane and is limited to execute this pattern distinction, or Classification with no more than two classes (Rosenblatt, 1958). Rosenblatt boasted his algorithm would be capable to walk, talk, see, write, reproduce itself and be self-aware (Olazaran, 1996). When it became clear that the IBM-powered Perceptron designed to recognize shapes and patterns held before a built-in cadmium sulfide photocell camera, was not able to perform as well as expected, the hype about Artificial Intelligence began to disintegrate. Minsky and Seymour Papert proved that a single layer neural network such as the Per- ceptron would not be able to learn an XOR function, meaning that it was able to classify stimuli it received but it lacked an internal representation of this classification in their book about Perceptrons in 1969. It could not convey in a machine language manner to the act of information perception but had to recapulate the act itself. Symbolic Representation is the technique used by humans to cope with tasks without having to reiterate them in their totality was not possible for the Perceptron (McCorduck, 2004). The pattern of researchers overconfident public addresses and perceived lack of quantifiable performance which also had the reason of severely limited computer power at the time, led to a chain reaction of agencies defunding research endeavors. The British governement and DARPA (The US Defense Advanced Research Projects Agency) were disappointed of the results after having spent tens of millions of Dollars, so they draw consequence (S. Russell & Norvig, 2009). This AI Winter lead to a stagnation of research for several years.

For the pioneers Newell and Simon who had established a close work relationship since the days of their summer in Dartmouth in 1956, the lack of funding did not make a difference. In 1976 they formulated their Symbol System, stating that “a physical symbol system has the necessary and sufficient means for general intelligent action(Allen Newell & Simon, 1975), meaning that any machine or human must apply symbols in data structure manipulation to exhibit intelligence (S. Russell & Norvig, 2009). By the end of the decade, the scientific community around Artificial Intelligence came to the conclusion that the approach of many of the preceding programs and machines of attempting to connect elementary reasoning steps to find complex solutions was not the most efficient approach. The alternative to these Weak Methods was to highly specify a program to perform a certain task in a limited domain allowing larger reasoning steps, Expert Systems (S. Russell & Norvig, 2009). The forerunners of Expert Systems were Stanford researcher and former student of Simon as well as Bruce Buchanan, a philosopher and computer scientist, and Joshua Lederberg, Nobel laureate genetist, who successfully solved the problem of detecting molecular structure from mass spectrometer information. They described their approach in a case study in 1970 (Feigenbaum, 1971). By the late 1970s, the first journal articles supported by the discoveries of program Meta-Dendral were published (Buchanan & Feigenbaum, 1978). As Expert Systems could deliver impressive results for specific tasks, commercial application was implemented swiftly by many large cooperations, e.g. the Digital Equipment Corporation (DEC), which equipped its manufacturing process with R1, a program written by John McDermott (McDermott, 1982). This machine saved DEC over 40 Million Dollars by 1984, so it was natural that large companies would invest in AI. Another breeze of spring ending the AI Winter was Japan’s ’Fifth Generation Project’. In 1981, the Japanese government announced a 10-year plan to build an intelligent machine and the US responded by establishing the research consortium Microelectronics and Computer Technology Corporation (MCC). The billions of dollars invested into expert systems, the growing market competition between the manufacturers nationally and the political hunt for prestige turned the AI industry into a flourishing industry (S. Russell & Norvig, 2009). Evolving from Rosenbaum’s Perceptron, a single-layer neural network with limited classification capabilities, the Parallel Distributed Processing System (PDP) was introduced in 1986 by David Rumeihart and James McClelland (Rumelhart & McClelland, 1986). This approach offered researchers a mathematical foundation for general usage. PDP overcame the XOR-limitation of the Perceptron, allowing the network to be far more robust and reliable in a vast range of tasks (Hornik, Stinch- combe & White, 1989). It was also applying Backpropagation, a concept first outlined by Arthur Bryson and Yu-Chi Ho in 1969 as Applied Optimal Control (Bryson & Ho, 1969). Backpropagation or backward progation of errors is used to train neural networks applying an optimization method. This iterative procedure adjusts the weight parameters according to the gradient of an error measure and is implemented by computing an error value for each output unit followed by backpropagating the error values through the network (Sammut & Webb, 2011). PDP as a connectivist approach meant a clear contrast to Simon and Newells symbolic computation. Over the time span of roughly ten years, this led to a certain isolationism within the realms of AI as well as from AI to the remaining field of computer science. At the eleventh annual conference on computational learning theory in 1998, David

McAllester reasoned that combining the several streams of formerly isolated endeavors was a desired theme in Machine Learning (McAllester & A., 1998). Combined with the fact that even mainstream expert systems became cost- ineffective with powerful desktop computer systems introduced by Compaq, Apple and IBM, the entire commercial AI industry collapsed in the early 1990s. This entailed the rise of new ideas like robotics. The introduction of personal computers brought a unprecedented incline in computing power. Moore’s Law the observation made by Gordon Moore, co-founder of Intel, states that the amount of transistors in a dense integrated circuit doubles every year (Moore, 1998) held true. In 1997, chess grand master Garry Kasparov was beaten by Deep Blue, a framework built by IBM (Mc- Corduck, 2004). Soon after computing power and application skyrocketed worldwide caused by the internet, large quantities of data were produced. David Yarowsky suggested that Artificial Intelligence, e.g. the difference between ”A plant is living thing” and ”A plant is word for a factory” , could be bootstrapped by large quantities of training data (Yarowsky, 1995). In 2005, Stanley, the autonomous car equipped with a Machine Learning program, lasers and GPS, traversed 142 miles along a desert route avoiding obstacles and collisions (Thrun et al., 2007). Since 2014, Tesla Motors applies deep neural network approach combined hardware similar to Stanley’s to enable its cars to drive autonomously (Fridman, 2017) and in 2016 Alphabet’s machine AlphaGo, running on the deep neural network algorithm Tensorflow, beat a professional Go player. Go is similar to chess but with a larger grid of 19x19 fields, it entails a much larger branching factor. In Go, the player has around 250 legal movement choices at each turn, compared to 35 moves in chess (Levinovitz, 2014).

Recapitulating 500 years, the field of Artificial Intelligence has been evolving from Da Vinci’s machine that could wave and walk and Babbage’s Analytical Engine to today’s voice recognition in smart phones. S. Russell and Norvig (2009) conclude the following: The idea of an Intelligent Agent choosing the best possible action in a situation goes back centuries with the idea that the human mind operates in some ways like a machine, operating on acquired knowledge in an internal language of symbols and applied to the challenge of which action to take. The journey to today’s state-of-the-art was accompanied by several scientific fields. First, mathematicians like Hilbert, Ackermann and Turing provided the understanding of logical certainty, understanding computation and algorithms. Economists aspired to formalize the decision-making problem to maximize the accuracy and speed of the expected outcome to the decision maker. Neuro science contributed with new discoveries about brain functionality and psychologists largely adopted the idea of brains as information-processing devices. Computer science was able to realize Moore’s Law into powerful machines that enable today’s Artificial Intelligence ventures. Control Theory, focusing on engineering machines that transform an environmental input to the best possible action (e.g. in robotics), uses mathematical tools. Artificial Intelligence came a long way from science fiction, purely-theoretic frameworks, exaggerated optimism and busts of funding and research. The recent advancements of Artificial Intelligence and the field of Machine Learning ultimately have evolved more expeditious caused by the scientific consensus on and approach to approaches and the availability of massive amounts of data.

## 3 Machine Learning Techniques

According to Mohri, Rostamizadeh and Talwalkar (2012), Machine Learning represents computational methods applying acquired knowledge to predict the occurrence of a certain premise. The acquired knowledge of the learning machine stems from electronic information, data, which has been collected and prepared for analysis. The data comes in observations, instances, comprised of a number of variables, features, which can either be labeled by humans or unlabeled. Labeled instances are data containing an indicator of correctness or desired outcome (and vice-versa) for each instance. Labeled data provides the machine information about the classification of outcomes.

illustration not visible in this excerpt

Table 2: An Example of labeled Training Data

The crucial success factors of data in prediction-making are quality and size of the set. Algorithms are designed to efficiently and accurately predict the results. Again, the quality of these algorithms is vital and can be limited into quality and space complexity and sample complexity of the data required to acquire a family of concepts. Across board, theoretical learning allows the applied algorithm to depend on the complexity of the established concept classes as well as the magnitude of the training sample. The learning algorithm in Machine Learning does so by employ data analysis and statistical approaches which are data-focused techniques unifying the fields of computer science, statistics, probability and optimization. Mohri et al. (2012) list the major classes of Machine Learning problems as Classification, Regression, Ranking, Clustering and Manifold Learning. Classification models categorize each item. The number of categories can vary depending on the task given to the machine. A category can e. g. be voting outcome, share prize development or weather. The number of categories can vary depending on the task given to the machine, speech recognition is one example that entails many categories. Regression models predict a real value for each item and are applicable in a vast range of tasks. The penalty for an incorrect cohere to the magnitude of the difference between true and predicted values as opposed to classification problems where there usually exists no notion of closeness between the various categories. Ranking models order items coherent to certain criterion. Primarily used in web search, returning the relevant results for a search query with ’Canonical Ranking’ or in natural language processing systems. Clustering divides items into homogeneous and convergent partitions. Clustering is applied to relatively large data sets in social network analysis and political science to detect Communities of people with similar traits, desires or connections.

illustration not visible in this excerpt

Figure 1: An Example of a Maximum Margin separating Hyperplane.

Manifold Learning transforms a given initial set of items into a lowerdimensional one while preserving the items properties. This is often used to perform digital image preprocessing in computer vision tasks. S. Russell and Norvig (2009) outline four types of learning derived from their specific types of feedback: In Unsupervised Learning, an agent reasons patterns in its provided input while not having been given labeled examples such as shown in table 2. Supervised Learning provides the agent with labeled input-output pairs which are basis for a function that maps from input to predicted output. Each instance in table 2 is represented as (x,y) where index is used to identify the instance and not used in the model, x consists of the ticker, beta, the trailing PE ratio as well as the DE ratio and y is the dependent target feature which can be continuous in regression modeling or discrete (also called class label) in classification. Table 2 displays data for supervised learning with three independent features x1, x2, x3, x4 as well as the dependent target feature y. x1,x2,x3,x4 E {R} and y E {’’buy”, ’’sell”}. For the training set S = (xi,yi), (x2,y2),(xs,ys), each yi comes from an unknown function y = f (x). Unlike the training data set, the test set does not contain a set of vectors y, since the labels are unknown. If for buy stocks, a low beta is a common trait in the training data, the agent will calculate its function accordingly. The goal is to estimate a function h that approximates the true function f. For this, x and y can represent any value and h delineates a hypothesis. Machine Learning generally is nothing but an exploration through the multitude of possible hypotheses with the objective to maximize the performance to predict yi future unlabeled data. For the sake of assessing accuracy of h, a test set including instances that diverge from the ones in the training set. A hypothesis h generalizes well if it accurately anticipates y for future instances. The true function f occasionally is stochastic meaning it does not act as a f (x) but a conditional probability distribution P(U|x). In the case mentioned above, the output y is a set of finite values as y E {”buy”, ’sell”}. The classification learning problem in this case is termed Boolean or binary classification, for there are only two distinct values within the set. If y was a set of numbers, the learning problem would be a regression. When the selected function h from the hypothesis space H fits the true function f perfectly, h is referred to as a consistent hypothesis. Whenever there exist several consistent hypotheses, the simplest one should be preferred for application. This principle is coined Occham’s Razor. However, often a compromise between the respective h’s ability to generalize well and its simplicity should be considered. When hconsistent E H holds, the learning problem is described as realizable. As the true function f is not known, a hypothesis most probable given by the data available can be calculated. h* = argmaxP(datalh)P(h). hen ' Semi-Supervised Learning comes into consideration whenever only a certain amount of input data entails labels and even these labels are not depicting reality in a reliable manner. This encompasses two problems for the agent to solve. Firstly to distinguish between random noise in the data and systematic inaccuracies within the labeled data, and secondly to appropriately label the rest of the data. For the data in table 2 systematic inaccuracies could arise due to the various industries and markets of the companies and the entailing difference in debt-to-equity. The agent develops concepts to distinguish clusters on its own which suggests that clustering is the most common application for unlabeled data. For example, it could infer that low P/E ratios imply lower stock performance on the next day and suggest to sell. Reinforcement Learning provides agent learning by providing a series of reinforcements, rewards or punishments. The agent is confronted with certain recurring events that lead to a positive or a negative outcome and detects which of the specific condition prior to the reward or punishment were most responsible for it. The trailing PE at a certain time of the year (for time series data) or the correlation with the market index could pose such conditions.

### 3.1 Cross Validation

In Machine Learning, a model is trained from an available data set. The model, be it a regression or a classifier, may be able to adequately predict the labels in the applied training data but lacks to predict novel instances. Cross validation can provide an angle on the generalized performance of models. Labeled data can be sparse in many cases. A lack in labeled data entails a lack of training data for the algorithm. In order to exploit the existing labeled for model selection (selection of the algorithm parameters) and for training, fc-fold cross validation can be applied. Cross validation encompasses the general method of learning algorithm evaluation and comparison. For this, the data is split into two sets, one set which trains a model and the other set which validates the model. Then, both training and validation sets crossover in successive frequence, provding that each data point evaluated against (Ross et al., 2009). The basic cross validation method is fc-fold cross validation which was formulated in the 1930s (Larson, 1931) and was first used to estimate regression model performance. It took over fourty years for the application to be broadened to model parameters and entire classifier selection (Stone, 1974). Today cross validation represents the standard procedure for model performance and selection. Mohri et al. (2012) explain the method as following: θ represents the vector of free parameters of the algorithm. For a fixed value θ, the given labeled data S is randomly sliced into n subsets or folds. The ith fold is therefore a labeled subset ((xi·,yii),(ximi,yimi)) with the size mi. For any i E {1,к}, the algorithm is trained on all folds except the ith fold in order to output a hypothesis hi. θ is then inspected on the cross validation error, the average error hypothesis hi generated, which is represented as Rcv (θ) with:

illustration not visible in this excerpt

Usually, all folds are equally segemented, so mi = m for all G {1,к} holds true. Generally for large n, the size of each training data in к-fold cross validation can be described as: [Abbildung in dieser Leseprobe nicht enthalten], close to the entire sample size m. A large к therefore entails a small bias and a large variance, respectively, smaller values for к with a size less than m entail smaller variance and larger bias. Ross et al. (2009) elaborate the ideal size of к deriving from the notion of how the cross validation performance measure is defined by both the training set and the test set: The training set influences the measure indirectly through the application of the learning algorithm, while the test set composition has a direct effect. Allowing overlapping training sets while holding the test set independent is what к-folds cross validation provides.

illustration not visible in this excerpt

Figure 2: The Cross Validation Score for the Support Vector Machine RBF Classifier on the Test Data which overfits at γ < 10_2.Created from the Valuation Data.

Choosing an appropriate value for k poses an inherent issue. Larger k entail more performance estimates and a training set size close to m. This maximizes the probability of any conclusion elaborated about the respective algorithm will generalize to when m data is used to train the model. Increasing k however leads to increased overlaps between training and test sets. So with k = 5, a 5-fold cross validation, each training set shares 3/4 of its instances with each of the remaining four sets, with k = 10, each training set shares 8/9 of its instances with the remaining nine. Additionally, increasing k shrinks the test set size which entails coarse-grained performance metric measurements. So with a test size of 10, accuracy can only be measured to the nearest 10 %, respectively with a test size of 10, 5 %. Hence, the consensus within the data mining community regards k = 10 as an ideal compromise, particularly since it allows predictions considering 90 % of the data, maximizing the ability to generalize to the full data set.

After the data has been split into training and testing segment, the training set sized m is taken to calculate the fc-fold cross validation error Rcv (θ) for a fraction of possible values of θ. Then, θ = θ0 is set for which Rcv (θ) is minimal. The algorithm is then trained with θ = θ0 over the entire training set m and its performance is evaluated according to the cross validation error. When used for performance evaluation, θ becomes a fixed parameter specification and for θ, m is split into к random folds neglecting test and training setting. The fc-fold cross validation and the standard deviation of the errors measured on each fold represent the performance.

illustration not visible in this excerpt

Figure 2 shows the training score and the cross validation score for the case of a Support Vector Machine and for various values of the classifier Kernel parameter γ. ^{[1]}

### 3.2 Decision Trees

Fuernkranz (2011) illustrates that one of the most commonly known and easy-to-grasp Machine Learning models is Decision Trees, a tree-shaped classification model. Its induction reaches back to the 1980s, first described as the inductive inference machinery within Machine Learning research (J. Ross Quinlan, 1983). J R Quinlan (1986) provides a prominent case for a Decision Trees application with his Golf data. The given weather predictors, such as temperature, wind strength and humidity, are provided to infer the target whether a person would be willing to play golf. A Decision Tree consists of nodes, edges and leaves. Nodes test the value of a specific attribute, edges correspond to the test result and link to the consequent node or leaf, and leaves are the terminal nodes that conclude the outcome. The classification is initiated at the first node from the top of the tree, the root, and its corresponding value which is an attribute of outlook in figure 3. The example then shifts down the branch corresponding to a specific value of the respective Ror an explanation of gamma and Kernels see section3.4. For low gamma values both scores remain low, which is referred to as underfitting. attribute (outlook) to reach a new node with a new attribute. These steps are repeated until the leaf is reached which provides a value of the target attribute Play Golf?.

illustration not visible in this excerpt

Figure 3: The Decision Tree Golf Example. Adapted from J R Quinlan (1986)

Figure 3 does not entail Temperature within the tree as the training data is classifiable without a connection to this predictor. Generally, the predictors close to the root have a higher influence in the leaf value then nodes further down the tree. The algorithm underlying Decision Trees is called Top-Down Induction of Decision Trees (TDIDT), also referred to as divide-and-conquer-learning. TDIDT singles out the most appropriate attribute for the Decision Tree root, then divides the data into independent sets and includes the respective nodes and branches. For discrete values, each test retains the form [Abbildung in dieser Leseprobe nicht enthalten] with v being one value for the attribute A. The set consisting of all training instances is St. The values v can also serve for numerical attributes for which a binary split formulated [Abbildung in dieser Leseprobe nicht enthalten] applies. The binary split solely indicates whether A lies beneath or above a specific threshold vt. Another possibility to bolster the algorithm with discrete data is discretization, an operation which transforms numeric attributes into categorical attributes. Following the data split corresponding to the specific A, the mechanism is employed by iterating over each of the split data sets. Whenever a distinct set only contains instances of the identical class or no subsequent split is feasible, since all remaining split options produce the exact outcome for every instance, the respective node is transformed into a leaf and associated with the corresponding class. If split feasibility is however still granted, an additional internal node is created and combined with the most appropriate split attribute for its set. When finished, converged, the algorithm has allotted the data into non-overlapping splits, pure nodes, each consisting entirely of instances of the same class. Pure nodes are only achievable when there exist no conflicting instances within the data, equipping like feature values but divergent class values.

Occham’s razor, the credo of simplicity over complexity described before, holds true for Decision Tree attribute selection. In the Golf instance, not all attributes are included for the reason that their leaves would only contain a single training instance which coerces unnecessary model complexity, decreased prediction speed and reliability. Proposed by J R Quinlan (1986), two additional common framework enforced to select the ideal attributes are the information-theoretic entropy and the Gini index.

illustration not visible in this excerpt

S delineates the training set and Si the training split affiliated with class ci. The maximum impurity for both (2) and (3) is reached where all classes are equally distributed which brings about equal sizes for the splits Si, respectively, the minimum is targeted where one single split Si accommodates all instances [Abbildung in dieser Leseprobe nicht enthalten] while the remaining [Abbildung in dieser Leseprobe nicht enthalten] uing decrease of impurity is noted as Gain:

with t being one specific test on A splitting S into non-overlapping independent St, while impurity reflects whatever impurity assessment. Impurity(S)acts as constant for all A and can thus be omitted for immediate minimization of the average impurity. In contemplation of attributes with many values being preferred to become pure successor nodes over attributes comprising fewer values the gain ratio is put in practice to normalize the gained entropy with the inherent entropy of the split:

illustration not visible in this excerpt

Eventually, the TDIDT consists of a an array of split-specific, local, decisions that exclusively deal with the instances that result to be at the node that is divided at present. This local consideration evokes the problem that each node finds its individual, local optimum and cannot assure convergence into a global optimum or smallest tree.

In order to tackle the issue of Decision Tree overfitting in the form of attributes with little number of values and impure successor nodes and to grant adherence to Occham’s razor, so-called pruning methods provide minimized node complexity. Pruning can be found in various modern Decision Tree induction methods. The algorithm named C4.5 incorporates a prepruning parameter m which prohibits additional splitting except for minimum two successor nodes including at least m instances (J Ross Quinlan, 1992). CART (Classification and Regression Decision Trees) is another algorithm which casts a cost-complexity pruning technique. Here m is found by cross validation (Breiman, Friedman, Olshen & Stone, 1984). In post- pruning, some of the nodes are substituted with leaves which eliminates the respective former sub-nodes and reducing complexity. The novel leafs however are class-impure, so the the most-common class at the leafs is affiliated with the leaf. C4.5 is the most commonly used Decision Tree algorithm as it tackles the issues of numeric values, omitted attribute values and overfitting. CART is optimally implemented for statistical learning. Decision Trees are often combined with various other learning models in Ensemble Methods which are covered subsequently in this section.

### 3.3 k Nearest Neighbors

To understand the concept of k Nearest Neighbors, one particular Machine Learning fundamental has to be elaborated.

S. Russell and Norvig (2009)(pp.737-739) distinct between two basic underlying model concepts within Machine Learning. Linear regression as well as neural networks transform training data into a defined set of parameters w in order to approximate the true function f with the hypothesis hw (x). Models that create such a parameter set of rigid size are called parametric. Particularly in the case of extensive novel instance data for the model to predict, a fixed-sized parameter set prompts decreasing prediction reliability. This is not the case for non-parametric models. The functionality of non-parametric models creates sub-hypotheses (e.g. the nodes in Decision Trees) that holds within all the training instances and from that predicts the succeeding instance. The parameter vector for non-parametric models expands in complexity with the number of instances available, hence they are also known as instance-based learning models. They are also called lazy learners as they do not use a parameterized compact model of the target and as stated for Decision Trees, they approximate local maxima for each test instance.

A straight-forward illustration of instance-based learning is table lookup where all training instances noted in a table. By request for h(x), the corresponding x is found , if it is contained in the table, the respective y is provided. The lack of generalization for this table lookup is obvious since it cannot anticipate y for novel unlabeled instances. Table lookup can however be drastically improved when marginally altered to a nearest neighbor classifier. The essential approach to nearest neighbor is to accumulate all labeled instances and then contrast them against novel unlabeled ones for the purpose of designating appropriate labels. Hence, the table lookup could be enhanced to the following: For xq, all k instances nearest to xq ought to be determined. The set of k Nearest Neighbors is formulated as NN(k,xq). In this process, ’nearest’ instances are determined by means of a certain distance measure. to find the one nearest neighbor of xq, initially the nearest training instance xn is detected, next the label of xn is annotated to xq, thus f (xq) = f (xn). When k Nearest Neighbors are requested, first the k nearest training instances are located, then following orders apply: When the target function comprises discrete values, the plurality vote (the most common label) among NN(k,xq) is assigned to xq. k is always selected to be an odd number in order to avert voting ties in the process. When the target function comprises continuous values, a regression is required and the mean of NN(k,xq) describes the target function values

illustration not visible in this excerpt

In order to determine the distance between xq and the example point xj, usually the Minkowsky Distance is adopted. The measure, also familiar as Lp norm is determined as (7)

For p = 2, (7) is referred to as Euclidean Distance and for [Abbildung in dieser Leseprobe nicht enthalten] is Manhattan Distance. When Boolean attribute values are processed, the count of attributes on which xq and xj differentiate is labeled Hamming Distance. Whenever related properties are measured (e.g the respective weights of components in a shipping container), p = 2 is used. For disparate properties (e.g, weight, color and volume of a shipping container), p = 1 is the common parameter. For various attribute dimensions, distance is affected by scale bias. This scale bias coerces errors in nearest neighbor detection. It occurs when e.g. one attribute dimension is weight in kilograms and another one is weight in grams while holding all other attribute dimensions unchanged. Hence, normalizing the data is imperative. The basic normalization proposal is to utilize both dimension mean μi and standard deviation σί to compute Normalized[Abbildung in dieser Leseprobe nicht enthalten]. The Mahalanobis Distance is the more advanced solution which considers co-variances among dimensions.

illustration not visible in this excerpt

Figure 4: A 2-Class Classification (k = 3, weights = ’distance’). Created from the Valuation Data.

A further classifier parameter is distance weights. Figure 4 exhibits this parameter with the beige outliers in the dark class. Uniform weights designate equal weights for all distant points, distance weights designate a value inverse to the corresponding query point distance.

Closer neighbors to xq have a larger effect than neighbors farther away and are less likely to be considered for class affiliation. KNN is comparably fast and it is capable of learning convoluted target functions. One common issue with the classifier is termed Curse of Dimensionality. For high-dimensional data, the results become increasingly unreliable.

### 3.4 Support Vector Machines

Zhang (2011) summarizes Support Vector Machines as a collection of algorithms of value for classification, regression, density measurement, novelty identification as well as further applications. The most basic instance is a Support Vector Machine which divides data into two classes. The margin between the classes is maximized. Advantages of Support Vector Machines are their comparably high degree of generalization and their capability of complex optimization procedures enabling Support Vector Machines to learn from extensive bulks of data.

Support Vector Machine development took place over four decades in three major waves. The concept of a Decision Boundary dividing training instances of two classes with a maximized decision boundary was introduced by Vapnik and Lerner (1963). Then Wahba (1991) put forward the concept of Kernels. Boser, Guyon and Vapnik (1992) expanded the scope of Support Vector Machine application by suggesting to apply non-linear Kernel Support Vector Machines that can process real data and compose the best decision boundary in feature space. Lastly, Cortes and Vapnik (1995) published the framework of soft margins authorizing certain data points to breach the margin condition. This is necessary when instance class is not linearly distinguishable. For Support Vector Machines optimal decision boundary margins are vital for an exact classification. The data instances that are situated on the margin are called support vectors which explains the name Support Vector Machine. Support Vector Machines can be set to classify linearly as depicted by Figure 1 on page 16. They can, as already mentioned, be enhanced with Kernels which grants processing non-linear dependencies. Support Vector Machines can be optimized by quadratic programming increasing analysis convenience for extensive data sets. The extensive introduction by Ihler (2015) of the University of California Irvine serves as foundation for this section.

illustration not visible in this excerpt

Figure 5: H2 represents a Maximum Margin Decision Boundary, while H1 equips narrower margins.

Visually comparing the decision boundaries of H1 and H2 in the figure above, both separate the different classes of data correctly. However, it becomes clear that H2 is the better model for the following reason: For H1, already tiny data deviatons can lead to an instance being located on the wrong side of the boundary which is in an incorrect prediction. Since test data set include novel unseen data that can differ from the training data known to the model, this problem can become significant. The question arises of how to quantify this notion into the optimum Support Vector Machine parameter settings. Parallel to the left and the right respectively, H2 is surrounded by a margin within which no data instance can be found. The maximization of this margin is the focus of the model optimization. For this, the parameter vector is considered which consists of a bias b, all training set instances [Abbildung in dieser Leseprobe nicht enthalten] and correspondingly a weight for each instance: [Abbildung in dieser Leseprobe nicht enthalten] + w2x2 + wnxn. It is important to consider that the decision boundary is scale invariant which implies that if all instances left or right from the boundary are e.g. squared, the decision for the instances remains unchanged. If the decision boundary is f (x) = 0, then the margins are defined as f [Abbildung in dieser Leseprobe nicht enthalten] + b the data in the positive class (right to the boundary) is situated in a region of positive response of at minimum +1, that is f (x) ≤ +1 in region + 1 and the data in the negative class (left) respectively in a negative class data is in a region of at maximum —1, that is f (x) < in region — 1. All the instances exactly on the margins are the support vectors. These are isosurfaces (surfaces constiting of instances of a constant value) of a linear response, thus they are hyperplanes parallel to the decision boundary and the margins are defined as the respective gap between the hyperplane and the decision boundary which equals to twice the distance from the decision boundary to the closest data instance. The margin width can be geometrically determined in terms of the parameters. For this, it should be considered that the weight vector w = [w1;w2..., wn] is perpendicular to the decision boundary. This holds true since for any two points x and x' on the decision boundary the linear response is 0 or wx + b = 0 and wx' [Abbildung in dieser Leseprobe nicht enthalten]. For a random point on the negative class hyperplane x- such that [Abbildung in dieser Leseprobe nicht enthalten] and a random point as near as possible to x- on the positive class margin x+ such that [Abbildung in dieser Leseprobe nicht enthalten], w is perpendicular to the hyperplanes, the vector describing the distance from x- to x+ is defined as a scalar vector [Abbildung in dieser Leseprobe nicht enthalten]. The two closest points also satisfy [Abbildung in dieser Leseprobe nicht enthalten]. Therefore [Abbildung in dieser Leseprobe nicht enthalten]. The value of the length of r is determined by further analysis as [Abbildung in dieser Leseprobe nicht enthalten]. Since the margin [Abbildung in dieser Leseprobe nicht enthalten] . This is true for all parameters that fulfill the assumptions for the positive or the negative region.

Maximizing the margin width while assuring the necessary support vectors

illustration not visible in this excerpt

**[...]**

^{[1]} For an explanation of gamma and Kernels see section3.4. For low gamma values both scores remain low, which is referred to as underfitting. For γ ≤ 10−2 the classifier is performing well. For γ > 10−2 the classifier is overfitting which means the training score increases in performance while the validation score decreases.

- Quote paper
- Justus Wiedemann (Author), 2017, Machine Learning Asset Valuation, Munich, GRIN Verlag, https://www.grin.com/document/373193

Publish now - it's free

Comments