This bachelor thesis aims to illustrate the idea behind Markov Decision Processes (MDP) and to present a few basic methods of Reinforcement Learning (RL) namely Monte Carlo Learning and Q-Learning, which are the solutions for decision problems modelled by MDPs. For the last section we apply these methods on an application and in the end discuss the results.
Let us imagine the scenario where we put a hamster inside a maze, we expect the hamster to go through the maze till it reaches some point we considered as the goal. Well, it may randomly work but most of the time it won’t. At this place, the hamster does not know how important this particular point remains namely the goal.
But how will it be, when we remunerate the hamster once the goal is reached, he receives a reward for example a piece of cheese. The hamster will start to remember the route, which leads to the cheese and he maybe will learn to go the easy and quick way to achieve this goal. What we did, is that we reinforce the good behavior of the hamster by giving it some reward.
Table of Contents
1 Introduction
1.1 Outline
2 Basics of MDPs and RL
2.1 Markov Decision Processes
2.1.1 Markov Process
2.2 Value Function
2.3 Policy Iteration
2.4 Reinforcement Learning
2.4.1 Monte Carlo Learning
2.4.2 Temporal difference Learning
3 Cleaning Robot Application
3.1 Introduction
3.2 Solving via Value Iteration
3.3 Solving via Monte Carlo Learning
3.4 Solving via Q-Learning
3.5 Comparison of Results
4 Discussion
Research Objectives and Key Themes
This thesis investigates the mathematical foundations of Markov Decision Processes (MDPs) and Reinforcement Learning (RL) algorithms. The primary goal is to evaluate different learning strategies—specifically Value Iteration, Monte Carlo Learning, and Q-Learning—by applying them to a practical simulation of a cleaning robot navigating an apartment to locate dust accumulation.
- Mathematical formalization of Markov Decision Processes (MDPs)
- Comparative analysis of model-based vs. model-free learning
- Implementation of Value Iteration for optimal policy derivation
- Evaluation of Monte Carlo and Q-Learning algorithms
- Performance benchmarking in a simulated spatial environment
Extract from the Thesis
3.5 Comparison of Results
As we saw in the last sections, all three methods for solving the given MDP (reach the room where the dust accumulates) converge to the optimal policy π*. But since they have various requirements and work differently, we will give a short view on some pros and cons of the used methods.
With the Value Iteration method we could accomplish quite good results. Above that it did not have high computational costs to converge to the optimal result. But on the other hand, we had to pass our agent the dynamics, rewards and the terminal states of the environment. Such a method is called model-based, where the agent knows everything about the environment before it starts solving a stated problem in it. As a consequence, value iteration methods become useless, when an environment is given, of which we do not know anything.
Then we introduced Monte Carlo Learning. The term learning means to gather experience by interacting with the surrounding. Like this we could get rid of the need of the knowledge about the environment. Learning methods are model-free, which means, it is not necessary to know anything about the MDP (The agent has no model of the environment).
MC methods work by simulating episodes following a specified policy π. Then the average is calculated at the end (Offline Learning). The main disadvantage of MC is the need to finish an episode to be able to update the state-values, which is in first place ineffective in infinite or in large environments. Moreover MC has a large variance, which implies the sensitivity to the initial values. This could result high computational costs to obtain quality outcomes.
The last proposed learning method was Q-Learning, which is a spacial case of TD methods. TD can learn from incomplete episodes (online). Furthermore TD methods have a lower variance than MC methods. Additionally Q-Learning has the off-policy property. So it is possible to involve more than one policy in order to improve the learning procedure.
Summary of Chapters
Introduction: Provides a high-level motivation for reinforcement learning using a simple maze analogy and outlines the two-part structure of the thesis.
Basics of MDPs and RL: Establishes the theoretical framework including Markov processes, the Bellman equation, and foundational learning algorithms.
Cleaning Robot Application: Demonstrates the practical implementation and comparison of Value Iteration, Monte Carlo Learning, and Q-Learning in a simulated home environment.
Discussion: Reviews the findings, highlights the commonalities between the algorithms, and identifies future directions such as Deep Q-Learning.
Keywords
Reinforcement Learning, Markov Decision Processes, Value Iteration, Monte Carlo Learning, Q-Learning, Policy Iteration, Bellman Equation, Model-based Learning, Model-free Learning, Agent, Reward Signal, Discount Factor, Off-policy Learning, Temporal Difference, Simulation.
Frequently Asked Questions
What is the core focus of this bachelor thesis?
The thesis focuses on explaining Markov Decision Processes (MDPs) and their application through Reinforcement Learning (RL) algorithms, shifting from theoretical foundations to a practical implementation.
Which specific reinforcement learning methods are presented?
The thesis presents three primary methods: Value Iteration, Monte Carlo Learning, and Q-Learning.
What is the primary goal of the research?
The goal is to demonstrate how these algorithms can be used to derive an optimal policy for an agent, specifically a cleaning robot, to navigate an environment and find a target destination (the location of dust).
What scientific approach does the author take?
The author emphasizes a mathematical point of view for understanding these algorithms, rather than focusing purely on computer science implementations.
What is covered in the main body of the work?
The main body covers the mathematical definitions of MDPs, the transition matrix, state-value and action-value functions, the Bellman expectation equation, and the subsequent implementation of these concepts in Python for a simulated robot.
Which keywords best characterize this work?
Key terms include Reinforcement Learning, MDP, Q-Learning, Monte Carlo, Value Iteration, and optimal policy.
Why is Value Iteration described as model-based?
It is model-based because the agent requires prior knowledge of the environment's dynamics, specifically the transition probabilities and reward functions, to compute the optimal results.
What are the advantages of Q-Learning mentioned in the thesis?
Q-Learning is highlighted for being an off-policy algorithm, having lower variance than Monte Carlo methods, and the ability to learn online from incomplete episodes.
How is the environment of the cleaning robot modeled?
The environment is modeled as a grid-based apartment with 7 rooms, where the agent receives a reward of 1 when reaching the child's room (where dust is present) and 0 or -1 elsewhere.
- Arbeit zitieren
- Omar Baiazid (Autor:in), 2021, Methods of Machine Learning and their Application. The Basics of Markov Decision Processes and Reinforcement Learning, München, GRIN Verlag, https://www.grin.com/document/1141604