Robotic agents can be made to learn various tasks through simulating many years of robotic interaction with the environment which cannot be made in case of real robots. With the abundance of a large amount of replay data and the increasing fidelity of simulators to implement complex physical interaction between the robots and the environment, we can make them learn various tasks that would require a lifetime to master. But, the real benefits of such training are only feasible, if it is transferable to the real machines. Although simulations are an effective environment for training agents, as they provide a safe manner to test and train agents, often in robotics, the policies trained in simulation do not transfer well to the real world. This difficulty is compounded by the fact that oftentimes the optimization algorithms based on deep learning exploit simulator flaws to cheat the simulator in order to reap better reward values. Therefore, we would like to apply some commonly used reinforcement learning algorithms to train a simulated agent modelled on the Aldebaran NAO humanoid robot.
The problem of transferring the simulated experience to real life is called the reality gap. In order to bridge the reality gap between the simulated and real agents, we employ a Difference model which will learn the difference between the state distributions of the real and simulated agents. The robot is trained on two basic tasks of navigation and bipedal walking. Deep Reinforcement Learning algorithms such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradients(DDPG) are used to achieve proficiency in these tasks. We then evaluate the performance of the learned policies and transfer them to a real robot using a Difference model based on an addition to the DDPG algorithm.
Table of Contents
1. Introduction
1.1 Motivation
1.2 Objective
1.3 Thesis Organization
2. Related Work
3. The Reinforcement Learning Problem
3.1 Introduction
3.2 Markov Process
3.3 Markov Decision Process
3.4 Policy
3.5 Return
3.5.1 Discount Factor
3.6 Value Function
3.7 Bellman Equations
3.7.1 Bellman Expectation Equation
3.7.2 Bellman Optimality Equation
3.8 Evaluating and Solving MDPs for Optimal Policy
3.8.1 Planning by Dynamic Programming
3.8.2 Policy Evaluation
3.8.3 Policy Iteration
3.8.4 Value Iteration
3.8.5 Extension to Dynamic Programming
3.9 Model-Free methods
3.9.1 Value Function Based Methods
3.9.2 Policy Search Methods
3.9.3 Actor Critic Methods
3.10 Model-Based methods
3.10.1 Dyna-Q
3.10.2 Real Time Model Based Architecture
3.11 Summary
4. Deep Reinforcement Learning for Robotic Manipulation
4.1 Introduction
4.2 Deep Q-Learning
4.3 Deep Deterministic Policy Gradient
4.4 Other Deep RL Algorithms
4.4.1 Policy Gradient Methods
4.4.2 Guided Policy Search
4.5 Comparison of DRL Methods
4.6 Summary
5. System Overview and Design
5.1 Introduction
5.2 Environment Specification
5.2.1 Webots Simulator
5.2.2 OpenAI Gym
5.2.3 Webots Remote control through TCP/IP
5.3 Agent Testing Framework Overview
5.3.1 RL Agent
5.3.2 Environment Controller
5.3.3 Robot Controller
5.4 Framework Improvements and Future Work
6. Methodology and Implementation
6.1 Introduction
6.2 Aldebaran NAO
6.3 Deep Q Networks for Robot Navigation
6.3.1 Navigation for Self-Driving Car
6.3.2 Model for DQN
6.3.3 Experience Replay
6.3.4 Exploration
6.4 Improvements to DQN
6.4.1 Fixed Q-Targets
6.4.2 Double DQNs
6.4.3 Prioritized Experience Replay
6.4.4 Navigation for NAO Robot
6.5 Dealing with High Dimensional Action Spaces
6.6 NAO Bipedal Walk: DDPG for Ideal Conditions
6.6.1 Learning Parameters for DDPG
6.7 Bridging the Reality Gap: Sample Efficient DDPG
6.7.1 Notation
6.7.2 Proposed Method
6.7.3 Algorithm
6.8 Summary
7. Results and Discussion
7.1 Introduction
7.2 System Specification
7.2.1 Software
7.2.2 Hardware
7.3 Self-Driving Car Navigation
7.4 NAO Robot Navigation
7.5 Sim-to-Real Evaluation on Inverted Pendulum and NAO
7.5.1 Learning Parameters
7.5.2 Results
7.6 Advantages of the approach
7.7 Limitations of the approach
7.8 Summary
8. Conclusion and Future Work
8.1 Conclusion
8.2 Directions for Future Work
Research Objective & Core Themes
This thesis aims to develop a robust, unified framework to train a NAO humanoid robot using Deep Reinforcement Learning (DRL) for complex tasks such as navigation and bipedal walking, specifically addressing the "reality gap" when transferring simulation-trained policies to physical robotic platforms.
- Deep Reinforcement Learning (DRL) algorithms (DQN and DDPG).
- Robot simulation using the Webots environment.
- Bridging the "reality gap" using a Difference Model.
- Development of an Agent Testing Framework for decoupling environment and learning implementation.
- Evaluation of learned policies for navigation and bipedal locomotion.
Excerpt from the Book
6.7 Bridging the Reality Gap: Sample Efficient DDPG
Model-based methods have been extensively studied for application of RL algorithms for robotic systems [17]. They are more favourable to model-free methods because they decrease the sample requirements from the real system considerably. Obtaining real data from robotic systems can be extremely difficult and time-consuming. For walking robots like NAO, obtaining data can be even more difficult since executing random policies on the robot could make the robot do extremely harmful maneuvers which could physically damage the robot. For DRL algorithms, which require something around 10^6 samples for their convergence, obtaining those number of samples on a real robot is impossible. Thus, a model learning algorithm is proposed which can learn a local difference model from running a few trials of a policy on the real system and then use the model in a DRL framework.
Summary of Chapters
1. Introduction: Outlines the motivation for using humanoid service robots and defines the project goals of utilizing DRL for navigation and bipedal walking.
2. Related Work: Reviews existing literature on robotics simulators and applications of DRL in robotic control, highlighting the challenge of the reality gap.
3. The Reinforcement Learning Problem: Provides a theoretical foundation covering Markov Decision Processes, value functions, Bellman equations, and various model-free and model-based RL methods.
4. Deep Reinforcement Learning for Robotic Manipulation: Explains the integration of deep neural networks as function approximators, focusing on DQN and DDPG algorithms for high-dimensional control.
5. System Overview and Design: Describes the design of a modular Agent Testing Framework using Webots and OpenAI Gym to standardize agent interaction and simulation control.
6. Methodology and Implementation: Details the practical implementation of navigation and walking tasks, including the "Sim-to-Real" approach using a learned Difference Model to adapt policies.
7. Results and Discussion: Presents the experimental results on inverted pendulum and NAO setups, discussing the efficacy of the Difference Model in improving policy transfer.
8. Conclusion and Future Work: Summarizes the thesis contributions and suggests future research directions, such as adding constraints to policy updates.
Keywords
Deep Reinforcement Learning, Robotics, NAO Robot, Simulation, Reality Gap, Difference Model, DDPG, DQN, Agent Testing Framework, Bipedal Walking, Navigation, Webots, Neural Networks, Control Policies, Sim-to-Real.
Frequently Asked Questions
What is the primary focus of this research?
The research focuses on training a NAO humanoid robot to perform complex behaviors like navigation and bipedal walking using Deep Reinforcement Learning, with a specific emphasis on transferring these skills from simulation to a real-world robot.
What is the "reality gap" in the context of this work?
The reality gap refers to the performance discrepancy of robotic control policies trained in simulation when they are deployed on physical machines, often due to differences in physics, friction, and sensor noise.
What is the main objective of the thesis?
The objective is to create a unified framework for training a NAO robot in simulation and to utilize Deep Reinforcement Learning for high-dimensional control, ensuring these behaviors successfully transfer to a real robot.
Which DRL algorithms are primarily used?
The thesis utilizes Deep Q-Networks (DQN) for discrete control tasks like navigation and Deep Deterministic Policy Gradient (DDPG) for continuous control tasks like bipedal walking.
What is the role of the "Difference Model"?
The Difference Model is a neural network designed to capture the mismatch between the dynamics of an ideal simulation and the real robotic system, allowing the agent to compensate for this bias when learning its policy.
How does the Agent Testing Framework help?
It decouples the implementation of the RL learning algorithm from the environment specifications, allowing the same agent code to be tested on various simulated environments consistently.
What are the key technical challenges in NAO bipedal walking?
Key challenges include managing the high-dimensional action space of the robot's joints and dealing with contact-rich dynamics that make the learning process non-differentiable and inherently chaotic.
What were the findings regarding the "Sim-to-Real" approach?
The results showed that while a Difference Model significantly improves policy transfer performance compared to learning on an ideal model alone, the time to stabilize the policy depends heavily on the system's dimensionality.
- Arbeit zitieren
- Suman Deb (Autor:in), 2019, Humanoid robot control policy and interaction design. A study on simulation to machine deployment, München, GRIN Verlag, https://www.grin.com/document/493652