In recent years, multiplayer games have gained significant popularity, but they also face many challenges. In particular, making games enjoyable for single players is essential for attracting a broad user base. To enable single-player enjoyment of multiplayer games, rule-based non-player characters (NPCs) have long been used to form teams with players. However, these NPCs often do not behave like humans and frequently act contrary to the human player's intentions. In this study, we propose a game AI that adopts human-like playstyles using behavioral data from human players. Subsequently, we conducted quantitative evaluations of human-like playstyles and qualitative assessments through cooperative play between the trained AI and human players. The results showed that the proposed method is a model capable of realizing human-like playstyles in both quantitative and qualitative evaluations.
Contents
1 Introduction
1.1 Background
1.2 Research Overview
1.3 Thesis Structure
2 Related Research
2.1 Research on Human-like Action Sequence Generation
2.2 Imitation Learning
2.3 Behavior Cloning
2.4 Generative Adversarial Imitation Learning
2.5 Policy Gradient Method
2.6 REINFORCE
3 Original RPG Battle
3.1 Character Status
3.2 Game Progress of the Original RPG Battle
3.3 Game Situations
3.4 Environmental States
4 Human-like Playstyle
5 Proposed Method
5.1 System Overview
5.2 Environment: Original RPG Battle
5.3 Sequence Data
5.4 Expert Data
5.5 Discriminator D
5.5.1 Sequence Data Provided to Discriminator D
5.5.2 Discriminator D Network Architecture
5.5.3 Learning of Discriminator D
5.6 Imitation Learning Agent
5.6.1 Sequence Data for Imitation Learning Agents
5.6.2 Action Decision Process of the Agent
5.6.3 Network Architecture of Imitation Learning Agents . .
5.6.4 Learning of Agents
6 Experiment
6.1 Expert Data Collection
6.2 Reward Design for Human-like Mistakes
6.3 Models and Hyperparameters used in Experiments
6.4 Classification of Human Player Playstyles by Clustering
6.4.1 Evaluation Method
6.4.2 Results of Time-Series Clustering on Expert Training Data
6.4.3 Validity of Fit Score using Expert Test Data
6.5 Quantitative Evaluation of Human-like Playstyles by Agents .
6.6 Human Subjective Evaluation
6.6.1 Best Models per Game Situation
6.6.2 Implementation of Human-like Playstyles in CooperativePlay
6.6.3 Human-like Playstyles and Mistake Suppression
7 Consideration
7.1 Quantitative Evaluation of Playstyles via Clustering
7.2 Human-like Playstyles across Models
7.3 Human-likeness of Non-operated Characters
7.4 Human-like Playstyles and Mistake Suppression
8 Conclusion
List of Tables
3.1 Character Status
3.2 Status information of ally and enemy characters
5.1 Three Neural Networks Possessed by the Agent
5.2 Output Dimensions of Agent’s Networks
6.1 Expert Data Counts per Situation
6.2 Hyperparameters for BC
6.3 Hyperparameters for GAIL
6.4 Hyperparameters for BC+GAIL
6.5 Fit Scores for Expert Test Data
6.6 Fit Scores of Generated Data by Agents
List of Figures
3.1 Game Progress
3.2 Situation 1*
3.3 Situation 2*
3.4 Situation 3*
5.1 Overview of the Training Environment
5.2 Data Provided as Elements of Sequence Data
5.3 Data Preprocessing and Output for Discriminator D
5.4 Sequence data agent 20 receives from Action Memory
5.5 Sequence data given as input to action net
6.1 Best Models in each Situation
6.2 Human-likeness of other characters for each operated character
6.3 Proportion of players who answered "no mistakes" for the best model
Abstract
In recent years, multiplayer games have gained significant popularity, but they also face many challenges. In particular, making games enjoyable for single players is essential for attracting a broad user base. To enable singleplayer enjoyment of multiplayer games, rule-based non-player characters (NPCs) have long been used to form teams with players. However, these NPCs often do not behave like humans and frequently act contrary to the human player’s intentions. In this study, we propose a game AI that adopts humanlike playstyles using behavioral data from human players. Subsequently, we conducted quantitative evaluations of human-like playstyles and qualitative assessments through cooperative play between the trained AI and human players. The results showed that the proposed method is a model capable of realizing human-like playstyles in both quantitative and qualitative evaluations.
Chapter 1
Introduction
1.1 Background
Reinforcement learning technologies have led to the development of AI stronger than humans in games such as Go and Chess. However, these game AIs sometimes exhibit behaviors that are incomprehensible to humans, which has been identified as a problem. Therefore, in recent years, research on understanding the character traits of game AIs [1] and research on game AIs that perform human-like actions based on human behavioral data [2, 3, 5] has been actively conducted. Such research on generating human-like behavior is expected to entertain human players and allow them to play active roles [4].
Furthermore, games involving competition and cooperation among multiple players have become popular recently. Especially in cooperative multiplayer games, each player has a role they wish to fulfill. Players implicitly understand the team’s playstyle based on the roles and statuses of game characters and enjoy the game by cooperating with others. On the other hand, players who prefer single-player modes may avoid playing multiplayer games even if they are attracted to the game content. As a countermeasure, game developers provide non-player characters (NPCs) as allies so that the game can be enjoyed alone. However, many NPCs follow pre-determined action patterns. A problem with these NPCs is that it is difficult for them to provide cooperation based on roles like a human player would.
1.2 Research Overview
In this study, we use an original RPG battle as a multiplayer game environment to acquire behavioral data from human players. By using this behavioral data as a dataset, we train an agent to imitate human-like playstyles using imitation learning techniques.
Furthermore, imitating even the mistakes made by human players can lead to frustration for the human player. Therefore, in this study, to balance the pursuit of human-like playstyles with the prevention of human-like mistakes, we implement negative reinforcement for human-like mistakes to suppress misplays. The trade-off between the negative reinforcement settings to suppress mistakes and human-like playstyles will be discussed in Section 7.4 of this thesis. Based on these considerations, the purpose of this research is to create a game AI that can imitate human-like playstyles and cooperate with humans.
1.3 Thesis Structure
This thesis consists of eight chapters. Chapter 2 discusses related research. Chapter 3 explains the original RPG battle. Chapter 4 defines human-like playstyles. Chapter 5 describes the proposed method. Chapter 6 details the experiments. Chapter 7 provides considerations. Chapter 8 presents the summary of this research and future prospects.
Chapter 2
Related Research
2.1 Research on Human-like Action Sequence Generation
5. Zhao et al. [5] obtained action logs from human players in a Massive Multiplayer Online Role-Playing Game (MMORPG) and succeeded in generating human-like action sequences by reading changes in actions caused by changes in player attributes.
They implemented Long Term Action Memory (LTAM), inspired by Long Short Term Memory (LSTM), to read differences in action sequences for each player attribute and succeeded in generating human-like action sequences.
2.2 Imitation Learning
In a given environment, when the policy taken by an expert is considered the optimal policy n *, the optimal policy n * is learned from trajectory data (expert data) consisting of actions taken by the expert from states provided by the environment under the optimal policy n *. This trajectory data is given as pairs of environment states and chosen actions [6].
2.3 Behavior Cloning
Behavior Cloning (BC) is a supervised learning method using expert data. The agent receives an environment state s from expert data following the optimal policy n * and learns a policy n to make action decisions [7, 8]. Expert data is given as trajectories of pairs of states s and corresponding actions a. By learning these trajectories, the agent’s policy n, is trained to best match the optimal policy n *. Here, $ represents the agent’s parameters, and the optimal parameters are c>". These parameters c>" are formulated using maximum likelihood estimation as shown in Equation 2.1 [8].
Illustrations are not included in the reading sample
2.4 Generative Adversarial Imitation Learning
Generative Adversarial Imitation Learning (GAIL) is an imitation learning framework proposed by Ho, Ermon et al. [9], inspired by Generative Adversarial Networks (GAN). GAIL takes an inverse reinforcement learning approach to the imitation learning problem, fitting an agent’s policy n g with weight 0 and a discriminator network D w with weight w. Here, log(D (s,a)) plays the role of a cost function. GAIL can be viewed as a maximization problem for the discriminator D regarding Equation 2.2 and a minimization problem for the agent’s policy n regarding the same equation.
Illustrations are not included in the reading sample
From Equation 2.2, GAIL can be formulated as shown in Equation 2.3.
Illustrations are not included in the reading sample
2.5 Policy Gradient Method
One of the well-known reinforcement learning methods is policy-based algorithms [10, 11]. Policy-based algorithms aim to maximize the expected return J(n g) (Equation 2.4) of the agent’s policy n g.
J(n g) = E T , [ G(T)] (2.4)
Illustrations are not included in the reading sample
The gradient for the agent’s policy n g is calculated as shown in Equation 2.5, and the agent’s policy parameters 0 are updated based on Equation 2.6.
Illustrations are not included in the reading sample
2.6 REINFORCE
REINFORCE is an improved method of the general policy gradient method [12]. In the general policy gradient method, the return G (T) from Section 2.5 was the discounted sum of all rewards obtained so far. However, since the goodness of an agent’s action is evaluated by the sum of rewards obtained after that action, the sum of rewards before the action is irrelevant to its evaluation. Since the agent’s goal is to maximize the expected return after the action time, in REINFORCE, the goodness of an action is given as the discounted return G t from time t onwards, with a discount rate y, as shown in Equation 2.7.
Illustrations are not included in the reading sample
With G t, the gradient of the expected return in REINFORCE is calculated as shown in Equation 2.8, and the agent’s policy parameters 0 are updated based on Equation 2.9.
Illustrations are not included in the reading sample
Chapter 3
Original RPG Battle
The environment used in this study is a turn-based command battle created based on typical multiplayer RPG battles. There are ally characters and enemy characters, and the order of actions is determined based on values combining each character’s speed and random numbers. Each character has pre-determined statuses such as Hit Points (HP) and Magic Points (MP), and if HP falls to 0 or below, that character becomes incapacitated. Additionally, when performing actions other than defense or single attacks, the acting character’s MP is consumed. If the MP required for an action exceeds the character’s current MP, the action becomes invalid.
Ally characters consist of four types: Warrior, Priest, Mage, and Entertainer. Each character has a different action set from which they choose their actions. Enemy characters are changed for each of the game situations described in Section 3.3.
3.1 Character Status
Characters in the original RPG possess the statuses shown in Table 3.1.
Table 3.1: Character Status
Illustrations are not included in the reading sample
3.2 Game Progress of the Original RPG Battle
The original RPG progresses as shown in Figure 3.1. Each game phase is described below.
• TURN START: Each turn begins.
• ACTION ORDER: After each turn starts, the order of actions is determined for all surviving characters in the game. The action order for each character is determined by the descending order of the product of each character’s SPD value and a random number.
• POP CHARACTER: The character to act is selected and stored in the game variable now_character.
• TURN NOW: The acting character (now_character) selects an action from their action set and, if the action requires target selection, selects a target. All actions are guaranteed hits except for constraints based on MP.
If all ally characters or all enemy characters’ HP falls to 0 or below during the "TURN NOW" phase, the game phase transitions to "GAME END" even during "TURN NOW".
• TURN END: Once all characters have finished their actions, the game phase returns to "TURN START" to proceed to the next turn.
• GAME END: The game ends, and all environmental variables are reset to their initial values.
Illustrations are not included in the reading sample
Figure 3.1: Game Progress
3.3 Game Situations
In the original game used in this research, we trained the game AI in three game situations to learn the characteristics of human player actions in various scenarios, rather than just corresponding to a single game situation. By changing the appearing enemy characters, we set the following three situations.
• Situation 1: A configuration where a single enemy is not particularly strong, but five enemy characters appear.
• Situation 2: A configuration where three enemy characters appear, and two of them support the remaining one.
Illustrations are not included in the reading sample
Figure 3.2: Situation 1*
Illustrations are not included in the reading sample
Figure 3.3: Situation 2*
• Situation 3: An enemy character whose status is set higher than other characters. A configuration where it is impossible to win with random actions.
Illustrations are not included in the reading sample
Figure 3.4: Situation 3*
(*: Designed by Freepik and distributed by Flaticon (https://www.flaticon.com/))
3.4 Environmental States
In the original RPG acting as the environment, the state of the environment is constantly updated as in-game characters, such as agents, perform actions. At this time, the environmental state consists of data including the following information.
• Status information of ally characters (Table 3.2)
• Status information of enemy characters (Table 3.2)
• The ratio of elapsed game turns to the maximum number of game turns.
Table 3.2: Status information of ally and enemy characters
Illustrations are not included in the reading sample
Chapter 4
Human-like Playstyle
Human-like playstyles are extremely complex and difficult to evaluate. Specifically, human-like playstyles refer to instances such as the following.
For example, suppose character A is known to have statuses specialized for attack, and a human team consists of (Attack-specialized character A, Recovery-specialized character B, Support-specialized character C).
• Instance 1: A playstyle where support-specialized character C performs actions that increase character A’s attack power, considering it more efficient to deal damage to enemy characters. Characters B and C perform supportive actions for character A’s attack.
• Instance 2: A playstyle where all characters A, B, and C attempt to deal damage to the enemy to maximize damage output every turn.
• Instance 3: A playstyle where support character C takes actions to improve defense in preparation for enemy actions, and turns to an offensive orientation after establishing a secure stance.
Based on the above, a human-like playstyle is the way a human player progresses through the game. Since it involves the order of actions and target selection for chosen actions, we define a human-like playstyle as the sequence of action selections by a human player.
Chapter 5
Proposed Method
The purpose of this research is to train an agent to learn the humanlike playstyle defined in Chapter 4 through imitation learning. Therefore, the proposed method in this study is implemented based on an imitation learning framework inspired by GAIL [9], and training is conducted using data generated by the agent’s actions in the environment and expert data from human players. Models adopting the proposed method (the system shown in Section 5.1 (Figure 5.1)) are denoted as "GAIL," and models trained through Behavior Cloning [7] are denoted as "BC."
5.1 System Overview
An overview of the proposed system is shown in Figure 5.1.
Agent i receives the current environment state from the environment (Figure 5.1-"environment") and sequence data containing other characters’ action data from agent i ’s previous action to the moment before, then selects an action. Subsequently, since possible targets change based on the action’s attributes, a mask is applied to select a valid target. The pair data of the action and target is considered the action, and agent i performs the action in the environment. Action data of character c in the environment includes a signal indicating whether character c is an ally or an enemy (side c), character c ’s id (id c), environment state just before the action (state), chosen action (action), chosen target (target id), environment state after the action (next state), re-
Illustrations are not included in the reading sample
Figure 5.1: Overview of the Training Environment
( *1: State provided by the environment, *2: Sequence data of actions from the past to the moment before, *3: Input provided to the action selection network, *4: Action selection network, *5: Mask controlling the input network depending on action attributes, *6: Input when using the target selection network via mask *5, *7: Network for selecting enemy characters, *8: Network for selecting ally characters, *9, *10: Masks restricting selection to only possible characters, *11: Pair data of selected action and target, *12: Data including environment state, character’s action, and action result, *13: Memory storing action data provided to the agent, *14: Memory storing action data used during training, *15: Memory storing expert data from human players, *16: Reward given to the agent ) ward from the environment (reward), and a signal indicating whether the game has ended (done). This combined data (Figure 5.1-"action data"(*12)) is added to the Action Memory (Figure 5.1-"Action Memory"(*13)) and Trajectory Memory (Figure 5.1-"Trajectory Memory"(*14)). This action data is added to both memories and used as sequence data received by other agents and as generated data for discriminator training. Action Memory stores data used when agents decide actions, and Trajectory Memory is used when the discriminator D and agents undergo training.
By simulating a certain number of episodes, the agent generates action data. After the agent performs actions in the environment for a certain number of episodes, the discriminator D (Figure 5.1-"Discriminator") learns to distinguish whether input data is fake using generated data in the Trajectory Memory and expert data. At this time, the discriminator D outputs the probability that the input data is fake.
Meanwhile, the agent learns by taking the - log of the discriminator D ’s output as a reward when generated data is provided as input. After training, the generated data used during training is discarded.
5.2 Environment: Original RPG Battle
The environment assumed in this study is the original RPG described in Chapter 3. While enemy characters change depending on the game situation, ally characters are fixed to four: "Warrior," "Priest," "Mage," and "Entertainer." These four ally characters have different action sets and different statuses. A feature of this game is that agents do not decide their actions all at once at the start of a turn, but rather receive the current game situation s when it is their turn to act, select an action, and then select a target for that action. During training, we changed the three situations shown in Section 3.3 for each episode.
5.3 Sequence Data
In the following sections, "sequence data" in this thesis refers to a sequence of concatenated data (s, a, t) consisting of state s, action a, and target t. Sequence data includes past actions of other characters and enemy character actions. Human players make action choices by reading changes in the environmental state caused by other characters’ actions. For this reason, sequence data contains the human player’s playstyle. Even if the same state is given by the environment, chosen actions are not the same depending on past actions. Therefore, to train human-like playstyles, we provide the agent with the environment state and sequence data containing other characters’ actions since the agent’s previous action. Sequence data provided to the discriminator includes sequences from the agent’s actions and those from expert data. The discriminator then learns to distinguish between expert data and generated data.
Data Included in Sequence Data
Sequence data includes pair data (s, a, t) concatenating the following state s, action a, and target t as elements.
• State s: The state stores all character information appearing in the game shown in Section 3.4 and values representing game turns.
• Action a: There are 23 types of all actions in the game used in this study. While each character has its own action set, we represent chosen actions using the length of all actions in the game. If the i -th action is selected, it is expressed as a tensor where only the i -th bit is 1 and other bits are 0.
• Target t: Similar to actions, targets are represented using a tensor length corresponding to target selections, combining all character numbers with cases like no selection: -1, all allies: 0, and all enemies: 1. The chosen target id is one-hot encoded.
• Concatenated Data: The above state s, action a, and target t are concatenated as shown in Figure 5.2 and treated as one element of sequence data.
Illustrations are not included in the reading sample
Figure 5.2: Data Provided as Elements of Sequence Data
Since the sequence data provided to discriminator D and the imitation learning agent differs slightly, they will be discussed in detail in Section 5.5 and Section 5.6, respectively.
5.4 Expert Data
Data included in expert data sequentially records action data of ally characters, enemy characters, the string "TURN START" representing the start of a turn, and "GAME END" representing the end of the game. This specification allows for understanding character action orders and the like. Character action data is a tuple consisting of character side (enemy or ally), character ID, state just before action, action, target, next state, reward, and "done" representing game completion.
5.5 Discriminator D
Discriminator D is provided with sequence data generated by agents and sequence data by experts as inputs, and outputs the probability that the given data is generated data.
5.5.1 Sequence Data Provided to Discriminator D
The sequence data given to discriminator D is the sequence data described in Section 5.3. In the case of sequence data corresponding to agent i,it includes action data (pairs of state, action, and target) for all characters that acted between agent i ’s action in turn (k - 1) and agent i ’s action in turn (k). Since the size of each sequence data varies, they are padded with 0 to match the maximum length sequence, as shown in Figure 5.3-"pre process," to ensure consistent input size.
5.5.2 Discriminator D Network Architecture
Discriminator D distinguishes whether sequence data is fake. Since the data handled is sequence data, LSTM was adopted as the network architecture. The network structure is a multi-layer LSTM combining two LSTM layers. Each LSTM layer has one hidden layer consisting of 128 units, and the output layer is combined with a linear layer. The output of discriminator D is [ to, to ], and a sigmoid function is applied to map the output to the range [0 , 1]. The output in the range [0 , 1] is used as the probability value Probfake that the input data is fake. The discriminator D network, data preprocessing, and network output are shown in Figure 5.3.
Illustrations are not included in the reading sample
Figure 5.3: Data Preprocessing and Output for Discriminator D
5.5.3 Learning of Discriminator D
Discriminator D learns to output the probability that a given sequence data is fake. Thus, discriminator D is trained to minimize the loss function given in Equation 5.1, outputting 1 for generated data and 0 for expert data.
L D = - E [log(D (generator data))] - E [log(1 - D (expert data))] (5.1)
Generated data by agents is stored in Trajectory Memory (Figure 5.1- Trajectory Memory(*14)). All data is used during discriminator training, and Tra jectory Memory is cleared after training.
5.6 Imitation Learning Agent
The imitation learning agent in this study acts as a generator of actions. It receiving sequence data consisting of actions from previous character actions and the current environment state, and probabilistically selects an action. In this environment, agents are trained individually, and sequence data is generated from their respective action data. Each agent learns to make discriminator D incorrectly judge its generated sequence data as authentic.
5.6.1 Sequence Data for Imitation Learning Agents
As described in Section 5.3, sequence data for agent i is action sequence data generated by other characters’ actions between agent i ’s previous action and the moment before its current action. Agent i receives concatenated data of this sequence data and environmental state s t as input. Since the neural network for agent’s policy is designed using LSTM like discriminator D, the immediate state s t cannot be simply concatenated to the sequence data. Therefore, yet-to-be-determined actions and targets are represented by padded-action and padded-target filled with zeros, used instead of one-hot actions and targets. The data concatenated with state s t (padded-data) is then concatenated with the sequence data received from Action Memory.
5.6.2 Action Decision Process of the Agent
The agent selects an action (action, target) according to the following process.
1 Acquisition of current state s t Receive current state s t from the environment.
2 Acquisition of past action sequence data Receive sequence data from Action Memory consisting of other characters’ (state, action, tar- get_id) pairs between its own previous action and the current moment. Figure 5.4 illustrates an example for agent 20.
Illustrations are not included in the reading sample
Figure 5.4: Sequence data agent 20 receives from Action Memory
When agent 20 acts in turn (k) in Figure 5.4, sequence data received from Action Memory is similar to "agent20 received sequence data" in the figure. This is sequence data containing other characters’ action data from agent 20’s action in turn (k - 1) until agent 20’s action in turn (k). Thus, the sequence data agent 20 receives in turn (k) starts from its action data (s 0 , a 0 , t 0) in turn (k- 1) and includes the "enemy" action data (s 3 ,a 3 ,t 3) between agent 20 and agent 40.
3 Action selection For action selection, the received state s t is concatenated with padded-action and padded-target as shown in Figure 5.5 to create padded-data. Padded-action and padded-target are 1D tensor data filled with zeros:
padded-action = [0] * [length of all actions]
padded-target = [0] * [length of all target_id]
Concatenated padded-action, padded-target, and state s t are treated as padded-data:
padded-data = (s t, padded-action, padded-target)
Subsequently, padded-data is added to the end of sequence data from step 2, and this is used as input for the agent’s action net.
Illustrations are not included in the reading sample
Figure 5.5: Sequence data given as input to action net
4 Target selection After action selection, target1 net or target2 net may be used depending on action attributes. If neither is used, the target is automatically selected based on attributes. Here, we describe input data when using target1 net (or target2 net). In addition to state s t, since the action is fixed by action net, one-hot action encoding is performed to generate action-one-hot. Then, as in step 3, only target is represented as padded-target, and concatenated state s t, action-one- hot, and padded-target are added to the sequence data from step 2, which is then used as input to target1 net (or target2 net).
5.6.3 Network Architecture of Imitation Learning Agents
The agent possesses three neural networks: action net, target1 net, and target2 net, as shown in Table 5.1.
Table 5.1: Three Neural Networks Possessed by the Agent
Illustrations are not included in the reading sample
Each network is a multi-layer LSTM combining three LSTM layers. Each layer consists of one hidden layer with 128 units, combined with a linear output layer. The imitation learning agent learns to make discriminator D perceive its generated data as authentic. Output dimensions are shown in Table 5.2.
Table 5.2: Output Dimensions of Agent’s Networks
Illustrations are not included in the reading sample
Output ranges are [ x, to ], and applying a softmax function outside the network maps elements to [0 , 1]. The softmax output represents a probability distribution, and agents select actions probabilistically. Target selection is determined in three cases by masks based on action attributes:
• Target selection by target1 net: Select ally character id as target_id.
• Target selection by target2 net: Select enemy character id as target_id.
• Automatic target selection: target_id is automatically determined among three options (-1: no selection, 0: all allies, 1: all enemies) based on attributes.
When using target1 net or target2 net, only surviving characters can be selected. Therefore, masks (mask t1 and mask t2 in Figure 5.1) are applied to set output values of unselectable characters to x, resulting in zero probability after softmax.
5.6.4 Learning of Agents
Imitation learning agents undergo GAIL-based training besides BC-based supervised learning. In GAIL training, agents learn to generate actions such that discriminator D identifies generated data as authentic.
Learning by BC
BC is simple supervised learning and does not involve simulations. Crossentropy loss L n e (Equation 5.2) is used to predict actions for given sequence data and learn expert policies. Agents update parameters 0 of policy n to minimize Equation 5.2.
Illustrations are not included in the reading sample
Learning by GAIL
In GAIL, agent training as a generator occurs simultaneously with discriminator training. As described in Section 5.5, agents are reinforcement learning agents learning via policy gradient methods. REINFORCE [12] was adopted as the algorithm. Agents learn to bring their policy closer to expert policy n * using the natural logarithm of discriminator D ’s output as a reward. Two rewards were set:
1. Reward from discriminator D output: Since discriminator D learns to output 1 for fake data and 0 for expert data, agents learn to make D misidentify generated data as expert data by receiving Equation 5.3 as reward.
Reward D = - log(D (generator data)) (5.3)
2. Reward for human-like mistake suppression: Equation 5.4 was set to suppress misplays. What constitutes a human-like mistake can be defined freely. When acting in the environment, a signal representing human-like mistakes is returned, used to set Rewardmiss . Reward values are negative NegativePoint (NP) when a mistake is made, and 0 otherwise.
Illustrations are not included in the reading sample
The sum of Reward D and Rewardmiss is given to the agent, which learns policy parameters to increase expected return. Update for target selection networks uses probability values before masks are applied.
Chapter 6
Experiment
6.1 Expert Data Collection
To collect expert data, ten human players played the game in each of the three situations. After sufficient practice to establish their own strategies, their gameplay data was acquired. The number of data points collected per situation is shown in Table 6.1.
Table 6.1: Expert Data Counts per Situation
Illustrations are not included in the reading sample
All situation data was used for training.
6.2 Reward Design for Human-like Mistakes
Human-like mistakes in this study refer to "wasted action turns because actions cannot be performed due to MP limits" from Chapter 3. While other mistakes like over-buffing are possible, they might be intentional for duration extension and thus shouldn’t be considered mistakes. Mistakes due to MP limits, however, are unlikely to be intentional.
6.3 Models and Hyperparameters used in Experiments
Agents were experiment using three training methods:
• Agent trained only with BC (BC)
• Agent trained only with GAIL (GAIL)
• Agent trained with GAIL after pre-training with BC (BC+GAIL)
Hyperparameters are shown in Tables 6.2, 6.3, and 6.4.
Table 6.2: Hyperparameters for BC
Illustrations are not included in the reading sample
Table 6.3: Hyperparameters for GAIL
Illustrations are not included in the reading sample
Table 6.4: Hyperparameters for BC+GAIL
Illustrations are not included in the reading sample
(Hyperparameters for BC pre-training were the same as Table 6.2)
6.4 Classification of Human Player Playstyles by Clustering
Since we define human-like playstyles through action sequence and target selection, expert sequence data contains individual playstyles. Treating this data as time-series data, we perform time-series clustering to quantitatively
classify playstyles and use the classifier to judge if agents generate human-like playstyles.
6.4.1 Evaluation Method
We defined Fit Score to quantitatively evaluate playstyles via clustering and checked its validity using expert data.
Fit Score
To calculate Fit Score: 1. Expert data clustering: Expert data was split 4:1 for training and testing. Training data was used for time-series clustering to train a classifier, which was then evaluated using test data. 2. Predict cluster membership: Predict which cluster data belongs to using the expert-trained classifier. 3. Determine cluster membership: Determine if data belongs to the generated clusters by setting a threshold based on the maximum distance between cluster centroids and their data points. 4. Calculate Fit Score: Calculated as the ratio of generated data points judged as "belonging" in step 3 (Equation 6.1).
Number of points judged as "belonging" Total number of data points
6.4.2 Results of Time-Series Clustering on Expert Training Data
Optimal cluster numbers were determined via the elbow method: 3 for situation 1, and 4 for situations 2 and 3.
6.4.3 Validity of Fit Score using Expert Test Data
Fit Scores for unused expert test data were 0.9 for situation 1, 0.971 for situation 2, and 1.0 for situation 3 (Table 6.5). Since Fit Scores exceeded 0.9 in all situations, we judged it valid as an evaluation metric.
Table 6.5: Fit Scores for Expert Test Data
Illustrations are not included in the reading sample
6.5 Quantitative Evaluation of Human-like Playstyles by Agents
Table 6.6 shows Fit Scores for generated data (500 episodes per model).
BC+GAIL showed the best results for situations 1 and 2, while BC was
Table 6.6: Fit Scores of Generated Data by Agents
Illustrations are not included in the reading sample
slightly better for situation 3.
6.6 Human Subjective Evaluation
Nine human players participated in a survey after playing cooperatively with the models. They performed evaluations on which model realized the most human-like playstyle in each game situation.
6.6.1 Best Models per Game Situation
Survey results showed that BC+GAIL was perceived as the most humanlike model across all game situations (Figure 6.1).
Illustrations are not included in the reading sample
6.6.2 Implementation of Human-like Playstyles in Co-
operative Play
Following results from Figure 6.1, respondents evaluated which characters behaved like humans when using BC+GAIL. Results are shown in Figure 6.2.
Illustrations are not included in the reading sample
6.6.3 Human-like Playstyles and Mistake Suppression
For BC+GAIL, mistake suppression involves a trade-off with pure humanlike behavior. Survey results on whether characters made mistakes are shown in Figure 6.3.
Percentage of players who responded that the best model "no mistakes"
Figure 6.3: Proportion of players who answered "no mistakes" for the best model
Chapter 7
Consideration
7.1 Quantitative Evaluation of Playstyles via Clustering
BC+GAIL performed best in both Fit Score and human evaluation in most cases. In Situation 3, BC scored slightly higher than BC+GAIL (by 0.002 points), indicating comparable performance. Fit Scores varied across situations. Modalities with fewer data points (Situation 1) showed lower Fit Scores. Shorter clear turns also made playstyle establishment more difficult.
7.2 Human-like Playstyles across Models
BC+GAIL was voted best by 89% of players in Situation 1, 56% in Situation 2, and 78% in Situation 3. GAIL was rated least human-like across all situations, likely because its discriminator training outpaced the generator training given the small dataset.
7.3 Human-likeness of Non-operated Characters
The "Warrior" was consistently rated as the most human-like character, likely due to its smaller action space allowing for better learning.
7.4 Human-like Playstyles and Mistake Suppression
BC+GAIL outperformed BC in "no mistake" ratings for "Priest" and "Entertainer," who have high-MP actions. Mistake suppression was found to be part of "human-likeness." However, for "Priest" and "Entertainer," negative reinforcement occasionally discouraged high-MP actions desired by human players, indicating areas where human-likeness couldn’t be fully realized.
Chapter 8
Conclusion
This study proposed a game AI that adopts human-like playstyles in multiplayer RPG battles and evaluated it through playstyle classification and human-likeness assessments. Using sequence data including past actions allowed for predicting and imitating playstyles. Future work involves increasing data for more stable training and using longer sequence lengths or additional features to capture deeper player intent. Applying feature extractors to generalize across situations is also planned.
References
[1] Y. Zhou and W. Li, “Discovering of Game AIs’ Characters Using a Neural Network based AI Imitator for AI Clustering,” 2020 IEEE Conference on Games (CoG), Osaka, Japan, 2020, pp. 198-205
[2] R. F. Julien Dossa, X. Lian, H. Nomoto, T. Matsubara and K. Uehara, “A Human-Like Agent Based on a Hybrid of Reinforcement and Imitation Learning,” 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 2019, pp. 1-8
[3] Pan, CF., Min, XY., Zhang, HR. et al. Behavior imitation of individual board game players. Appl Intell 53, 11571-11585 (2023).
[4] H. Bando, S. Ikeda, C. Hsueh, "Teammate AI in cooperative games letting human players play active roles," IPSJ SIG GI, 2023-03-17
[5] S. Zhao et al., “Player Behavior Modeling for Enhancing Role-Playing Game Engagement,” in IEEE Transactions on Computational Social Systems, vol. 8, no. 2, pp. 464-474, April 2021
[6] Hussein, Ahmed, et al. “Imitation learning: A survey of learning methods.” ACM Computing Surveys (CSUR) 50.2 (2017): 1-35.
[7] Bain, Michael, and Claude Sammut. “A Framework for Behavioural Cloning.” Machine Intelligence 15. 1995.
[8] Torabi, Faraz, Garrett Warnell, and Peter Stone. “Behavioral cloning from observation.” arXiv preprint arXiv:1805.01954 (2018).
[9] Ho, Jonathan, and Stefano Ermon. “Generative adversarial imitation learning.” Advances in neural information processing systems 29 (2016).
[10] Williams, R.J. Simple statistical gradient-following algorithms for con- nectionist reinforcement learning. Mach Learn 8, 229-256 (1992)
[11] OpenAI Spinning Up “Algorithms Docs Vanilla Policy Gradient”, https://spinningup.openai.com/en/latest/algorithms/vpg. html#vanilla-policy-gradient
[12] Y. Saito, "Deep Learning from Scratch 4: Reinforcement Learning," O’Reilly Japan, 2022.
[...]
- Quote paper
- Masatoshi Fujiyama (Author), 2024, Proposal of Game-AI that Takes Human-Like Play Styles in Multiplayer RPGs, Munich, GRIN Verlag, https://www.grin.com/document/1690581