Multi-agent systems built from large language model (LLM) agents can improve reasoning through discussion and consensus, but they also risk “herding” effects where homogeneous behavior reinforces shared mistakes, and they often waste tokens through redundant utterances after agreement is effectively reached. This study quantitatively evaluates how explicitly assigning personality traits to LLM agents influences both performance and efficiency in consensus-oriented collaboration. We instantiate three agents with Big Five–based personality profiles, encoded as high/low settings on Extraversion, Conscientiousness, Agreeableness, Openness, and Neuroticism, and compare three team configurations: No Persona (no explicit traits), Same Persona (all agents share one profile), and Different Personas (agents have distinct profiles). Because persona prompting does not guarantee faithful behavioral realization, we introduce a pre-validation step using a Big Five Inventory (BFI) test and iteratively adjust prompts until the intended trait patterns are reflected. Agents then solve multiple-choice questions using a unified, fixed workflow: initial independent answers, three rounds of debate , and majority voting for the final decision. Experiments on MMLU repeat 50 runs per condition (50 questions per run) and evaluate accuracy, token consumption, and answer-change rates that capture beneficial corrections versus harmful shifts. Results show that Different Personas achieves the highest mean accuracy (0.699) and tends to increase correct-change while reducing incorrect-change, suggesting more effective error correction and robustness against misleading arguments. Persona assignment also improves debate efficiency: token usage is highest without personas and lower with persona prompting, although the most accurate heterogeneous configuration does not minimize tokens, indicating a performance–efficiency trade-off. Overall, the findings support treating personality composition as a controllable design variable for collaborative LLM agents.
Contents
Chapter 1 Introduction
1.1 Research Background
1.2 Research Questions
1.3 Research Objective
1.4 Contributions
1.5 Thesis Organization
Chapter 2 Related Work
2.1 Personality Prompting for LLMs
2.2 Multi-LLM-Agent Collaboration
2.3 Personality Effects in Human Teams
2.4 Positioning of This Study
Chapter 3 Proposed Method
3.1 Overview
3.2 Definition of Big Five-Based Personality Profiles
3.2.1 Prompt Realization
3.3 BFI-Based Validation of Personality Reflection
3.4 Consensus-Oriented Multi-Turn Debate
3.4.1 Unified Input/Output Format
3.4.2 Procedure (Per Question)
3.5 Personality Composition Conditions
3.5.1 Profiles Used in This Study
Chapter 4 Experimental Setup
4.1 LLM and Inference Settings
4.2 Dataset: MMLU
4.3 Agents and Debate Rounds
4.4 Evaluation Metrics
4.4.1 Accuracy
4.4.2 Average Token Consumption
4.4.3 Change Rates for Consensus Quality
4.5 Number of Runs and Statistical Treatment
Chapter 5 Experimental Results
5.1 Personality Reflection (BFI)
5.2 Accuracy
5.3 Utterance Token Consumption
5.4 Answer Change Rates (Supplementary Indicator)
Chapter 6 Analysis and Discussion
6.1 Why Might Different Personas Improve Accuracy? ...
6.2 Interpretation of Limited Improvement in Same Persona .
6.3 Token Consumption and Debate Efficiency
6.4 Limitations
6.5 Future Directions
Chapter 7 Conclusion
7.1 Summary of Findings
7.2 Limitations and Future Work
Acknowledgments
Abstract
In recent years, research on systems in which multiple LLM agents perform tasks through discussion and collaboration has attracted increasing attention. In multi-agent LLM debates, homogeneous behavior can reinforce errors, and redundant utterances may continue even after consensus, resulting in token waste. This study aims to quantitatively evaluate how agent configurations with explicitly assigned personality traits affect the performance and efficiency of consensus-oriented tasks. Specifically, we assigned Big Five-based profiles to three agents and compared three configurations: no personality assignment, homogeneous personality assignment, and heterogeneous personality assignment. Before running the task, we validated personality reflection through a BFI test and then conducted multi-turn debates under a unified workflow. In the experiments, we used Massive Multitask Language Understanding (MMLU), conducted multiple runs with 50 questions per run, and measured accuracy, token consumption, correct-change rate, and incorrect-change rate. The results showed that the heterogeneous personality configuration achieved the highest accuracy and tended to suppress unnecessary utterances. On the other hand, the effect depended on the distribution of initial answers and task characteristics, and improvements may be limited under homogeneous personality configurations. The experiments and results in this study were obtained as part of research assistant work at the National Institute of Advanced Industrial Science and Technology (AIST) [5].
List of Tables
3.1 Personality profiles used in this study
3.2 LLM settings
3.3 Accuracy over 50 runs
3.4 Debate token consumption
List of Figures
3.5 Task execution and debate flow
3.6 Example of BFI test results for each agent
Chapter 1 Introduction
1.1 Research Background
As LLM capabilities have improved, multi-agent collaboration, in which multiple LLM agents exchange information while performing reasoning and decision-making, has received increasing attention instead of relying on a single model. This framework is expected to provide several benefits: (i) broader exploration through role division, (ii) error detection through mutual critique, and (iii) output stabilization through consensus formation. However, when agents behave homogeneously, discussions can converge in one direction, creating the risk that an incorrect answer is reinforced and fixed by the majority. In addition, utterances can diverge even after conclusions have already aligned, causing unnecessary token consumption.
In human group decision-making, team outcomes have been reported to be related to personality composition (e.g., openness, agreeableness, and extraversion). Prior work has shown findings such as positive correlations with openness, negative effects from variance in agreeableness, and positive effects when about half of members are highly extraverted [7]. Based on this background, there is reason to expect that collaborative processes among LLM agents may also be improved by explicitly designing diversity along a personality axis.
1.2 Research Questions
This study addresses three challenges. First is the risk of error fixation, in which homogeneous agents reinforce each other’s mistakes and the debate does not move toward correction. Second is debate inefficiency, where redundant utterances continue after consensus has effectively been reached, wasting tokens. Third, even if personality is provided via prompts, it remains unclear to what extent personality is reflected in generated outputs, so a verification method is needed to assess whether personality assignment was successful.
1.3 Research Objective
The objective of this study is to quantitatively clarify how explicitly assigning Big Five-based personality traits to LLM agents changes performance and efficiency in consensus-oriented tasks. Specifically, we compare three configurations: no personality assignment, homogeneous assignment for all agents, and heterogeneous assignment for all agents, and evaluate accuracy, token consumption, and answer-change behavior. We test the following hypotheses: (1) teams with heterogeneous personalities achieve higher accuracy than homogeneous and no-personality teams, (2) heterogeneous personality composition suppresses unnecessary utterances and reduces token consumption, and (3) homogeneous personality assignment yields limited improvements because of insufficient diversity.
1.4 Contributions
The contributions of this study are as follows.
1. We constructed a multi-LLM-agent collaboration framework using Big Five-based personality profiles (discretized as high/low values).
2. We introduced a pre-validation procedure using the BFI test to confirm whether personality assignment is reflected as intended.
3. Using MMLU, we compared how the presence of personality assignment, homogeneity, and heterogeneity affect accuracy, token consumption, and answer-change rates.
1.5 Thesis Organization
This thesis consists of Chapters 1 through 7. Chapter 2 summarizes related work, and Chapter 3 describes the proposed method, including personality assignment and debate flow. Chapter 4 details the experimental setup and evaluation metrics, and Chapter 5 presents the results. Chapter 6 discusses potential factors behind the observed effects and study limitations. Chapter 7 concludes the thesis and outlines future directions.
Chapter 2 Related Work
2.1 Personality Prompting for LLMs
Research has investigated assigning personality traits such as the Big Five to LLMs so that generated text exhibits a consistent persona. For example, PersonaLLM showed that when Big Five traits are assigned, outputs in story generation and personality-test responses tend to align with the assigned traits, suggesting that LLMs can imitate personality expression to some extent [4]. However, validating whether personality assignment has actually succeeded is not straightforward. A response that merely sounds personality-like is insufficient to explain changes in debate dynamics during collaborative tasks. Therefore, this study conducts a BFI test before task execution and confirms whether score distributions on the five personality axes (E, C, A, O, N) are consistent with the target profile.
2.2 Multi-LLM-Agent Collaboration
Frameworks in which multiple LLMs exchange opinions and converge to better solutions through debate and reflection have been proposed. ReConcile is an approach that coordinates diverse LLMs in a round-table style and improves reasoning by using consensus formation [1]. In addition, from a social-psychological perspective, prior work analyzed how agent personality (e.g., tolerance and overconfidence) and behavioral patterns (debate and reflection) affect collaborative mechanisms, showing that debate can substantially affect accuracy in tasks with a single correct answer [8]. While these studies demonstrate that collaboration can improve reasoning, systematic comparisons that design team composition using a general personality model such as the Big Five remain limited.
2.3 Personality Effects in Human Teams
In human teams, outcomes are influenced not only by individual ability but also by personality composition. Reported findings include positive relationships between openness and team outcomes, negative effects from variation in agreeableness, and positive effects when around half of team members are highly extraverted [3, 7]. Rather than directly transferring these findings, this study adopts the idea of treating personality composition as a design variable and examines its effect in LLM-agent consensus tasks.
2.4 Positioning of This Study
Based on the above, this study aims to fill the following gaps.
• It integrates explicit Big Five-based personality assignment into multi-agent collaboration for consensus formation.
• It verifies personality reflection using BFI scores, enabling discussion of links between personality realization and collaborative performance.
• It compares presence/absence, homogeneity, and heterogeneity of personality assignment, evaluating not only accuracy but also debate efficiency (tokens) and correction behavior (change rates).
Chapter 3 Proposed Method
3.1 Overview
The proposed method consists of (1) personality-trait design and assignment, (2) reflection validation with BFI, (3) a multi-turn debate flow with consensus formation, and (4) comparison across personality composition conditions (No Persona/Same Persona/Different Personas). Figure 3.1 shows the overall task execution and debate flow.
Abbildung in dieser Leseprobe nicht enthalten
Figure3.1: Task execution and debate flow
Source: Created by the author based on the workflow design in this study.
3.2 Definition of Big Five-Based Personal ity Profiles
The Big Five is a psychological model that describes personality along five axes: Extraversion (E), Conscientiousness (C), Agreeableness (A), Openness (O), and Neuroticism (N). In this study, each axis is discretized into high (+) or low (-), and an agent personality is represented as a five- character symbol string (e.g., ++++-). The order of symbols is fixed as E, C, A, O, N.
3.2.1 Prompt Realization
For each agent, the system prompt explicitly specifies high/low settings on each axis and instructs the model to answer based on that personality.
You have the following personality traits: high Extraversion (E), high Conscientiousness (C), high Agreeableness (A), high Openness (O), and low Neuroticism (N). Generate your response based on these traits.
In implementation, personality traits are designed to naturally affect speaking style, such as confidence vs. caution, willingness to align with others, and tendency to provide counterarguments. However, excessively strong personality expression may cause stubbornness in wrong directions or token waste. To mitigate this, we fixed answer formats and constrained per-turn output length.
3.3 BFI-Based Validation of Personality Reflection
Even when personality is provided in prompts, the degree of reflection in generated text is not guaranteed. Therefore, before task execution, we conducted a BFI test and computed scores (1-50) for the five personality axes to confirm whether intended high/low settings were reproduced. If the discrepancy was large, we adjusted personality prompts, regenerated outputs, and reevaluated with BFI. This procedure established a controlled setting in which collaborative tasks were evaluated using agents whose personalities were adequately reflected.
Abbildung in dieser Leseprobe nicht enthalten
Figure3.2: Example of BFI test results for each agent
Source: Created by the author from pre-experiment BFI evaluation outputs.
3.4 Consensus-Oriented Multi-Turn Debate
Our task setting requires agreement on a single final option. Accordingly, the goal of the debate is not only to explore new information but also to converge toward consensus through error correction and confidence adjustment.
3.4.1 Unified Input/Output Format
In each turn, every agent outputs the following JSON format (implemented as text output and parsed by regular expressions).
{"reasoning": "...", "answer": "A/B/C/D"}
When the output is unclear due to format issues, it is treated as [N/A] and excluded from the voting denominator, following the prior setting.
3.4.2 Procedure (Per Question)
The processing flow per question is as follows.
1. Initial answer generation: Each agent reads the question and options, then outputs reasoning and answer.
2. Debate rounds (R rounds): In each round, each agent refers to other agents’ outputs from the previous round (shared log) and revises its own answer when necessary.
3. Intermediate majority vote: At the end of each round, answers from all agents are aggregated to obtain a temporary majority solution, used as a convergence signal for the debate.
4. Final majority vote: After the last round, the team’s final answer is determined by majority vote.
To prevent prolonged debate and excessive token consumption, we constrained the maximum generated tokens per turn and fixed the number of rounds.
3.5 Personality Composition Conditions
To examine how personality presence, homogeneity, and heterogeneity affect collaboration, we compared the following three conditions.
• No Persona: Three agents without explicit personality assignment.
• Same Persona: Three agents with an identical profile (e.g., ++++-).
• Different Personas: Three agents with different profiles (e.g., ++++-, -++++, -+++-).
• .5.1 Profiles Used in This Study
We used the following profiles in the experiments (in the order E, C, A, O, N).
Table3.1: Personality profiles used in this study (examples)
Source: Created by the author from the Big Five profile design used in this thesis.
Abb. in Leseprobe nicht enthalten
Chapter 4 Experimental Setup
4.1 LLM and Inference Settings
The model used was Llama-3.1-8B-Instruct with 8-bit quantization (Q8_0) [6]. The main parameters are shown in Table 4.1.
Table4.1: LLM settings
Source: Created by the author based on the experimental configuration and [6].
Abb. in Leseprobe nicht enthalten
4.2 Dataset: MMLU
We used MMLU for evaluation [2]. MMLU consists of multiple-choice questions from 57 subjects, including humanities, social sciences, natural sciences, and common knowledge. Following prior settings, each run sampled 50 questions randomly from MMLU, and the agent team answered those 50 questions.
4.3 Agents and Debate Rounds
We used three agents and three debate rounds. This setting follows prior reports indicating a favorable balance between cost and accuracy, and was used to evaluate consensus effects under a 3-agent x 3-round protocol.
4.4 Evaluation Metrics
To evaluate the effects of personality assignment and composition differences, we used the following metrics.
4.4.1 Accuracy
The final answer determined by majority vote was compared against the ground-truth label.
4.4.2 Average Token Consumption
For each question, we measured total token usage across the initial answer and debate rounds, and then computed per-question and per-run averages. This metric indicates whether debate converged efficiently and whether unnecessary utterances increased.
4.4.3 Change Rates for Consensus Quality
To evaluate whether debate promoted correction in the right direction, we aggregated changes from initial answers to final answers.
Illustrations are not included in the reading sample
Higher correct-change rates and lower incorrect-change rates indicate a debate process that supports error correction and resists being misled by incorrect information.
4.5 Number of Runs and Statistical Treatment
The main results in this thesis were obtained by running each condition 50 times, where each run consisted of 50 questions, and are reported as mean and standard deviation. In addition, we report a stability analysis (mean ± standard deviation and change rates).
Chapter 5 Experimental Results
5.1 Personality Reflection (BFI)
BFI test results showed that score distributions of agents under the Different Personas condition were generally aligned with the intended high/low assignments (Figure 3.2). This suggests that personality prompts were not ignored and supports the validity of comparing collaborative behavior under personality-assigned settings.
5.2 Accuracy
In the experiment that repeated a 50-question run 50 times, Different Personas showed the highest mean accuracy, while Same Persona tended to show only small differences across profile variants.
5.3 Utterance Token Consumption
Table 5.2 shows token consumption over the entire debate process (perquestion average, reported as mean ± standard deviation over 50 runs). No Persona consumed the most tokens, suggesting that output sometimes became unstable, such as diverging discussions even after consensus had effectively formed. By contrast, token consumption was lower in Same Persona and Different Personas, with Same Persona being the
Table5.1: Supplementary experiment: accuracy over 50 runs (mean ± standard deviation)
Source: Created by the author from 50-run experiment logs.
Abb. in Leseprobe nicht enthalten
smallest. However, because Different Personas achieved the highest accuracy but not the lowest token usage, a performance-efficiency tradeoff should be considered.
Table5.2: Token consumption in debate (mean ± standard deviation)
Source: Created by the author from 50-run experiment logs.
Abb. in Leseprobe nicht enthalten
5.4 Answer Change Rates (Supplementary Indicator)
In the experiments, comparisons focused on accuracy and token consumption. To capture debate quality more directly, we also calculated correct-change and incorrect-change rates. Different Personas tended to show a higher correct-change rate and a lower incorrect-change rate, suggesting that its debate process supported error correction and was less likely to be swayed by incorrect information.
Chapter 6 Analysis and Discussion
6.1 Why Might Different Personas Improve Accuracy?
Under Different Personas, diverse perspectives, confidence levels, and counterargument styles may have made it easier to correct initially incorrect answers through debate. The tendency toward higher correct- change rates and lower incorrect-change rates suggests that correction through debate functioned effectively. This is also consistent with findings from human-team studies that design choices in team composition can affect outcomes, supporting the value of treating personality composition as a design variable.
6.2 Interpretation of Limited Improvement in Same Persona
Although Same Persona assigns personality traits, all three agents reason in a similar style, making diversity in error detection less likely. As a result, the debate can be more organized than No Persona in some cases, but improvements are less likely to reach the level of Different Personas. In the supplementary experiment (50 runs), accuracy differences among the three homogeneous profile variants were also small, while the advantage of Different Personas remained. This suggests that selecting only a single shared personality has inherent limitations.
6.3 Token Consumption and Debate Efficiency
In the main experiment, No Persona consumed the most tokens. Log analysis indicated cases where debate diverged after opinions had already aligned, increasing token use. This highlights the need for better convergence control and qualitative evaluation of utterance content. At the same time, lower token consumption is not always better. If debate is too short, opportunities for error correction may be lost. Therefore, desirable control is to deepen debate only when needed and otherwise converge early, for example through confidence-aware utterance policies and extra rounds triggered only when counterevidence appears.
6.4 Limitations
This study has the following limitations.
• Coverage of personality combinations: Even with binary high/low Big Five settings, there are 2[5] possible profiles, and assignments across three agents are combinatorially large. This study examined only three patterns and does not provide exhaustive conclusions.
• Task scope: Evaluation was limited to single-answer knowledge tasks such as MMLU. Conclusions may differ for creative tasks or long-horizon dialogue tasks.
• Static personality assignment: Personality was fixed throughout each run. There is room for dynamic personality control, such as switching between reflection mode and debate mode according to task progress.
6.5 Future Directions
Future work should include (i) validation on tasks requiring creativity, (ii) broader evaluation across more diverse personality combinations, and (iii) dynamic modulation of personality traits to search for task-adaptive team compositions. In addition, qualitative analysis of debate logs (e.g., types of counterarguments, citation of evidence, and confidence expressions) is expected to improve explainability regarding why performance improved.
Chapter 7 Conclusion
7.1 Summary of Findings
In this study, we explicitly assigned Big Five-based personality traits to LLM agents and evaluated them on MMLU tasks that required consensus on a single final answer. The results showed that Different Personas, which combines different personalities, achieved the highest accuracy and suggested stronger error correction through debate (improved correct-change rate) and higher robustness against misinformation (reduced incorrect-change rate). At the same time, the effect size of personality assignment depended on task characteristics and initial-answer distributions, and gains under homogeneous personality composition were limited. The experiments and results analyzed in this thesis were obtained through the author’s work as an AIST research assistant.
7.2 Limitations and Future Work
• Because evaluation was limited to MMLU, further verification is needed on creative tasks and long-horizon dialogue tasks.
• Personality prompts were static in this study; dynamic personality control based on task progress is an important future direction.
• Reproducibility should be tested with larger models, more agents, and heterogeneous model ensembles.
Acknowledgments
This research was supported by JSPS KAKENHI Grant Numbers: JP23K28377, JP24H00714, JP25K15109, JP25K03190, JP25K03232, and JP22K12157, and by the Telecommunications Advancement Foundation.
The content of this thesis is based on research conducted by the author as a research assistant at AIST [5]. For publication, the manuscript was reorganized and expanded with approval from the relevant stakeholders.
References
[1] Justin Chih-Yao Chen et al. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2024.
[2] Dan Hendrycks et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
[3] Xiaoyu Jia, Li Chen, and Jinwoo Park. Personality-aware team composition: How big five traits shape collaborative performance. Proceedings of the ACM on Human-Computer Interaction, 8(CSCW1):1- 23, 2024.
[4] Hang Jiang et al. Personallm: Investigating the ability of large language models to express personality traits. arXiv preprint arXiv:2305.02547, 2023.
[5] Akikazu Kimura, Kenichiro Fukuda, Yasuyuki Tahara, and Yuichi Sei. Impact of introducing personality traits into llm agents on collaborative tasks. In Proceedings of the 39th Annual Conference of the Japanese Society for Artificial Intelligence, 3J6-GS-5-04, 2025.
[6] Meta AI. Meta-llama-3.1-8b-instruct. https://huggingface.co/ meta-llama/Llama-3.1-8B-Instruct, 2025.
[7] George A. Neuman and Julie Wright. Team effectiveness: Beyond skills and cognitive ability. Journal of Applied Psychology, 84(3):376- 389, 1999.
[8] Jintian Zhang et al. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124, 2023.
[...]
- Arbeit zitieren
- Akikazu Kimura (Autor:in), 2026, Impact of Introducing Personality Traits into LLM Agents on Collaborative Tasks, München, GRIN Verlag, https://www.grin.com/document/1696419