More Agents Is All You Need
Authors: Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, Deheng Ye
Paper: https://arxiv.org/abs/2402.05120
Code: https://anonymous.4open.science/r/more_agent_is_all_you_need/README.md
Honestly, we're getting tired of article titles claiming “X is all you need”. This time, a team from Tencent has demonstrated that repeatedly sampling from the same model and aggregating the results through voting improves the quality of the outcome with an increasing number of instantiated agents.
It's not like we didn't know about ensemble methods. CoT-SC (Chain-of-Thought with Self-Consistency) essentially did the same thing. The current work tests whether simply brute-forcing with a large number of agents works. Spoiler: it does.
The authors identify three approaches in similar works:
LLM Self-Ensemble, like CoT-SC, where the same LLM is used to generate multiple results for assembling the final answer.
Heterogeneous LLM Ensemble does the same but with different LLMs. This includes even distilling multiple LLMs into one.
Collaboration of multiple LLM agents, unlike approach 2, implies some interaction between the agents.
The current work clearly falls into the first category but may be applicable in other approaches as well.
The method is simple:
Generate N samples by querying the LLM the same number of times (the work implies using the same prompt, but it seems it would be even better with different prompts).
Conduct a majority vote to select the answer. This involves calculating the cumulative similarity of each answer to the others (for open-ended generation, BLEU was used, which is somewhat questionable; for close-ended, the frequency of answers was calculated). The answer with the highest cumulative similarity is chosen as the final one.
They tested it on various tasks: Arithmetic Reasoning (GSM8K+MATH), General Reasoning (MMLU+Chess), Code Generation (HumanEval).
Models used: Llama2-Chat 13B and 70B, GPT-3.5-Turbo, GPT-4.
Benchmarks used: CoT, Zero-shot CoT, Solo Performance Prompting (SPP), LLM-Debate, Reflexion.
Each method from the benchmarks can also be enhanced by adding such ensembling.
As a result, there's a quality increase with the ensemble size. The most notable improvement is up to about 10 participants, after which it becomes significantly weaker. Only in chess tasks using Llama did they not outperform the chosen benchmarks.
Improvements are quite stable across different hyperparameter values. On more complex datasets and with simpler LLMs, the benefit is greater.
They also explored improvements depending on the complexity of the task, the number of steps, and the a priori probability of the correct answer (which, I understand, equals the probability of randomly guessing).
Performance at each step can be improved, so the approach can be extended to steps. As the a priori probability increases, performance also increases, so the authors suggest a hierarchical procedure where a task with a low probability is broken down into several subtasks with higher probabilities. Here, different models are tried for different subtasks (the cheaper GPT-3.5 for simpler tasks, the more expensive GPT-4 for complex ones). This all works.
That's the gist of it.