AIDB Daily Papers
複数LLMの組み合わせ効果の限界:67モデルでのルーティング、投票、エージェント混合における同時誤答上限
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- 複数のLLMを組み合わせる際の性能向上は、全てのモデルが同じ質問で誤答する確率(同時誤答率)によって上限が定められることを明らかにした。
- 従来の評価指標である誤答相関(rho)では、この同時誤答率を正確に把握できず、組み合わせによる真の性能向上を見誤る可能性がある。
- 67モデルを用いた実験では、特に自由回答形式の数学問題やGPQA-Diamondタスクにおいて、同時誤答率が無視できないほど高く、単一モデルの性能を超える組み合わせは限定的であった。
Abstract
Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical marginals and pairwise correlations can have different all-wrong rates. A Clopper-Pearson bound on beta gives a finite-sample certificate on the largest gain any router, vote, or cascade could deliver before training a router. Across 67 models from 21 providers, a tetrachoric-calibrated single-factor model still underprices the all-wrong tail: on open-ended mathematics, observed beta is 0.052 versus 0.023 under the full 67-model Gaussian copula, about 2.5 times underpricing, with 90 percent CI 1.7 to 3.4 and k equals 17. The effect recurs on execution-graded code, where beta is 0.079. Re-asking the same GPQA-Diamond questions in free-response rather than multiple-choice form reopens the tail, with beta 0.127 and a five-judge panel with kappa 0.73 to 0.92, locating co-failure in answer format rather than subject. At matched quality, low-rho heterogeneous ensembles beat high-rho Self-MoA, but on checkable tasks in our pool, combining models rarely beats the single best model without a strong query-level routing signal. Gains come from models failing on different questions, not from adding more models.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: