AIDB Daily Papers
GitOfThoughts:再現・差分・マージ可能なバージョン管理された推論とエージェントメモリ
※ 日本語タイトル・ポイントはAIによる自動生成です。正確な内容は原論文をご確認ください。
ポイント
- LLMの推論プロセスをGitリポジトリとして記録し、再現・監査・マージを可能にするGitOfThoughtsを提案した。
- 様々なメモリ形式(ベクター、グラフ、Gitなど)で推論精度を比較したが、新規問題に対する効果は限定的であった。
- メモリの効果は、問題が類似している(コピー可能性閾値超え)場合に限定され、テスト時サンプリングが有効な手段であった。
Abstract
Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.
Paper AI Chat
この論文のPDF全文を対象にAIに質問できます。
質問の例: