每日调研 2026-05-25 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-05-25 AI/LLM 最新论文与研究热点简报

时间范围：主要覆盖 Hugging Face Daily Papers 与 arXiv 在 2026-05-21 至 2026-05-24 左右出现/更新的论文与项目；GitHub 侧补充最近一周仍活跃的 repo。
访问情况：Hugging Face Papers、GitHub API、arXiv 单篇/批量元数据可访问；arXiv 关键词搜索接口本次触发 429/timeout，因此用 Hugging Face Daily Papers 发现候选，再回查 arXiv 元数据与项目页。X/Twitter 未作为主来源，以避免不可稳定访问导致的信息不完整。

#一句话总览

今天最贴近 wenjun 方向的信号很集中：Agent RL 正在从“单轮可验证任务”走向真实长轨迹环境与工具/技能编排；latent reasoning 不再只是文本 CoT 的压缩，而开始承担跨模态/低 token 推理介质；代码/终端/表格等可执行环境正在成为训练与评测 agentic RL 的关键载体。

#重点论文与动态筛选

#1. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

链接：https://huggingface.co/papers/2605.22138 / https://arxiv.org/abs/2605.22138
来源：Hugging Face Daily Papers / arXiv
日期：2026-05-21
类别：LLM Agent / Model-based RL / Tool-use / Evaluation
一句话核心贡献：提出 SR²AM，把 agent 决策拆成反应式执行、基于 world model 的模拟规划、以及决定“何时/多深规划”的自调节模块，用来避免无节制 CoT 带来的 token 浪费。

为什么值得关注： 这篇非常贴近“Dreamer for LLM Agent / model-based RL for language agents”的主线。它不是简单说“让模型多想几步”，而是强调 planning should be invoked selectively：agent 需要一个 meta-controller 来决定什么时候启动模拟、模拟多长 horizon、是否值得付出 token/计算成本。

与 wenjun 研究方向的关系： 如果把 LLM Agent 看成部分可观测环境中的决策器，SR²AM 的三层结构可以对应到：System I policy、System II latent/world-model rollout、System III compute allocation。对长轨迹 RL 来说，一个很自然的问题是：能否把“规划深度/是否调用 world model”本身纳入可学习动作，并用环境回报训练？

#2. ACC: Compiling Agent Trajectories for Long-Context Training

链接：https://huggingface.co/papers/2605.21850 / https://arxiv.org/abs/2605.21850
来源：Hugging Face Daily Papers / arXiv
日期：2026-05 附近（HF Daily Papers 收录于本轮）
类别：LLM Agent / Pretraining Data / Context Compression / Long-context Training
一句话核心贡献：把 agent 轨迹“编译”为更适合长上下文训练的数据形态，关注如何让长轨迹中的观测、动作、工具反馈、状态转移变成模型可学习的上下文。

为什么值得关注： 最近 agent 数据的瓶颈不是“有没有轨迹”，而是“轨迹是否以正确的粒度、顺序和压缩方式进入训练”。ACC 这类工作说明 agent pretraining data 可能需要一套 compiler：保留因果状态、关键决策点、工具返回值，同时去掉冗余 token。

与 wenjun 研究方向的关系： 这直接连接“agent 预训练数据如何塑造能力”和“通用上下文压缩器”。如果未来做代码 agent / web agent 的持续预训练，原始日志很可能不是最佳训练语料；需要研究 trajectory-to-context 的结构化编译策略。

#3. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

链接：https://huggingface.co/papers/2605.22535 / https://arxiv.org/abs/2605.22535
项目：https://github.com/EuniAI/TerminalWorld
来源：Hugging Face Daily Papers / arXiv / GitHub
日期：2026-05-21
类别：Code Agent / Evaluation / Tool-use / Long-horizon Agent
一句话核心贡献：从 80,870 条真实 terminal recordings 中自动反向构造高保真评测任务，得到 1,530 个验证任务，并人工整理 200 个 Verified 子集；当前强模型/agent 在真实终端工作流上最高 pass rate 约 62.5%。

为什么值得关注： TerminalWorld 的价值不只是 benchmark，而是数据引擎：它从 in-the-wild terminal traces 反推可执行任务，覆盖 18 类真实任务、1,280 个 unique commands，并包含超过 50 步的长工作流。这比人工写几个 shell task 更接近真实 coding/ops agent 的分布。

与 wenjun 研究方向的关系： 对代码智能与 agentic RL 来说，终端任务天然有可验证 reward：命令执行状态、文件系统 diff、测试结果、目标状态。它可以成为研究“长轨迹 credit assignment + 可执行环境 + 数据编译”的好载体。

#4. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

链接：https://huggingface.co/papers/2605.22642 / https://arxiv.org/abs/2605.22642
项目：https://github.com/Spreadsheet-RL/Spreadsheet-RL
来源：Hugging Face Daily Papers / arXiv / GitHub
日期：2026-05-21
类别：LLM Agent / Post-training RL / Tool-use / Evaluation
一句话核心贡献：在真实 Microsoft Excel 环境中构建 spreadsheet agent 的 RL 微调框架，并自动收集 start-goal spreadsheet pair，面向金融、供应链等多步表格工作流。

为什么值得关注： 这是 agentic RL 从 toy browser/QA 转向真实生产软件环境的典型例子。Excel 任务具有强状态、强工具调用、强格式约束，同时又能定义较明确的目标状态，是训练 agent 的高价值环境。

与 wenjun 研究方向的关系： Spreadsheet-RL 暗示：未来 LLM Agent 的 RL 环境设计可能比算法本身更关键。如何把真实软件中的任务转成 start-goal MDP、如何自动生成 curriculum、如何避免 agent 只学会 UI hack，都是值得深挖的问题。

#5. From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

链接：https://huggingface.co/papers/2605.22074 / https://arxiv.org/abs/2605.22074
来源：Hugging Face Daily Papers / arXiv
日期：2026-05-21
类别：Post-training RL / RLVR / Credit Assignment / Reasoning Model
一句话核心贡献：提出 SCRL，把参考 reasoning chain 拆成可验证子问题，用 subproblem-level reward normalization 给失败 rollout 中的部分进展也分配学习信号。

为什么值得关注： RLVR 的痛点是 hard problem 上 correct final answer 太少，导致 outcome reward 过稀疏。SCRL 的核心思路是把“最终答案正确”拆成“中间子问题可验证”，从而把 hard problems 从 gradient dead zone 中拉出来。

与 wenjun 研究方向的关系： 长轨迹 agent 也有同样问题：最终成功/失败太稀疏，单靠 episode-level reward 很难学。SCRL 的 subproblem curriculum 可以迁移到 agent 任务：把长任务拆成可验证中间状态、文件 diff、测试子目标或 tool-state predicate。

#6. DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

链接：https://huggingface.co/papers/2605.21467 / https://arxiv.org/abs/2605.21467
项目：https://github.com/RUCBM/DelTA
来源：Hugging Face Daily Papers / arXiv / GitHub
日期：2026-05 附近（HF Daily Papers 本轮收录）
类别：Post-training RL / RLVR / Credit Assignment
一句话核心贡献：面向 RLVR 中 token-level credit assignment，尝试更细粒度地区分哪些 token 对最终可验证结果负责。

为什么值得关注： 与 SCRL 类似，DelTA 也在解决“最终 reward 如何分摊到生成过程”的问题，但更偏 token 粒度。若 outcome reward 只有 0/1，模型很容易把 credit 错分给无关模板、长 CoT 习惯或偶然 token。

与 wenjun 研究方向的关系： 对代码 agent 来说，token credit 可以延伸成 action/tool-call credit：哪一次文件编辑、哪一次测试、哪一段计划真正贡献了成功？这是 agentic RL 的核心难点。

链接：https://huggingface.co/papers/2605.22012 / https://arxiv.org/abs/2605.22012
来源：Hugging Face Daily Papers / arXiv
日期：2026-05-21
类别：Latent Reasoning / Multimodal LLM / Reasoning Model
一句话核心贡献：认为显式文本 CoT 会把连续音视频证据压成离散 token，从而丢失细粒度 temporal grounding；提出交替使用文本推理与 audio-visual latent states 的 LatentOmni。

为什么值得关注： 这篇强化了一个趋势：latent reasoning 不只是“少写 CoT、省 token”，而是让中间推理状态保留连续感知信息。对多模态任务来说，文本 CoT 可能天然不适合表达所有中间证据。

与 wenjun 研究方向的关系： 对 LLM Agent 而言，环境状态也不一定应该全变成文字。网页 DOM、终端状态、代码 AST、测试日志都可能有更合适的 latent state 表示。一个问题是：agent 的 world model 是否也应在 latent state 而非纯文本 trajectory 上学习？

#8. WorldKV: Efficient World Memory with World Retrieval and Compression

链接：https://huggingface.co/papers/2605.22718 / https://arxiv.org/abs/2605.22718
来源：Hugging Face Daily Papers / arXiv
日期：2026-05 附近（HF Daily Papers 本轮收录）
类别：Context Compression / Memory / Systems / Long-context
一句话核心贡献：围绕 world memory 的检索与压缩来提升长上下文/世界状态记忆效率。

为什么值得关注： 长轨迹 agent 的上下文不是越长越好，而是要把“世界状态”与“对下一步有用的信息”分开建模。WorldKV 代表的方向是：从 KV/memory 层做可检索、可压缩的世界记忆。

与 wenjun 研究方向的关系： 如果做 model-based LLM Agent，一个关键组件就是可更新的 world memory。研究问题包括：哪些观测应进入长期 world memory？如何压缩但不丢失可验证目标所需状态？memory retrieval 是否可被 RL 信号优化？

#9. Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

链接：https://huggingface.co/papers/2605.22177 / https://arxiv.org/abs/2605.22177
来源：Hugging Face Daily Papers / arXiv
日期：2026-05-21
类别：LLM Agent / Post-training RL / Tool-use / Skill Composition
一句话核心贡献：把异构模型与技能库的调用编排视为序列决策问题，用轻量 RL policy 在层级 model-skill registry 中动态选择专家与技能。

为什么值得关注： Agent 不一定需要把所有能力都压进一个巨大模型。Maestro 的方向更像“learned orchestrator”：冻结专家模型/工具，训练调度策略决定何时调用谁。

与 wenjun 研究方向的关系： 这与 OpenClaw/Hermes 这类多工具 agent 很相关。未来 agentic RL 可以不只训练 base LLM，也训练中间件 policy：选择工具、选择模型、选择上下文压缩策略、选择是否规划。

#10. π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

链接：https://huggingface.co/papers/2605.14678 / https://arxiv.org/abs/2605.14678
来源：Hugging Face Daily Papers / arXiv
日期：2026-05-14 发布，2026-05-19 更新
类别：LLM Agent / Evaluation / Intent Understanding / Long-horizon Agent
一句话核心贡献：构建包含隐藏用户意图、跨任务依赖、跨 session 连续性的 proactive personal assistant benchmark，用于评估长期交互中的主动性与任务完成。

为什么值得关注： 这篇击中了“从指令理解走向意图理解”的问题：用户不会总是显式说出完整约束，agent 需要从 persona、历史 session 和任务依赖中推断隐藏需求。

与 wenjun 研究方向的关系： 如果要做真正的个人助理 agent，benchmark 不能只测 single-turn instruction following。π-Bench 的 hidden intent 与 cross-session continuity 可以作为“意图理解/主动辅助”的评估设计参考。

#11. GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

链接：https://huggingface.co/papers/2605.21605 / https://arxiv.org/abs/2605.21605
来源：Hugging Face Daily Papers / arXiv
日期：2026-05-20
类别：LLM Agent / Self-evolving Agent / Tool-use / Multimodal
一句话核心贡献：把图像生成过程建模为工具编排轨迹，通过比较同一请求下好/坏轨迹并蒸馏结构化 visual experience，让生成 agent 自演化。

为什么值得关注： 虽然领域是图像生成，但方法论对 agent 通用：同一个任务多条轨迹之间的 best-worst 差异可以被抽象成经验，再反哺 agent 策略。

与 wenjun 研究方向的关系： 代码 agent 也可以做类似“经验蒸馏”：同一个 bug fix / feature task 下，多条编辑-测试轨迹的差异可以转成结构化 debugging experience，而不是只留下最终 patch。

#12. Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

链接：https://huggingface.co/papers/2605.21803 / https://arxiv.org/abs/2605.21803
来源：Hugging Face Daily Papers / arXiv
日期：2026-05-20
类别：Pretraining Mechanism / Scaling Law / Optimizer
一句话核心贡献：指出相同 Transformer 架构在不同 optimizer 下会表现出不同的 representation spectral scaling laws；例如 Muon 在 rare-token 表征上的 hard-rank scaling 明显强于 AdamW。

为什么值得关注： 这提醒我们：scaling law 不应只把 optimizer 当成固定训练细节。优化器会影响 FFN 宽度如何转化为实际可用的谱容量，尤其在 rare-token / tail representation 上。

与 wenjun 研究方向的关系： 对基础模型训练与能力形成机制来说，这提示“能力差异”可能来自优化器对表征秩/谱结构的塑形，而不只是数据量、参数量、loss。代码数据、长尾符号、稀有 API 的学习尤其可能受影响。

#今日值得跟进的 repo / model / dataset

TerminalWorld

- 链接：https://github.com/EuniAI/TerminalWorld

- 价值：真实 terminal recording 反推任务的数据引擎，很适合作为代码/运维 agent RL 与 evaluation substrate。

Spreadsheet-RL

- 链接：https://github.com/Spreadsheet-RL/Spreadsheet-RL

- 价值：真实 Excel 环境中的 agentic RL 框架，适合观察 start-goal 软件环境如何构造 reward/curriculum。

SR²AM / self-regulated planning

- 链接：https://github.com/sailing-lab/sr2am / https://github.com/sailing-lab/sr2am-self-regulated-planning

- 价值：对“何时规划、规划多深、何时反应式执行”的可学习控制非常贴近 model-based LLM Agent。

AstraFlow

- 链接：https://github.com/Infini-AI-Lab/astraflow

- 价值：Dataflow-Oriented Reinforcement Learning for (Multi-)Agentic LLMs，README 显示支持 fully async multi-policy collaborative RL，值得作为 agent RL 系统侧参考。

HRM-Text

- 链接：https://github.com/sapientinc/HRM-Text / https://huggingface.co/sapientinc/HRM-Text-1B

- 价值：1B 级文本模型，强调 hierarchical recurrent architecture、task completion 与 latent space reasoning，并声称用更低 compute/data 完成预训练；可作为 latent reasoning + efficient pretraining 的观察对象。

Skim

- 链接：https://github.com/dean0x/skim

- 价值：面向 coding agents 的上下文优化引擎，解析 17 种语言 AST，压缩代码、测试输出、构建错误、git diff；与“通用上下文压缩器”方向高度相关。

#今日最值得精读的 3 篇

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

精读理由：最贴近 model-based RL / Dreamer for LLM Agent；建议重点看它如何定义 world model、planning invocation、self-regulation policy 与 token-efficiency 评估。

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

精读理由：真实 terminal traces → benchmark/data engine 的构造方式很重要；建议重点看任务反推、验证、reward/pass criteria 与和 Terminal-Bench 的差异。

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

精读理由：RLVR credit assignment 的方法可迁移到长轨迹 agent；建议重点看 subproblem extraction、reward normalization 与 hard problem gradient dead zone 分析。

备选精读：LatentOmni（若今天想看 latent reasoning）、Spreadsheet-RL（若想看真实软件环境 agentic RL）。

#研究机会 / idea

#Idea 1：把 SCRL 式“可验证子问题”迁移到代码 Agent 长轨迹 RL

将代码任务从最终测试 pass/fail 拆成中间可验证子目标：定位相关文件、生成最小复现、通过某个子测试、保持 lint/类型检查、修改正确函数、避免无关 diff。然后把这些子目标作为 curriculum，使失败轨迹中的部分正确行为也能获得学习信号。

可做问题：

如何自动从历史成功 patch / CI log 中抽取子问题？
子目标 reward 会不会诱导 agent 过度优化局部指标？
subproblem-level normalization 能否缓解长轨迹 sparse reward？

#Idea 2：Agent trajectory compiler：从原始工具日志到可训练上下文

受 ACC、TerminalWorld、Skim 启发，可以研究一个通用 trajectory compiler：输入 terminal/browser/code agent 原始日志，输出适合 SFT/RL/pretraining 的结构化上下文。核心不是简单截断，而是保留状态转移与决策因果链。

可做问题：

哪些 token 是“状态”，哪些是“噪声”，哪些是“credit assignment 证据”？
AST、diff、test output、shell history 是否应该用不同压缩器？
编译后的轨迹对 agent planning / tool use 能力提升是否优于原始日志？

#Idea 3：Learned compute allocation for LLM Agent

SR²AM、Maestro、WorldKV 都指向同一个方向：agent 不应固定使用同一套推理流程，而应学习何时检索 memory、何时压缩上下文、何时模拟规划、何时调用强模型/弱模型/工具。

可做问题：

把“是否规划/检索/压缩/换模型”作为 action 后，reward 如何定义？
能否用 RL 学一个低成本 controller，调度 frozen LLM + tools + world memory？
在长轨迹环境里，compute allocation 是否比更长 CoT 更稳定？

#快速判断

最贴近 wenjun 当前主线： SR²AM、TerminalWorld、SCRL、ACC。
最适合转成实验： TerminalWorld + SCRL 风格子目标 reward，用真实 terminal/code tasks 做长轨迹 agent RL。
最值得持续观察： agent trajectory compilation、latent/world-state memory、learned tool/model orchestration。

#2026-05-25 AI/LLM 最新论文与研究热点简报

#一句话总览

#重点论文与动态筛选

#1. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

#2. ACC: Compiling Agent Trajectories for Long-Context Training

#3. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

#4. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

#5. From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

#6. DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

#7. LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

#8. WorldKV: Efficient World Memory with World Retrieval and Compression

#9. Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

#10. π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

#11. GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

#12. Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

#今日值得跟进的 repo / model / dataset

#今日最值得精读的 3 篇

#研究机会 / idea

#Idea 1：把 SCRL 式“可验证子问题”迁移到代码 Agent 长轨迹 RL

#Idea 2：Agent trajectory compiler：从原始工具日志到可训练上下文

#Idea 3：Learned compute allocation for LLM Agent

#快速判断