每日调研 2026-05-23 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-05-23 AI/LLM 最新论文与研究热点简报

检索时间：2026-05-23 08:00（Asia/Shanghai）
主要覆盖：2026-05-21 至 2026-05-22 的 Hugging Face Papers / arXiv 新论文，辅以 GitHub 更新。arXiv API 在本次任务中出现 429/timeout，因此改用 arXiv HTML 页面与 Hugging Face Papers API 交叉核验。X/Twitter 未作为主来源使用（自动化环境不可稳定登录/检索），用 HF、arXiv、GitHub 代替。

#0. 今日判断

今天和 wenjun 当前方向最相关的线索非常集中：Agent 训练正在从“提示词 + 工具调用”走向三条更可训练的路线：

轨迹数据变成训练信号：ACC 把 agent 的多轮工具轨迹编译成长上下文训练样本，补上 SFT 只训练 assistant turn、屏蔽 tool response 的盲区。
环境/任务级 RL 开始细化：Spreadsheet-RL、Maestro、DelTA、SCRL 都在不同层面处理“真实任务 + 可验证奖励 + 信用分配”的问题。
latent/world-model 式规划重新出现：Efficient Agentic Reasoning 明确把 agentic reasoning 分成世界模型模拟、元控制/自调节、执行策略；LatentOmni / Bernini / WorldKV 则从多模态 latent reasoning 与 persistent memory 侧面说明“文本 CoT 不是唯一中间表征”。

如果只精读 3 篇，建议优先：ACC、Efficient Agentic Reasoning、DelTA / SCRL（二选一或都看）。

#1. 重点论文与动态（按相关性筛选）

#1. ACC: Compiling Agent Trajectories for Long-Context Training

链接：https://arxiv.org/abs/2605.21850 ；HF: https://huggingface.co/papers/2605.21850
来源 / 日期：arXiv，Submitted on 21 May 2026；HF Papers 2026-05-21
类别：LLM Agent / Pretraining Data / Long-context / Tool-use
一句话核心贡献：把 agent 多轮工具调用产生的轨迹编译成可用于长上下文训练的数据，使模型学习跨 turn、跨工具 observation 整合证据，而不是只学“下一步调用哪个工具”。

为什么值得关注：

现有 agent SFT 常把工具返回内容 mask 掉，只在 assistant 的工具选择/回复上训练；这会导致一个隐性问题：真正回答问题所需的 evidence 往往散落在多轮 observation 中，但训练目标没有直接要求模型把这些 evidence 整合起来。ACC 的核心价值在于把 agent 轨迹视为一种廉价、自然产生的长上下文语料，并通过“编译”方式显式暴露远距离证据整合任务。

与 wenjun 方向的关系：

对“agent 预训练数据如何塑造能力”很直接：agent 轨迹不是普通长文档，它包含 action-observation-correction 的结构。
对长轨迹 RL / model-based agent 有启发：如果未来要训练 world model 或 latent dynamics，轨迹编译可以作为监督预训练阶段，让模型先学会压缩/检索/整合环境状态。
也和 context compression 相关：agent 轨迹天然很长，如何选择哪些 observation 进入训练目标，本质是数据层面的上下文压缩。

#2. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

链接：https://arxiv.org/abs/2605.22138 ；HF: https://huggingface.co/papers/2605.22138
来源 / 日期：arXiv，Submitted on 21 May 2026；HF Daily Papers
类别：LLM Agent / Model-based RL / World Model / Planning / Test-time Scaling
一句话核心贡献：提出把 agentic reasoning 拆成三个系统：用 world model 做未来状态模拟，用 self-regulation 决定何时/多深规划，再由执行策略行动。

为什么值得关注：

这篇非常贴近“Dreamer for LLM Agent”的问题意识：不是无限拉长 CoT，而是让 agent 学会什么时候需要规划、规划多远、规划什么粒度。它把当前 reasoning model 的 token 级 test-time scaling 问题转成 agent 决策问题：规划本身有成本，必须被元控制。

与 wenjun 方向的关系：

对 LLM model-based RL 是直接参考：world model 不一定只预测像素/状态，也可以预测 tool/environment 的未来 observation 或任务进展。
对长轨迹 agent RL 的关键变量是 planning budget，而不是简单“让模型多想”。
可作为研究问题：如何从真实 agent 轨迹中学习一个“规划触发器”和“规划深度控制器”？

#3. DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

链接：https://arxiv.org/abs/2605.21467 ；HF: https://huggingface.co/papers/2605.21467
来源 / 日期：arXiv，Submitted on 20 May 2026；HF Daily Papers，高热度
类别：Post-training RL / RLVR / Credit Assignment / Reasoning Model
一句话核心贡献：从“判别器视角”解释 RLVR 中序列级 reward 如何转化为 token 概率更新，并提出更细粒度的 token credit assignment 思路。

为什么值得关注：

RLVR 的主流做法常只有 response-level reward，训练时实际更新却发生在 token probability 上。DelTA 试图解释：哪些 token 被推高/压低，并不只是“正确答案整体更高”这么简单，而是由正负样本 advantage 加权形成的隐式线性判别方向决定。这对理解 RLVR 为什么有时学会格式/捷径、为什么长推理信用分配困难很重要。

与 wenjun 方向的关系：

如果做 code agent / tool agent RL，最终 reward 往往来自单元测试、执行成功、环境完成度；DelTA 的问题会更严重：长轨迹中哪个 action/token 真正导致成功？
可为 agentic RL 设计 trajectory-level 到 action-level / token-level 的 credit assignment 提供理论参照。

#4. From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

链接：https://arxiv.org/abs/2605.22074 ；HF: https://huggingface.co/papers/2605.22074
来源 / 日期：arXiv，Submitted on 21 May 2026
类别：Post-training RL / RLVR / Curriculum / Credit Assignment
一句话核心贡献：把参考 reasoning chain 拆成可验证子问题，用课程 RL 把“失败 rollout 中的部分进展”转化为可训练信号。

为什么值得关注：

这篇和 DelTA 互补：DelTA 更偏解释 token 更新机制，SCRL 更偏工程化解决 hard problem 上正确 rollout 稀少的问题。它的关键是把“最终答案才可验证”改造成“中间子目标也可验证”，从而提高稀疏奖励任务的学习效率。

与 wenjun 方向的关系：

对 long-horizon agent 特别自然：把复杂任务拆成可执行/可验证子任务，如通过单测、文件状态、环境 observation 验证。
对 code agent RL 可迁移：PR、debug、重构任务可以设计中间 verifiable milestones。

#5. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

链接：https://arxiv.org/abs/2605.22642 ；HF: https://huggingface.co/papers/2605.22642
来源 / 日期：arXiv，Submitted on 21 May 2026
类别：LLM Agent / Post-training RL / Tool-use / Office Agent
一句话核心贡献：面向真实电子表格多步任务，提出用 RL fine-tuning 训练专用 spreadsheet agent，而不是只靠通用 LLM prompting。

为什么值得关注：

Spreadsheet 是典型的“状态外显、操作可执行、结果可验证”的 agent 环境，复杂度又比纯数学题更接近真实办公任务。它非常适合作为研究 agentic RL 的中间场景：有 GUI/表格状态、有公式、有多步依赖，也有可检查的最终结果。

与 wenjun 方向的关系：

是 code/office agent RL 的一个好 benchmark 方向：可以研究轨迹数据、环境设计、verifiable reward、tool abstraction。
对“从指令理解到意图理解”也相关：表格任务中用户往往只说业务目标，不会显式列出每个单元格操作。

#6. Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

链接：https://arxiv.org/abs/2605.22177 ；HF: https://huggingface.co/papers/2605.22177
来源 / 日期：arXiv，Submitted on 21 May 2026
类别：LLM Agent / Tool-use / Post-training RL / Multi-model Orchestration
一句话核心贡献：用 RL 学习在多模型、多技能之间进行层级编排，而不是固定规则或单一 LLM 负责所有技能。

为什么值得关注：

Agent 能力越来越像“调度系统”：不同模型、工具、技能在不同子任务上各有优势。Maestro 把调度/编排本身当成 RL policy 学习对象，这是从 prompt engineering 到 trainable agent controller 的重要过渡。

与 wenjun 方向的关系：

和 model-based agent 可结合：controller 可用 world model 预测调用某技能后的状态/收益。
对代码智能也有启发：代码 agent 往往需要在搜索、编辑、测试、静态分析、检索、模型调用之间调度。

#7. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

链接：https://arxiv.org/abs/2605.22535 ；HF: https://huggingface.co/papers/2605.22535
来源 / 日期：arXiv，Submitted on 21 May 2026
类别：Evaluation / Code Agent / Tool-use / Long-horizon Agent
一句话核心贡献：从 80,870 条真实 terminal recordings 自动反向构造 1,530 个高保真终端任务，并人工验证其中 200 个代表性任务。

为什么值得关注：

终端任务是 code agent、devops agent、research agent 的基础环境。相比手写 toy benchmark，真实 terminal recording 能暴露更多“脏”的长尾操作：路径、权限、依赖、错误恢复、多命令链路。

与 wenjun 方向的关系：

可作为 agentic RL 的环境来源：真实轨迹可转成 imitation / offline RL / curriculum。
和 ACC 形成闭环：真实 terminal 轨迹既可做 benchmark，也可编译成长上下文训练数据。

链接：https://arxiv.org/abs/2605.22012 ；HF: https://huggingface.co/papers/2605.22012
来源 / 日期：arXiv，Submitted on 21 May 2026
类别：Latent Reasoning / Multimodal LLM / Representation
一句话核心贡献：认为文本 CoT 会压缩并扭曲连续音视频证据，提出在统一 audio-visual latent space 中进行中间推理。

为什么值得关注：

它代表一个重要趋势：对多模态/连续环境，语言 token 未必是最优中间推理介质。对于需要细粒度时间对齐、空间 grounding、声音-视觉联合证据的任务，latent reasoning 可能比显式文本链更保真。

与 wenjun 方向的关系：

对 latent-space reasoning 是直接材料。
对 LLM agent 的启发是：环境状态压缩不一定要转写成自然语言，可以保留 latent state，再让语言只负责 high-level decision 或解释。

#9. WorldKV: Efficient World Memory with World Retrieval and Compression

链接：https://arxiv.org/abs/2605.22718 ；HF: https://huggingface.co/papers/2605.22718
来源 / 日期：arXiv，Submitted on 21 May 2026
类别：World Model / Context Compression / Memory / Systems
一句话核心贡献：面向自回归视频扩散世界生成，提出 World Retrieval + World Compression，在保持 persistent world consistency 的同时控制 KV cache 成本。

为什么值得关注：

虽然不是 LLM agent 论文，但它讨论的是一个更一般的问题：长 rollout 中如何维持“世界记忆”。全 KV 保一致性但成本线性增长，滑窗快但遗忘；检索和压缩是折中。

与 wenjun 方向的关系：

对 long-horizon agent memory 很有类比价值：agent 长轨迹也需要 selective retrieval + compressed state。
可启发“通用上下文压缩器”：不是压缩所有历史，而是按当前视角/任务检索相关 memory chunk。

#10. Understanding Data Temporality Impact on Large Language Models Pre-training

链接：https://arxiv.org/abs/2605.22769 ；HF: https://huggingface.co/papers/2605.22769
来源 / 日期：arXiv，Submitted on 21 May 2026
类别：Pretraining Data / Continual Learning / Temporal Knowledge
一句话核心贡献：研究预训练数据顺序对时间敏感事实知识获得的影响，并构建 7,000+ temporally grounded questions 的评测。

为什么值得关注：

LLM 预训练通常 shuffle corpus，但现实知识有时间顺序。该工作把“知识过时/时间 grounding”从后训练/检索问题拉回到预训练数据组织问题：数据顺序是否影响模型把事实和对应时间段绑定起来？

与 wenjun 方向的关系：

对持续预训练、数据配比、时间去重有参考价值。
对 agent 也相关：agent 需要区分“过去有效的 API/文档/代码”和“当前版本有效的事实”。

#11. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

链接：https://arxiv.org/abs/2605.22791 ；HF: https://huggingface.co/papers/2605.22791
来源 / 日期：arXiv，Submitted on 21 May 2026
类别：Foundation Model Architecture / Efficient Attention / Long-context
一句话核心贡献：在线性注意力中解耦 erase 与 write gate，改善压缩 recurrent state 的可编辑性。

简评：

这类工作对“长上下文 + 低成本推理”重要。它不是 agent 论文，但如果 agent 模型要长期运行，固定大小 recurrent memory 如何稳定更新，是架构层面的关键问题。

#12. Forecasting Downstream Performance of LLMs With Proxy Metrics

链接：https://arxiv.org/abs/2605.18607 ；HF: https://huggingface.co/papers/2605.18607
来源 / 日期：arXiv，Submitted on 18 May 2026；HF Daily Papers 收录
类别：Foundation Model Training / Evaluation / Scaling / Pretraining
一句话核心贡献：用候选模型 next-token distribution 的 token-level 统计量构造 proxy metrics，以预测下游能力，弥补 loss 与真实能力不一致的问题。

简评：

对基础模型训练机制很有价值：如果 proxy metric 能更早预测能力形成，就能减少盲目跑大规模下游评测，也能用于数据/架构/训练 recipe 早期选择。

#13. Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

链接：https://arxiv.org/abs/2605.15669 ；HF: https://huggingface.co/papers/2605.15669
来源 / 日期：arXiv，Submitted on 15 May 2026；HF Daily Papers 收录
类别：Code Agent / Evaluation / Execution Feedback
一句话核心贡献：为自然语言设计规则到可执行 DRC 脚本的合成任务构建大规模 benchmark，并强调 execution correctness 与 test generation。

简评：

它是垂直领域代码 agent benchmark，但方法论很通用：不要用 code similarity 评估，要用执行反馈和生成测试。对 agentic coding 的 reward design 值得参考。

#14. Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

链接：https://arxiv.org/abs/2605.20244 ；HF: https://huggingface.co/papers/2605.20244
来源 / 日期：arXiv，Submitted on 18 May 2026；HF Daily Papers 收录
类别：Code Agent / Formal Methods / Agentic Search
一句话核心贡献：提出检索增强的 agentic framework，对 Lean 证明做多目标可控重构，优化长度、编译成本和版本兼容性。

简评：

形式化证明是代码 agent 的极佳实验场：反馈可验证、搜索空间大、目标多样。Lean Refactor 强调“正确但冗长/脆弱”的 LLM 输出如何通过 agentic strategy search 改善。

#15. GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

链接：https://arxiv.org/abs/2605.21605 ；HF: https://huggingface.co/papers/2605.21605
来源 / 日期：arXiv，Submitted on 20 May 2026
类别：LLM Agent / Self-evolving Agent / Tool-use / Distillation
一句话核心贡献：把开放式图像生成过程建模为工具编排轨迹，并通过 visual experience distillation 实现自演化。

简评：

虽然任务是图像生成，但“工具轨迹 → 经验蒸馏 → agent 自改进”的范式与 self-evolving code agent 很接近。关键在于如何定义经验质量、如何把成功/失败轨迹转为可泛化策略。

#2. GitHub / repo / model / dataset 动态

GitHub search 使用 pushed:>2026-05-20 过滤，仅作为“值得点开跟进”的工程线索；其中部分仓库可能是早期/低星项目，需要进一步审查代码质量。

#2.1 google/adk-python

链接：https://github.com/google/adk-python
来源 / 日期：GitHub，updated 2026-05-23T00:00:40Z（检索时）
类别：LLM Agent / Tool-use / Framework
一句话：Google 的 Agent Development Kit Python 版，面向构建、评估和部署 agent。
为什么跟进：ADK 类框架正在定义 agent 工程接口，值得观察其 evaluation、tool schema、state/session 设计是否能承载 RL 数据采集。

#2.2 prototypebench/prototypebench

链接：https://github.com/prototypebench/prototypebench
来源 / 日期：GitHub，updated 2026-05-22T17:26:57Z
类别：Code Agent / Evaluation / RLVR
一句话：面向全栈 feature shipping 的 coding agent benchmark，包含 PR-mined tasks、测试与执行评分。
为什么跟进：如果质量属实，这是比单文件修 bug 更接近真实软件工程的 agentic coding 评测场景。

#2.3 weizhepei/RELEX

链接：https://github.com/weizhepei/RELEX
来源 / 日期：GitHub，updated 2026-05-22T18:02:09Z
类别：Post-training RL / RLVR
一句话：仓库描述为 “You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories”。
为什么跟进：若论文/代码完整，可能提供低成本 RLVR 的训练 recipe；但当前仅从 GitHub metadata 判断，需要进一步阅读。

#2.4 anakin87/llm-rl-environments-lil-course

链接：https://github.com/anakin87/llm-rl-environments-lil-course
来源 / 日期：GitHub，updated 2026-05-22T17:12:15Z
类别：LLM Agent / RL Environment / Education
一句话：关于为语言模型评估和训练构建 RL environments 的小课程。
为什么跟进：对搭建 agent RL 实验环境和 verifiable reward pipeline 有实践参考价值。

#2.5 GreyhavenHQ/greywall

链接：https://github.com/GreyhavenHQ/greywall
来源 / 日期：GitHub，updated 2026-05-22T23:55:50Z
类别：Code Agent / Systems / Safety Sandbox
一句话：面向 AI coding agents 的 deny-by-default sandbox，使用内核级文件系统、网络和 syscall 隔离。
为什么跟进：代码 agent 要做真实执行与 RL 数据采集，sandbox 是基础设施；安全隔离设计会影响可扩展实验。

#3. 今日最值得精读的 3 篇

ACC: Compiling Agent Trajectories for Long-Context Training

https://arxiv.org/abs/2605.21850

关键词：agent trajectory as data、long-context supervision、tool observation integration。

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

https://arxiv.org/abs/2605.22138

关键词：world model、self-regulated planning、model-based LLM agent。

DelTA / SCRL（二选一或连读）

DelTA: https://arxiv.org/abs/2605.21467

SCRL: https://arxiv.org/abs/2605.22074

关键词：RLVR credit assignment、verifiable subproblems、token/trajectory-level learning signal。

#4. 今日最值得跟进的 3 个 repo / model / dataset

google/adk-python：https://github.com/google/adk-python

看点：主流 agent framework 的接口与评估设计。

prototypebench/prototypebench：https://github.com/prototypebench/prototypebench

看点：全栈 coding agent benchmark，可能适合 agentic RL / SWE 任务研究。

weizhepei/RELEX：https://github.com/weizhepei/RELEX

看点：低成本 RLVR / rank-1 trajectory 方向，需进一步核验论文与实现。

#5. 研究机会 / idea

#Idea 1：Agent 轨迹编译 + 可验证子目标，用于长轨迹 RL 的 warm start

把 ACC 的 trajectory compilation 和 SCRL 的 verifiable subproblems 结合：

先从真实 agent 轨迹中抽取 observation evidence、action dependency、intermediate state；
再把长任务拆成可验证子目标，如测试通过、文件 diff 满足约束、环境状态达到某条件；
用这些子目标做 SFT / offline RL warm start，再接 online RL。

关键问题：如何自动从轨迹中发现“子目标边界”和“可验证状态”？ 这可能是 agent RL 中比 PPO 细节更核心的部分。

#Idea 2：面向 LLM Agent 的 planning budget controller

基于 Efficient Agentic Reasoning 的思路，专门研究一个小模型/模块来决定：

什么时候直接行动；
什么时候需要模拟未来几步；
什么时候需要检索历史 memory；
什么时候停止思考并执行。

可用数据来源：TerminalWorld / coding benchmark 轨迹中成功与失败 episode 的“无效思考长度”“错误行动前状态”。目标不是提高 CoT 长度，而是提高 tokens-to-success efficiency。

#Idea 3：Agent memory 的“世界 KV”类比：检索式压缩而不是摘要式压缩

WorldKV 给出一个很好的类比：长 rollout 的 memory 不应只做滑窗或全局摘要，而应按当前状态检索 scene/task-relevant chunks。对 LLM Agent 可设计：

action/observation KV chunk；
state-aware retrieval；
compression with verifiability（压缩后仍能支持关键测试/推理）。

可以在 terminal/code agent 环境中验证：压缩策略是否保留完成任务所需的最小状态。

#6. 来源索引

Hugging Face Papers API：https://huggingface.co/api/daily_papers，https://huggingface.co/api/papers
arXiv abstracts：

- https://arxiv.org/abs/2605.21850

- https://arxiv.org/abs/2605.22138

- https://arxiv.org/abs/2605.21467

- https://arxiv.org/abs/2605.22074

- https://arxiv.org/abs/2605.22642

- https://arxiv.org/abs/2605.22177

- https://arxiv.org/abs/2605.22535

- https://arxiv.org/abs/2605.22012

- https://arxiv.org/abs/2605.22718

- https://arxiv.org/abs/2605.22769

- https://arxiv.org/abs/2605.22791

- https://arxiv.org/abs/2605.18607

- https://arxiv.org/abs/2605.15669

- https://arxiv.org/abs/2605.20244

- https://arxiv.org/abs/2605.21605

GitHub Search API：按 pushed:>2026-05-20 检索 LLM agent RL、code agent、latent reasoning、context compression、RLVR 等关键词。

#2026-05-23 AI/LLM 最新论文与研究热点简报

#0. 今日判断

#1. 重点论文与动态（按相关性筛选）

#1. ACC: Compiling Agent Trajectories for Long-Context Training

#2. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

#3. DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

#4. From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

#5. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

#6. Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

#7. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

#8. LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

#9. WorldKV: Efficient World Memory with World Retrieval and Compression

#10. Understanding Data Temporality Impact on Large Language Models Pre-training

#11. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

#12. Forecasting Downstream Performance of LLMs With Proxy Metrics

#13. Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

#14. Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

#15. GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

#2. GitHub / repo / model / dataset 动态

#2.1 google/adk-python

#2.2 prototypebench/prototypebench

#2.3 weizhepei/RELEX

#2.4 anakin87/llm-rl-environments-lil-course

#2.5 GreyhavenHQ/greywall

#3. 今日最值得精读的 3 篇

#4. 今日最值得跟进的 3 个 repo / model / dataset

#5. 研究机会 / idea

#Idea 1：Agent 轨迹编译 + 可验证子目标，用于长轨迹 RL 的 warm start

#Idea 2：面向 LLM Agent 的 planning budget controller

#Idea 3：Agent memory 的“世界 KV”类比：检索式压缩而不是摘要式压缩

#6. 来源索引