每日调研 2026-05-31 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-05-31 AI/LLM 最新论文与研究热点简报

检索时间：2026-05-31 08:00（Asia/Shanghai）
主要覆盖：Hugging Face Daily Papers 2026-05-28/29 左右的高相关论文、arXiv 近期条目、论文关联 GitHub 仓库。
访问限制：arXiv API 在批量查询时出现 429/timeout，因此今日以 Hugging Face Papers 页面与论文页元数据为主，并用 GitHub API 交叉验证仓库活跃度；X/Twitter 未作为主证据源，避免引用不可验证的社媒传闻。
时间范围说明：最近 24 小时内严格相关内容不够密集，因此扩大到最近约 3-7 天，重点筛选与 wenjun 的 LLM Agent、代码智能、model-based RL / world model、潜空间/表征推理、后训练 RL、长期记忆与基础模型训练机制相关的条目。

#一句话总览

今天最值得关注的信号是：Agent 训练正在从“拿现成 benchmark 做 SFT/RL”转向“可生成、可验证、可交互的环境工厂”；同时，world model、长期记忆、异步工具调用、verifiable reward、RL 后训练污染检测这些问题开始更系统地进入 LLM Agent 研究栈。对 wenjun 来说，最相关的不是单个模型效果，而是这些工作背后共同指向的训练机制：如何构造环境、如何给长轨迹 credit / reward、如何让 agent 记忆和技能随交互自演化。

#重点推荐 1：LiteCoder-Terminal —— 用合成可执行终端环境训练长轨迹语言 Agent

标题：LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents
链接：https://huggingface.co/papers/2605.29559
arXiv：https://arxiv.org/abs/2605.29559
Repo：https://github.com/icip-cas/LiteCoder
来源：Hugging Face Daily Papers / arXiv / GitHub
日期：2026-05-28
类别：Code Agent / LLM Agent / Post-training RL / Tool-use / Evaluation
一句话核心贡献：提出 LiteCoder-Terminal-Gen，一个零依赖的合成管线，能从领域 specification 自动生成可执行、可验证的 terminal 训练环境，并构造 SFT 与偏好优化数据来训练长轨迹命令行 agent。

#为什么值得关注

这篇很贴近 wenjun 近期关心的 self-evolving code agent / agentic RL。它的关键不只是“又做了一个 terminal benchmark”，而是把训练环境本身变成可扩展产物：

环境可合成：不再完全依赖 scraped GitHub repos，能按能力缺口定向生成任务。
反馈可执行：terminal 任务天然有状态、错误、文件系统和命令输出，适合做多步 credit assignment。
监督可验证：比开放式 web/GUI 任务更容易做自动判定，也更适合 RL。
训练链条完整：包含 expert trajectories、SFT、trajectory-level preference optimization / DMPO 等后训练信号。

#与 wenjun 研究方向的关系

如果 wenjun 要做 代码 Agent 的 agentic RL 或 model-based RL，LiteCoder 的最大启发是：可以先不急着设计新 RL 算法，而是设计一个“可合成、可重置、可验证、可干预”的 terminal world。进一步可以问：

能不能把 terminal 环境抽象成 LLM agent 的“小型 world model”训练场？
能不能记录 agent 的错误轨迹，自动生成下一批 curriculum？
能不能把环境状态、命令输出、文件 diff 压缩成 latent state，用于 model-based planning？

#重点推荐 2：Skill0.5 —— 技能内化与外部调用之间的折中式 Agentic RL

标题：Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning
链接：https://huggingface.co/papers/2605.28424
arXiv：https://arxiv.org/abs/2605.28424
Repo：https://github.com/JasonZhujp/Skill0_5
来源：Hugging Face Daily Papers / arXiv / GitHub
日期：2026-05-27
类别：LLM Agent / Post-training RL / Tool-use / Continual Learning
一句话核心贡献：提出 Skill0.5，把通用技能内化进模型，同时保留任务特定技能的外部利用，并用难度感知 router 在两者之间动态选择，以改善 OOD 泛化。

#为什么值得关注

现有 skill-based agent 常在两个极端之间摇摆：

全外部化：把技能写进工具库/记忆库/context，灵活但上下文成本高、检索不稳定。
全内化：通过 SFT/RL 写入参数，推理时便宜，但容易过拟合、遗忘或技能冲突。

Skill0.5 的问题意识更接近真实 agent：有些技能应该成为模型的“肌肉记忆”，有些则应该保留为可检索、可替换、可版本化的外部资产。

#与 wenjun 研究方向的关系

这可以直接连接到 agent 预训练数据如何塑造能力 与 持续学习：

哪些技能值得进入参数？哪些技能应该留在工具/记忆中？
对代码 agent 来说，通用调试策略、测试优先习惯、错误定位模式是否应内化？具体 repo API、项目约定是否应外部化？
在 long-horizon RL 中，skill router 是否能作为一种高层 policy，底层技能作为 options？

#重点推荐 3：minWM + YoCausal —— 视频 world model 走向“可交互”和“可检验因果性”

#3.1 minWM

标题：minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
链接：https://huggingface.co/papers/2605.30263
arXiv：https://arxiv.org/abs/2605.30263
Repo：https://github.com/shengshu-ai/minWM
来源：Hugging Face Daily Papers / arXiv / GitHub
日期：2026-05-28
类别：Model-based RL / World Model / Systems
一句话核心贡献：给出一个从视频 diffusion foundation model 到实时可交互 world model 的全栈开源框架，包含数据构造、可控微调、自回归训练、少步蒸馏和流式推理。

#3.2 YoCausal

标题：YoCausal: How Far is Video Generation from World Model? A Causality Perspective
链接：https://huggingface.co/papers/2605.30346
arXiv：https://arxiv.org/abs/2605.30346
Repo：https://github.com/youzhe0305/YoCausal
来源：Hugging Face Daily Papers / arXiv / GitHub
日期：2026-05-28
类别：Model-based RL / Evaluation / World Model
一句话核心贡献：从因果认知角度评估视频生成模型是否真的具备 world model 能力，使用时间反转视频构造自然 counterfactual，并提出 Reverse Surprise Index 等指标。

#为什么值得关注

这两篇放在一起看很有价值：minWM 代表“怎么把生成模型工程化成可交互 world model”，YoCausal 则追问“这种模型到底懂不懂因果”。这与 LLM Agent 的 world model 路线高度同构：

LLM 可以 rollout 未来文本/工具反馈，但这是否等价于 world model？
如果只是拟合轨迹统计，而非学到因果结构，那么 model-based planning 很容易在分布外崩溃。
一个 agent world model 不仅要预测 next observation，还要支持 intervention：如果执行这个命令、修改这个文件、调用这个 API，会发生什么？

#与 wenjun 研究方向的关系

对 Dreamer for LLM Agent / model-based RL language agents 来说，核心研究问题可以借鉴视频 world model：

如何把历史交互压缩成 latent state？
如何学习 action-conditioned dynamics？
如何验证模型不是只记住轨迹共现，而是真的能对干预做反事实预测？
如何用 learned simulator 安全地产生训练数据或规划候选轨迹？

#重点推荐 4：WorldMemArena —— 多模态 Agent 记忆不只是 recall，而是“行动-世界交互中的状态维护”

标题：WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
链接：https://huggingface.co/papers/2605.29341
arXiv：https://arxiv.org/abs/2605.29341
Repo：https://github.com/UCSB-AI/WorldMemArena
来源：Hugging Face Daily Papers / arXiv / GitHub
日期：2026-05-28
类别：LLM Agent / Evaluation / Memory / Tool-use
一句话核心贡献：提出面向多模态长程 agent 的记忆评测框架，强调记忆需要跟踪演化世界、更新过期信息、在决策时检索证据，而不只是静态对话 recall。

#为什么值得关注

很多 agent memory benchmark 仍停留在“问答式回忆”，但真实 agent 的记忆有四个环节：写入、维护、检索、使用。WorldMemArena 把 memory 放回 action-world interaction 中评估，更接近长期任务。

#与 wenjun 研究方向的关系

这对 长轨迹 RL / agent 预训练数据 / 环境设计 都重要：如果 agent 的 memory policy 会影响未来状态，那么 memory 本身就是 action。可以考虑：

把“写什么记忆、删什么记忆、何时检索”纳入 RL action space；
用 verifiable environment 检查记忆是否 stale；
研究 memory compression 是否能作为 latent state learning 的中间形式。

#重点推荐 5：CorVer / RUBRIC-ARROW / LaRA —— 后训练 RL 的奖励与评估正在变细

#5.1 CorVer

标题：Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering
链接：https://huggingface.co/papers/2605.29648
arXiv：https://arxiv.org/abs/2605.29648
Repo：https://github.com/shichengf/CorVer
来源：Hugging Face Daily Papers / arXiv / GitHub
日期：2026-05-28
类别：Post-training RL / RLVR / Evaluation
一句话核心贡献：提出 CorVer，用语料共现等 corpus-grounded 信号替代昂贵的 NLI/LLM judge，在事实问答中提供句子级过程奖励。

#5.2 RUBRIC-ARROW

标题：RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
链接：https://huggingface.co/papers/2605.29156
arXiv：https://arxiv.org/abs/2605.29156
来源：Hugging Face Daily Papers / arXiv
日期：2026-05-27
类别：Post-training RL / Reward Modeling / Evaluation
一句话核心贡献：用交替训练的 rubric generator 与 rubric-conditioned judge 改进非可验证领域的 pointwise reward modeling，减少硬布尔 rubric 聚合导致的 ties。

#5.3 LaRA

标题：LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training
链接：https://huggingface.co/papers/2605.29888
arXiv：https://arxiv.org/abs/2605.29888
来源：Hugging Face Daily Papers / arXiv
日期：2026-05-28
类别：Post-training RL / Evaluation / Mechanistic Analysis
一句话核心贡献：提出 layer-wise representation analysis，用模型层间几何偏移检测 RL 后训练中的数据污染，弥补只看 likelihood/entropy 的输出级检测不足。

#为什么值得关注

RLVR 正在从 math/code 扩展到事实问答、主观任务和更复杂的 agent 轨迹，但 reward 的可靠性成为瓶颈。今天这几篇分别从三端推进：

CorVer：降低过程奖励成本；
RUBRIC-ARROW：让非可验证任务的 rubric reward 更可训练；
LaRA：检查 RL 后训练是否被污染，避免把 benchmark 记忆误当作泛化。

#与 wenjun 研究方向的关系

对代码 agent / 长轨迹 agent 来说，reward 不应该只在最后 pass/fail。可以构造分层 reward：

环境级：测试是否通过、任务是否完成；
过程级：每步 action 是否推进状态；
表征级：RL 是否真的学到了新策略，而非污染/记忆；
rubric 级：对不可完全验证的设计、重构、文档任务提供可解释反馈。

#其他值得扫一眼的论文与动态

类别	标题	链接	日期	一句话核心贡献	快评
LLM Agent / Safety	AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security	https://huggingface.co/papers/2605.29801	2026-05-28	面向 OpenClaw/Codex 类开放环境 agent 的轻量安全对齐框架，更新 agent safety taxonomy 并构造 taxonomy-guided data engine。	对真实执行型 agent 安全很相关，尤其是工具调用、文件系统、shell 执行场景。
Retrieval / Tool-use	OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources	https://huggingface.co/papers/2605.29250	2026-05-28	不是把所有知识源压进同一个向量空间，而是识别合适的 repository 并用原生 query engine 分发查询。	对 agent 工具检索有启发：统一层应该调度结构化能力，而不是抹平结构。
Code/Tool Agent	CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval	https://huggingface.co/papers/2605.29271	2026-05-28	让 LLM query rewriter 与 dense encoder 迭代共训，改善大 API catalog 下的工具检索。	很适合代码 agent 的 API/tool retrieval；可与 repository-specific memory 结合。
Tool-use / Evaluation	AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios	https://huggingface.co/papers/2605.27995	2026-05-27	评估 agent 在多任务、工具延迟反馈下是否会利用等待时间并协调异步调用。	真实 agent 执行常常被 I/O latency 限制，这类能力会影响长期任务效率。
GUI Agent	PhoneWorld: Scaling Phone-Use Agent Environments	https://huggingface.co/papers/2605.29486	2026-05-28	把真实 GUI 轨迹和截图转成可控 mobile environments、可执行任务、自动 verifier 和训练 rollout。	与 LiteCoder 同属“环境工厂”路线；可借鉴到 IDE/web/terminal agent。
GUI Agent	UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents	https://huggingface.co/papers/2605.29534	2026-05-28	用可复用 app-specific graph knowledge 帮助小模型 GUI agent 做任务规划。	对小模型本地 agent、隐私场景和 skill graph 有参考价值。
Multimodal Agent	PANDO: Efficient Multimodal AI Agents via Online Skill Distillation	https://huggingface.co/papers/2605.24785	2026-05-26	通过在线 skill distillation、skill library、反思和 demotion 让 web agent 随经验积累更高效。	与 self-evolving agent 高度相关：经验不是只存轨迹，而要蒸馏成可复用技能。
Program-of-Thought	REPOT: Recoverable Program-of-Thought via Checkpoint Repair	https://huggingface.co/papers/2605.30052	2026-05-28	对 PoT 生成的动作程序做 deterministic verified replay，在首个非法 transition 处用一次 LLM call 修复。	对代码/规划 agent 很实用：把“整条轨迹失败”变成“验证前缀 + 局部修复”。
Context / KV	CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM	https://huggingface.co/papers/2605.24786	2026-05-24	根据 next-token distribution 的置信度动态调整 KV cache budget，不确定时保留更多上下文。	可看作推理时上下文压缩策略；对 long-horizon agent 降成本有意义。
Token Compression	EarlyTom: Early Token Compression Completes Fast Video Understanding	https://huggingface.co/papers/2605.30010	2026-05-28	在 vision encoder 早期压缩视觉 token，降低 Video-LLM TTFT 和计算成本。	虽偏多模态，但“早压缩而非后压缩”的思想可迁移到通用上下文压缩器。
Continual Learning	How LoRA Remembers? A Parametric Memory Law for LLM Finetuning	https://huggingface.co/papers/2605.30260	2026-05-28	用 LoRA 作为可控 probe，建立 exact parametric memory 的 power law，并研究参数量、序列长度与记忆能力关系。	对持续学习、知识注入、agent skill 内化有直接参考价值。
Representation Steering	UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering	https://huggingface.co/papers/2605.30076	2026-05-28	在 activation space 学习 text-guided conditional velocity field，实现更通用的行为 steering。	与 latent-space reasoning/activation-level control 有关，可关注其是否能做中间状态规划。
Belief State	When Should Models Change Their Minds? Contextual Belief Management in Large Language Models	https://huggingface.co/papers/2605.30219	2026-05-28	提出 CBM/BeliefTrack，评估模型何时更新、保持或忽略上下文信息。	对长程 agent 的 belief state 更新机制很重要，也接近 POMDP 视角。
AI Scientist	CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists	https://huggingface.co/papers/2605.26029	2026-05-28	构造合成实验室环境，要求 agent 通过干预恢复 SCM 并预测 held-out 结果。	这是非常 model-based 的 agent benchmark：要求主动实验、建模、反事实推理。
Multimodal Deep Research	Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation	https://huggingface.co/papers/2605.29861	2026-05-28	用多 agent harness 生成可验证、文本-视觉证据交错的 deep research report。	与研究助手/报告生成有关，关键在 source-aligned intermediate state。
Scaling Mechanism	Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention	https://huggingface.co/papers/2605.29548	2026-05-28	用任务混合的 synthetic setup 分析大模型为何更能保留 rare/complex tasks，强调 capacity 与 gradient interference。	对基础模型训练机制、数据长尾和能力形成有参考价值。
Agentic Image	GenClaw: Code-Driven Agentic Image Generation	https://huggingface.co/papers/2605.30248	2026-05-28	把图像生成 agent 从 prompt rewriting 推向 code-driven canvas manipulation。	虽非代码智能主线，但“代码作为可控中间表示”值得关注。

#今日最值得精读的 3 篇

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

精读理由：最贴近代码 Agent + 长轨迹 RL + 可验证环境，尤其值得看环境合成、trajectory 数据、DMPO 设置。

Skill0.5: Joint Skill Internalization and Utilization for OOD Generalization in Agentic RL

精读理由：直接对应“技能应该内化还是外部化”的 agent learning 核心问题，也适合连接持续学习与 tool/memory 设计。

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

精读理由：把 memory 从静态 recall 推到动态 world interaction；对 long-horizon agent 的 state/memory/retrieval policy 很关键。

备选精读：如果今天想偏 model-based RL，则把第三篇换成 minWM + YoCausal 组合阅读。

#今日最值得跟进的 3 个 repo / model / dataset

LiteCoder：https://github.com/icip-cas/LiteCoder

- GitHub 验证：2026-05-30 仍有更新；描述为 “Advancing Small and Medium-sized Code Agents”。

- 跟进点：环境生成脚本、SFT/RL 数据格式、verifier 设计、是否能扩展到 repo-level coding。

minWM：https://github.com/shengshu-ai/minWM

- GitHub 验证：2026-05-30 仍有更新，约 385 stars；定位为实时交互 world model 框架。

- 跟进点：action-conditioned rollout、streaming inference、few-step distillation；可作为 LLM world model 工程类比。

WorldMemArena：https://github.com/UCSB-AI/WorldMemArena

- GitHub 验证：2026-05-30 仍有更新。

- 跟进点：memory write/maintenance/retrieval/use 的分项评估；是否可改造成文本/代码 agent 记忆 benchmark。

可顺手关注：Skill0.5 repo（https://github.com/JasonZhujp/Skill0_5）、CorVer repo（https://github.com/shichengf/CorVer）、ParametricMemoryLaw repo（https://github.com/zjunlp/ParametricMemoryLaw）。

#研究机会 / idea

#Idea 1：为代码 Agent 构造“Dreamer-style terminal world model”

以 LiteCoder-Terminal 这类可执行环境为基础，记录 (state, action, observation, reward)：

state：文件树摘要、关键文件 embedding、最近命令输出、测试状态；
action：shell command、edit patch、tool call；
observation：stdout/stderr、diff、test result；
reward：局部 verifier + 最终测试结果。

然后训练一个 latent dynamics model：给定 latent state 和 action，预测下一 observation/reward 或 failure mode。研究问题：

LLM agent 的 world model 是否可以只预测“任务相关状态变化”，而不预测完整文本？
这种 learned simulator 能否用于 imagination rollout，降低真实执行成本？
如何检测 simulator 是否产生虚假的成功轨迹？

#Idea 2：Skill 内化/外部化的判别准则

结合 Skill0.5、PANDO、LoRA memory law，可以做一个系统问题：

对代码 agent 来说，什么知识应该写入参数，什么知识应该存成外部 skill/memory/tool？

可能实验：

选一组通用 coding skills：debug loop、test-first、API reading、error localization；
选一组 repo-specific skills：项目命令、目录结构、内部 API；
分别用 LoRA 内化、skill library 外部化、hybrid router 三种方式训练；
比较 OOD repo、上下文成本、遗忘、迁移能力。

#Idea 3：把 memory policy 当作 RL action，而不是附属工程模块

WorldMemArena 和 CBM 都说明：长程 agent 的关键不是“有无限上下文”，而是知道何时更新 belief、何时保留、何时忽略。可以把 memory 操作显式建模为 action：

write(memory)、update(memory_id)、delete(memory_id)、retrieve(query)、ignore(obs)；
reward 不只来自最终任务，还来自后续步骤中 memory 是否被正确使用；
引入 stale-memory penalty 和 contradiction verifier。

这条线与 latent-space reasoning 也能连接：memory 不是原始文本堆积，而是可学习的 compact latent belief state。

#今日结论

今日不建议把注意力分散到所有新模型/榜单上。更值得抓住三条主线：

环境生成是 Agent RL 的核心基础设施：LiteCoder、PhoneWorld、CausaLab 都在证明“可控环境 + 自动 verifier”比单纯收集轨迹更重要。
World model 研究正在从生成质量转向交互性和因果性：minWM 解决可交互工程链条，YoCausal 追问是否真的学到因果。
长期 agent 的状态管理正在成为独立问题：WorldMemArena、CBM、CONF-KV、EarlyTom 都在不同层面处理“哪些历史应该保留、压缩、更新或丢弃”。

对 wenjun 的优先级建议：先精读 LiteCoder-Terminal，并把它与 Skill0.5、WorldMemArena 串起来看；如果要推进 LLM model-based RL，则同步读 minWM/YoCausal，把“可交互 world model + 因果评测”迁移到 terminal/code agent 场景。