每日调研 2026-06-13 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-06-13 AI/LLM 最新论文与研究热点简报

检索时间：2026-06-13 08:00（Asia/Shanghai）
主要覆盖：2026-06-11 至 2026-06-12 的 Hugging Face Daily Papers、arXiv 最新提交；为补足 wenjun 近期重点方向，部分扩展到 2026-06-04 至 2026-06-10。
说明：X/Twitter 页面可访问但未登录状态下难以稳定抽取实时技术流内容，本次以 arXiv、Hugging Face Papers、GitHub API 作为可验证来源；每条均附原始链接，避免用未验证传闻。

#一句话总览

过去 24–48 小时最值得注意的信号是：LLM Agent 研究正在从“单次任务求解”快速转向“动态环境中的记忆演化、自我世界模型、长轨迹 credit assignment、代码代理运行时约束”；与此同时，latent reasoning 方向出现了多篇集中投稿，核心矛盾从“能不能在隐空间想”推进到“隐空间递归如何引入可训练随机性、持久记忆和可解释/可控机制”。

对 wenjun 的研究主线来说，今天尤其值得看四条线：

ProPlay / EvoArena / MemoPilot / HORMA：把 agent 的 memory、world model、procedure graph、test-time learning 串成闭环。
ECPO / 3SPO / TRACE / SGCD：长轨迹 agentic RL 的 credit assignment 正在从 trajectory-level reward 往 state/prefix/action-level 分配推进。
Demystifying Hidden-State Recurrence / Dropout-GRPO / Persistent Memory for Continuous Latent Reasoning：latent-space reasoning 的 RL 训练问题开始被系统化提出。
TRACE for coding agents / DeNovoSWE / Claw-SWE-Bench / AIDev rejection study：代码 Agent 的关键瓶颈从 benchmark 分数转向真实工程闭环：用户偏好、PR rejection、adapter/harness、whole-repo generation。

#重点论文与动态（按相关性筛选）

#1. ProPlay: Procedural World Models for Self-Evolving LLM Agents

链接：<http://arxiv.org/abs/2606.12780v1>
代码：<https://github.com/antman9914/proplay>
来源/日期：arXiv，2026-06-11；GitHub 仓库 2026-06-07 更新
类别：Model-based RL / LLM Agent / Self-evolving Agent / Memory
一句话贡献：提出 procedure-level world model，把成功轨迹抽象成 procedure graph，并在执行前做“preplay”式未来路径演练，执行后再根据环境反馈更新图结构。

为什么值得关注：这篇非常贴近“Dreamer for LLM Agent / model-based RL for language agents”的问题设定。它没有直接训练一个神经动力学模型，而是把 LLM agent 的经验抽象成 procedure graph：节点/边对应任务阶段和因果转移，每条转移带可靠性记录，用来估计过去结果对当前任务的贡献。这样 agent 可以在真实执行前，在内部过程图上模拟若干 procedural path。

与 wenjun 方向的关系：

如果把 Dreamer 的 latent dynamics 换成语言/程序层面的 procedure graph，ProPlay 可以看作一种更离散、更可解释的 world model。
它提示一个研究问题：LLM agent 的 world model 不一定要预测 observation token，而可以预测“阶段转移 + 可执行 procedure + 成功可靠性”。
可与长轨迹 RL 结合：procedure graph 的 edge reliability 可以作为中间 credit 或 value prior。

#2. EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

链接：<http://arxiv.org/abs/2606.13681v1>
代码：<https://github.com/Aiden0526/EvoArena>
HF Papers：<https://huggingface.co/papers/2606.13681>
来源/日期：arXiv / Hugging Face Daily Papers，2026-06-11；GitHub pushed 2026-06-12
类别：LLM Agent / Continual Learning / Memory / Evaluation
一句话贡献：构建动态环境 benchmark EvoArena，并提出 patch-based memory EvoMem，用结构化 update history 记录环境变化。

核心信息：论文认为现有 agent benchmark 大多是静态环境，而真实部署中任务条件、软件状态、用户偏好会持续变化。EvoArena 覆盖 terminal、software、social preference 三类动态域。当前 agent 平均准确率仅 39.6%；EvoMem 在 EvoArena 上平均提升 1.5%，在 GAIA 和 LoCoMo 上分别提升 6.1% 和 4.8%。

为什么值得关注：这篇把“记忆”从简单 append / retrieve 推进为“版本化环境状态”。patch-based memory 不是只记 facts，而是记“何时发生了什么变化、旧状态如何被更新”。对动态环境 agent 来说，这比普通 RAG memory 更接近真实状态跟踪。

与 wenjun 方向的关系：

适合用于研究 agent 预训练数据如何塑造持续适应能力：如果训练数据只含静态任务轨迹，模型天然不擅长处理 evolving state。
可作为 long-horizon RL 环境设计参考：让环境随 episode 或 subtask 改变，测试 agent 是否能把 memory update 当成决策的一部分。
与 model-based RL 的连接：EvoMem 的 patch history 可以作为 world model 的状态转移日志。

#3. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

链接：<http://arxiv.org/abs/2606.13106v1>
HF Papers：<https://huggingface.co/papers/2606.13106>
来源/日期：arXiv / Hugging Face Papers，2026-06-11
类别：Latent Reasoning / Post-training RL / Reasoning Model
一句话贡献：研究 hidden-state recurrence 形式的 latent reasoning，并用 on-policy RL 训练可切换的隐空间推理机制。

为什么值得关注：latent reasoning 近期密集出现，但许多工作还停留在“把 CoT token 换成 hidden state recurrence”的结构尝试。这篇标题中的 switchable latent reasoning 很关键：未来的强 reasoning model 可能不是一直显式 CoT，也不是一直 latent，而是根据任务、预算、可验证性在 visible reasoning 与 hidden recurrence 之间切换。

与 wenjun 方向的关系：

对“潜空间推理 latent-space reasoning”是直接相关论文，建议精读方法部分和 RL objective。
可与 agent 场景结合：长轨迹 agent 中，某些局部搜索/规划可以 latent 化，但关键工具调用与状态更新仍需显式化。
研究机会：设计一个 agent benchmark，比较显式 ReAct、纯 latent recurrence、可切换 latent/explicit 三种策略在 long-horizon credit assignment 下的差异。

#4. InterleaveThinker: Reinforcing Agentic Interleaved Generation

链接：<http://arxiv.org/abs/2606.13679v1>
代码：<https://github.com/zhengdian1/InterleaveThinker>
HF Papers：<https://huggingface.co/papers/2606.13679>
来源/日期：arXiv / Hugging Face Daily Papers，2026-06-11；GitHub pushed 2026-06-12
类别：LLM Agent / Post-training RL / Multimodal Agent / Tool-use
一句话贡献：用 planner agent + critic agent + GRPO 训练，把已有 image generator 包装成能生成图文交错序列的 agentic pipeline。

核心信息：InterleaveThinker 构建 Interleave-Planner-SFT-80k、Interleave-Critic-SFT-112k、Interleave-Critic-RL-13k。由于一个 interleaved generation trajectory 可能包含 25 次以上生成器调用，作者用 accuracy reward 与 step-wise reward 做单步 RL，从而间接改善整条轨迹。

为什么值得关注：这不是 wenjun 最核心的代码/LLM agent 方向，但它展示了一个很重要的通用模式：复杂多步生成任务可以拆成 planner、executor、critic，并通过 step-wise RL 训练 critic 的纠错能力。这与 code agent 的“写代码—跑测试—定位失败—修复”非常相似。

与 wenjun 方向的关系：可借鉴到 self-evolving code agent：critic 不只是打分器，而是 trajectory 中可学习的纠错/再规划 agent；用 step-wise reward 降低整轨迹 RL 成本。

#5. Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

链接：<http://arxiv.org/abs/2606.13174v1>
来源/日期：arXiv，2026-06-11
类别：Code Agent / Continual Learning / Runtime Enforcement / Memory
一句话贡献：提出 TRACE，把用户纠正自动挖掘成原子规则，并编译为 coding-agent runtime checks，使 agent 在后续任务中强制遵守用户偏好。

核心信息：论文指出，普通 memory 并不等于 preference compliance：Mem0 在匿名真实用户摩擦案例中仍有 57.5% 的适用偏好检查被违反。TRACE 在 ClawArena 上把 held-out preference violation 从 100% 降到 37.6%（in-distribution），从 100% 降到 2.0%（OOD）。

为什么值得关注：这是代码 Agent 从“能不能做对题”到“能不能长期和一个用户协作”的关键一步。用户纠正不应该只作为下一次 prompt 的软提示，而可以被编译成 runtime enforcement。

与 wenjun 方向的关系：

对“从指令理解走向意图理解”高度相关：用户意图不是单轮 instruction，而是跨任务稳定约束。
对 self-evolving coding agent：可以把用户 correction 视为在线环境反馈，形成可验证规则，再作为 future rollout 的 hard constraint。
可结合 RL：规则违反作为 dense negative reward，规则通过作为 process reward。

#6. FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

链接：<http://arxiv.org/abs/2606.12087v1>
代码：<https://github.com/RUCAIBox/FORT-Searcher>
HF Papers：<https://huggingface.co/papers/2606.12087>
来源/日期：arXiv / Hugging Face Papers，2026-06-10；GitHub pushed 2026-06-11
类别：LLM Agent / Tool-use / Evaluation / Data Synthesis
一句话贡献：提出 shortcut-aware difficulty framework，合成更难被捷径解掉的 deep search agent 训练任务。

核心信息：作者指出，很多 search task 表面结构复杂，但存在更便宜的 shortcut。FORT 定义四类 shortcut risk：evidence co-coverage、single-clue selectivity、exposed constants、prior-knowledge binding，并用 solving cost、answer hit time、prior-shortcut rate 等 trajectory signatures 诊断真实难度。

为什么值得关注：对于 agent RL，环境和数据设计几乎决定了学到的是“搜索能力”还是“投机捷径”。FORT 的价值在于把 shortcut resistance 作为数据合成的一等目标。

与 wenjun 方向的关系：

与“通过环境设计催生自演化智能”直接相关。
可迁移到代码 Agent：合成 SWE 任务时也要避免“错误定位太明显”“测试只覆盖单点”“issue 文本泄漏答案”等 shortcut。
对 RLVR：verifiable reward 需要配合 shortcut-resistant environment，否则模型会学 verifier hack。

#7. Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

链接：<http://arxiv.org/abs/2606.12344v1>
代码：<https://github.com/opensquilla/claw-swe-bench>
HF Papers：<https://huggingface.co/papers/2606.12344>
来源/日期：arXiv / Hugging Face Papers，2026-06-10；GitHub pushed 2026-06-11
类别：Code Agent / Evaluation / Agent Harness
一句话贡献：提出面向 OpenClaw-style 通用 agent harness 的 SWE-bench-style benchmark 与 adapter protocol，使异构 agent 在统一 workspace、patch、runtime budget 下可比。

核心信息：完整 benchmark 包含 350 个 GitHub issue-resolution 实例，覆盖 8 种语言和 43 个仓库；Lite 版 80 题。论文特别指出同一 GLM 5.1 backbone 下，OpenClaw minimal direct-diff adapter Pass@1 只有 19.1%，而 full adapter 达到 73.4%，说明 harness/adapter 设计本身对结果影响巨大。

为什么值得关注：这对代码智能评测很重要：我们常把“模型能力”和“agent harness 工程能力”混在一起。Claw-SWE-Bench 的 adapter protocol 可以帮助拆开这两者。

与 wenjun 方向的关系：如果做 code agent RL，训练数据和 reward 都会经过 harness；harness 的 action space、patch extraction、workspace contract 实际上塑造了 agent 能学到的策略。

#8. Understanding the Rejection of Fixes Generated by Agentic Pull Requests — Insights from the AIDev Dataset

链接：<http://arxiv.org/abs/2606.13468v1>
来源/日期：arXiv，2026-06-11
类别：Code Agent / Evaluation / Software Engineering
一句话贡献：分析 Copilot、Devin、Cursor、Claude 等 agent 生成 PR 被拒原因，发现 46.41% 的 agent fix PR 被拒。

核心信息：作者对 306 个未合并 PR 做定性分析，并归纳 14 类拒绝原因，主要包括实现不正确、不完整、方法错误、CI/test 失败、agent 无法正确处理项目约束等。

为什么值得关注：真实工程中的失败模式不是简单的 “unit test pass/fail”。很多 PR 被拒是因为设计意图、维护成本、风格、范围控制、CI pipeline、review trust 等问题。

与 wenjun 方向的关系：这为 code agent 的 reward design 提供现实依据：不能只用测试通过率训练 agent，还要引入 maintainability、scope control、review acceptance likelihood 等信号。

#9. DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

链接：<http://arxiv.org/abs/2606.10728v1>
代码：<https://github.com/AweAI-Team/DeNovoSWE>
来源/日期：arXiv，2026-06-09；GitHub pushed 2026-06-10，updated 2026-06-12
类别：Code Agent / Long-horizon Agent / Data / Evaluation
一句话贡献：构建 4,818 个 whole-repository generation 实例，从文档生成完整 repo，并用 sandboxed agentic workflow 自动构造数据。

核心信息：DeNovoSWE 面向从高层 spec / documentation 生成整个仓库，而不是局部 bug fixing。作者用 divide-and-conquer 与 critic-repair 构造数据，并用 difficulty-aware trajectory filtering 控制质量。微调 Qwen3-30B-A3B 后，在 BeyondSWE-Doc2Repo 上从 5.8% 提升到 47.2%。

为什么值得关注：代码智能的任务边界正在从“修一个 bug”扩展到“从零搭一个项目”。这需要更长 horizon、更强任务分解、更复杂验证环境。

与 wenjun 方向的关系：这是 code agent RL / self-evolving code agent 很好的环境来源。whole-repo generation 比 SWE-bench 更能暴露 planning、state tracking、module interface consistency 问题。

链接：<http://arxiv.org/abs/2606.11680v1>
来源/日期：arXiv，2026-06-10
类别：LLM Agent / Context Compression / Memory / RL
一句话贡献：提出 HORMA，把经验组织成类似文件系统的层级结构，并训练轻量 agent 在层级 memory 中导航检索最小充分上下文。

核心信息：HORMA 不依赖简单压缩或相似度检索，而是把 summarized entities 链接回原始 trajectories，构建可导航层级结构；navigation module 用 RL 选择最小但充分的上下文。在 ALFWorld、LoCoMo、LongMemEval 上，在受限上下文预算下提升性能；长对话任务 token 使用最多只需 baseline 的 22.17%。

为什么值得关注：这是“通用上下文压缩器”的 agent 版本：压缩不是一次性摘要，而是可导航 memory hierarchy。

与 wenjun 方向的关系：适合与 agent 预训练数据结合：如果预训练轨迹天然被组织成任务树/文件系统/因果图，模型可能学到更好的 long-horizon state abstraction。

#11. 3SPO: State-Score-Supervised Policy Optimization for LLM Agents

链接：<http://arxiv.org/abs/2606.09961v1>
来源/日期：arXiv，2026-06-08
类别：Post-training RL / LLM Agent / Long-horizon Credit Assignment
一句话贡献：提出 state-score-supervised policy optimization，在每一步基于历史成功率计算 state score，进行 step-wise credit assignment 与 post-step policy optimization。

为什么值得关注：agent RL 的一个核心问题是 trajectory-level reward 太粗，尤其是多轮 ReAct/工具调用中，中间某个错误 action 会被终局奖励稀释。3SPO 尝试不引入 value model 或辅助模型，而用 state score 动态监督每步优化。

与 wenjun 方向的关系：可与 model-based agent 结合：world model / memory graph 给出 state abstraction，3SPO 类方法给出 state-level credit。

#12. When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

链接：<http://arxiv.org/abs/2606.05885v1>
来源/日期：arXiv，2026-06-04
类别：Post-training RL / LLM Agent / Credit Assignment
一句话贡献：提出 ECPO，指出 dense credit 本身可能统计不可靠，并用 evidence-calibrated action advantage 与 variance-gated credit weighting 校准 step-level credit。

核心信息：作者认为 GiGPO 等方法虽然构造 step-level advantage，但在 rollout 数有限时，稀有但幸运的 action 会得到过大的 advantage，导致 anchor bias 与训练震荡。ECPO 对低样本 action estimate 做 shrinkage，并抑制被高噪声主导的 anchor states。在 ALFWorld/WebShop 上相对 GiGPO 有稳定提升。

为什么值得关注：这是长轨迹 RL 的关键提醒：不是 credit 越密越好，而是 credit 的统计证据要可靠。

与 wenjun 方向的关系：若研究 long-horizon agent RL，应关注“credit estimator 的置信度/证据量”，而不只是设计更多中间 reward。

#13. Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

链接：<http://arxiv.org/abs/2606.10184v1>
来源/日期：arXiv，2026-06-08
类别：Latent Reasoning / GRPO / Post-training RL
一句话贡献：指出 continuous latent reasoning 在 GRPO 中缺少 rollout 多样性，并用跨 latent recurrence 固定的 dropout mask 引入结构化随机性。

核心信息：GRPO 依赖同一 prompt 下 K 个 rollout 的 reward variance；但 Coconut 类 latent reasoning 如果 hidden-state recurrence 是确定性的，多 rollout 会完全相同，advantage 变成零。Dropout-GRPO 用同一 rollout 内固定的 Bernoulli mask，使不同 rollout 成为近似参数后验样本，从而恢复可训练性。GSM8K 上 Coconut baseline 从 27.29% 提升到 29.01% pass@1。

为什么值得关注：这篇抓住了 latent reasoning + RL 的一个基础技术矛盾：隐空间推理如果没有 token sampling，自然缺少探索。

与 wenjun 方向的关系：若做 latent-space reasoning 的 RLVR，需要先解决“隐轨迹如何产生可比较的探索分布”。Dropout、noise injection、latent action sampling 都可能成为关键。

#14. Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

链接：<http://arxiv.org/abs/2606.07720v1>
来源/日期：arXiv，2026-06-05
类别：Latent Reasoning / Memory / Architecture
一句话贡献：提出 AGCLR，用跨 reasoning pass 的 gated concept stream 解决 continuous latent reasoning 中早期中间事实被覆盖的 concept bottleneck。

核心信息：作者指出 CoCoNuT 式 latent reasoning 每轮 hidden state 被覆盖，随着 reasoning depth 增加会丢失早期事实。AGCLR 维护 persistent residual memory，并用 write/read/forget gates 控制写入、读取与遗忘。在 GSM8K、HotpotQA、ProsQA 上有一致提升。

与 wenjun 方向的关系：这与 agent memory 有结构相似性：latent reasoning 内部也需要 memory evolution，而不是单个 hidden state 反复覆盖。

#15. MiniMax Sparse Attention

链接：<http://arxiv.org/abs/2606.13392v1>
HF Papers：<https://huggingface.co/papers/2606.13392>
来源/日期：arXiv / Hugging Face Daily Papers，2026-06-11
类别：Systems / Long Context / Agent Infrastructure
一句话贡献：提出基于 GQA 的 blockwise sparse attention，用 Index Branch 为每个 GQA group 选择 Top-k KV blocks，并配套 GPU 执行优化。

为什么值得关注：agentic workflow、repo-scale code reasoning、persistent memory 都依赖超长上下文；但真实部署瓶颈在 KV 与 attention 成本。MiniMax Sparse Attention 的重要性在于它不是单纯算法压缩，还强调可在 GPU 上高效执行。

与 wenjun 方向的关系：如果做长轨迹 agent 或代码仓库级 agent，context 长度与 KV 成本会成为训练/推理瓶颈；这类 sparse attention 与 ReasonAlloc/HORMA 等方法可以作为不同层次的上下文管理方案。

#16. Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

链接：<http://arxiv.org/abs/2606.11499v1>
来源/日期：arXiv，2026-06-09
类别：Pretraining Data / Data Quality / Base Model Training
一句话贡献：提出 WebGraphMix，用 Common Crawl host-level web graph centrality 控制中心站点与边缘站点的预训练混合比例。

核心信息：作者假设 central hosts 提供可复用抽象，peripheral hosts 提供专业长尾知识。该方法不需要训练 classifier 或标注数据，直接基于 web graph centrality。在 DataComp-LM pipeline 中用 400M/1B 模型、8B/28B tokens 测试，发现中心与边缘区域能力互补，1:1 mixture 平均分 41.4%，优于未筛选 39.8%。

为什么值得关注：这是预训练数据选择从“文本质量打分”走向“网络结构信号”的例子。

与 wenjun 方向的关系：对于 agent 预训练数据质量，可以类比构造“交互图 centrality”：高中心任务教通用抽象，边缘任务教长尾技能，混合比例可能塑造 agent 的泛化/专精平衡。

#17. ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

链接：<http://arxiv.org/abs/2606.13316v1>
来源/日期：arXiv，2026-06-11
类别：Post-training RL / Context Compression / Reasoning Model
一句话贡献：提出 RLVR 框架，让模型在长推理中自我总结和组织 reasoning trajectory，以减少无谓长 rollout 与上下文耗尽。

为什么值得关注：RLVR 容易鼓励越来越长的 reasoning rollout；ReSum 把 summarization 变成模型内部的轨迹管理动作，而不是外部系统模块。这与 long-horizon agent 的“阶段总结/信息折叠”问题高度相似。

#18. Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

链接：<http://arxiv.org/abs/2606.12634v1>
来源/日期：arXiv，2026-06-10
类别：Tool-use / Post-training RL / Credit Assignment
一句话贡献：提出 SGCD，用 sibling rollouts 的成功/失败对比生成 stepwise credit reference，但保持 policy gradient 作为主优化信号，避免自蒸馏放大坏捷径。

为什么值得关注：它指出 token-level self-distillation 可能“沉默地摧毁工具使用”：老师行为里有好技能也有坏 shortcut，直接模仿会一起放大。SGCD 将 distillation 用于 credit reassignment，而不是替代 actor loss。

#19. TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

链接：<http://arxiv.org/abs/2606.11119v1>
来源/日期：arXiv，2026-06-09
类别：Post-training RL / Agentic RL / RLVR Efficiency
一句话贡献：把 ReAct-style thought-action-observation turn 视为语义节点，在 prompt-level 与 prefix-level 分配 rollout budget，以提高 RLVR 的 reward contrast。

为什么值得关注：rollout budget 是 agentic RL 的现实瓶颈；如果某些 prompt 或 prefix reward variance 很低，继续采样是浪费。该方向与 active data selection、curriculum、environment design 有交叉。

#20. Bittensor Agent Arenas as a Trajectory Primitive: Distilling a Shopping Agent from ShoppingBench Subnet Traces

链接：<http://arxiv.org/abs/2606.10064v1>
来源/日期：arXiv，2026-06-08
类别：LLM Agent / Trajectory Data / Post-training / Tool-use
一句话贡献：提出把 incentive-aligned agent arena 作为轨迹生成机制，用 ShoppingBench subnet traces 蒸馏 shopping agent。

为什么值得关注：它强调小模型 agentic post-training 的瓶颈不是算法，而是有监督/可评判/抗泄漏的多轮轨迹基质。这与 wenjun 关注的“agent 预训练数据如何塑造能力”非常相关。

#其它可快速扫读的相关条目

标题	链接	来源/日期	类别	一句话贡献
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning	<http://arxiv.org/abs/2606.13673v1>	arXiv/HF，2026-06-11	Tool-use / Agent Interface / VLM Agent	用 stateful Python kernel + perception/geometry primitives 作为 action interface，让 VLM agent 逐步执行空间推理代码。
The End of Code Review: Coding Agents Supersede Human Inspection	<http://arxiv.org/abs/2606.13175v1>	arXiv，2026-06-11	Code Agent / Software Engineering	立场性文章：认为 coding agents 已可替代传统人工 code review 的许多目标。建议批判性阅读。
ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models	<http://arxiv.org/abs/2606.11164v1>	arXiv，2026-06-09	Systems / Context Compression / Reasoning	针对 long CoT 解码阶段做 layer/head 级 KV cache budget allocation。
From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory	<http://arxiv.org/abs/2606.08656v1>	arXiv，2026-06-07	LLM Agent / Test-time Learning / Memory RL	MemoPilot 将 memory update 视为多轮决策问题，用 multi-turn GRPO 优化 frozen LLM 的测试时学习。
Co-Evolving Skill Generation and Policy Optimization	<http://arxiv.org/abs/2606.08755v1>	arXiv，2026-06-07	LLM Agent / Skill Memory / RL	在 skill 进入 skill bank 前做 online validation，避免无效/有害技能污染 agent。
Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments	<http://arxiv.org/abs/2606.05661v1>	arXiv，2026-06-04	Continual Learning / Agent Evaluation	构建跨软件工程、信号处理、疫情预测等领域的 stateful continual learning benchmark。
CLaaS: Continual learning as a service for sample efficient online learning	<http://arxiv.org/abs/2606.05559v1>	arXiv，2026-06-04	Continual Learning / Deployment	把部署中 agent 的在线经验 replay 与异步训练包装为 chat API 背后的 continual-learning service。

#今日最值得精读的 3 篇

ProPlay: Procedural World Models for Self-Evolving LLM Agents

<http://arxiv.org/abs/2606.12780v1>

原因：最贴近 wenjun 的 model-based RL / Dreamer for LLM Agent 主线；重点读 procedure graph、preplay、reliability record 如何定义。

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

<http://arxiv.org/abs/2606.13681v1>

原因：把动态环境、memory evolution、continual adaptation 放到统一 benchmark；适合作为后续 agent continual learning / environment design 的实验基座。

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

<http://arxiv.org/abs/2606.13106v1>

原因：直接对应 latent-space reasoning，且关注 on-policy RL；建议与 Dropout-GRPO、AGCLR 连读。

备选第 4 篇：Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents（<http://arxiv.org/abs/2606.13174v1>），如果今天更关注代码 Agent 与用户意图理解，优先级可提升。

#今日最值得跟进的 3 个 repo / model / dataset

Aiden0526/EvoArena

<https://github.com/Aiden0526/EvoArena>

动态环境 agent benchmark + EvoMem；适合复现实验并观察 memory evolution 数据格式。

RUCAIBox/FORT-Searcher

<https://github.com/RUCAIBox/FORT-Searcher>

shortcut-resistant search task synthesis；适合借鉴到代码 Agent / web agent 的环境设计。

AweAI-Team/DeNovoSWE

<https://github.com/AweAI-Team/DeNovoSWE>

whole-repository generation dataset；适合作为 long-horizon code agent training/evaluation 数据源。

可顺手跟进：

zhengdian1/InterleaveThinker：<https://github.com/zhengdian1/InterleaveThinker>，多 agent planner/critic + GRPO 的多步生成范式。
opensquilla/claw-swe-bench：<https://github.com/opensquilla/claw-swe-bench>，OpenClaw-style harness 的 SWE 评测协议。
VectifyAI/ConDB：<https://github.com/VectifyAI/ConDB>，KV-cache native context database，虽然论文线索有限，但与 agent 长上下文基础设施相关。

#研究机会 / idea

#Idea 1：把 Dreamer 式 world model 改写成 LLM Agent 的 procedure graph + latent state hybrid

ProPlay 给了 procedure graph，EvoArena 给了 evolving state，MemoPilot/HORMA 给了 memory update/navigation。可以考虑一个 hybrid 架构：

显式层：procedure graph 记录任务阶段、可执行 skill、转移可靠性；
隐式层：latent state 记录无法结构化表达的环境/用户偏好；
学习目标：用 environment feedback 同时更新 graph reliability 和 latent state；
决策方式：执行前做 preplay，选择若干 high-value procedure path，再由 LLM/tool agent 展开。

关键问题：world model 是否必须预测 observation？还是只需预测“哪条 procedure transition 更可能带来 reward”？

#Idea 2：为 long-horizon code agent 构造 shortcut-resistant whole-repo RL 环境

结合 FORT-Searcher、DeNovoSWE、AIDev rejection study，可以设计一个代码 Agent 环境：

任务不是局部修 bug，而是从文档生成/改造完整 repo；
reward 不只看测试通过，还看 CI、接口一致性、scope control、review rejection risk；
通过隐藏部分 spec、改变测试位置、延迟暴露错误，避免 agent 走 shortcut；
使用 ECPO/3SPO/TRACE 类方法做 step/prefix-level credit。

这会比普通 SWE-bench 更接近“真实软件工程智能”的训练环境。

#Idea 3：latent reasoning 的探索机制：从 dropout 到 learned latent action

Dropout-GRPO 说明 continuous latent reasoning 缺少 rollout 多样性；AGCLR 说明 latent recurrence 会丢事实；Demystifying Hidden-State Recurrence 则关注可切换 latent reasoning。一个可深挖问题是：

latent reasoning 中的“action”到底是什么？
dropout/noise 只是探索噪声，能否学习一个 discrete/continuous latent action space？
在 agent 场景中，哪些步骤应该显式 CoT/ReAct，哪些步骤应该 latent rollout？
reward 应如何对不可见 latent steps 做 credit assignment？

可以从小规模数学/多跳 QA 扩展到工具调用 agent：让 agent 在工具调用之间进行 latent planning，再用最终工具结果和轨迹成功率训练 latent planner。

#今日判断

今天的新进展没有单个“爆炸性模型发布”，但对 wenjun 更重要的是研究主题的收敛：

agent memory 不再只是 RAG，而是 evolving state、procedure graph、hierarchical navigation、runtime rule enforcement；
agent RL 不再只讨论 GRPO/PPO，而是 state/prefix/action-level credit、rollout budget、shortcut-resistant environment；
code agent 不再只比 SWE-bench 分数，而是在走向 whole-repo generation、真实 PR acceptance、用户长期偏好；
latent reasoning 正从概念探索进入可训练机制：hidden recurrence、persistent memory、dropout-induced rollout diversity、on-policy RL。

如果今天只投入 2 小时，建议：先读 ProPlay + EvoArena 摘要/方法图，再扫 Demystifying Hidden-State Recurrence 和 Dropout-GRPO；如果还有时间，读 TRACE for coding agents，把“用户纠正 → runtime enforcement”这个思路记下来，后续很可能成为代码 Agent 长期协作能力的关键组件。

#2026-06-13 AI/LLM 最新论文与研究热点简报

#一句话总览

#重点论文与动态（按相关性筛选）

#1. ProPlay: Procedural World Models for Self-Evolving LLM Agents

#2. EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

#3. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

#4. InterleaveThinker: Reinforcing Agentic Interleaved Generation

#5. Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

#6. FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

#7. Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

#8. Understanding the Rejection of Fixes Generated by Agentic Pull Requests — Insights from the AIDev Dataset

#9. DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

#10. Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

#11. 3SPO: State-Score-Supervised Policy Optimization for LLM Agents

#12. When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

#13. Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

#14. Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

#15. MiniMax Sparse Attention

#16. Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

#17. ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

#18. Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

#19. TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

#20. Bittensor Agent Arenas as a Trajectory Primitive: Distilling a Shopping Agent from ShoppingBench Subnet Traces

#其它可快速扫读的相关条目

#今日最值得精读的 3 篇

#今日最值得跟进的 3 个 repo / model / dataset

#研究机会 / idea

#Idea 1：把 Dreamer 式 world model 改写成 LLM Agent 的 procedure graph + latent state hybrid

#Idea 2：为 long-horizon code agent 构造 shortcut-resistant whole-repo RL 环境

#Idea 3：latent reasoning 的探索机制：从 dropout 到 learned latent action

#今日判断