每日调研 2026-06-18 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-06-18 AI/LLM 最新论文与研究热点简报

检索时间：2026-06-18 08:00（Asia/Shanghai）
主要覆盖：Hugging Face Daily Papers 2026-06-17 与 arXiv 2026-06-15 至 2026-06-16 新提交/更新；GitHub 近 7 天更新仓库作为补充。
来源限制：Hugging Face Daily Papers API、Hugging Face paper API 与 GitHub Search API 可访问；arXiv export API 本次返回 HTTP 429，因此 arXiv 元数据主要经 HF paper 镜像与 arXiv abs 链接交叉使用；X/Twitter 未使用登录态/API，未作为一手来源。

#一句话结论

今天的主线可以概括为：“世界模型与 agent 训练都在向可交互、可循环、可自演化的方向收敛，而代码智能的 test-time/latent compute 也开始被系统化地建模。”

World Model 重新成为热点：Looped World Models、ActWorld、EgoCS-400K、ACE-Ego-0 都在讨论如何让模型具备长视野、动作条件、交互记忆和高质量轨迹数据；
自演化 Agent 从记忆检索走向训练目标：OPD-Evolver 直接把“读/用/写/维护经验”的能力通过 on-policy distillation 蒸馏进 policy；
代码模型的 latent/test-time compute 进入预训练架构层面：LoopCoder-v2 不是简单多采样，而是从 18T tokens 训练 looped coder，研究循环深度与代码/agentic 软件工程能力的关系；
RLVR 与 post-training 的稳定性仍在快速迭代：ZPPO、d-OPSD、HAW 都在试图解决 sparse outcome、teacher guidance、self-future supervision、credit assignment 等难题；
Agent 数据质量问题越来越显性：ProCUA-SFT 指出直接用最大公开人类 CUA 轨迹 SFT 甚至可能负迁移，说明 agent 预训练数据的“结构与生成方式”比规模本身更关键。

#重点推荐 5 条

#1. Looped World Models

链接：<https://arxiv.org/abs/2606.18208> / <https://huggingface.co/papers/2606.18208>
来源 / 日期：arXiv / Hugging Face Daily Papers；2026-06-16 / Daily 2026-06-17
类别：Model-based RL / World Model / Latent Reasoning
一句话贡献：提出 LoopWM，用参数共享 transformer block 迭代细化 latent environment state，把“循环 latent depth”作为世界模型长程仿真的新 scaling axis。

为什么值得关注：

这篇和 wenjun 近期关注的 LLM model-based RL / Dreamer for Agent 非常贴近。传统 world model 的矛盾是：长 horizon 仿真需要更深计算，但深模型部署昂贵且误差会复合。LoopWM 的思路不是盲目加参数，而是让同一个模块反复 refine latent state，并能根据预测复杂度自适应计算深度。摘要中声称相对传统方案最高可达 100x 参数效率，这个数字需要读正文验证，但方向本身值得重点跟。

与 wenjun 方向的关系：

如果把 LLM Agent 的“环境状态”看作文本/工具/记忆构成的 latent state，LoopWM 提供了一个类比：用循环 latent computation 做 imagination，而不是把所有思考都展开成文本 CoT；
对 Dreamer-style Agent，关键不是只预测下一 observation，而是学一个可反复 refine、可按不确定性分配计算预算的 latent dynamics；
可与昨天的 Tyler / Latent Thought Flow 放在一起读：一个偏 latent reasoning compute routing，一个偏 world model iterative latent depth。

#2. LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

链接：<https://arxiv.org/abs/2606.18023> / <https://huggingface.co/papers/2606.18023>
来源 / 日期：arXiv / Hugging Face Daily Papers；2026-06-16 / Daily 2026-06-17
类别：Code Agent / Latent Reasoning / Test-time Scaling / 基础模型训练
一句话贡献：训练一族 7B Parallel Loop Transformer coder，在 18T tokens 预训练后比较 loop count 对代码生成、代码推理、agentic software engineering 与 tool-use benchmark 的收益/成本。

为什么值得关注：

LoopCoder-v2 的价值不只是“代码模型又刷榜”，而是它把 test-time compute scaling 往模型结构里推：通过 parallel loop、cross-loop position offsets 和 shared-KV gated sliding-window attention，让循环计算成为可控设计变量。摘要称 two-loop variant 在代码生成、代码推理、agentic SWE 与工具使用上都有广泛增益，同时避免顺序 looping 带来的延迟和 KV-cache 膨胀。

与 wenjun 方向的关系：

对代码智能来说，这是一条不同于 RLVR / agent scaffold 的路线：先让 base coder 学会在 latent computation 里“多想一轮”；
对 long-horizon code agent，looped architecture 可能减少外显反思 token，同时保留更强内部 refinement；
很适合和 latent-space reasoning、context compression、KV cache 系统论文组成一个“代码 Agent 的隐式计算 vs 显式轨迹”专题。

#3. OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

链接：<https://arxiv.org/abs/2606.17628> / <https://huggingface.co/papers/2606.17628>
来源 / 日期：arXiv / Hugging Face Daily Papers；2026-06-16 / Daily 2026-06-17
类别：LLM Agent / Self-evolving Agent / Continual Learning / Post-training
一句话贡献：提出 slow-fast co-evolution 框架，在 fast loop 中用四级记忆层级进行 test-time evolution，在 slow loop 中用 outcome-calibrated memory attribution 和 hindsight 把“会进化”的能力蒸馏进策略。

为什么值得关注：

很多 memory agent 只是“能存经验、能检索反思”，但 OPD-Evolver 把问题定义得更进一步：agent 需要学会选择有用经验、使用经验、写入可复用知识、维护增长的 repository。也就是说，它研究的不是 memory module 本身，而是 evolver competence。这非常贴近 self-evolving code agent / agentic continual learning 的核心问题。

与 wenjun 方向的关系：

可作为“代码 Agent 从成功/失败轨迹中自我改进”的直接参考；
outcome-calibrated memory attribution 很像长轨迹 RL 中的 credit assignment：哪些历史经验真正导致成功？
如果未来做 agent 预训练数据，OPD-Evolver 暗示数据不应只是 trajectory，而要包含“经验如何被读、用、写、维护”的操作标签。

#4. Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

链接：<https://arxiv.org/abs/2606.18216> / <https://huggingface.co/papers/2606.18216>
来源 / 日期：arXiv / Hugging Face Daily Papers；2026-06-16 / Daily 2026-06-17
类别：Post-training RL / RLVR / Distillation / Reasoning Model
一句话贡献：提出 ZPPO，在学生所有 rollout 都失败时把 teacher 放进 prompt 提供近端提示，而不是把 teacher response 直接注入 policy gradient。

为什么值得关注：

RLVR 的常见尴尬是 hard questions 上全错，advantage 为零，样本被丢弃；但如果直接拿 teacher logits/answer 训练，又会破坏 on-policy 假设并引入漂移。ZPPO 的设计很巧：teacher 不进入梯度，而是进入 prompt，构造带候选/局部提示的 reformulated prompts，让学生仍然从自己的 rollout 中获得可优化信号。

与 wenjun 方向的关系：

对长轨迹 Agent RL，环境中大量 early-stage policy 会全失败，ZPPO 提供了一种“teacher-supported region”的具体实现思路；
对 model-based RL / Dreamer Agent，也可类比为：world model 或 teacher planner 只改变 state/question framing，不直接替代 policy gradient；
适合与 ExpRL、WAPO、STRIDE、HSD 一起读，形成 RLVR sparse reward 稳定性主线。

#5. ProCUA-SFT Technical Report

链接：<https://arxiv.org/abs/2606.17321> / <https://huggingface.co/papers/2606.17321>
来源 / 日期：arXiv / Hugging Face Daily Papers；2026-06-15 / Daily 2026-06-17
类别：LLM Agent / Computer-use Agent / Agent Data / Post-training
一句话贡献：提出 3.1M step-level CUA SFT 数据集 ProCUA-SFT，并指出直接用 AgentNet 继续 SFT UI-TARS 7B 会让 OSWorld success rate 从 26.3% 掉到 8–10%。

为什么值得关注：

这篇最重要的信息不是“又有一个大 CUA 数据集”，而是 agent 轨迹数据会负迁移。如果公开人类轨迹规模小、覆盖不均、动作风格与目标 benchmark 不匹配，SFT 不但不提升，甚至可能破坏已有能力。ProCUA-SFT 用自动 pipeline 从真实桌面内容中合成 grounded tasks，再蒸馏 93K synthetic trajectories，形成 2,484 application combinations 的 step-level 样本。

与 wenjun 方向的关系：

对“agent 预训练数据如何塑造能力”非常关键：轨迹数据质量、任务 grounding、工具/应用组合、多步反馈，比 token 数更重要；
对代码 Agent 同样成立：盲目收集 GitHub issue 轨迹或失败调试日志可能会带来负迁移；
可借鉴其数据 pipeline 思路，设计 code-use / repo-use agent 的 synthetic trajectory generation 与 filtering。

#论文与动态清单

标题	链接	来源 / 日期	类别	一句话核心贡献	备注
Looped World Models	<https://arxiv.org/abs/2606.18208>	arXiv / HF；2026-06-16 / 06-17	Model-based RL / World Model	用参数共享循环 block 迭代细化 latent environment state，把 iterative latent depth 作为 world simulation scaling axis。	今日精读
LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling	<https://arxiv.org/abs/2606.18023>	arXiv / HF；2026-06-16 / 06-17	Code Agent / Latent Reasoning	训练 7B Parallel Loop Transformer coder，系统研究 loop count 对代码、工具使用、agentic SWE 的收益成本。	今日精读
OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation	<https://arxiv.org/abs/2606.17628>	arXiv / HF；2026-06-16 / 06-17	LLM Agent / Self-evolving	通过 slow-fast co-evolution 与 on-policy distillation 学会读、用、写、维护经验。	今日精读
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients	<https://arxiv.org/abs/2606.18216>	arXiv / HF；2026-06-16 / 06-17	Post-training RL / RLVR	在 hard questions 上把 teacher 放进 prompt 形成近端学习区，而不是破坏 on-policy 梯度。	今日精读
ProCUA-SFT Technical Report	<https://arxiv.org/abs/2606.17321>	arXiv / HF；2026-06-15 / 06-17	Computer-use Agent / Data	构建 3.1M step CUA SFT 数据，并报告 AgentNet 直接 SFT 可能导致 OSWorld 负迁移。	Agent 数据质量重点
ActWorld: From Explorable to Interactive World Model via Action-Aware Memory	<https://arxiv.org/abs/2606.17730>	arXiv / HF；2026-06-16 / 06-17	World Model / Interactive Agent	从可导航世界模型扩展到支持中途物体交互的 action-aware memory world model。	适合 world model agent
EgoCS-400K: An Egocentric Gameplay Dataset for World Models	<https://arxiv.org/abs/2606.18180>	arXiv / HF；2026-06-16 / 06-17	World Model / Pretraining Data	基于 CS/CS2 职业比赛 demo 构建 replay-grounded egocentric video-action-language 轨迹数据。	世界模型数据
ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining	<https://arxiv.org/abs/2606.17200>	arXiv / HF；2026-06-15 / 06-17	Pretraining Data / VLA	把人类第一视角视频转成 robot-format pseudo-action trajectories，用统一动作表示联合训练 VLA。	agent 预训练数据
Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes	<https://arxiv.org/abs/2606.17043>	arXiv / HF；2026-06-15 / 06-16	RL / Credit Assignment	面向 sparse binary episode outcome 的 VLA online RL，分层处理 viability、efficiency 与 intervention 段 credit。	可迁移到 long-horizon Agent RL
Learning from the Self-future: On-policy Self-distillation for dLLMs	<https://arxiv.org/abs/2606.18195>	arXiv / HF；2026-06-16 / 06-17	Post-training / Diffusion LLM	为 diffusion LLM 设计 OPSD，用 suffix conditioning 和 step-level supervision 学 self-future experience。	自蒸馏机制
Rethinking the Role of Efficient Attention in Hybrid Architectures	<https://arxiv.org/abs/2606.15378>	arXiv / HF；2026-06-13 / 06-17	基础模型训练 / Long Context	系统分析 hybrid architecture 中 efficient attention 对长上下文能力涌现速度与机制的影响。	训练机制值得读
GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?	<https://arxiv.org/abs/2606.17861>	arXiv / HF；2026-06-16 / 06-17	Code Agent / Evaluation	用真实游戏引擎评测 agent 是否能端到端生成可玩游戏，强调 engine grounding、artifact completeness、interactive verification。	code agent benchmark
Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion	<https://arxiv.org/abs/2606.14885>	arXiv / HF；2026-06-12 / 06-17	LLM Agent / Tool-use / Search	将 retrieval 作为 agent-callable action 动态扩展本地 workspace，使大语料 DCI 可扩展。	deep research agent 实用
Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent LLM Systems	<https://arxiv.org/abs/2606.17182>	arXiv / HF；2026-06-15 / 06-17	Multi-Agent / Systems / Verification	用 TLA+ 与 Verus 形式化多 Agent 共享 memory/tool registry 时的并发异常与一致性层级。	多 agent runtime 安全
Self-Evolving Visual Questioner	<https://arxiv.org/abs/2606.13929>	arXiv / HF；2026-06-11 / 06-17	Self-evolving / Multimodal Agent	VLM 自己生成并过滤更难、更视觉中心的问题，实现无外部监督的 questioner/answerer 自演化。	自演化数据循环

#今日最值得精读的 3 篇

Looped World Models

读它的目的：理解 world model 是否可以通过循环 latent depth 获得长程仿真能力，而不是靠更大参数或更长显式 rollout。

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

读它的目的：理解代码模型的 latent/test-time compute scaling 如何从架构和预训练阶段进入，而不仅是推理时多采样或 agent 反思。

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

读它的目的：理解 self-evolving agent 的训练目标如何从“有记忆”转向“会管理经验并把经验转成可复用能力”。

备选精读：如果今天想集中看 RLVR / sparse reward，则把第三篇换成 ZPPO，并回看昨天的 ExpRL、WAPO、STRIDE、Path-Conditioned Self-Distillation。

#今日最值得跟进的 3 个 repo / model / dataset

说明：HF paper API 未在上述论文条目中返回官方 repositories / models / datasets 字段；以下为 GitHub Search 在 2026-06-17 至 06-18 可访问结果中，按 wenjun 方向筛选的近期更新项目，需后续继续确认成熟度与论文关联。

Infini-AI-Lab/astraflow

- 链接：<https://github.com/Infini-AI-Lab/astraflow>

- 观察点：Dataflow-Oriented Reinforcement Learning for (Multi-)Agentic LLMs。

- 为什么跟进：与 multi-agent / agentic RL 的训练组织方式相关，可能提供 rollout、verification、trajectory 管理方面的工程参考。

albert-lv/OpenAgora

- 链接：<https://github.com/albert-lv/OpenAgora>

- 观察点：Open-source rollout、verification、trajectory plane for agentic reinforcement learning。

- 为什么跟进：名字和定位都贴近 agentic RL 基础设施，适合观察是否形成可复用的轨迹/验证平面。

Accenture/ContextEcho

- 链接：<https://github.com/Accenture/ContextEcho>

- 观察点：Benchmark for Persona Drift in Long Agentic-Coding Sessions。

- 为什么跟进：长会话 coding agent 不只会遗忘上下文，还会 persona / instruction drift；这和 context compression、prompt cache、agent memory 一起构成真实部署问题。

补充可观察项目：

linny006/agent-eval-harness：<https://github.com/linny006/agent-eval-harness>，面向真实 GitHub issues 的 coding agent 评测 harness；
ppap54088/ProxMO-RL：<https://github.com/ppap54088/ProxMO-RL>，声称做 multi-turn RL for LLM agents 的 proximity-based credit assignment；
dean0x/skim：<https://github.com/dean0x/skim>，面向 coding agents 的代码感知 context optimization engine。

#研究机会 / Idea

#Idea 1：把 LoopWM 的“循环 latent depth”迁移到 LLM Agent world model

今天的 LoopWM、ActWorld、EgoCS-400K 共同说明，world model 的关键矛盾是 长程仿真、动作可控、数据可得、误差累积。对 LLM Agent，可以定义一个文本/工具世界模型：

state：任务描述、repo 状态、工具历史、测试反馈、记忆摘要；
action：读文件、改代码、运行测试、搜索、询问子 agent；
transition：预测 action 后的 observation / hidden failure / future bottleneck；
looped latent depth：对高不确定 action 分配更多内部 refinement。

一个可做的实验是：在 SWE-bench 或小型 repo benchmark 上，让 policy 在执行高成本工具前先调用 latent world model 做 1/2/4 次循环 imagination，看是否提升 patch success 或减少无效工具调用。

#Idea 2：Agent 轨迹数据的负迁移检测基准

ProCUA-SFT 提醒：agent data 不是越多越好。可以专门设计一个“轨迹数据质量/负迁移”研究：

收集成功轨迹、失败轨迹、半自动轨迹、synthetic 轨迹、human 轨迹；
控制任务分布、工具集合、observation verbosity、动作粒度；
对同一个 base agent 做 SFT / DPO / offline RL / memory retrieval；
测量 success rate、tool efficiency、instruction drift、error recovery、OOD generalization。

关键问题是：什么样的 agent 轨迹会塑造能力，什么样的轨迹只是教会模型模仿低效行为？这与 wenjun 关心的 agent 预训练数据质量高度相关。

#Idea 3：Teacher-in-prompt 作为长轨迹 RL 的 cold start 机制

ZPPO 的思想可以迁移到代码 Agent RL：早期 policy 在复杂 issue 上全失败，直接 RL 没梯度；teacher 直接给 patch 又会变成 imitation。折中方案：

teacher 只给局部提示，例如可能相关文件、失败原因候选、测试输出解释；
student 仍需自己 rollout、选择工具、生成 patch；
reward 只根据最终测试/验证器给出；
随训练推进逐步减少 teacher prompt 支持，形成 curriculum。

这可以被表述为 “teacher-supported region for agentic RL”，比纯 SFT 更 on-policy，比纯 RL 更不容易冷启动失败。

#明日跟踪建议

继续追踪 LoopWM / LoopCoder-v2 是否放出官方代码、模型权重或技术报告细节，尤其关注 loop count ablation、KV cache 成本与 agentic SWE benchmark；
检查 ProCUA-SFT 是否发布数据集，以及其 synthetic trajectory filtering / negative transfer 分析是否能迁移到 code agent；
关注 agentic RL 基础设施 repo（AstraFlow、OpenAgora、ProxMO-RL）是否有真实 benchmark、rollout storage、verifier、credit assignment 实现，而不只是概念 README；
将近期论文整理成三个专题线：world model for agent、latent/test-time compute for code models、self-evolving agent memory and trajectory data。