每日调研 2026-06-10 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-06-10 AI/LLM 最新论文与研究热点简报

检索时间：2026-06-10 08:00（Asia/Shanghai）
主要覆盖窗口：2026-06-08 至 2026-06-10；由于部分 Hugging Face Daily Papers 榜单条目来自 6 月 3-7 日但在今日榜单中靠前，本期将时间范围适度扩展到最近 7 天，并在每条中标明日期。
检索来源：Hugging Face Papers / Daily Papers、arXiv API（cs.AI / cs.CL / cs.LG / cs.SE / stat.ML 及关键词检索）、GitHub Search（部分请求触发 rate limit，仅成功验证到少量 repo）、arXiv 摘要页。X/Twitter 未纳入直接证据源：当前环境没有可稳定访问和认证的 X 检索入口，因此用 HF / arXiv / GitHub 替代。

#0. 今日总判断

今天最贴近 wenjun 近期主线的趋势很清楚：LLM Agent 的“长轨迹训练 / 记忆 / 技能 / 上下文压缩”正在从 prompt 工程转向可训练的中间机制。几条值得串起来看：

Agent RL 侧：MemoPilot、Q-Evolve、ECPO、CAHL、Reasoning Arena 都在处理同一个核心瓶颈：长轨迹 agent 的 credit assignment、过程奖励、planner-executor 对齐、verifiable reward 不够密集时怎么办。
Agent memory / skill 侧：LatentSkill、DCPM、Bayesian-Agent、CICL 说明“记忆”不再只是 RAG 检索，而是可学习、可压缩、可审计、甚至可写入权重或 adapter 的能力载体。
代码智能侧：SWE-Explore 与 SIGA 都把 coding agent 的评价和适配从“最终 patch 对不对”拆到更细粒度的 repository exploration、interface grounding、validation-enforced termination。这对 agentic coding RL 很重要，因为它给过程奖励和分阶段训练提供了可观测目标。
latent / model-based 侧：End-to-End Context Compression at Scale、LatentSkill、AGCLR、Latent Spatial Memory / Echo-Memory / AHA-WAM 共同指向一个机会：在 token 之外训练“潜表示状态”，并把它作为 long-horizon reasoning / memory / world model 的状态变量。

#1. 重点论文与动态（按 wenjun 相关性排序）

#1.1 From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory

链接：https://arxiv.org/abs/2606.08656
来源：arXiv cs.AI / cs.CL / cs.LG；标注 Accepted by ICML 2026
日期：2026-06-07
类别：LLM Agent / Post-training RL / Memory / Test-time Learning
一句话核心贡献：提出 MemoPilot，把 agent 的“记忆更新”本身建模为多轮决策问题，用 multi-turn GRPO 训练一个 plug-in memory copilot，使冻结 LLM 在连续交互中通过更好的 memory update 实现 test-time learning。

为什么值得关注：

传统 agent memory 多是手写规则：成功了总结经验、失败了反思、下次检索。MemoPilot 的关键变化是：不训练主模型，而训练“怎么写记忆”这个策略，并通过 turn-wise reward 和 context-independent turn-level advantage 解决多轮 credit assignment。它在 Rock-Paper-Scissors 和 Limit Texas Hold'em 上提升冻结 player 的 Elo。

与 wenjun 研究方向的关系：

这篇与 “LLM model-based RL / Dreamer for LLM Agent” 很接近：memory updater 可以被看作 agent 的外部 latent state transition model。一个自然延伸是：把记忆写入视为 belief-state update，然后用 model-based planning / imagined rollouts 训练 memory policy，而不仅是基于真实交互的 GRPO。

#1.2 End-to-End Context Compression at Scale

链接：https://arxiv.org/abs/2606.09659
来源：arXiv cs.CL / cs.AI / cs.LG；Hugging Face Daily Papers 2026-06-10 榜单
日期：2026-06-08
类别：Context Compression / Latent Reasoning / Systems / Pretraining
一句话核心贡献：重新系统化 encoder-decoder 式上下文压缩，训练 0.6B encoder + 4B decoder 的 Latent Context Language Models（LCLMs），在 1:4、1:8、1:16 压缩率上探索大规模端到端压缩。

为什么值得关注：

KV cache 压缩常有三类问题：质量掉得多、压缩本身贵、必须先把长输入塞进目标模型上下文。LCLM 路线的意义是把“长上下文”变成一串 latent embeddings 供 decoder 消费，绕开纯 KV 缓存裁剪的限制。论文明确做了架构搜索和超过 350B tokens 的持续预训练，是少见的 scale-oriented context compressor 工作。

与 wenjun 研究方向的关系：

对于长轨迹 agent，历史 observation/action/tool result 很快超过上下文。LCLM 可以被看成 agent trajectory compressor 的基础模型形态：把轨迹压成 latent memory，再让 policy 在短上下文内决策。下一步可问：这种压缩是否保留 reward-relevant state？能否用 RL 信号训练压缩器，而不是只用 LM loss？

#1.3 SWE-Explore: Benchmarking How Coding Agents Explore Repositories

链接：https://arxiv.org/abs/2606.07297
来源：arXiv cs.SE / cs.CL；Hugging Face Daily Papers 2026-06-10 榜单第一条
日期：2026-06-05
类别：Code Agent / Evaluation / Repository Understanding
一句话核心贡献：提出 SWE-Explore，把 repository-level coding agent 的能力拆出“仓库探索 / 代码定位”子任务：给定 repo 和 issue，agent 需在固定 line budget 下返回相关代码区域排序。

为什么值得关注：

SWE-bench 这类 benchmark 通常只看最终 patch 是否通过测试，导致我们难以知道 agent 失败是因为没理解 issue、没找到文件、定位错函数，还是修复能力不行。SWE-Explore 用 848 个 issue、10 种语言、203 个开源仓库构造 line-level ground truth，使“探索”本身可评价。

与 wenjun 研究方向的关系：

这正是 code agent RL 需要的中间监督：最终 patch reward 太稀疏，repository exploration 可以提供更早的 process reward。对 agentic RL 来说，可以设计两阶段或层级策略：explorer 先优化 relevant-region recall / ranking，editor 再优化 patch correctness。

#1.4 Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

链接：https://arxiv.org/abs/2606.09380
来源：arXiv cs.LG / cs.AI / cs.CL
日期：2026-06-08
类别：Post-training RL / RLVR / Reasoning / Evaluation
一句话核心贡献：针对 RLVR 中同一 prompt 的多条 trace 得到相同 verifiable reward、导致 group-relative advantage 没梯度的问题，引入 trace tournament 和 Bradley-Terry ranking，把“同分答案”内部的推理质量差异转成相对奖励。

为什么值得关注：

RLVR 的实际瓶颈不是只有 reward 是否可验证，而是 reward 是否足够有信息量。当一组 samples 全对或全错时，GRPO 类方法没有可用 advantage。Reasoning Arena 的思路是：对这些 reward non-diverse 的 groups，不丢弃，而交给 judge 系统做 trace-level pairwise comparison，并用小规模 anchor pool 降低比较成本。

与 wenjun 研究方向的关系：

长轨迹 agent 中“最终成功 / 失败”更稀疏，且不同轨迹可能都失败但失败质量不同。Trace tournament 可以转化为 agent trajectory tournament：比较工具选择、信息检索、subgoal decomposition、恢复错误的能力，从而训练更细粒度的 long-horizon policy。

#1.5 LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

链接：https://arxiv.org/abs/2606.06087
来源：arXiv cs.CL / cs.AI；Hugging Face Daily Papers 2026-06-10 榜单
日期：2026-06-04
类别：LLM Agent / Latent Skill / Context Efficiency / Adapter
一句话核心贡献：提出 LatentSkill，用预训练 hypernetwork 把文本技能转成可插拔 LoRA adapter，让技能从上下文 token 迁移到权重空间，减少 per-step prompt 开销。
repo：GitHub 搜索验证到 https://github.com/yuaofan0-oss/LatentSkill （2026-06-09 更新，少量 stars；仅作代码入口提示，成熟度需进一步检查）。

为什么值得关注：

很多 agent 系统把 reusable skills 写成文本，每步塞进 prompt。这既占上下文，也泄露 skill 内容。LatentSkill 的路线是：技能仍可模块化加载、组合，但表达位置从 context space 转到 weight space。摘要称其在 ALFWorld / Search-QA 上优于 in-context skill baseline，并显著减少 prefill tokens。

与 wenjun 研究方向的关系：

这是“agent 预训练数据如何塑造能力”的一个微型版本：文本程序 / SOP / skill 能否被内化成 latent adapter？如果和 model-based RL 结合，可以把高价值轨迹自动蒸馏为 skill adapters，再通过环境验证筛选。

#1.6 When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

链接：https://arxiv.org/abs/2606.05885
来源：arXiv cs.LG / cs.AI
日期：2026-06-04
类别：LLM Agent / Post-training RL / Long-horizon Credit Assignment
一句话核心贡献：提出 ECPO，指出更密的 step-level credit 不一定可靠，并通过 action-level evidence calibration 与 variance-gated weighting 抑制小样本 lucky action 导致的偏置。

为什么值得关注：

在 ALFWorld / WebShop 这类长轨迹环境中，GiGPO 等方法试图在 anchor states 构造 step-level advantage。但如果 rollout 数有限，某个动作碰巧成功就可能被过度奖励，训练后期振荡。ECPO 的价值在于强调：dense credit 不是越密越好，必须校准证据强度。

与 wenjun 研究方向的关系：

如果 wenjun 做 LLM Agent RL 或 Dreamer-like rollout，虚拟轨迹中的 reward / advantage 更容易有偏。ECPO 的 shrinkage / variance gate 可作为 imagined rollout credit 的安全阀。

#1.7 Self-evolving LLM agents with in-distribution Optimization

链接：https://arxiv.org/abs/2606.07367
来源：arXiv cs.LG
日期：2026-06-05
类别：LLM Agent / Self-evolving Agent / Offline-to-online RL
一句话核心贡献：提出 Q-Evolve，通过自动过程奖励标注和 in-distribution critic，把专家数据与 agent 自生成轨迹结合，用 weighted IQL 稳定稀疏奖励下的 self-evolving agent 训练。

为什么值得关注：

这篇强调“in-distribution”很关键：agent 自我演化如果直接对 out-of-distribution trajectories 做 Bellman backup，很容易 critic 崩。Q-Evolve 用 hybrid off-policy dataset 与 weighted IQL 学 critic，再派生 step-wise process rewards。

与 wenjun 研究方向的关系：

它非常适合和“环境设计催生自演化智能”放在一起看：环境不只是给 final reward，还要能产生可学习的过程信号；critic 的分布约束则决定了自演化是否稳定。

#1.8 Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs

链接：https://arxiv.org/abs/2606.09371
来源：arXiv cs.AI
日期：2026-06-08
类别：Tool-use / LLM Agent / RLVR / Hierarchical Policy
一句话核心贡献：提出 CAHL，用 RLVR 联合优化高层 planner 与低层 tool executor，缓解两者分开训练造成的能力错配。

为什么值得关注：

很多 tool agent 使用层级结构：planner 拆任务，executor 调工具。但如果 planner 计划的 subtask 超出 executor 能力，或者 executor 的真实能力没有反馈给 planner，系统会产生结构性失败。CAHL 直接把 capability alignment 纳入训练。

与 wenjun 研究方向的关系：

这是 code agent / web agent 的普遍问题。对于长轨迹 RL，可把 planner 的 action space 设计为“executor 可验证可完成的 subgoal”，再通过 RLVR 联合训练。

#1.9 Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

链接：https://arxiv.org/abs/2606.09483
来源：arXiv cs.AI / cs.CL / cs.LG
日期：2026-06-08
类别：LLM Agent / Memory / Personalization / Self-evolving Agent
一句话核心贡献：提出 DCPM，把 agent memory 分成同步写入的 System 1 与异步归纳 schema / intention 的 System 2，用层级记忆支持跨 session 的隐式个性化推理。

为什么值得关注：

论文明确指出：长期记忆不只是召回 passage，还包括 belief revision、diachronic identity、latent intentions、cross-domain patterns。DCPM 的 doubly linked supersedes chains 和 nighttime schema induction，是对“记忆如何演化”的结构化回答。

与 wenjun 研究方向的关系：

对于 agent 预训练 / 持续学习，这篇提供了一个可实验化问题：哪些记忆应该保留为 episodic evidence，哪些应该被抽象成 schema 或 latent intention？这也对应从“指令理解”走向“意图理解”。

#1.10 SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

链接：https://arxiv.org/abs/2606.09774
来源：arXiv cs.AI / cs.CL
日期：2026-06-08
类别：Code Agent / Tool-use / Scientific Simulation / Self-evolving Adapter
一句话核心贡献：提出 SIGA，把科学模拟器配置视为 agent-tool interface grounding 问题，通过检索、procedural memory、in-trajectory validation 和 validation-enforced termination 让通用 coding agent 学会特定模拟器契约。

为什么值得关注：

这不是普通的代码生成 benchmark，而是让 agent 操作真实科学软件。它强调 coding agent 本来会读文件、改代码、运行命令、修错，但缺少特定 simulator 的 vocabulary、constraints、validation rules 与 termination conditions。SIGA 的 adapter 正是补这层接口契约。

与 wenjun 研究方向的关系：

这对“环境设计催生自演化智能”很有启发：如果环境提供强 validation 和可执行契约，agent 就能自我修正并积累 procedural memory。代码智能不一定先训练大模型，可能先训练 interface-grounding adapters。

#1.11 Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

链接：https://arxiv.org/abs/2606.07720
来源：arXiv cs.CL / cs.AI / cs.LG
日期：2026-06-05
类别：Latent Reasoning / Continuous Thought / Memory
一句话核心贡献：提出 AGCLR，在 CoCoNuT 类 continuous latent reasoning 中加入 Gated Concept Stream，使中间事实跨 reasoning passes 持久保存，缓解 concept bottleneck。

为什么值得关注：

CoCoNuT 的核心想法是让模型在 latent space 中思考，但该论文指出中间 hidden states 每轮被覆盖，会导致早期推理事实丢失。AGCLR 用 write / read / forget gates 维护 persistent residual memory。

与 wenjun 研究方向的关系：

这是 latent-space reasoning 的直接相关工作。可继续追问：persistent concept stream 是否能与 agent memory 统一？即把 token-level latent reasoning 的 memory 与 episode-level agent memory 接起来。

#1.12 Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

链接：https://arxiv.org/abs/2606.08151
来源：arXiv cs.AI
日期：2026-06-06
类别：Tool-use / Context Compression / Memory / Code Agent
一句话核心贡献：提出 CICL，把 instance evidence 组织为 context graph，并按 action shift、outcome uplift、necessity、negative-transfer risk 选择高价值 evidence，压缩成 typed memory cards 供 tool agent 使用。

为什么值得关注：

它的核心不是“压得更短”，而是“压缩哪些会改变决策的证据”。在 SWE-bench Verified 文件检索子任务上，摘要称 memory cards 能带来可测增益，并暴露 context selection 的限制。

与 wenjun 研究方向的关系：

对 code agent 来说，context compression 不应只按语义相似度，而要按对 action 的因果影响排序。这个思路可以用于训练 agent trajectory compressor 的 reward：保留会改变下一步工具调用 / patch 决策的信息。

#1.13 Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

链接：https://arxiv.org/abs/2606.05296
来源：arXiv cs.AI
日期：2026-06-03
类别：LLM Agent / Test-time Scaling / Black-box RL / Planning
一句话核心贡献：提出 Agentic Monte Carlo（AMC），利用 RL 与 Bayesian inference 的等价，把黑盒 LLM agent 的 prior trajectory 通过 Sequential Monte Carlo 采样成更接近 optimal policy 的 posterior，而不更新模型参数。

为什么值得关注：

黑盒模型不能做参数级 RL，但可以在 test time 对 trajectories 做采样、重加权和 value-guided steering。AMC 在 AgentGym 上随 test-time compute 扩展，并称可超过 GRPO baseline。

与 wenjun 研究方向的关系：

AMC 是 model-based / planning for black-box agent 的一个入口。它提示：即使没有训练权限，也可以通过 trajectory posterior inference 实现 RL-like optimization。可与 tree search、world model rollout、memory updater 联合。

#1.14 Agents' Last Exam

链接：https://arxiv.org/abs/2606.05405
来源：arXiv cs.AI / cs.CL / cs.LG；Hugging Face Daily Papers 2026-06-10 榜单
日期：2026-06-03
类别：LLM Agent / Evaluation / Long-horizon Workflows
一句话核心贡献：提出 Agents' Last Exam（ALE），面向经济上有价值的真实长程工作流，覆盖 13 个行业簇、55 个子领域，并强调 verifiable outcomes。

为什么值得关注：

论文的出发点是：AI benchmark 成绩很强，但没有转化为专业领域的大规模经济部署，原因之一是评估缺少真实、长程、可验证的 workflow。ALE 试图补这个空白。

与 wenjun 研究方向的关系：

如果要研究 long-horizon agent RL，benchmark 必须既有真实任务复杂性，又能产出可验证 reward。ALE 可作为任务设计参考，但需要进一步看其任务是否可重复运行、是否支持交互轨迹采集与训练。

#1.15 Latent Spatial Memory for Video World Models / Echo-Memory / AHA-WAM

Latent Spatial Memory for Video World Models：https://arxiv.org/abs/2606.09828，2026-06-08，类别：World Model / Latent Memory / Multimodal。提出 Mirage，在 diffusion latent space 中维护 3D spatial memory，避免 RGB point cloud memory 的 expensive / lossy round trip。
Echo-Memory: A Controlled Study of Memory in Action World Models：https://arxiv.org/abs/2606.09803，2026-06-08，类别：World Model / Memory / Evaluation。固定 action-to-video interface，仅改变 memory 存取机制，控制研究 action-conditioned world models 的记忆问题。
AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing：https://arxiv.org/abs/2606.09811，2026-06-08，类别：World Model / Robotics / Long-horizon Control。提出异步 horizon-adaptive world-action model，避免 world prediction 与 action execution 被绑定在同一时间分辨率。

为什么值得关注：

虽然这些偏视觉 / embodied，但它们和 LLM Agent 的 model-based RL 有共通问题：状态如何记忆、预测 horizon 如何选择、world branch 与 action branch 是否需要同频更新。

与 wenjun 研究方向的关系：

LLM Agent 的“世界模型”不一定是视频生成，而可能是对网页、代码库、工具状态、用户意图的 latent transition model。这几篇提供了可迁移的结构思想：latent memory、controlled memory ablation、异步 horizon。

#1.16 FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

链接：https://arxiv.org/abs/2606.09079
来源：arXiv cs.LG / cs.AI；Hugging Face Daily Papers 2026-06-10 榜单
日期：2026-06-08
类别：Systems / Long Context / Memory Indexing
一句话核心贡献：提出 Lookahead Sparse Attention（LSA），用 Neural Memory Indexer 预测未来 query 需要哪些历史 KV chunks，仅把 query-critical chunks 保留在 GPU memory。

为什么值得关注：

长上下文 serving 的瓶颈常是 KV cache。FlashMemory 的角度不是被动裁剪，而是主动预测未来需求；并通过 backbone-free decoupled training，把 indexer 作为双编码器独立训练。

与 wenjun 研究方向的关系：

对长轨迹 agent，未来步骤会需要哪些历史片段本质上也是 planning 问题。可以把 lookahead indexer 看作 agent memory retrieval policy，与 downstream reward 联合优化。

#1.17 SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

链接：https://arxiv.org/abs/2606.09669
来源：arXiv cs.AI / cs.CL；Hugging Face Daily Papers 2026-06-10 榜单
日期：2026-06-08
类别：Multimodal Agent / Evaluation / Spatial Reasoning
一句话核心贡献：提出 SpatialWorld，整合 8 个异构仿真后端与 760 个标注任务，以 simulator-agnostic protocol 评估多模态 agent 的交互式空间理解。

与 wenjun 研究方向的关系：

它说明 agent evaluation 正在从静态问答转向交互环境。即使 wenjun 不做 embodied，也可借鉴其统一协议设计：任务、观测、动作、评价如何跨环境标准化。

#1.18 OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

链接：https://arxiv.org/abs/2606.09826
来源：arXiv cs.CV / cs.AI；Hugging Face Daily Papers 2026-06-10 榜单
日期：2026-06-08
类别：Agent Evaluation / Game Agent / Test-time Improvement
一句话核心贡献：构建 12 个 UE5 游戏环境，并提出 Improvement Dynamics Curve（IDC）评估 agentic-reflection harness 中 agent 随尝试改进的动态。

与 wenjun 研究方向的关系：

重点不是 VLM 游戏本身，而是 improvement dynamics：agent 不应只评 first-attempt score，还应评多次尝试、反思、策略更新后的学习曲线。这与 test-time learning / self-evolving agent 直接相关。

#2. 其他可扫读条目

标题	链接	来源 / 日期	类别	一句话贡献
On the Geometry of On-Policy Distillation	https://arxiv.org/abs/2606.07082	arXiv / 2026-06-05；HF Daily	Post-training / Distillation / RLVR	用参数空间诊断比较 OPD、SFT、RLVR，发现 OPD 更新进入低维 subspace locking，可为 reasoning distillation 提供机制解释。
GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution	https://arxiv.org/abs/2606.06892	arXiv / 2026-06-05	Pretraining Data / Data Attribution	把数据归因从单样本 additive score 改为 subset-level counterfactual utility prediction，显式建模 redundancy 和 complementarity。
MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models	https://arxiv.org/abs/2606.07996	arXiv / 2026-06-06	Pretraining Data / Privacy / Detection	面向黑盒模型检测语料级预训练数据痕迹，适合关注数据泄露与训练数据审计。
Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses	https://arxiv.org/abs/2606.08348	arXiv / 2026-06-06	LLM Agent / Skill Evolution	把 prompts / tools / memory / SOPs / skills 作为 hypotheses，根据 verified trajectory evidence 维护 posterior 并指导 skill 演化。
Bespoke-Card: Why Tune When You Can Generate? Synthesizing Workload-Specific Cardinality Estimators	https://arxiv.org/abs/2606.09361	arXiv / 2026-06-08	Code Agent / Database / Agentic Code Generation	用 planning agent + coding agent + validator 为特定 workload 生成 cardinality estimator 代码，说明 agentic code generation 可用于系统优化。
CoVEBench: Can Video Editing Models Handle Complex Instructions?	https://arxiv.org/abs/2606.08415	arXiv / 2026-06-07；HF Daily	Evaluation / Complex Instruction Following	用 416 视频、626 多点编辑指令和 9990 checklist items 评估复杂组合指令遵循，对“意图理解”评估有参考。

#3. 今日最值得精读的 3 篇

From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory

https://arxiv.org/abs/2606.08656

精读理由：memory updater 作为可训练策略，非常接近 long-horizon agent 的 belief-state learning / model-based RL 入口。

End-to-End Context Compression at Scale

https://arxiv.org/abs/2606.09659

精读理由：大规模 latent context compressor 是长轨迹 agent、持续上下文、trajectory summarization 的基础模块候选。

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

https://arxiv.org/abs/2606.07297

精读理由：把 coding agent 成败拆成 repository exploration 子能力，为 agentic coding RL 的过程奖励设计提供 benchmark。

备选第 4 篇：Reasoning Arena（https://arxiv.org/abs/2606.09380），适合从 RLVR 机制角度精读。

#4. 今日最值得跟进的 3 个 repo / model / dataset

说明：GitHub Search 在本次运行中部分请求触发 rate limit，因此这里只列已通过可访问来源确认的项目入口或明确应跟踪的论文 artifact；若后续 release 代码，应优先补充。

LatentSkill repo

- 链接：https://github.com/yuaofan0-oss/LatentSkill

- 关联论文：https://arxiv.org/abs/2606.06087

- 跟进理由：文本技能转 LoRA adapter 的实现如果完整，可直接复现实验并观察 hypernetwork 如何生成 skill adapter。

SWE-Explore benchmark / dataset artifact（待论文页或作者后续释放）

- 论文：https://arxiv.org/abs/2606.07297

- 跟进理由：848 issues、203 repos 的 line-level exploration benchmark 对 code agent 训练和诊断很有价值。建议持续搜索 SWE-Explore GitHub、HF dataset 或论文附录中的 release 链接。

LCLM / End-to-End Context Compression artifacts（待模型权重或代码释放）

- 论文：https://arxiv.org/abs/2606.09659

- 跟进理由：0.6B encoder + 4B decoder 的 1:4 / 1:8 / 1:16 compressor 若开放，将是做 agent trajectory compression、长上下文压缩和 latent memory 的重要 baseline。

#5. 研究机会 / idea

#Idea 1：把 memory update 作为 Dreamer-style latent transition 来训练

MemoPilot 训练“怎么写记忆”，LCLM 训练“怎么压缩上下文”，world-model papers 训练“怎么维护 latent memory”。可以把三者统一成：

observation/action/tool result 进入 encoder；
memory state 作为 latent belief；
memory updater 作为 transition model；
policy 在 latent memory 上 planning；
用真实 reward + imagined rollouts 训练 updater 与 policy。

关键实验问题：memory state 是否能预测未来 reward-relevant events，而不仅是复述历史？

#Idea 2：Code Agent RL 的分层 reward：explore → localize → edit → validate

SWE-Explore 给 explore/localize 阶段提供监督，SIGA 给 tool-interface grounding / validation 阶段提供结构。可设计一个分层 code agent 训练框架：

explorer：优化 relevant line / file recall；
planner：把 issue 分解成可验证 subgoals；
editor：生成 patch；
validator：运行测试并做 failure repair；
memory：记录 repo-specific / simulator-specific contracts。

研究问题：分阶段 reward 是否比 final test reward 更稳定？不同阶段是否需要共享模型，还是 adapter 化？

#Idea 3：从 “semantic compression” 转向 “decision-aware compression”

CICL 强调 action shift / outcome uplift，LCLM 强调大规模 latent compression。两者可结合：先用 LCLM 做通用压缩，再用 agent RL 信号训练一个 decision-aware selector / compressor。

关键问题：

哪些历史 token / tool result 对下一步 action 有因果影响？
压缩器是否会删除低相似但高决策价值的信息？
能否用 counterfactual action change 作为压缩器训练 reward？

#6. 今日行动建议

上午优先读：MemoPilot（2606.08656）和 LCLM（2606.09659）。前者给 agent test-time learning 框架，后者给长上下文 latent state 基础设施。
如果今天要做 code agent 方向笔记：读 SWE-Explore（2606.07297）+ SIGA（2606.09774），重点抽象出可用于 RL 训练的中间状态和 reward。
如果要写 idea：围绕“trajectory memory as latent state”写一页 proposal，把 MemoPilot / LCLM / CICL / AGCLR 统一起来。