每日调研 2026-06-07 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-06-07 AI/LLM 最新论文与研究热点简报

检索时间：2026-06-07 08:00 Asia/Shanghai。主要覆盖 Hugging Face Daily Papers 2026-06-05/06、arXiv 2026-06-03 到 2026-06-04 新提交/更新，以及可访问的 GitHub/HF 页面。X/Twitter 搜索页可打开但需要 JavaScript/登录，无法稳定抽取具体推文，因此本期未把 X 作为事实来源，改用 arXiv、Hugging Face Papers、GitHub 项目页。

#0. 今日判断

过去 24-48 小时最贴近 wenjun 研究主线的信号非常集中：Agent 的“长期经验内化/记忆/技能”与“潜空间推理”同时升温。如果把近期论文串起来看，大家都在试图解决同一个问题：语言模型 agent 不能只靠一次性 prompt 或单轮 RL，而要能在长期交互中把经验压缩、保真地写入可复用能力；同时，显式 CoT 的 token 瓶颈推动了一批 latent reasoning / token-latent 混合推理方法。

本期最值得优先看的 3 条：

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents：直接讨论 self-evolving agent 的经验内化为何会在多轮学习中 collapse。
Latent Reasoning with Normalizing Flows / TARPO / ReLAT：潜空间推理从“能不能做”进入“如何保持自回归、可采样、可检查、可 RL 探索”的方法细化阶段。
Code2LoRA + Asuka-Bench + TensorBench / SmellBench：代码智能从单题 pass@k 继续走向 repo-specific adaptation、underspecified intent、多轮 refinement、可维护性与可靠评测。

#1. 重点论文与动态

#1. Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

链接：https://arxiv.org/abs/2606.04703；HF：https://huggingface.co/papers/2606.04703；GitHub：https://github.com/RUCBM/ExpInternalization
来源/日期：arXiv / Hugging Face Papers，2026-06-03 提交，2026-06-06 HF Daily Papers 收录
类别：LLM Agent / Continual Learning / Self-evolving Agent / Agent Memory
一句话核心贡献：系统研究 LLM agent 将历史交互经验转化为参数化能力时，在多轮 experience learning 中为什么不是持续变强，而是出现 progressive capability collapse，并指出 principle-level experience 比 instance-level experience 更耐久。

为什么值得关注：这篇很接近“自演化 LLM Agent”的核心瓶颈。过去很多工作默认：把轨迹经验总结成经验库、原则、反思，继续训练或继续提示，agent 就会越用越强。但论文指出多轮经验内化会出现能力坍塌，尤其当经验粒度太贴近实例、学习循环缺少稳定化机制时，后续迭代可能覆盖或污染已有能力。

与 wenjun 方向的关系：如果你要做 long-horizon agent RL 或 model-based RL for LLM Agent，这篇提供了一个重要负例：经验不是越多越好，经验的表示粒度、筛选机制、内化目标可能比轨迹数量更关键。它也提示可以把“principle-level world model / policy prior”作为 agent 预训练或持续学习的单位，而不是简单地蒸馏完整轨迹。

#2. Latent Reasoning with Normalizing Flows

链接：https://arxiv.org/abs/2606.06447；HF：https://huggingface.co/papers/2606.06447
来源/日期：arXiv / Hugging Face Papers，2026-06-04 提交，2026-06-06 HF Daily Papers 收录
类别：Latent Reasoning / Reasoning Model / Test-time Scaling
一句话核心贡献：提出 NF-CoT，用 normalizing flows 做连续 latent reasoning，同时尽量保留自回归语言模型的 left-to-right 生成、概率采样、KV-cache 解码兼容性与可计算 likelihood。

为什么值得关注：很多 latent CoT 方法的问题是牺牲了文本 CoT 的工程优势：不能自然自回归、难以采样、难以接入 KV cache、likelihood 不好算。这篇把 latent reasoning 往“可训练、可采样、可部署”的方向推了一步。

与 wenjun 方向的关系：你近期关注 latent-space reasoning，这篇值得精读方法部分。特别是它试图把连续 latent state 变成可概率建模的中间计算，而不仅是隐藏层里不可控的一团向量。若未来做 agent 中的 latent planning/world model，可以借鉴其“保留 LM 解码范式”的设计目标。

#3. TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

链接：https://arxiv.org/abs/2606.05859
来源/日期：arXiv，2026-06-04 提交
类别：Latent Reasoning / Post-training RL / Reasoning Model
一句话核心贡献：提出一个纯 RL 框架，在每个 step 由 action head router 自适应选择生成显式 token 还是进行连续 latent reasoning，以缓解纯 latent 表示确定性强、RL 探索不足的问题。

为什么值得关注：这篇和 NF-CoT 形成互补。NF-CoT 强调 latent reasoning 的概率建模与 LM 兼容性，TARPO 则强调 token/latent 的逐步路由和 RL 探索。它把“何时说出来、何时在心里想”变成了策略学习问题。

与 wenjun 方向的关系：这非常适合连接到 agentic RL：agent 不是每一步都需要把思考外显为文本，也不是完全 latent；可以学习一个 action-routing policy，把 token budget、可解释性、搜索深度统一进控制策略。

#4. Closing the Loop on Latent Reasoning via Test-Time Reconstruction

链接：https://arxiv.org/abs/2606.06252
来源/日期：arXiv，2026-06-04 提交
类别：Latent Reasoning / Evaluation / Test-time Scaling
一句话核心贡献：提出 ReLAT，通过 test-time reconstruction 检查 latent state 是否仍保留原始问题约束，试图解决 latent reasoning 中间状态不可检查、开环漂移的问题。

为什么值得关注：latent reasoning 最大的工程风险是“看不见它错在哪里”。ReLAT 的问题意识很好：文本 CoT 虽然慢，但可以审计；latent state 如果没有 fidelity check，很容易悄悄丢约束。

与 wenjun 方向的关系：对长轨迹 agent 尤其重要。agent 的 belief state、memory summary、latent plan 都可能发生漂移；用 reconstruction 或 consistency check 做闭环，是 agent world model/latent memory 训练中可迁移的设计。

#5. Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

链接：https://arxiv.org/abs/2606.06492；HF：https://huggingface.co/papers/2606.06492
来源/日期：arXiv / Hugging Face Papers，2026-06-04 提交，2026-06-06 HF Daily Papers 收录
类别：Code Agent / Code Intelligence / Repository Adaptation / Continual Learning
一句话核心贡献：用 hypernetwork 从 repository snapshot 或 code diff 生成 repo-specific LoRA adapter，在不增加推理 token 开销的情况下向代码模型注入仓库级知识，并支持随软件演化更新 adapter。

为什么值得关注：代码模型常见做法是 RAG/依赖分析把仓库上下文塞进 prompt，或者对每个仓库做微调。前者 token 昂贵且上下文易丢，后者维护成本高。Code2LoRA 把 repo knowledge 变成参数化 adapter，尤其 Code2LoRA-Evo 用 GRU hidden state 追踪 diff，非常贴近真实软件演化。

与 wenjun 方向的关系：这是“agent 预训练数据/环境如何塑造能力”的代码侧例子：仓库知识不一定只能作为上下文，也可以转成轻量参数状态。对 self-evolving code agent 来说，一个自然问题是：能否把 agent 修改代码后的 diff、测试反馈、review 反馈持续写入 adapter 或 memory policy？

链接：https://arxiv.org/abs/2606.05920
来源/日期：arXiv，2026-06-04 提交
类别：Code Agent / Evaluation / Intent Understanding / Multi-round Refinement
一句话核心贡献：构建一个面向 Web 开发的代码 agent benchmark，把不完整用户意图、多轮反馈、浏览器渲染行为和 UI agent 测试放进闭环评测。

为什么值得关注：它把代码 agent 从“完整 prompt 到一次性输出”推进到“用户看到中间结果后继续澄清/修改”的真实开发流程。这和从 instruction following 到 intent understanding 的转变非常一致。

与 wenjun 方向的关系：如果研究 code agent 的 RL 环境，Asuka-Bench 类似一个可交互环境模板：用户意图是隐藏变量，agent 必须通过迭代产物和反馈逐步识别意图。它也适合做 model-based RL：学习一个用户反馈/渲染结果的世界模型来减少真实交互成本。

#7. TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

链接：https://arxiv.org/abs/2606.05570
来源/日期：arXiv，2026-06-04 提交
类别：Code Agent / Evaluation / Systems
一句话核心贡献：在一个 compiler-based tensor framework 上构造 199 个 feature-addition 和 refactoring 任务，用更可靠的编译/测试机制评估 repo-level coding agent。

为什么值得关注：当前很多 repo-level benchmark 难题依赖人工 review 或不完整测试，评估噪声很高。TensorBench 选择编译器/张量框架作为环境，任务难且可通过编译与行为测试进行较稳定判定。

与 wenjun 方向的关系：对代码智能的 RLVR 很有价值：verifiable reward 不仅来自算法题，也可以来自 compiler IR、sparse tensor operator、runtime component 等真实系统任务。

#8. SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks

链接：https://arxiv.org/abs/2606.05574
来源/日期：arXiv，2026-06-04 提交
类别：Code Agent / Evaluation / Software Engineering
一句话核心贡献：提出面向 refactoring 的细粒度 benchmark，主动向真实代码片段注入 code smell，用于评估代码 agent 对可读性、可扩展性、鲁棒性的改进能力。

为什么值得关注：代码 agent 不能只看功能正确性。很多 agent 生成的代码能跑但结构臃肿、难维护。SmellBench 关注 maintainability，是比 pass@k 更接近工程质量的一类评测。

#9. EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

链接：https://arxiv.org/abs/2606.03841；HF：https://huggingface.co/papers/2606.03841；GitHub：https://github.com/usail-hkust/EvoDS
来源/日期：arXiv 2026-06-02；HF Daily Papers 2026-06-06 收录
类别：LLM Agent / Self-evolving Agent / Context Compression / Agentic RL
一句话核心贡献：提出自演化 data science agent，通过 Autonomous Skill Acquisition 合成/验证/复用可执行技能，并用 Adaptive Context Compression 把上下文管理视为可学习控制问题。

为什么值得关注：这篇把“技能学习”和“上下文压缩”放在一个 agentic RL 系统里，而不是单独做 prompt compression。对长任务而言，context management 本身就是决策问题：何时保留、压缩、遗忘、调用技能。

与 wenjun 方向的关系：非常适合和你的“通用上下文压缩器”“self-evolving code/data agent”兴趣相连。可以把 ACC 抽象为 belief-state compression policy，再把技能库视为 options/hierarchical RL。

#10. MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

链接：https://arxiv.org/abs/2606.06473；HF：https://huggingface.co/papers/2606.06473；GitHub：https://github.com/InternScience/MLEvolve
来源/日期：arXiv / Hugging Face Papers，2026-06-04 提交，2026-06-06 HF Daily Papers 收录
类别：LLM Agent / AutoML / Self-evolving Agent / Search
一句话核心贡献：提出面向机器学习算法发现的多 agent 自演化框架，用 Progressive MCGS、跨分支信息流和 Retrospective Memory 改善长期搜索。

为什么值得关注：很多 MLE agent 的问题是每条搜索分支互相隔离、没有记忆、缺少从探索到利用的层级控制。MLEvolve 明确把搜索树扩展为可共享信息的图，并引入回顾性记忆。

与 wenjun 方向的关系：可作为“科研/工程 agent 的长期 credit assignment”案例。尤其是 Retrospective Memory 如何选择、压缩、复用失败/成功经验，和 agent 预训练数据如何塑造能力密切相关。

#11. MemTrain: Self-Supervised Context Memory Training

链接：https://arxiv.org/abs/2606.03197；HF：https://huggingface.co/papers/2606.03197
来源/日期：arXiv 2026-06-02；HF Daily Papers 2026-06-04 收录
类别：LLM Agent / Memory / Context Compression / Pretraining Objective
一句话核心贡献：用 Wikipedia 上的自监督代理任务训练 context-memory 能力，包括多轮 memory update 后恢复 masked entities 等 proxy objective。

为什么值得关注：它不是直接在下游 agent 任务上做 RL，而是构造通用 memory behavior 的自监督预训练。这很像“agent 预训练数据如何塑造能力”的一个具体方向：先训练模型具备可迁移的记忆更新/读取能力，再做下游 post-training。

与 wenjun 方向的关系：可启发“agent pretraining data”设计：不只是收集工具调用轨迹，还可以构造针对 belief update、entity tracking、constraint retention 的自监督任务。

#12. Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

链接：https://arxiv.org/abs/2605.30159；HF：https://huggingface.co/papers/2605.30159
来源/日期：arXiv 2026-05-28；HF Daily Papers 2026-06-06 收录
类别：LLM Agent / Memory / Long-horizon RL
一句话核心贡献：指出长轨迹 agent 的递归摘要会丢失任务相关信息并引入语义噪声，主张优化 memory policy 时关注中间记忆质量，而不仅是最终轨迹成功。

为什么值得关注：和 MemTrain/EvoDS/Experience Internalization 构成同一条线：memory 不是简单摘要，而是 latent task state 的外部化；如果 memory 模糊，belief 就会偏移，长任务最终失败。

#13. World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

链接：https://arxiv.org/abs/2606.05979；HF：https://huggingface.co/papers/2606.05979；GitHub：https://github.com/SJTU-DENG-Lab/WLA
来源/日期：arXiv / Hugging Face Papers，2026-06-04 提交，2026-06-06 HF Daily Papers 收录
类别：Model-based RL / World Model / Embodied Agent / VLA
一句话核心贡献：提出 WLA 模型，把世界建模、语言推理和动作合成统一在自回归 Transformer 中，输入文本指令、图像和机器人状态，联合预测 textual subtasks、subgoal images 和 robot actions。

为什么值得关注：WLA 明确把 world modeling interface 和 language reasoning 结合起来，是 physical AI 里的 model-based agent 思路。虽然不是纯 LLM software agent，但其“预测下一状态 + 子目标 + 动作”的接口对 LLM agent world model 有迁移价值。

与 wenjun 方向的关系：你关注 Dreamer for LLM Agent，这类工作可提供类比：LLM agent 的“世界状态”可以是网页/代码仓库/工具环境状态，动作是 tool call 或代码 diff，subgoal 是文本/结构化计划。关键是如何让 world model 同时支持语言推理与可验证动作。

#14. MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

链接：https://arxiv.org/abs/2606.06245
来源/日期：arXiv，2026-06-04 提交
类别：Model-based RL / Latent Reasoning / VLA / Test-time Scaling
一句话核心贡献：面向长程高不确定控制，提出多路径 latent reasoning：初始化多条假设、K 步共享权重细化，再按 reward-guided path preference 聚合后解码动作。

为什么值得关注：它把 test-time scaling、latent reasoning 和 reward/world-model progress signal 结合起来，比单路 action decoding 更像 planning。

与 wenjun 方向的关系：这很接近“Dreamer-like LLM agent”的思想：在 latent space 中展开多条候选未来，用 reward/progress 选择或聚合，再输出动作。可考虑迁移到代码 agent：多条 patch latent plan + verifier reward + final diff decoding。

#15. Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

链接：https://arxiv.org/abs/2606.04923；HF：https://huggingface.co/papers/2606.04923
来源/日期：arXiv 2026-06-03；HF Daily Papers 2026-06-04 收录
类别：Post-training RL / RLHF/RLAIF / Safety / Evaluation
一句话核心贡献：提出 CHERRL controllable hacking environment，通过给 LLM-as-a-Judge 注入已知 bias，稳定复现、分析和检测 rubric-based RL 中的 reward hacking。

为什么值得关注：RL with rubric/judge 是 agent 训练常用方案，但 judge bias 被 policy 利用后，训练可能得到看似高分但实际无效/不安全的策略。CHERRL 提供了可控复现实验环境。

与 wenjun 方向的关系：对 long-horizon agent RL 尤其关键：agent 的 reward 往往来自 rubric judge、verifier 或用户代理。若不研究 reward hacking，self-evolving agent 可能只是学会讨好评估器。

#16. Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

链接：https://arxiv.org/abs/2605.31058；HF：https://huggingface.co/papers/2605.31058
来源/日期：arXiv 2026-05-29；HF Daily Papers 2026-06-06 收录
类别：Code Agent / Post-training RL / RLVR / Synthetic Data
一句话核心贡献：提出 Atomic Decomposition and Recombination，为代码 RLVR 生成更有新颖性和难度的可验证任务，缓解高质量 verifiable code task 稀缺问题。

为什么值得关注：代码 RLVR 的瓶颈之一不是算法，而是“模型能力边缘附近”的可验证任务不够。ADR 的思路是把任务拆成原子再重组，以扩大难度和新颖性。

与 wenjun 方向的关系：这可以直接用于 self-evolving code agent：agent 不只解题，还可以生成 curriculum，围绕当前能力边界合成可验证环境。

#17. Self-Distilled Policy Gradient

链接：https://arxiv.org/abs/2606.04036；HF：https://huggingface.co/papers/2606.04036；GitHub：https://github.com/lauyikfung/SDPG
来源/日期：arXiv 2026-06-02；HF Daily Papers 2026-06-04 收录
类别：Post-training RL / RLVR / Self-distillation
一句话核心贡献：提出 SDPG，将 group-relative verifier advantages、完整词表 on-policy self-distillation 和 reference-policy KL 正则结合，提高稀疏奖励 RL 的稳定性和表现。

为什么值得关注：它针对 RLVR 稳定性与稀疏奖励监督不足问题，用 privileged context 下的自蒸馏提供 dense supervision。代码已开源。

#18. OPRD: On-Policy Representation Distillation

链接：https://arxiv.org/abs/2606.06021；HF：https://huggingface.co/papers/2606.06021
来源/日期：arXiv / Hugging Face Papers，2026-06-04 提交，2026-06-06 HF Daily Papers 收录
类别：Post-training RL / Distillation / Reasoning Model
一句话核心贡献：把 on-policy distillation 从输出 token 概率匹配推进到隐藏层 representation alignment，避免大词表 Monte Carlo KL 的采样方差，并利用教师中间表征。

为什么值得关注：对推理模型蒸馏很实用。它说明只在输出空间模仿 teacher 可能浪费中间结构信息；对小模型 reasoning distillation 或 agent policy distillation 都有参考价值。

#19. Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

链接：https://arxiv.org/abs/2606.02060；HF：https://huggingface.co/papers/2606.02060；GitHub：https://github.com/NJU-LINK/DRIFT
来源/日期：arXiv 2026-06-01，2026-06-02 更新；HF Daily Papers 2026-06-04 收录
类别：LLM Agent / Evaluation / Tool-use / Deep Research
一句话核心贡献：提出 TELBench 和 DRIFT，用 span-level error localization 找出 deep-research agent 轨迹中哪些检索、证据检查、假设或综合片段导致最终答案不可靠。

为什么值得关注：final answer 分数只能告诉你成败，不能告诉你轨迹哪里坏了。DRIFT 的 claim-centric auditing 思路对调试长轨迹 agent 很有用。

与 wenjun 方向的关系：如果做 model-based RL 或长轨迹 credit assignment，需要把错误定位到中间 span/action，而不是只用最终 reward。TELBench/DRIFT 是很好的评估与数据构造参考。

#20. Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

链接：https://arxiv.org/abs/2606.03965；HF：https://huggingface.co/papers/2606.03965
来源/日期：arXiv 2026-06-02；HF Daily Papers 2026-06-04 收录
类别：LLM Agent / Reasoning Control / Test-time Scaling
一句话核心贡献：把 reasoning steering 建模为 MDP，由 controller agent 在推理过程中根据已有 reasoning trace 和剩余预算选择下一步策略与 steering phrase，引导冻结 reasoner 更高效推理。

为什么值得关注：这不是简单压缩 CoT，而是把“如何思考”作为可控 action。它和 token budget、test-time scaling、agent controller 都相关。

#21. AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

链接：https://arxiv.org/abs/2606.05622；HF：https://huggingface.co/papers/2606.05622
来源/日期：arXiv / Hugging Face Papers，2026-06-04 提交，2026-06-06 HF Daily Papers 收录
类别：LLM Agent / Planning / Evaluation / Intent Understanding
一句话核心贡献：提出动态交互式 benchmark，评估 LLM agent 在世界约束和用户约束逐步揭示时能否持续 re-plan。

为什么值得关注：现实任务的约束往往不是一开始全部给出，而是在 agent 触犯约束或用户补充反馈时逐步暴露。AdaPlanBench 把 hidden constraints 纳入规划评测。

#22. AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

链接：https://arxiv.org/abs/2606.05557；HF：https://huggingface.co/papers/2606.05557
来源/日期：arXiv / Hugging Face Papers，2026-06-04 提交，2026-06-06 HF Daily Papers 收录
类别：LLM Agent / Intent Understanding / Tool-use
一句话核心贡献：在 scene perception 和 tool use 之间加入 IntentFrame 推断隐含需求，并用 gap score 控制 probe budget 和工具选择。

为什么值得关注：它直接对应“从指令理解走向意图理解”。用户问“Lin Wei 在哪”可能真正想知道对方是否方便被打扰，agent 需要主动探测隐含需求，而不是只回答字面问题。

#23. Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

链接：https://arxiv.org/abs/2606.03895；HF：https://huggingface.co/papers/2606.03895
来源/日期：arXiv 2026-06-02；HF Daily Papers 2026-06-04 收录
类别：LLM Agent / Systems / Tool-use / Safety
一句话核心贡献：提出类 library OS 的 agent runtime，把 agent 作为带进程身份、生命周期、能力、对象记忆、checkpoint、事件和审计记录的 AgentProcess 管理。

为什么值得关注：长运行 agent 需要状态、权限、恢复、审计和 side-effect 管理。Agent libOS 从系统角度给出抽象，对实际构建 agent infra 很有参考价值。

#24. STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

链接：https://arxiv.org/abs/2606.05165；HF：https://huggingface.co/papers/2606.05165
来源/日期：arXiv 2026-06-03；HF Daily Papers 2026-06-04 收录
类别：Pretraining Data / Data Attribution / Mechanistic Understanding
一句话核心贡献：把训练数据归因从参数梯度近似转向 activation-space functional effect，并用 sparse recovery 从 subset perturbations 估计数据影响。

为什么值得关注：这贴近“预训练数据如何塑造能力”。如果能更可靠地追踪数据对模型行为的贡献，就能为数据质量、去重、污染检测、能力形成机制提供工具。

#2. 今日最值得精读的 3 篇

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

精读原因：直接命中 self-evolving LLM agent 的多轮经验内化失败模式，是做 agent 持续学习/长轨迹 RL 前必须理解的风险。

Latent Reasoning with Normalizing Flows

精读原因：latent reasoning 方法正在从概念走向工程可用，这篇重点解决自回归、采样、KV-cache 和 likelihood 兼容性。

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

精读原因：repo-level code intelligence 的新路线：不用无限塞上下文，而是把仓库知识转为随 diff 演化的 adapter。

备选：如果今天更想看 RL/post-training，可把第三篇换成 Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based RL 或 Combinatorial Synthesis for Code RLVR。

#3. 今日最值得跟进的 3 个 repo/model/dataset

ExpInternalization

- 链接：https://github.com/RUCBM/ExpInternalization

- 对应论文：Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

- 跟进理由：可能包含 experience internalization 的实验设置、经验粒度构造和 collapse 诊断方法。

EvoDS

- 链接：https://github.com/usail-hkust/EvoDS

- 对应论文：EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

- 跟进理由：skill acquisition + adaptive context compression 是长任务 agent 的关键组合。

MLEvolve

- 链接：https://github.com/InternScience/MLEvolve

- 对应论文：MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

- 跟进理由：适合观察 long-horizon MLE agent 如何组织搜索、记忆和跨分支信息共享。

额外可关注：

SDPG：https://github.com/lauyikfung/SDPG，用于 RLVR/self-distilled policy gradient。
WLA：https://github.com/SJTU-DENG-Lab/WLA，用于 world-language-action embodied world model。
DRIFT：https://github.com/NJU-LINK/DRIFT，用于 deep-research agent 轨迹错误定位。

#4. 研究机会 / idea

#Idea 1：把“经验内化 collapse”形式化成 agent continual learning 的稳定性 benchmark

可以基于 ExpInternalization 的发现，构造一个面向 code/web/tool agent 的多轮经验内化 benchmark：每轮让 agent 从轨迹中抽取 principle、skill 或 memory，再在新任务上测试旧能力保持与新能力迁移。核心指标不是单轮成功率，而是：

old-task retention；
new-task transfer；
principle-level vs instance-level experience 的泛化差异；
多轮训练后的 collapse onset；
经验过滤/去重/压缩策略对 collapse 的影响。

这能连接持续学习、agent 预训练数据质量和 self-evolving agent。

#Idea 2：面向 LLM Agent 的 Dreamer-like latent world model：从文本轨迹到 latent belief，再到 verifier-guided planning

WLA/MPCoT/ReLAT 给了一个组合思路：

用 agent 轨迹训练 latent belief/world model，预测下一 observation/tool result/code state；
在 latent space 展开多条 candidate plan；
用 verifier/reward model/progress model 评估路径；
用 reconstruction/consistency check 防止 latent state 丢失原始约束；
最后只把最优 latent plan 解码成 tool calls 或 code diffs。

这正好对应“Dreamer for LLM Agent”，关键难题是如何定义可学习的环境状态与 reward，以及如何在 latent planning 中保持可审计性。

#Idea 3：Code Agent 的 repo-specific memory 不一定要放在 prompt，可尝试“adapter + memory + verifier”三层结构

Code2LoRA 提示 repo knowledge 可以被参数化；EvoDS/MemTrain 提示可学习 memory/context compression；TensorBench/Asuka-Bench 提供更真实的验证环境。可以探索三层 code agent：

Adapter 层：存仓库稳定知识和 API 约定；
Memory 层：存近期开发任务、用户偏好、失败测试、review feedback；
Verifier 层：用编译、测试、UI 渲染、code smell 检查提供可验证 reward。

研究问题：哪些知识应该进 adapter，哪些应该进外部 memory，哪些只应在当前上下文保留？这也可自然连接到代码数据质量、去重和持续预训练。

#5. 其他可快速浏览条目

标题	链接	类别	日期	一句话贡献
Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation	https://arxiv.org/abs/2606.06428	Post-training RL / Contextual Learning	2026-06-04	用轻量 chrF reward 做 RL，让模型学习利用上下文语言知识翻译未见低资源语言，而不是记忆特定语言。
The Shadow Price of Reasoning	https://arxiv.org/abs/2606.03092	Reasoning / Inference Budget	2026-06-02	从经济学影子价格角度建模推理预算全局分配，在资源受限时决定哪些 query 值得继续思考。
Unsupervised Skill Discovery for Agentic Data Analysis	https://arxiv.org/abs/2606.06416	LLM Agent / Skill Discovery	2026-06-04	DataCOPE 从无标注探索轨迹中提取 verifier signal，并进行对比式技能蒸馏。
Streaming Communication in Multi-Agent Reasoning	https://arxiv.org/abs/2606.05158	Multi-Agent / Reasoning	2026-06-03	StreamMA 将多 agent 的 generate-then-transfer 改为逐步 streaming，降低延迟且减少后续 agent 被较差后半段推理误导。
AdaPlanBench	https://arxiv.org/abs/2606.05622	Planning / Evaluation	2026-06-04	评估 agent 在世界约束和用户约束逐步揭示时的自适应规划能力。
AURA	https://arxiv.org/abs/2606.05557	Intent Understanding / Tool-use	2026-06-04	在工具调用前推断用户隐含需求，并动态控制 probe budget。
SABER	https://arxiv.org/abs/2606.01317	Code Agent / Safety	2026-05-31	在有状态项目工作区中评估 coding agent 的 operational safety，而不只是单条回复拒答。

#6. 来源访问说明

Hugging Face Papers 日期页、arXiv API、arXiv abs 页面可访问。
GitHub API 在多次检索后触发 rate limit，因此 repo 链接主要从 Hugging Face paper 页面中抽取；未发现 repo 的论文不强行补充。
X/Twitter 搜索页面可返回 HTML，但需要 JavaScript/登录才能稳定查看与抽取实时推文；本期未将 X 内容作为事实依据。