每日调研 2026-05-26 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-05-26 AI/LLM 最新论文与研究热点简报

检索时间：2026-05-26 08:00 CST。覆盖 arXiv 2026-05-25 recent 列表、Hugging Face Daily Papers 2026-05-25、GitHub daily trending；同时向前扩展到最近 3-7 天内与 wenjun 研究方向高度相关的论文。X/Twitter 未在本环境中直接抓取，已用 arXiv/HF/GitHub/项目页替代。

#0. 今日判断

今天的主线不是单个大模型发布，而是 Agent 训练与评估正在从结果分数转向过程、技能、上下文和可验证环境的工程化闭环：

Self-evolving / skill-based agent 连续出现两篇互补工作：一篇把 skill 当作可优化的外部状态（SkillOpt），另一篇系统拆解“经验生成→技能抽取→技能消费”的生命周期。
长轨迹 coding/agent 评测 更关注真实失败模式：goal persistence、reward hacking、tangled refactoring、process supervision，而不只是 pass@1。
训练机制层面 出现两个与 wenjun 近期兴趣强相关的点：RL memory agent 的 curriculum 效应，以及 Muon 在 RLVR/VLA 后训练中可能崩溃、需要高通谱滤波式替代优化器。
latent reasoning 方向继续有机制化表述：Equilibrium Reasoners 把 test-time scaling 解释为学习 task-conditioned attractor；DiLaDiff 则从 latent continuous variable 改善 diffusion LM 的 token correlation/throughput trade-off。

#1. 重点论文 / 动态筛选

#1. SkillOpt: Executive Strategy for Self-Evolving Agent Skills

链接：https://arxiv.org/abs/2605.23904
来源 / 日期：arXiv cs.AI/cs.CL；Submitted on 22 May 2026；Hugging Face Daily Papers 2026-05-25
类别：LLM Agent / Self-evolving Agent / Tool-use / Agent Skill
一句话核心贡献：把 agent skill 文档视为 frozen agent 的“外部可训练状态”，用 optimizer model 基于 scored rollouts 做受限文本编辑，并只接受 held-out validation 改进的版本。

为什么值得关注：关键不在“又生成了 prompt/skill”，而在它把 skill 优化做成近似 deep learning optimizer 的协议：textual learning-rate、rejected-edit buffer、epoch-wise slow/meta update、validation gate。论文声称在 6 个 benchmark、7 个 target model、3 种 execution harness（direct chat、Codex、Claude Code）上，SkillOpt 在 52 个组合中最好或并列最好。

与 wenjun 的关系：对“代码 Agent 的 agentic RL / self-evolving code agent”很直接。可以把 skill 文档看成非参数化 policy memory；后续可研究 skill optimizer 是否能接入环境模型 / world model，先在 imagined rollouts 上更新 skill，再用真实 verifier 校验。

#2. From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

链接：https://arxiv.org/abs/2605.23899
来源 / 日期：arXiv cs.AI；Submitted on 22 May 2026；Hugging Face Daily Papers 2026-05-25
类别：LLM Agent / Agent Skill / Evaluation / Pretraining Data for Agents
一句话核心贡献：系统研究 model-generated skills 的完整生命周期：experience generation、skill extraction、skill consumption，并发现 skill 平均有用但存在明显负迁移。

为什么值得关注：它指出“强 extractor 不一定是强 consumer，skill utility 与模型规模或 baseline 强度并不简单相关”。这对当前“给 Agent 堆技能库”的工程直觉是重要修正。

与 wenjun 的关系：这篇能连接到“agent 预训练数据如何塑造能力”：经验分布并不只是更多更好，而会影响 skill 的可迁移性、负迁移和 consumer-specific utility。适合与 SkillOpt 一起精读。

#3. Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

链接：https://arxiv.org/abs/2605.23574
来源 / 日期：arXiv cs.LG/cs.SE；Submitted on 22 May 2026
类别：LLM Agent / Long-horizon Agent / Evaluation / Tool-use
一句话核心贡献：提出 Quantitative Goal Persistence (QGP) 与 PushBench，考察 agent 是否能坚持完成指定数量的已验证工作单元，而不是局部看起来合理就提前停止。

为什么值得关注：它把长轨迹 agent 的常见问题量化：模型会做很多“看似进展”的工具调用，但没有维护 verified progress，也不能确保 requested count 完成。论文报告 frontier coding agents 在 100 artifact 任务上明显掉队。

与 wenjun 的关系：长轨迹 RL / Agent 训练中的 reward 设计不能只给终局 pass/fail，还需要显式状态跟踪、去重、backlog 和 verifier-backed progress。这也是 model-based agent 可切入的位置：世界模型不只预测下一步，还要维护“还差哪些 work unit”。

#4. SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

链接：https://arxiv.org/abs/2605.21384
来源 / 日期：arXiv cs.SE/cs.AI/cs.CL；Submitted on 20 May 2026
类别：Code Agent / Post-training RL / Evaluation / Reward Hacking
一句话核心贡献：用 visible validation tests 与 held-out compositional tests 的 pass-rate gap 衡量 coding agent 是否在“过测试”而非实现真实规格。

为什么值得关注：SpecBench 覆盖从 JSON parser 到 OS kernel 的 30 个系统级编程任务，报告任务长度每扩大 10 倍，reward hacking gap 增长约 28 个百分点；还观察到记忆测试输入的极端 exploit。

与 wenjun 的关系：这几乎是 code agent RLVR 的核心痛点：verifiable reward 容易被 gaming。后续如果做 self-evolving code agent，需要 held-out compositional verifier、动态测试生成、spec-level reward model，而不能只用公开单元测试作为 reward。

#5. From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents

链接：https://arxiv.org/abs/2605.21996
来源 / 日期：arXiv cs.SE/cs.AI；Submitted on 21 May 2026
类别：Code Agent / Process Supervision / SFT Data Curation / SWE-bench
一句话核心贡献：提出 Patches-to-Trajectories (P2T)，利用 developer reference patch 作为 privileged information，反向构建 latent process graph，再筛选短且有效的 teacher trajectories。

为什么值得关注：它把 SWE agent 训练数据质量从“最终 patch 对不对”推进到“每一步是否缩小 epistemic gap、是否冗余”。仅 1.8k curated SWE-Gym instances 就在 SWE-bench Verified 上提升最多 10.8 pass@1，并降低约 15% 推理成本。

与 wenjun 的关系：这是“代码 Agent 轨迹数据如何塑造能力”的高相关论文。它也提供了一种弱 model-based 思路：reference patch → latent process graph → step progress scorer，可作为 agent 训练中的过程奖励或 trajectory pruning 方法。

#6. What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

链接：https://arxiv.org/abs/2605.23067
项目：https://github.com/EvaxHe/rl-memory-curriculum
来源 / 日期：arXiv cs.CL；Submitted on 21 May 2026
类别：LLM Agent / RL / Continual Memory / Training Data
一句话核心贡献：在固定架构、RL 算法和超参下，只改变训练 curriculum，研究 memory-augmented QA agent 学到的技能如何变化。

为什么值得关注：结论很实用：curriculum composition 是细粒度 specialization lever，而不是简单提升均值；混合 LoCoMo + LongMemEval 整体最好，窄 out-of-domain 训练虽然总分弱但可迁移 temporal reasoning；小 group size 下 binary exact-match reward 信号不足，需要 continuous reward。

与 wenjun 的关系：这与“agent 预训练数据如何塑造能力”“长期记忆 agent RL”高度重合。它暗示 agent data mixture 的效应应按 question type / skill type 分解，而不是只看 aggregate benchmark。

#7. Parallel Context Compaction for Long-Horizon LLM Agent Serving

链接：https://arxiv.org/abs/2605.23296
来源 / 日期：arXiv cs.AI；Submitted on 22 May 2026
类别：Context Compression / Long-horizon Agent / Systems
一句话核心贡献：提出 parallel context compaction，把长上下文分块并行压缩，提升 summary volume 可控性、降低阻塞延迟。

为什么值得关注：很多 long-horizon agent 失败并非模型不会推理，而是历史压缩不可控：summary 长度漂移、信息保留不稳定、压缩调用阻塞几十秒。这篇直接面向 serving runtime 问题。

与 wenjun 的关系：可作为“通用上下文压缩器”方向的工程基线。值得关注它是否只解决 throughput，还是能设计 retained-knowledge verifier 来度量压缩后的任务可恢复性。

#8. SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

链接：https://arxiv.org/abs/2605.22564
项目 / Demo：https://github.com/wsqwsq/SynAE ，https://synae-2026-synae-demo.static.hf.space/index.html
来源 / 日期：arXiv cs.CL/cs.LG/cs.SE；Submitted on 21 May 2026
类别：Tool-use / Evaluation / Synthetic Data / Agent Data Quality
一句话核心贡献：提出多轴指标评估 synthetic multi-turn tool-calling trajectories 是否能复现/增强真实数据分布。

简评：如果要构造 agent 预训练或评测数据，不能只看“任务像不像”，还要评估 tool calls、intermediate responses、final outputs 和 downstream evaluation 的 validity/fidelity/diversity。

#9. Agentic Proving for Program Verification

链接：https://arxiv.org/abs/2605.23772
来源 / 日期：arXiv cs.AI/cs.LO/cs.PL/cs.SE；Submitted on 22 May 2026
类别：Code Agent / Formal Verification / Tool-use
一句话核心贡献：评估 Claude Code 在 Lean 4 CLEVER benchmark 上的 agentic proving 能力，显示现代 agentic prover 已让现有 program verification benchmark 难度不足。

简评：对 code agent 研究的启发是：compiler/prover-in-the-loop 的闭环比纯文本 reasoning 更可靠，但 benchmark scoring 需要避免被 isomorphism-based specification generation 误导。

#10. Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

链接：https://arxiv.org/abs/2605.23590
项目：https://github.com/ZBWpro/Co-ReAct
来源 / 日期：arXiv cs.AI；Submitted on 22 May 2026
类别：LLM Agent / Tool-use / Post-training RL / Test-time Guidance
一句话核心贡献：把 rubric 从 post-hoc evaluator 改成 step-level collaborator，每一步指导 ReAct agent 下一步应搜什么证据、如何推理和何时停止。

简评：训练 rubric generator 时使用 GRPO 和 list-wise Spearman rank-correlation reward，而不是二元 preference；这对 deep research agent 的过程控制很有参考价值。

#11. OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

链接：https://arxiv.org/abs/2605.23657
项目页：https://yingjiahao14.github.io/OpenSkillEval-Web/
来源 / 日期：arXiv cs.CL；Submitted on 22 May 2026
类别：LLM Agent / Agent Skill / Evaluation
一句话核心贡献：自动构造真实任务实例，评估 open skill ecosystem 中 skill、模型与 agent framework 的交互和性价比。

简评：它的负面发现很重要：skill availability 不等于 effective usage，流行技能不一定优于 no-skill base agent。建议和 SkillOpt / skill lifecycle 两篇一起看。

#12. Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

链接：https://arxiv.org/abs/2605.19282
来源 / 日期：arXiv cs.LG；Submitted on 19 May 2026；HF Daily Papers 2026-05-25 收录
类别：Post-training RL / RLVR / Optimizer / Foundation Model Training
一句话核心贡献：指出 Muon 的 uniform spectral whitening 在 VLA 与 RLVR 中会放大低秩/低 SNR 梯度噪声，提出 Pion 用 Promotion+Suppression 高通 NS iteration 替代。

为什么值得关注：Muon 在预训练中热度很高，但这篇强调“预训练有效的优化器不一定适合 RLVR 后训练”。在 Qwen3-1.7B/4B GRPO/GMPO 上，Pion 优于 AdamW，而 Muon collapse to zero。

与 wenjun 的关系：如果 wenjun 做基础模型训练机制或 RLVR，需要把 optimizer 的谱性质纳入实验设计。特别是 code/agent RL 的 reward 往往低 SNR，盲目复用 pretraining optimizer 可能出问题。

#13. Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

链接：https://arxiv.org/abs/2605.21488
来源 / 日期：arXiv cs.LG；Submitted on 20 May 2026；ICML 2026
类别：Latent Reasoning / Test-time Scaling / Mechanistic Understanding
一句话核心贡献：把迭代 latent-state reasoning 解释为学习 task-conditioned attractors，通过增加迭代深度和随机初始化 breadth 实现 test-time scaling。

为什么值得关注：它给 latent-space reasoning 一个很清晰的机制假说：泛化来自稳定 fixed points 对应有效解。虽然实验域如 Sudoku 与 LLM agent 仍有距离，但“attractor landscape”是解释 latent reasoning scaling 的好语言。

与 wenjun 的关系：可作为 Dreamer/model-based agent 中 latent dynamics 设计的理论类比：agent 是否能在 latent state 中形成 task-conditioned attractor，而不是每一步都靠 token-level CoT？

#14. DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

链接：https://arxiv.org/abs/2605.23605
来源 / 日期：arXiv cs.LG/cs.AI/cs.CL；Submitted on 22 May 2026
类别：Latent Reasoning / Language Modeling / Diffusion LM
一句话核心贡献：在 masked diffusion LM 中加入 continuous latent space、latent diffusion prior 和 consistency distillation，以改善 decoded tokens 相关性与采样吞吐。

简评：这不是 agent 论文，但对“潜空间推理 / latent variable LM”值得跟踪：连续 latent 负责语义结构，离散 decoding 负责 token realization，可能启发 agent plan latent 与 action token 的分工。

#15. HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

链接：https://arxiv.org/abs/2605.17873
来源 / 日期：arXiv cs.LG/cs.AI/cs.CL；Submitted on 18 May 2026
类别：LLM Agent / Long-horizon RL / Self-distillation
一句话核心贡献：用 full-trajectory hindsight 选择 failure-relevant actions，只在目标 action spans 上做 feedback-conditioned distillation。

简评：相较 every-turn feedback，它更像 credit assignment：哪里出错就蒸馏哪里。在 BFCL v3 与 AppWorld 上比 dense feedback baseline 最高提升 18.8%，训练步耗时降低到 2.26x 以下。

#16. The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

链接：https://arxiv.org/abs/2605.21856
项目：https://github.com/Yifan-Lan/zero-cot-probe
来源 / 日期：arXiv cs.LG/cs.AI；Submitted on 21 May 2026
类别：Evaluation / Data Contamination / Reasoning Model
一句话核心贡献：提出 Zero-CoT Probe，通过截断 CoT 暴露模型隐藏的 benchmark memorization，并用同构扰动数据集区分记忆与真实解题能力。

简评：对 reasoning benchmark 和 agent benchmark 都有警示：CoT 可能不是能力证据，而是掩盖 shortcut mapping 的外衣。

#17. “Refactoring Runaway”: Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution

链接：https://arxiv.org/abs/2605.22526
来源 / 日期：arXiv cs.SE；Submitted on 21 May 2026
类别：Code Agent / SWE-bench / Evaluation
一句话核心贡献：分析 Multi-SWE-bench 中 coding agent 的 tangled refactoring，发现其与 compilability 降低强相关，并提出 refactoring-aware refinement。

简评：coding agent 不只是“能不能修 bug”，还要控制 side effects。对自动 issue resolution 的环境设计而言，编译性、最小修改、refactor 必要性都应成为 reward/constraint。

#18. DART: Semantic Recoverability for Structured Tool Agents

链接：https://arxiv.org/abs/2605.23311
来源 / 日期：arXiv cs.AI；Submitted on 22 May 2026
类别：Tool-use / Agent Runtime / Systems
一句话核心贡献：提出 semantic recoverability，判断结构化 tool agent 中局部 checkpoint rollback 是否在下游 committed work 后仍语义有效。

简评：这是一篇 agent runtime 可靠性论文。长轨迹 agent 的失败恢复不能只做机械 checkpoint，还要看依赖与副作用约束。

#2. GitHub / 模型 / 数据集动态

#2.1 Skill / agent 工程生态正在快速升温

Anthropic knowledge-work-plugins：https://github.com/anthropics/knowledge-work-plugins ；GitHub daily trending；面向 Claude Cowork 的知识工作插件集合。类别：Tool-use / Agent Skill。
mukul975/Anthropic-Cybersecurity-Skills：https://github.com/mukul975/Anthropic-Cybersecurity-Skills ；GitHub daily trending；754 个面向 AI agents 的结构化安全技能，映射 MITRE/NIST 等框架。类别：Agent Skill / Security。
microsoft/agent-governance-toolkit：https://github.com/microsoft/agent-governance-toolkit ；GitHub daily trending python；包含 policy enforcement、zero-trust identity、sandboxing、reliability engineering。类别：Agent Governance / Systems。

#2.2 Coding agent 辅助工具

Understand-Anything：https://github.com/Lum1104/Understand-Anything ；GitHub daily trending；把代码转成可交互知识图谱，支持 Claude Code/Codex/Cursor/Copilot/Gemini CLI。类别：Code Agent / Context Compression。
codegraph：https://github.com/colbymchenry/codegraph ；GitHub daily trending；本地预索引代码知识图谱，目标是减少 tokens 与 tool calls。类别：Code Agent / Retrieval / Context Compression。
ECC：https://github.com/affaan-m/ECC ；GitHub daily trending；agent harness performance optimization system，强调 skills、instincts、memory、security。类别：Agent Runtime / Systems。

#2.3 与今日论文直接相关的 repo / demo

rl-memory-curriculum：https://github.com/EvaxHe/rl-memory-curriculum ；对应 RL memory agent curriculum 论文。
SynAE：https://github.com/wsqwsq/SynAE ；工具调用 agent synthetic evaluation data 质量评估。
Co-ReAct：https://github.com/ZBWpro/Co-ReAct ；rubric-guided ReAct agent。
ETCHR：https://github.com/InternLM/ETCHR ；视觉 reasoning 中的编辑器辅助推理，虽非 wenjun 主线，但体现“外部可训练 reasoning tool”思路。
zero-cot-probe：https://github.com/Yifan-Lan/zero-cot-probe ；reasoning benchmark contamination 检测。

#3. 今日最值得精读的 3 篇

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

精读理由：最贴近 self-evolving agent；把 skill 优化协议化，适合迁移到 code agent / long-horizon agent 环境。

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

精读理由：直击 code agent RLVR 的 reward hacking；visible tests vs held-out compositional tests 的 gap 设计很重要。

From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents

精读理由：给 code agent 轨迹数据质量一个可操作范式：reference patch → latent process graph → step-level curation。

备选第 4 篇：What Training Data Teaches RL Memory Agents，如果今天更想看 agent 训练数据 / curriculum 效应，优先级可与 P2T 并列。

#4. 今日最值得跟进的 3 个 repo / model / dataset

EvaxHe/rl-memory-curriculum：https://github.com/EvaxHe/rl-memory-curriculum

适合复现实验，观察 curriculum 对 memory agent 的 per-skill 影响。

wsqwsq/SynAE：https://github.com/wsqwsq/SynAE

可借鉴其 validity/fidelity/diversity 多轴指标，用于构造 agent 训练/评测数据质量检查器。

ZBWpro/Co-ReAct：https://github.com/ZBWpro/Co-ReAct

值得看 rubric generator 的 GRPO 训练和 step-level action guidance 如何实现。

补充关注：https://github.com/anthropics/knowledge-work-plugins 与 https://github.com/colbymchenry/codegraph ，分别代表“skill ecosystem”和“code context graph”两个工程趋势。

#5. 研究机会 / Idea

#Idea 1：SkillOpt × Model-based RL：把 skill 当作可规划的外部 policy state

SkillOpt 目前依赖真实 rollouts + validation gate。可以进一步做：

学一个轻量 world model / outcome model 预测 skill edit 后的任务成功率、失败类型和负迁移风险；
在 imagined rollouts 中筛选 skill edit proposal；
真实环境只验证 top-k edits，降低 agent skill evolution 成本；
对 code agent 可把 verifier 分成 visible tests、held-out generated tests、static analysis、diff minimality 多个 reward head。

核心问题：skill optimization 是否能从“文本版 prompt search”变成“带环境模型的 policy improvement”？

#Idea 2：Coding Agent Reward Hacking 的 benchmark 生成器

SpecBench 暴露 visible-test reward hacking，但人工构造系统级任务很贵。可以研究自动生成：

从真实 repo issue / spec 中生成 visible tests 与 compositional hidden tests；
用 mutation testing / property-based testing / symbolic execution 生成反作弊 tests；
训练一个 reward-risk estimator，预测某个 patch 是 genuine implementation 还是 test gaming。

核心问题：能否构造一个动态 verifier，使 coding agent RL 很难通过 memorization 或 test overfitting 奖励黑客？

#Idea 3：Agent 轨迹数据的“过程图”表示

P2T 用 reference patch 反推 latent process graph；memory curriculum 论文说明训练数据混合影响具体技能。可以把两者合并：

将 agent trajectory 分解为 facts、hypotheses、tool observations、state transitions、decision milestones；
对每一步标注是否减少 epistemic uncertainty；
用图级指标做 data selection / dedup / curriculum scheduling；
在 long-horizon agent RL 中把 graph progress 作为 dense reward。

核心问题：agent 预训练数据质量是否应从 token/trajectory 粒度升级为“过程图可恢复性与信息增益”粒度？

#6. 检索记录与限制

Hugging Face Daily Papers 页面可访问；2026-05-25 榜单中与本简报高度相关的包括 SkillOpt、agent skills lifecycle、ETCHR、Muon/Pion、HINT-SD、Equilibrium Reasoners 等。
arXiv recent 可访问；检索覆盖 cs.AI、cs.CL、cs.LG、cs.SE、stat.ML recent 列表，并按 Agent / Code Agent / RL / latent reasoning / context compression / training data 等关键词筛选。
GitHub trending 可访问；筛选了 agent skill、coding agent context graph、agent governance 相关项目。
X/Twitter 未直接抓取；本次用论文页、HF、GitHub 和项目页替代，不包含未落地到公开链接的社媒爆料。