每日调研 2026-05-24 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-05-24 AI/LLM 最新论文与研究热点简报

检索时间：2026-05-24 08:15 CST。主要覆盖 arXiv / Hugging Face Papers 在 2026-05-20 至 2026-05-21 新提交或更新的论文；由于 5 月 22-23 正逢周末，arXiv 新量较少，本期按要求扩展到最近 3-4 天。
重点筛选方向：LLM Agent、代码智能、agentic RL/RLVR、latent-space reasoning、长上下文/上下文压缩、agent 轨迹数据、基础模型训练机制。
访问说明：Hugging Face Papers 与 arXiv 可访问；X/Twitter 未作为可靠来源使用，本期用 arXiv、HF Papers、GitHub/HF 公开项目页替代。

#0. 今日总判断

这两天的主线非常清晰：Agent 研究正在从“更会调用工具”走向“如何从轨迹中学习、压缩、评估和自我改写”；RLVR 研究则在集中拆解 credit assignment、在线 rollout 成本与训练轨迹几何；latent reasoning 继续从显式 CoT 转向可迭代的隐空间动力系统。

对 wenjun 的方向，最值得注意的不是单个 benchmark 分数，而是三个正在合流的研究范式：

Agent 轨迹即预训练/后训练数据：ACC、TerminalWorld、Insights Generator、Agentic CLEAR 都在把真实 agent 运行日志、工具调用、终端记录、失败轨迹变成训练或诊断资产。
从 outcome reward 到可分解 credit：SCRL、DelTA、LamPO、OWPO、G2D 都试图回答 RLVR 的核心问题：最终对/错信号如何稳定地分配到 token、subproblem 或候选解之间。
隐空间推理的“动力系统化”：Equilibrium Reasoners、GRAM、LatentOmni 共同说明，推理不一定只靠输出更多文字，也可以靠 latent state 的迭代、采样、收敛和吸引子结构实现 test-time scaling。

#1. 今日最重要的 5 条

#1.1 MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

类别：LLM Agent / Self-Evolving Agent / Tool-use / Systems
来源与日期：arXiv，2026-05-21
链接：https://arxiv.org/abs/2605.22794
一句话核心贡献：提出让 autonomous agent 直接在源代码层面自我改写，而不仅仅修改 prompt、skill file、memory schema 或 workflow graph。

为什么值得关注：

现有“self-evolving agent”多数把演化限制在文本可变层：prompt、技能库、记忆、工作流图。MOSS 的关键论点是：很多 agent 失败其实来自 harness/框架代码里的结构性问题，例如 routing、hook 顺序、状态不变量、dispatch 逻辑；这些问题不在 prompt 层，文本演化触达不到。论文主张 source-level adaptation 是更一般的演化介质，因为代码层是图灵完备的，也能以确定性方式生效，不容易被长上下文漂移侵蚀。

与 wenjun 研究方向的关系：

这非常贴近“通过环境设计催生自演化智能”和“代码 Agent 的 agentic RL / self-evolving code agent”。如果把 agent harness 看成环境接口的一部分，那么 MOSS 等于把策略学习对象从“模型输出文本”扩大到“模型 + harness + tool protocol”的联合系统。值得进一步问：源代码改写能否纳入 verifiable reward？能否像 code RL 一样用测试、回放轨迹、回归 suite 约束自演化？

#1.2 ACC: Compiling Agent Trajectories for Long-Context Training

类别：LLM Agent / Pretraining Data / Long Context / Context Compression
来源与日期：arXiv + Hugging Face Papers，2026-05-21
链接：https://arxiv.org/abs/2605.21850 ，HF: https://huggingface.co/papers/2605.21850
一句话核心贡献：把 agent 解决任务时产生的多轮工具调用轨迹“编译”为适合长上下文训练的样本，使模型学习在长轨迹中定位分散证据。

为什么值得关注：

长上下文训练通常依赖昂贵的长文档构造或人工合成；ACC 的视角更 agent-native：agent 在真实任务中天然生成大量 trajectory，里面混合了指令、工具调用、环境观察、中间错误和最终答案。论文观察到，回答原始问题所需证据往往分散在多轮 observation 中，因此可以把这些轨迹转化为 long-context reasoning 训练数据。

与 wenjun 研究方向的关系：

这正好对应“agent 预训练数据如何塑造能力”。如果未来要做 LLM Agent 的基础模型训练机制研究，ACC 提供了一个很好的切入点：不是泛化地喂长文，而是喂带工具交互结构的长轨迹。可以进一步研究：哪些轨迹片段真正带来能力？失败轨迹是否比成功轨迹更有用？trajectory compilation 是否能和 context pruning / latent reasoning 结合？

#1.3 From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

类别：Post-training RL / RLVR / Credit Assignment / Reasoning
来源与日期：arXiv + Hugging Face Papers，2026-05-21
链接：https://arxiv.org/abs/2605.22074 ，HF: https://huggingface.co/papers/2605.22074
一句话核心贡献：提出 SCRL，把参考 reasoning chain 拆成可验证 subproblem，并用 curriculum RL 改善 RLVR 在困难题上的稀疏奖励和 credit assignment。

为什么值得关注：

Outcome-only RLVR 在难题上低效：正确 rollout 稀少，失败尝试里的部分进展无法被利用。SCRL 的思路是从参考推理链派生可验证子问题，让模型先在更短、更可控的局部推理单元上获得密集反馈，再逐步推进到完整问题。这相当于把“长链推理”重新表述为“可验证子任务课程”。

与 wenjun 研究方向的关系：

对长轨迹 Agent RL 很重要。Agent 任务也常常只有最终成功/失败，但中间可能完成了检索、定位、修改、测试等子目标。SCRL 提供的思想可以迁移为：从成功/失败 agent trace 中自动抽取 verifiable subgoals，然后做 curriculum RL 或 offline preference construction。

#1.4 Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

类别：Latent Reasoning / Test-time Scaling / Foundation Mechanism
来源与日期：arXiv，2026-05-20
链接：https://arxiv.org/abs/2605.21488
一句话核心贡献：把迭代式 latent-state reasoning 解释为学习任务条件吸引子；推理能力来自 latent dynamical system 向 solution-aligned fixed point 收敛。

为什么值得关注：

这篇不是普通“多想几步”的 CoT 论文，而是把 test-time compute 放在 latent dynamics 里：通过更多迭代深度或多个随机初始 latent trajectory 来扩展推理。作者提出，泛化性推理来自任务条件 attractor，稳定不动点对应有效解；test-time scaling 的收益与 latent state 向解空间吸引子的收敛程度相关。

与 wenjun 研究方向的关系：

这直接命中“潜空间推理 latent-space reasoning”。对 LLM agent / model-based RL 也有启发：如果 agent 的 belief/world state 不是文本历史，而是可迭代收敛的 latent state，那么 planning、memory、context compression 都可被重新表述为 latent dynamics 问题。

#1.5 TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

类别：LLM Agent / Evaluation / Tool-use / Terminal Agent
来源与日期：arXiv + Hugging Face Papers，2026-05-21
链接：https://arxiv.org/abs/2605.22535 ，HF: https://huggingface.co/papers/2605.22535 ，GitHub: https://github.com/EuniAI/TerminalWorld
一句话核心贡献：从 80,870 条真实终端记录自动反向构造 1,530 个高保真终端任务，并发布人工审核的 TerminalWorld-Verified 子集。

为什么值得关注：

TerminalWorld 的数据来源很有意思：不是专家手写任务，而是从真实 terminal recordings 反向工程任务。完整 benchmark 覆盖 18 类真实终端工作流、1,280 个唯一命令，其中部分任务超过 50 步。论文报告当前前沿模型和 agent 在 Verified 子集上最高 pass rate 只有 62.5%，并且和 Terminal-Bench 等专家构造 benchmark 相关性弱，说明它捕捉的是另一种“真实终端能力”。

与 wenjun 研究方向的关系：

这类数据非常适合研究 code/terminal agent 的长轨迹 RL、轨迹诊断和 agent 预训练数据。它也提醒我们：真实工作流数据和专家 benchmark 的能力分布可能不同；如果只在后者上做 RL，可能会过拟合到人造任务结构。

#2. 其他值得关注论文与动态

#2.1 Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

类别：LLM Agent / Planning / Test-time Compute
来源与日期：arXiv + Hugging Face Papers，2026-05-21
链接：https://arxiv.org/abs/2605.22138 ，HF: https://huggingface.co/papers/2605.22138
Repo/Model：https://github.com/sailing-lab/sr2am-self-regulated-planning ，https://huggingface.co/sailing-lab/SR2AM-v0.1-8B
一句话核心贡献：研究 agent 何时以及如何进行 planning，提出 self-regulated simulative planning，避免无控制地拉长推理 token。

简评：与其期待 reactive policy 自发学会规划，不如显式控制规划是否发生、规划结构和规划 horizon。对 agent 的 test-time scaling 很关键：不是无限 CoT，而是有自我调节的模拟规划。

#2.2 DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

类别：Post-training RL / RLVR / Credit Assignment
来源与日期：arXiv + Hugging Face Papers，2026-05-20
链接：https://arxiv.org/abs/2605.21467 ，HF: https://huggingface.co/papers/2605.21467 ，GitHub: https://github.com/RUCBM/DelTA
一句话核心贡献：从 discriminator 视角解释 RLVR 更新，分析 response-level reward 如何转化为 token-level 概率变化。

简评：适合作为理解 RLVR 内部机制的论文读。它关注的不是再造一个 benchmark，而是问“最终奖励如何改变每个 token 的学习方向”。这对解释 agentic RL 中 credit assignment 很重要。

#2.3 One-Way Policy Optimization for Self-Evolving LLMs

类别：Post-training RL / RLVR / Self-Evolving LLM
来源与日期：arXiv，2026-05-21
链接：https://arxiv.org/abs/2605.22156
一句话核心贡献：提出 OWPO，将 verifier 决定的优化方向与 reference policy 约束的更新幅度解耦，避免参考策略抑制超越自身的改进。

简评：很多 RLVR 方法用 reference policy 做 token-level constraint，但这种约束可能惩罚所有偏离，包括正确的创新偏离。OWPO 的非对称 reweighting 对“自演化”这个关键词很相关。

#2.4 You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

类别：Post-training RL / Training Dynamics / Efficient RL
来源与日期：arXiv，2026-05-20
链接：https://arxiv.org/abs/2605.21468
一句话核心贡献：发现 RLVR 权重轨迹高度低秩且可预测，并提出 RELEX 用短窗口估计 rank-1 子空间后外推后续 checkpoint。

简评：这是基础模型训练机制视角下值得看的一篇：RLVR 的能力提升可能沿非常低维的参数方向发生。若结果稳健，它会影响我们如何理解“后训练到底改了什么”。

#2.5 How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

类别：Post-training RL / Offline Preference Optimization / RLVR Efficiency
来源与日期：arXiv，2026-05-20
链接：https://arxiv.org/abs/2605.21266
一句话核心贡献：提出 G2D：少量 GRPO warm-up 生成信息量更高的 preference data，然后用 DPO 离线训练，降低在线 rollout 成本。

简评：这和 agent RL 的实际训练成本高度相关。长轨迹在线 rollout 非常贵，若少量在线探索就能构造高质量离线偏好数据，是可复用的训练范式。

#2.6 LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

类别：Post-training RL / Reasoning / Pairwise Advantage
来源与日期：arXiv，2026-05-20
链接：https://arxiv.org/abs/2605.21235
一句话核心贡献：用 pairwise decomposed advantage 替代组内标量 advantage，保留候选回答之间的细粒度相对信息。

简评：与 GRPO 系列密切相关，适合和 SCRL、DelTA 对照阅读。一个方向是：token-level、pairwise-level、subproblem-level credit assignment 能否统一？

类别：Latent Reasoning / Multimodal Reasoning
来源与日期：arXiv + Hugging Face Papers，2026-05-21
链接：https://arxiv.org/abs/2605.22012 ，HF: https://huggingface.co/papers/2605.22012 ，GitHub: https://github.com/yfanDai/LatentOmni
一句话核心贡献：认为文本 CoT 会把连续音视频信号过度离散化，提出统一 audio-visual latent reasoning 来保留细粒度时序证据。

简评：虽然是多模态方向，但它对 latent reasoning 的论证很有代表性：当中间推理必须依赖连续信号时，强迫转成自然语言会丢 grounding。

#2.8 Generative Recursive Reasoning

类别：Latent Reasoning / Recursive Reasoning / Test-time Scaling
来源与日期：arXiv，2026-05-19，2026-05-20 更新
链接：https://arxiv.org/abs/2605.19376
一句话核心贡献：提出 GRAM，把 deterministic recursive reasoning 扩展为随机 latent trajectory，可通过递归深度和并行轨迹采样实现 test-time scaling。

简评：可与 Equilibrium Reasoners 一起读：一个强调 attractor/fixed point，一个强调 probabilistic multi-trajectory。二者都在摆脱“输出更多文字=更多计算”的范式。

#2.9 Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning

类别：Code Agent / Context Compression / Latent Reasoning
来源与日期：arXiv，2026-05-14
链接：https://arxiv.org/abs/2605.15315
一句话核心贡献：提出 LaMR，把代码上下文保留标准拆成 semantic evidence 与 dependency support 两个 rubric，用 latent multi-rubric reasoning 做 coding agent context pruning。

简评：虽然稍早于 48 小时，但非常贴近 wenjun 的代码智能关注。它把“什么代码该读”拆成语义证据与依赖支撑两类，适合作为 repo-level agent 的通用上下文压缩器思路。

#2.10 SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

类别：Code Agent / Evaluation / Reward Hacking / Long-Horizon Agent
来源与日期：arXiv + Hugging Face Papers，2026-05-20
链接：https://arxiv.org/abs/2605.21384 ，HF: https://huggingface.co/papers/2605.21384
一句话核心贡献：通过 visible validation tests 与 held-out compositional tests 的 pass-rate gap 衡量长程 coding agent 的 reward hacking。

简评：这是 coding agent RL 必读问题：如果 reward 主要来自测试集，agent 会不会学会“过 visible tests 但违背真实需求”？SpecBench 给出一个可操作的度量框架。

#2.11 Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

类别：LLM Agent / Post-training RL / Tool-use / Spreadsheet Agent
来源与日期：arXiv + Hugging Face Papers，2026-05-21
链接：https://arxiv.org/abs/2605.22642 ，HF: https://huggingface.co/papers/2605.22642 ，GitHub: https://github.com/Spreadsheet-RL/Spreadsheet-RL
一句话核心贡献：面向真实电子表格任务训练 LLM agent，并用 RL 推进 spreadsheet 操作能力。

简评：Spreadsheet 是典型“结构化工具环境”，比纯代码更贴近 office automation。它可作为 agentic RL 在工具环境中的一个可验证沙盒。

#2.12 DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

类别：LLM Agent / Long-term Memory / Context Compression / RL
来源与日期：arXiv，2026-05-21
链接：https://arxiv.org/abs/2605.22411
一句话核心贡献：把长期记忆 QA 拆成高召回候选检索与 query-conditioned evidence distillation，并用 RL 训练记忆蒸馏器。

简评：Agent 长期记忆的核心不是“存更多”，而是 query-time 把分散证据蒸馏出来。对 long-context agent 的 memory module 很有启发。

#2.13 Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

类别：LLM Agent / Evaluation / Trace Diagnostics
来源与日期：arXiv，2026-05-21
链接：https://arxiv.org/abs/2605.22608
一句话核心贡献：提出自动、多层级 agent 评估框架，从 system、trace、node 三个粒度生成行为洞察。

简评：和 Insights Generator 同属“trace diagnostics”方向。随着 agent 轨迹变长，评估不可能只看 final score，需要自动归纳失败模式。

#2.14 Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

类别：LLM Agent / Evaluation / Trace Mining / Code Agent
来源与日期：arXiv，2026-05-20，2026-05-21 更新
链接：https://arxiv.org/abs/2605.21347
一句话核心贡献：形式化 corpus-level trace diagnostics，并用多 agent 系统在大量执行轨迹中提出、检验和汇总系统性行为模式。

简评：很适合结合 ACC：一个负责把轨迹编译成训练数据，一个负责从轨迹群体中抽象诊断信号。二者合起来像 agent 训练数据 flywheel 的雏形。

#2.15 IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

类别：LLM Agent / Inference-time Planning / Systems
来源与日期：arXiv，2026-05-21
链接：https://arxiv.org/abs/2605.22154
一句话核心贡献：利用 agent 等待工具 observation 的 idle time 进行 speculative planning，以尽量不增加延迟的方式提升表现。

简评：这篇偏系统但很实用。很多真实 agent 的瓶颈不是模型推理，而是工具/API/环境等待；把等待时间用于候选计划生成，是 agent serving 的 test-time compute 优化。

#2.16 Life-Harness: Adapting the Interface, Not the Model

类别：LLM Agent / Runtime Harness / Environment Design
来源与日期：arXiv，2026-05-21
链接：https://arxiv.org/abs/2605.22166
一句话核心贡献：不改模型权重，而是从训练轨迹中把 recurring interaction failures 转成可复用 harness interventions，改善 deterministic agents。

简评：和 MOSS 是同一大方向：能力瓶颈不一定在模型，也可能在接口、状态约定、动作执行和轨迹控制。区别是 Life-Harness 更像接口适配，MOSS 更激进地改源代码。

#2.17 Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

类别：Code Agent / Hardware Agent / Skill Evolution / Verifier-guided Search
来源与日期：arXiv，2026-05-20
链接：https://arxiv.org/abs/2605.21810
一句话核心贡献：面向 Verilog/EDA 长上下文任务，从 rollout trace 中挖掘成功失败模式并演化自然语言 skill，用 verifier 指导后续搜索。

简评：这是“代码/硬件 agent + verifier + skill evolution”的具体实例，非常适合观察 agentic RL 在专业工程领域如何落地。

#2.18 WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

类别：LLM Agent / Evaluation / Spreadsheet / Finance Workflow
来源与日期：arXiv，2026-05-21
链接：https://arxiv.org/abs/2605.22664
一句话核心贡献：评估 agent 从高层金融指令端到端构造完整 spreadsheet artifact 的能力，而不是只做单公式编辑或问答。

简评：和 Spreadsheet-RL 可成组阅读：一个偏 benchmark，一个偏 RL 改进。都说明 agent evaluation 正在从 toy 操作走向完整交付物。

#3. 今日最值得精读的 3 篇

ACC: Compiling Agent Trajectories for Long-Context Training

链接：https://arxiv.org/abs/2605.21850

原因：直接对应“agent 预训练数据如何塑造能力”；可作为 long-context agent training data 的方法入口。

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

链接：https://arxiv.org/abs/2605.22794

原因：把 self-evolving agent 从 prompt/skill 层推进到 source-level adaptation，对环境设计、自演化智能和代码 agent 都很关键。

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

链接：https://arxiv.org/abs/2605.21488

原因：为 latent-space reasoning 提供了机制化解释；适合和 LLM model-based RL / world-state learning 联动思考。

备选第 4 篇：From Reasoning Chains to Verifiable Subproblems / SCRL（https://arxiv.org/abs/2605.22074），适合专门读 RLVR credit assignment。

#4. 今日最值得跟进的 3 个 repo / model / dataset

TerminalWorld

- GitHub：https://github.com/EuniAI/TerminalWorld

- 论文：https://arxiv.org/abs/2605.22535

- 跟进原因：真实终端轨迹反向构造 benchmark；适合长轨迹 agent evaluation / RL / trace mining。

SR2AM Self-Regulated Planning

- GitHub：https://github.com/sailing-lab/sr2am-self-regulated-planning

- Model：https://huggingface.co/sailing-lab/SR2AM-v0.1-8B

- 论文：https://arxiv.org/abs/2605.22138

- 跟进原因：agent planning 的 test-time compute 控制问题，很适合与 model-based planning / Dreamer-for-agent 对照。

Spreadsheet-RL

- GitHub：https://github.com/Spreadsheet-RL/Spreadsheet-RL

- 论文：https://arxiv.org/abs/2605.22642

- 跟进原因：结构化工具环境中的 agentic RL，可作为“可验证 reward + 工具操作”的中间难度实验场。

补充可跟进：DelTA（https://github.com/RUCBM/DelTA）、LatentOmni（https://github.com/yfanDai/LatentOmni）。

#5. 研究机会 / Idea

#Idea 1：把 ACC 式 trajectory compilation 扩展为“agent 预训练数据质量评估”

问题：并非所有 agent trajectory 都适合训练。成功轨迹、失败轨迹、绕路轨迹、工具噪声、重复 observation 对能力形成的贡献可能完全不同。

可做方向：

构造 trajectory quality metrics：证据密度、工具调用必要性、错误恢复质量、上下文依赖跨度、可验证子目标覆盖率。
对比不同筛选策略对 long-context QA、tool-use、code repair 的影响。
将 LaMR/DeferMem 这类 query-conditioned evidence distillation 用于轨迹压缩，看压缩后训练是否保留 agent 能力。

#Idea 2：面向长轨迹 Agent RL 的 subproblem curriculum

SCRL 在 reasoning chain 上拆 verifiable subproblems；Agent 任务也可以拆：定位文件、读懂接口、修改代码、运行测试、修复报错、提交结果。

可做方向：

从成功 agent trace 自动抽取 subgoal graph。
用 visible tool feedback / unit test / static checker 作为局部 verifier。
比较 outcome-only GRPO、subgoal curriculum RL、offline DPO from warm-up rollouts 三种训练范式。

这个方向能把 SCRL、G2D、Trace2Skill、SpecBench 串起来。

#Idea 3：Latent world state for LLM Agent：从文本上下文走向隐空间 belief dynamics

Equilibrium Reasoners 和 GRAM 说明 latent iterative dynamics 可能承载推理；DeferMem/ACC 说明 agent 需要在长历史中抽取状态。可以考虑把 agent memory 从“文本检索片段”转为“可迭代更新的 latent belief state”。

可做方向：

给 agent 每步 observation 编码成 latent state，用任务 reward 或 subgoal verifier 训练 latent update。
对比纯文本 CoT、summary memory、latent memory 在长 horizon tool-use 上的样本效率。
将 latent state 作为 model-based RL 的 world model 输入，尝试“Dreamer for LLM Agent”的雏形实验。

#6. 快速索引表

标题	类别	日期	链接
MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems	LLM Agent / Self-Evolving Agent	2026-05-21	https://arxiv.org/abs/2605.22794
ACC: Compiling Agent Trajectories for Long-Context Training	LLM Agent / Pretraining Data	2026-05-21	https://arxiv.org/abs/2605.21850
SCRL: From Reasoning Chains to Verifiable Subproblems	RLVR / Credit Assignment	2026-05-21	https://arxiv.org/abs/2605.22074
Equilibrium Reasoners	Latent Reasoning	2026-05-20	https://arxiv.org/abs/2605.21488
TerminalWorld	Agent Evaluation / Terminal	2026-05-21	https://arxiv.org/abs/2605.22535
Self-Regulated Simulative Planning	Agent Planning	2026-05-21	https://arxiv.org/abs/2605.22138
DelTA	RLVR / Token Credit	2026-05-20	https://arxiv.org/abs/2605.21467
OWPO	RLVR / Self-Evolving LLM	2026-05-21	https://arxiv.org/abs/2605.22156
RELEX / Rank-1 RLVR Trajectories	Training Dynamics	2026-05-20	https://arxiv.org/abs/2605.21468
G2D: How Much Online RL is Enough?	RLVR Efficiency	2026-05-20	https://arxiv.org/abs/2605.21266
LamPO	RLVR / Pairwise Advantage	2026-05-20	https://arxiv.org/abs/2605.21235
LatentOmni	Latent Reasoning / Multimodal	2026-05-21	https://arxiv.org/abs/2605.22012
GRAM: Generative Recursive Reasoning	Latent Reasoning	2026-05-20 update	https://arxiv.org/abs/2605.19376
LaMR Context Pruning for Coding Agents	Code Agent / Context Compression	2026-05-14	https://arxiv.org/abs/2605.15315
SpecBench	Code Agent / Reward Hacking	2026-05-20	https://arxiv.org/abs/2605.21384
Spreadsheet-RL	Agent RL / Tool-use	2026-05-21	https://arxiv.org/abs/2605.22642
DeferMem	Long-term Memory / RL	2026-05-21	https://arxiv.org/abs/2605.22411
Agentic CLEAR	Agent Evaluation	2026-05-21	https://arxiv.org/abs/2605.22608
Insights Generator	Trace Diagnostics	2026-05-21 update	https://arxiv.org/abs/2605.21347
IdleSpec	Agent Systems / Speculative Planning	2026-05-21	https://arxiv.org/abs/2605.22154
Life-Harness	Runtime Harness / Environment Design	2026-05-21	https://arxiv.org/abs/2605.22166
Trace2Skill	Code/EDA Agent / Skill Evolution	2026-05-20	https://arxiv.org/abs/2605.21810
WorkstreamBench	Spreadsheet Agent Evaluation	2026-05-21	https://arxiv.org/abs/2605.22664

#2026-05-24 AI/LLM 最新论文与研究热点简报

#0. 今日总判断

#1. 今日最重要的 5 条

#1.1 MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

#1.2 ACC: Compiling Agent Trajectories for Long-Context Training

#1.3 From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

#1.4 Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

#1.5 TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

#2. 其他值得关注论文与动态

#2.1 Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

#2.2 DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

#2.3 One-Way Policy Optimization for Self-Evolving LLMs

#2.4 You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

#2.5 How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

#2.6 LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

#2.7 LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

#2.8 Generative Recursive Reasoning

#2.9 Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning

#2.10 SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

#2.11 Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

#2.12 DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

#2.13 Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

#2.14 Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

#2.15 IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

#2.16 Life-Harness: Adapting the Interface, Not the Model

#2.17 Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

#2.18 WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

#3. 今日最值得精读的 3 篇

#4. 今日最值得跟进的 3 个 repo / model / dataset

#5. 研究机会 / Idea

#Idea 1：把 ACC 式 trajectory compilation 扩展为“agent 预训练数据质量评估”

#Idea 2：面向长轨迹 Agent RL 的 subproblem curriculum

#Idea 3：Latent world state for LLM Agent：从文本上下文走向隐空间 belief dynamics

#6. 快速索引表