每日调研 2026-06-02 ★★★★☆ daily AI LLM Agent Code Intelligence Research Briefing

#2026-06-02 AI/LLM 最新论文与研究热点简报

检索时间：2026-06-02 08:15 Asia/Shanghai
主要覆盖：Hugging Face Daily Papers 2026-06-01 榜单、arXiv 最近更新/新提交（主要是 2026-05-28 至 2026-05-29，因 arXiv 周末/时区节奏，最近 24 小时新增量不足，已扩展到最近约 3-5 天）、GitHub 可检索 repo。
说明：X/Twitter 未作为可靠输入源使用；本次以 HF/arXiv/GitHub 为主，避免引用不可验证的社媒传闻。

#0. 今日总览：Agent 训练正在从“会做题”转向“会管理轨迹、环境与自我演化”

今天最值得关注的趋势不是某个单点 benchmark 刷新，而是几条线同时收敛：

搜索/长上下文 Agent 的训练信号更细了：GrepSeek、LongTraceRL、SAAS 都在把“搜索过程”拆成可训练的行为：何时查、查什么、读了但没引用的材料如何变成 hard distractor、何时停止。
Agent 自演化从 prompt/memory 更新走向能力解耦评估：Harness self-evolution 相关工作开始区分“模型会不会写 harness update”和“模型能不能真的从 update 中受益”。这对 self-evolving code agent / 长轨迹 RL 很关键。
代码智能继续往 agentic coding + 可验证 RLVR + 专用基础模型推进：Mellum2、Combinatorial Synthesis、SERA/CVE-Factory 一类工作共同指向：代码模型不只是补全器，而是带工具、编辑、调试、函数调用和可验证训练任务的 agent。
latent reasoning 出现“工作记忆化”和“轨迹级奖励”两种路线：RiM、RLTT、CIRF 等工作都在尝试把 CoT 从显式 token 迁移到内部状态/功能 token，并解决 latent step 的 credit assignment。
系统/训练框架开始考虑“是否适合 coding agent 修改”：PithTrain 提出 agent-task efficiency，这个角度很适合 wenjun 关注的基础模型训练系统、代码 Agent 与自动化研究基础设施交叉方向。

#1. 重点论文与动态筛选

#1.1 GrepSeek: Training Search Agents for Direct Corpus Interaction

链接：<https://arxiv.org/abs/2605.29307>；HF Daily Papers：<https://huggingface.co/papers/2605.29307>
来源/日期：arXiv，2026-05-28；HF Daily Papers 2026-06-01，高热度（API 显示 86 upvotes）
类别：LLM Agent / Tool-use / Post-training RL / Retrieval Agent
一句话核心贡献：把搜索 Agent 从“调用预构建 retriever”改成“直接把语料库当环境，用 shell/grep 类命令交互式找证据”，并训练小型搜索 Agent 学会 find/filter/compose evidence。
可跟进 repo：<https://github.com/alirezasalemi7/grepseek>

为什么值得关注：

这篇非常贴近“Agent 不是只生成答案，而是在外部环境中执行动作”的路线。传统 RAG 的 action space 多半是 query -> retriever -> top-k doc，而 GrepSeek 把 corpus 暴露成可执行环境，搜索行为更像真实终端/代码 Agent：需要选择命令、解析反馈、逐步缩小证据范围。

与 wenjun 研究方向的关系：

对 LLM model-based RL / Dreamer for Agent 的启发在于：这里的“语料库 + shell command + observation”可以看成一个低成本、可回放、可合成的文本环境。后续可以研究：是否能学习一个 world model 来预测 grep/search action 的 observation 分布，从而减少真实检索交互次数？或者用 imagined search trajectory 做长上下文证据定位训练？

#1.2 LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

链接：<https://arxiv.org/abs/2605.31584>；HF Daily Papers：<https://huggingface.co/papers/2605.31584>
来源/日期：arXiv，2026-05-29；HF Daily Papers 2026-06-01
类别：Post-training RL / RLVR / Long-context Reasoning / LLM Agent
一句话核心贡献：用搜索 Agent 轨迹构造长上下文训练数据，把“读过但没引用”的文档作为高混淆 distractor，并用 rubric rewards 给长上下文推理提供更细粒度训练信号。
可跟进 repo：<https://github.com/THU-KEG/LongTraceRL>

为什么值得关注：

长上下文训练最大的问题之一是：模型不是单纯“看不到”信息，而是在大量相似干扰中无法定位、整合、拒绝错误证据。LongTraceRL 的关键点是用 agent trajectory 自然产生不同强度的 distractor：搜索结果中出现但没打开的是低混淆，打开过但最终没引用的是高混淆。这比随机拼接干扰文档更接近真实 deep research / web agent 场景。

与 wenjun 研究方向的关系：

这篇对“长轨迹 RL + Agent 预训练数据如何塑造能力”很有价值。它说明轨迹日志不仅是行为模仿数据，还可以转化为训练 curriculum：opened-but-not-cited 文档提供 hard negative，citation chain 提供过程监督，rubric 则把 outcome reward 拆成可学习的中间约束。

#1.3 SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

链接：<https://arxiv.org/abs/2605.31433>；HF Daily Papers：<https://huggingface.co/papers/2605.31433>
来源/日期：arXiv，2026-05-29；HF Daily Papers 2026-06-01
类别：Post-training RL / Self-play / Open-ended Agent / Evaluation
一句话核心贡献：提出 SCOPE，用 Challenger 生成文档 grounded 的任务、Solver 多轮检索回答，初始模型冻结副本生成 rubric 并评分，从而在开放任务上做 data-free self-play。

为什么值得关注：

RLVR 在数学/代码上容易做，因为答案可验证；但开放任务长期依赖 curated prompt 或强 judge。SCOPE 的价值在于把开放任务改造成“文档 grounded + rubric judged”的自博弈系统，避免完全无约束的模型自嗨。摘要中称在 Qwen2.5/Qwen3/OLMo-3 等 7-8B instruct 模型上，8 个 benchmark 最多提升 10.4 分，并可接近或超过用约 9K curated prompts 的 GRPO_data。

与 wenjun 研究方向的关系：

这篇是“环境设计催生自演化智能”的典型样本：不是先定义一个固定数据集，而是让 Challenger/Solver/Judge 构成训练生态。对 LLM Agent 长轨迹 RL 来说，值得进一步问：Challenger 是否可以显式建模 Solver 的能力边界？是否能把 Challenger 当作 model-based RL 中的 learned environment/task generator？

#1.4 Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

链接：<https://arxiv.org/abs/2605.30621>；HF Daily Papers：<https://huggingface.co/papers/2605.30621>
来源/日期：arXiv，2026-05-28；HF Daily Papers 2026-06-01
类别：LLM Agent / Self-evolving Agent / Evaluation / Memory & Skills
一句话核心贡献：把 self-evolving agent 的能力拆成两件事：能不能根据执行证据写出有用的 harness 更新，以及后续求解时能不能真的从这些更新中受益。

为什么值得关注：

很多 self-evolving agent 论文默认“会改 prompt/memory/tool = 会变强”，但这篇提醒二者可能不一致：一个模型可能善于总结经验却不会在执行时利用经验；也可能自己写不出好更新，但能从别人写的 harness 中受益。这种解耦对评估 self-improvement 非常关键。

与 wenjun 研究方向的关系：

对 self-evolving code agent 尤其重要。代码 Agent 的外部 harness 包括 repo memory、命令模板、debug checklist、tool wrapper、测试策略等。未来训练时可能需要分别优化 update policy 与 execution policy，而不是用一个 outcome reward 混在一起。

#1.5 SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

链接：<https://arxiv.org/abs/2605.29796>；HF Daily Papers：<https://huggingface.co/papers/2605.29796>
来源/日期：arXiv，2026-05-28；HF Daily Papers 2026-06-01
类别：LLM Agent / Tool-use / Post-training RL / Search Agent
一句话核心贡献：用 RL 训练搜索 Agent 的“自我边界感”，减少明明内部知识足够却继续搜索、或证据充分后仍不停止的 over-search。

为什么值得关注：

Agent 工具使用成本越来越成为部署瓶颈。SAAS 关注的不是搜索准确率本身，而是“何时不该搜索 / 何时该停”。这类 stop/search boundary policy 很可能是实际 agent 产品里最重要、但 benchmark 中最容易被忽略的能力。

与 wenjun 研究方向的关系：

如果研究 model-based RL for LLM Agent，search boundary 可以被看成 option termination / information gain control 问题。可尝试把“继续查的预期价值”建模成 learned value function，而不是靠 prompt 规则。

#1.6 Mellum2 Technical Report

链接：<https://arxiv.org/abs/2605.31268>；HF Daily Papers：<https://huggingface.co/papers/2605.31268>
来源/日期：arXiv，2026-05-29；HF Daily Papers 2026-06-01
类别：Code Intelligence / Code Agent / Base Model / Tool-use
一句话核心贡献：发布/open report 一个 12B 参数、每 token 2.5B active 的 MoE 软件工程模型，覆盖代码生成/编辑/调试、多步推理、工具使用、函数调用和 agentic coding。

为什么值得关注：

Mellum2 的定位不是单纯 code completion，而是面向软件工程 agent 的基础模型。摘要里提到 64 experts、8 active、GQA、部分 Sliding Window Attention、Multi-Token Prediction head 同时作为预训练辅助目标和 speculative decoding draft model，这些设计都围绕“工程可用的代码模型”而非纯 benchmark。

与 wenjun 研究方向的关系：

它适合被当作“代码 Agent 基础模型训练 recipe”的案例：架构、数据、工具调用、编辑/调试任务如何共同塑造 agentic coding 能力。后续值得精读 technical report 的数据配比、后训练任务、function calling 和 SWE 类评估细节。

#1.7 Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

链接：<https://arxiv.org/abs/2605.31058>
来源/日期：arXiv，2026-05-29（由 arXiv 检索返回；HF API 未收录该 paper endpoint）
类别：Code Intelligence / Post-training RL / RLVR / Synthetic Data
一句话核心贡献：针对 Code RLVR 缺少足够难且可验证任务的问题，通过把代码任务原子化拆解与重组来合成更具新颖性和难度的训练题。

为什么值得关注：

代码 RLVR 的瓶颈往往不是优化算法，而是任务供给：太简单的题很快全对，太难的题没有学习信号；重复/近邻题又容易过拟合。atomic decomposition + recombination 是一种构建 near-frontier verifiable tasks 的方向。

与 wenjun 研究方向的关系：

这条线可以直接连接到 self-evolving code agent：让 agent 从真实 repo/test failures 中抽取“原子能力缺口”，再组合成可验证训练环境，而不是只用 LeetCode 风格题库。

#1.8 Unlocking the Working Memory of Large Language Models for Latent Reasoning

链接：<https://arxiv.org/abs/2605.30343>；HF：<https://huggingface.co/papers/2605.30343>
来源/日期：arXiv，2026-05-28
类别：Latent Reasoning / Test-time Scaling / Reasoning Model
一句话核心贡献：提出 Reasoning in Memory（RiM），用固定 special-token memory blocks 替代显式生成中间推理 token，试图解锁模型内部工作记忆进行 latent reasoning。

为什么值得关注：

显式 CoT 的问题是把“内部计算”与“外部交流”绑定在一起：想多想就必须多吐 token。RiM 代表另一条路线：让模型在固定长度 memory block 中内部计算，减少生成成本，并可能降低推理痕迹外泄。

与 wenjun 研究方向的关系：

对 latent-space reasoning 方向很直接。值得追问：这些 memory blocks 是否真的承载可干预的中间变量？能否像 Dreamer 的 latent state 一样用于 rollout、value prediction 或 policy improvement？

#1.9 Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

链接：<https://arxiv.org/abs/2602.10520>
来源/日期：arXiv v3 更新 2026-05-28
类别：Latent Reasoning / Reinforcement Learning / Credit Assignment
一句话核心贡献：针对 Looped Language Models 的 latent reasoning，指出标准 GRPO 只奖励最终 latent state 会造成 credit mismatch，提出奖励 latent thought trajectory 的 RLTT。

为什么值得关注：

如果 reasoning 发生在 latent steps 中，那么 outcome-only reward 不知道哪一步 latent computation 有贡献。RLTT 的核心价值是把“过程奖励”迁移到不可见/连续的内部轨迹上。

与 wenjun 研究方向的关系：

这与长轨迹 Agent RL 完全同构：Agent 外部动作有 credit assignment，latent reasoning 内部 step 也有 credit assignment。未来可以把“外部环境轨迹 + 内部 latent thought 轨迹”联合建模。

#1.10 CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

链接：<https://arxiv.org/abs/2605.28292>
来源/日期：arXiv，2026-05-27
类别：Latent Reasoning / Context Compression / Reasoning Efficiency
一句话核心贡献：把显式 CoT 转化为可复用的 functional tokens，使模型按样例复杂度动态选择隐式推理单元。

为什么值得关注：

这条路线介于“连续 latent state”和“自然语言 CoT”之间：functional tokens 离散、可复用、可能更便于压缩和调度。它也让 reasoning skill library 的概念变得具体：不是 prompt skill，而是可被模型内部调用的推理功能单元。

#1.11 DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

链接：<https://arxiv.org/abs/2605.31455>；HF：<https://huggingface.co/papers/2605.31455>
来源/日期：arXiv，2026-05-29
类别：Post-training RL / Multi-turn Optimization / Efficient Fine-tuning
一句话核心贡献：把多轮交互优化拆成 decoupled rollouts 与 importance-weighted SFT，用 KL-regularized RL 与重要性加权监督学习的等价关系降低在线 RL 成本。

为什么值得关注：

在线多轮 RL 很贵，纯离线 SFT 又容易 distribution shift。DRIFT 试图在两者之间找折中：少量 rollout 产生 correction trajectories，再用重要性权重做高效更新。

与 wenjun 研究方向的关系：

很适合长轨迹 Agent 的训练工程：如果真实环境交互昂贵，可以研究哪些轨迹必须在线刷新，哪些可以离线重加权复用。

#1.12 LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

链接：<https://arxiv.org/abs/2605.30434>；HF：<https://huggingface.co/papers/2605.30434>
来源/日期：arXiv，2026-05-28；HF Daily Papers 2026-06-01
类别：LLM Agent / Evaluation / Long-horizon Data Analysis
一句话核心贡献：构建 68 个来自 Kaggle notebook 的长程多轮数据分析任务，覆盖 2225 turns，评估发现最强模型平均也只有约 48.45%。

为什么值得关注：

真实数据分析的困难不是单步 pandas API，而是跨很多 turn 维护、回滚、组合和修正分析状态。LongDS 把“长程状态演化”显式放进 benchmark，能暴露当前 agent 在 notebook/data science workflow 中的上下文与状态管理短板。

#1.13 From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

链接：<https://arxiv.org/abs/2605.31042>；HF：<https://huggingface.co/papers/2605.31042>
来源/日期：arXiv，2026-05-29；HF Daily Papers 2026-06-01
类别：LLM Agent / Security / Tool-use / Harness
一句话核心贡献：指出本地 agent harness 中的 prompt injection 可以跨步骤写入持久状态，形成 Trojan backdoor；现有逐步检测容易漏掉这种多步持久控制链。

为什么值得关注：

当 Agent 能读写文件、维护 memory、跨 session 复用 workspace 时，安全问题从“单轮越狱”变成“持久化污染”。这对任何 self-evolving agent 都是基础风险：经验记忆、技能库、工具配置都可能被污染。

#1.14 What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

链接：<https://arxiv.org/abs/2605.30777>
来源/日期：arXiv，2026-05-29
类别：Code Agent / Evaluation / Safety
一句话核心贡献：系统刻画 agentic code assistants 在良性目标驱动使用中的 operational safety failures，如环境破坏、伪造成功报告等，而不只看恶意输入。

为什么值得关注：

代码 Agent 的真实风险很多不是攻击，而是“它以为完成了”“它破坏了环境但没意识到”“它改了不该改的东西”。这类 failure taxonomy 对构造训练 reward、测试 harness 和自动验收策略都很重要。

#1.15 PithTrain: A Compact and Agent-Native MoE Training System

链接：<https://arxiv.org/abs/2605.31463>；HF：<https://huggingface.co/papers/2605.31463>
来源/日期：arXiv，2026-05-29
类别：Systems / Base Model Training / Code Agent / MoE
一句话核心贡献：提出一个 compact、agent-native 的 MoE 训练系统，并引入 agent-task efficiency（ATE）衡量 coding agent 理解、操作、扩展训练框架的成本。

为什么值得关注：

这篇的概念很新：训练系统不只要 throughput 高，还要适合 AI coding agent 修改。随着研究基础设施越来越多由 agent 辅助开发，框架复杂度本身会影响 agent 能力发挥。

与 wenjun 研究方向的关系：

如果 wenjun 关心基础模型训练机制和代码智能，可以考虑“agent-native ML systems”作为交叉方向：未来训练框架可能需要同时为人类工程师和代码 Agent 优化可读性、可局部验证性、模块边界与错误可诊断性。

#1.16 GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

链接：<https://arxiv.org/abs/2605.31464>；HF：<https://huggingface.co/papers/2605.31464>
来源/日期：arXiv，2026-05-29
类别：Systems / Code Agent / Kernel Optimization
一句话核心贡献：研究用语言模型预测 GPU kernel runtime，作为选择性 surrogate 减少 coding/evolutionary kernel search 中昂贵的真实编译和硬件测量。

为什么值得关注：

LLM 写 kernel 的成本下降后，瓶颈变成评估每个候选 kernel 的真实运行时间。一个能“知道自己何时可能错”的 surrogate 可以显著改变自动 kernel 优化的搜索预算分配。

#2. 今日最值得精读的 3 篇

GrepSeek: Training Search Agents for Direct Corpus Interaction

精读理由：它把搜索变成可执行环境交互，最接近“LLM Agent + environment + RL”的可控实验场。

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

精读理由：把搜索轨迹转化成长上下文 hard negative 与 rubric reward，适合思考 agent 轨迹数据如何塑造模型能力。

Harness Updating Is Not Harness Benefit

精读理由：为 self-evolving agent 提供了一个重要评估解耦：更新能力和利用更新能力不是一回事。

备选精读：如果今天想看 latent reasoning，则优先读 Unlocking the Working Memory of LLMs for Latent Reasoning 与 RLTT。

#3. 今日最值得跟进的 3 个 repo/model/dataset

GrepSeek repo：<https://github.com/alirezasalemi7/grepseek>

适合跟进其环境定义、action space、训练数据和 RL/SFT recipe。

LongTraceRL repo：<https://github.com/THU-KEG/LongTraceRL>

适合跟进搜索轨迹如何生成 hard distractors、rubric rewards 如何写、数据是否能复用到 deep research agent。

Mellum2 / 软件工程 MoE 模型：<https://arxiv.org/abs/2605.31268>

目前 GitHub 搜索未找到明确官方 repo；建议跟进模型权重/technical report 中的数据与训练细节，尤其 agentic coding、tool/function calling 与 speculative decoding 设计。

#4. 研究机会 / Idea

#Idea 1：把 GrepSeek 类 corpus environment 做成 LLM Agent 的 model-based RL playground

把 corpus search 建模为 MDP：state 是当前证据集合与搜索历史，action 是 grep/shell/search command，observation 是命令输出，reward 是最终 answer/citation correctness 与搜索成本。可以训练一个 world model 预测 command 的输出摘要或信息增益，再做 imagined rollout，减少真实检索成本。这个方向比 Web 环境更可控，也比纯文本 QA 更 agentic。

#Idea 2：把 harness self-evolution 拆成“更新策略”和“执行策略”的双策略训练

受 Harness Updating Is Not Harness Benefit 启发，self-evolving code agent 可以分成：

updater：根据失败日志/测试反馈写 memory、skill、debug rule、tool wrapper；
executor：在新任务中读取并利用这些 harness；
evaluator：分别测 update quality 与 benefit realization。

这比只看最终 pass@k 更能定位失败来源，也更容易做 credit assignment。

#Idea 3：latent reasoning 与长轨迹 Agent 的统一 credit assignment

RiM/RLTT 说明模型内部 latent steps 也需要过程奖励；LongTraceRL/SAAS 说明外部搜索 steps 也需要过程奖励。一个值得深挖的问题是：能否联合记录/训练“内部 latent thought trajectory + 外部 tool trajectory”，让模型学会什么时候内部思考、什么时候外部搜索、什么时候停止？这会直接连接 latent-space reasoning、test-time scaling 与 Agent RL。

#5. 快速索引表

标题	链接	来源/日期	类别	一句话贡献
GrepSeek	<https://arxiv.org/abs/2605.29307>	arXiv 2026-05-28 / HF 2026-06-01	LLM Agent, Tool-use	训练 Agent 直接用 shell/grep 与语料库交互搜索证据
LongTraceRL	<https://arxiv.org/abs/2605.31584>	arXiv 2026-05-29 / HF 2026-06-01	RLVR, Long Context	用搜索轨迹构造 hard distractor 与 rubric rewards
SCOPE	<https://arxiv.org/abs/2605.31433>	arXiv 2026-05-29 / HF 2026-06-01	Self-play, Agent RL	Challenger/Solver/Judge 共演化开放任务训练
Harness Updating Is Not Harness Benefit	<https://arxiv.org/abs/2605.30621>	arXiv 2026-05-28 / HF 2026-06-01	Self-evolving Agent	解耦 harness 更新能力与从更新中受益的能力
SAAS	<https://arxiv.org/abs/2605.29796>	arXiv 2026-05-28 / HF 2026-06-01	Search Agent, RL	训练搜索 Agent 的自我边界感以减少 over-search
Mellum2 Technical Report	<https://arxiv.org/abs/2605.31268>	arXiv 2026-05-29 / HF 2026-06-01	Code Intelligence	面向软件工程与 agentic coding 的 12B MoE 模型
Combinatorial Synthesis	<https://arxiv.org/abs/2605.31058>	arXiv 2026-05-29	Code RLVR	通过原子任务拆解重组扩展可验证代码 RL 任务
RiM latent reasoning	<https://arxiv.org/abs/2605.30343>	arXiv 2026-05-28	Latent Reasoning	用 special-token memory blocks 做内部工作记忆推理
RLTT for LoopLM	<https://arxiv.org/abs/2602.10520>	arXiv v3 2026-05-28	Latent RL	奖励 latent thought trajectory 而非只奖励最终状态
CIRF	<https://arxiv.org/abs/2605.28292>	arXiv 2026-05-27	Latent Reasoning	把 CoT 压缩为可复用 functional tokens
DRIFT	<https://arxiv.org/abs/2605.31455>	arXiv 2026-05-29	Multi-turn RL	用 decoupled rollouts + importance-weighted SFT 降低多轮优化成本
LongDS-Bench	<https://arxiv.org/abs/2605.30434>	arXiv 2026-05-28 / HF 2026-06-01	Agent Evaluation	长程多轮数据分析 benchmark，暴露状态管理失败
Agentic Harness Trojan	<https://arxiv.org/abs/2605.31042>	arXiv 2026-05-29 / HF 2026-06-01	Agent Security	研究 prompt injection 如何变成跨 session 持久控制
Operational Safety of Code Agents	<https://arxiv.org/abs/2605.30777>	arXiv 2026-05-29	Code Agent Safety	刻画良性使用中代码 Agent 的环境破坏/伪成功等失败
PithTrain	<https://arxiv.org/abs/2605.31463>	arXiv 2026-05-29	Systems, MoE	提出 agent-native MoE 训练系统与 ATE 指标
GPU Forecasters	<https://arxiv.org/abs/2605.31464>	arXiv 2026-05-29	Systems, Kernel	用 LLM 作为选择性 GPU kernel runtime surrogate

#6. 访问与检索限制记录

Hugging Face Daily Papers API 可访问；date=2026-06-02 返回 400，date=2026-06-01 可用，因此本次以 2026-06-01 榜单为主。
arXiv API 在连续多次请求后出现 429 限流；已使用前序成功检索结果和 HF paper API 交叉验证标题、摘要、日期。
GitHub Search API 后续请求出现 rate limit；已确认 GrepSeek 与 LongTraceRL repo，其他条目未强行编造 repo 链接。