
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
The paper introduces Claw-Eval, an end‑to‑end suite that evaluates large language model agents by auditing every step of their execution rather than only the final output. It uses a three‑phase pipeline—Setup, Execution, Judge—and records actions through execution traces, server logs, and environment snapshots. Across 300 human‑verified tasks, the framework scores agents on completion, safety, and robustness, revealing that traditional benchmarks miss a large share of safety and robustness failures. Results from 14 frontier models show significant consistency drops under simulated errors and notable weaknesses in multimodal video tasks and dialogue questioning.

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
The paper introduces SKILL0, a framework that trains large language model agents to internalize specialized skills directly into their parameters, removing the need for runtime skill retrieval. Using an in‑context reinforcement learning curriculum, explicit skill descriptions are gradually withdrawn as...

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild
MetaClaw presents a continual‑learning framework for large language model agents that combines instant, text‑based skill injection with scheduled weight updates, eliminating service downtime. The fast loop creates concise behavioral rules from user failures and injects them directly into the prompt....

ReMix: Reinforcement Routing for Mixtures of LoRAs in LLM Finetuning
The paper introduces ReMix, a reinforcement‑learning based routing strategy for Mixture‑of‑LoRAs that eliminates the common “routing weight collapse” where a single adapter dominates. By assigning constant, equal weights to all activated adapters and training the router as a policy, ReMix...

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs
The paper introduces CONSTORY‑CHECKER, an automated pipeline, and ConStory‑Bench, a 2,000‑prompt benchmark, to evaluate narrative consistency in long‑form story generation by LLMs. The four‑stage system extracts suspect spans, pairs conflicting statements, generates evidence chains, and produces anchored reports. Evaluation across...
