AI Paper of the Day

Creator

1 followers

Each day, I'll share my insights on an interesting paper in Computer Vision, NLP or Multimodal AI

Blog•Apr 8, 2026

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

The paper introduces Claw-Eval, an end‑to‑end suite that evaluates large language model agents by auditing every step of their execution rather than only the final output. It uses a three‑phase pipeline—Setup, Execution, Judge—and records actions through execution traces, server logs, and environment snapshots. Across 300 human‑verified tasks, the framework scores agents on completion, safety, and robustness, revealing that traditional benchmarks miss a large share of safety and robustness failures. Results from 14 frontier models show significant consistency drops under simulated errors and notable weaknesses in multimodal video tasks and dialogue questioning.

By AI Paper of the Day

Blog•Apr 5, 2026

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

The paper introduces SKILL0, a framework that trains large language model agents to internalize specialized skills directly into their parameters, removing the need for runtime skill retrieval. Using an in‑context reinforcement learning curriculum, explicit skill descriptions are gradually withdrawn as...

By AI Paper of the Day

Blog•Mar 19, 2026

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

MetaClaw presents a continual‑learning framework for large language model agents that combines instant, text‑based skill injection with scheduled weight updates, eliminating service downtime. The fast loop creates concise behavioral rules from user failures and injects them directly into the prompt....

By AI Paper of the Day

Blog•Mar 12, 2026

ReMix: Reinforcement Routing for Mixtures of LoRAs in LLM Finetuning

The paper introduces ReMix, a reinforcement‑learning based routing strategy for Mixture‑of‑LoRAs that eliminates the common “routing weight collapse” where a single adapter dominates. By assigning constant, equal weights to all activated adapters and training the router as a policy, ReMix...

By AI Paper of the Day

Blog•Mar 10, 2026

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

The paper introduces CONSTORY‑CHECKER, an automated pipeline, and ConStory‑Bench, a 2,000‑prompt benchmark, to evaluate narrative consistency in long‑form story generation by LLMs. The four‑stage system extracts suspect spans, pairs conflicting statements, generates evidence chains, and produces anchored reports. Evaluation across...

By AI Paper of the Day

AI Paper of the Day

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

ReMix: Reinforcement Routing for Mixtures of LoRAs in LLM Finetuning

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

Technology Pulse