This AI Paper from Stanford and Harvard Explains Why Most ‘Agentic AI’ Systems Feel Impressive in Demos and then Completely Fall Apart in Real Use

Agentic AI systems sit on top of large language models and connect to tools, memory, and external environments. They already support scientific discovery, software development, and clinical research, yet they still struggle with unreliable tool use, weak long‑horizon planning, and poor generalization. The latest research paper “Adaptation of Agentic AI” from Stanford, Harvard, UC Berkeley, and Caltech proposes a unified view of how these systems should adapt and maps existing methods into a compact, mathematically defined framework.

How this research paper models an agentic AI system

The survey models an agentic AI system as a foundation‑model agent together with three key components:

Planning module – decomposes goals into sequences of actions.

Static procedures: Chain‑of‑Thought, Tree‑of‑Thought.

Dynamic procedures: ReAct, Reflexion (react to feedback).
Tool‑use module – connects the agent to web search engines, APIs, code‑execution environments, Model Context Protocols, and browser automation.
Memory module – stores short‑term context and long‑term knowledge, accessed through retrieval‑augmented generation.

Adaptation changes prompts or parameters for these components using:

Supervised fine‑tuning.
Preference‑based methods (e.g., Direct Preference Optimization).
Reinforcement‑learning methods (e.g., Proximal Policy Optimization, Group Relative Policy Optimization).
Parameter‑efficient techniques (e.g., Low‑Rank Adaptation).

Four adaptation paradigms

The framework defines four paradigms by combining two binary choices:

| Dimension | Options |

|-----------|---------|

| Target | Agent adaptation (A) vs. Tool adaptation (T) |

| Supervision signal | Tool execution (1) vs. Agent output (2) |

This yields:

A1 – Tool‑Execution‑Signaled Agent Adaptation
A2 – Agent‑Output‑Signaled Agent Adaptation
T1 – Agent‑Agnostic Tool Adaptation
T2 – Agent‑Supervised Tool Adaptation

A1 – Learning from verifiable tool feedback

Process: The agent receives input x, produces a structured tool call a, the tool returns result y, and the learning objective O_tool measures tool success (e.g., execution correctness, retrieval quality).

Methods:

Supervised imitation of successful tool trajectories (e.g., Toolformer, ToolAlpaca, Gorilla).
Reinforcement learning using verifiable tool outcomes as reward (e.g., DeepRetrieval).

DeepRetrieval frames query reformulation as a Markov decision process: state = user query, action = rewritten query, reward = combination of retrieval metrics (Recall, nDCG), format penalties, and SQL execution accuracy. Training uses KL‑regularized Proximal Policy Optimization and applies to literature search, corpus QA, and text‑to‑SQL.

A2 – Learning from final agent outputs

The objective O_agent depends only on the final output o produced by the agent, even if tools are used internally. Purely supervising o is insufficient because the agent can ignore tools and still improve likelihood. Effective A2 systems therefore:

Combine supervision on tool calls with supervision on final answers, or
Assign sparse rewards (e.g., exact‑match accuracy) to o and propagate them through the full trajectory.

T1 – Agent‑agnostic tool training

The main agent is frozen; tools are optimized to be broadly reusable. Objective O_tool depends only on tool outputs (retrieval accuracy, ranking quality, simulation fidelity, downstream task success). A1‑trained search policies such as DeepRetrieval can later be reused as T1 tools inside new agentic systems without modifying the agent.

T2 – Tools optimized under a frozen agent

A powerful but fixed agent A (often a closed‑source foundation model) receives tool results and produces the final output o. The optimization objective remains O_agent, but trainable parameters belong to the tool. Techniques include quality‑weighted training, target‑based training, and reinforcement‑learning variants that derive learning signals for the tool from the final agent outputs.

Memory is treated as a special case of T2: an external store accessed via learned read/write functions while the agent stays frozen. Recent T2 systems:

s3 – trains a 7 B‑parameter searcher that maximizes a “Gain‑Beyond‑RAG” reward defined by a frozen generator.
AgentFlow – trains a planner to orchestrate mostly frozen Qwen 2.5‑based modules using Flow‑GRPO.

Key Takeaways

The research defines a precise four‑paradigm framework for adapting agentic AI by crossing two dimensions: target (agent vs. tool) and supervision signal (tool execution vs. final agent output).
A1 methods (Toolformer, ToolAlpaca, Gorilla, DeepRetrieval) adapt the agent directly from verifiable tool feedback, often using KL‑regularized PPO.
A2 methods optimize the agent from signals on final outputs; they must still supervise tool calls or propagate sparse rewards, otherwise the agent may ignore tools.
T1 and T2 shift learning to tools and memory. T1 trains generally useful retrievers, searchers, and simulators without a specific agent; T2 adapts tools under a frozen agent (e.g., s3, AgentFlow).
The authors argue that practical systems will combine occasional A1/A2 updates on a strong base model with frequent T1/T2 adaptation of retrievers, search policies, simulators, and memory for robustness and scalability.

AI Blogs and Articles

Why It Matters