KDD2026 - Retain to Refine: Adaptive Online QuestionAnswering via Query Routing and Long-ShortMemory
Why It Matters
By dynamically allocating compute and grounding reasoning, the approach cuts latency for routine queries while improving answer quality for complex ones, directly boosting user satisfaction and operational efficiency in production search services.
Key Takeaways
- •Adaptive routing splits queries by difficulty for efficiency
- •Lightweight critic predicts difficulty, sending simple queries to fast path
- •Long‑short memory stores verified facts and transient gaps
- •Supervised retrospection validates retrieval, preventing query drift across iterations
- •Real‑world AB tests show higher engagement and lower latency
Summary
The paper introduces “Retain to Refine,” an adaptive online question‑answering framework that tailors its reasoning pipeline to each query’s difficulty.
A lightweight critic agent first predicts whether a query is simple or complex. Simple factoid questions are answered directly, while hard multihop queries are handed to a memory‑augmented agent that employs a novel long‑short memory architecture: long memory stores verified facts for grounding, short memory captures transient gaps to focus retrieval.
The system also incorporates supervised retrospection, explicitly checking retrieval quality at each iteration to curb query drift. Offline benchmarks show a superior accuracy‑latency trade‑off versus uniform pipelines, and the memory‑augmented agent beats strong multihop baselines.
In a large‑scale A/B test on BU’s search platform, “Retain to Refine” delivered statistically significant lifts in user engagement and reduced response times, demonstrating that adaptive routing and long‑short memory can make real‑world QA services both faster and more reliable.
Comments
Want to join the conversation?
Loading comments...