Scaffolding vs Reinforcement Finetuning for AI Forecasting

Scaffolding vs Reinforcement Finetuning for AI Forecasting

LessWrong
LessWrongApr 30, 2026

Key Takeaways

  • Finetuned bot beats baseline on numeric forecasts, loses on binary.
  • Training used 979 samples, cost $1,670, 12‑hour runtime.
  • Model trusted authoritative sources, misread meta‑level prediction cues.
  • Data mix (politics vs finance) influenced unexpected performance patterns.
  • Iterating scaffolding may yield higher ROI than costly finetuning.

Pulse Analysis

AI forecasting has become a strategic tool for investors, policymakers, and tech firms seeking probabilistic insight into future events. Recent advances combine large language models with reinforcement finetuning (RFT) to teach models not only to predict outcomes but also to reason about underlying evidence. Parallel to model‑level training, engineers often construct scaffolding—multi‑agent pipelines that gather data, generate forecasts, and aggregate results. This dual approach promises higher accuracy, yet it also introduces trade‑offs between model complexity, data quality, and operational cost.

In the Metaculus minibench‑2025‑09‑29 tournament, a finetuned o4‑mini bot paired with a three‑team scaffold was pitted against a simpler baseline pipeline. While the finetuned system achieved a striking +14.59 average score on numeric questions, it fell to –0.70 on binary items, dragging its overall performance below the baseline. The bot’s tendency to trust authoritative sources—evident in its strong financial forecasts—clashed with meta‑level questions where the presence of evidence did not guarantee forecaster reaction. Moreover, the training set’s composition (56.5% binary, 21.1% numeric) and topic mix (politics, finance, AI) produced unexpected failure modes, underscoring that data balance alone does not guarantee generalized skill.

The cost analysis further tilts the balance toward iteration over finetuning. With a $1,670 spend and roughly 35 hours of engineering, the finetuned model delivered mixed results, suggesting that refining the scaffold—improving research retrieval, aggregation logic, and confidence calibration—could yield a higher return on investment. Tools like historical back‑testing APIs and experience databases can simulate finetuning benefits without retraining. For organizations eyeing AI‑driven forecasting, the lesson is clear: prioritize robust scaffolding, validate on diverse question types, and treat finetuning as a targeted, cost‑justified enhancement rather than a blanket solution.

Scaffolding vs Reinforcement Finetuning for AI Forecasting

Comments

Want to join the conversation?