When the Ruler Is Made of the Thing It Measures: Multi-Model Evidence on AI Occupational Exposure Scores
Companies Mentioned
Why It Matters
The finding shows that policy decisions and academic conclusions about AI’s labor impact can hinge on the choice of model, making single‑model analyses potentially misleading.
Key Takeaways
- •Exposure share varies 2.7% (Gemini) to 51.5% (Claude)
- •Claude rates >80% of management jobs high‑exposure; Gemini rates <20%
- •Regression estimates of AI impact flip sign across models
- •Multi‑model checks recommended for robust AI‑labour research
Pulse Analysis
The surge of generative AI has prompted economists to quantify how many occupations are vulnerable to automation. Standard practice involves prompting a large language model to rate each job’s task set, producing an "exposure score" that feeds into policy briefs, corporate forecasts, and academic papers. This methodology, adopted by Goldman Sachs, the IMF, the ILO, and PwC, treats the AI model as a neutral ruler measuring a fixed labor reality. Yin et al.'s 2026 study challenges that premise by running the same scoring pipeline across four leading models—Gemini 2.5, Claude 4.5, GPT‑4, and ChatGPT‑5—revealing a nineteen‑fold spread in high‑exposure estimates.
The disparity stems from each model’s unique training corpus, calibration, and reinforcement signals, which bias how tasks are interpreted as automatable. Because the most rapidly advancing tasks also generate the bulk of new training data, the measurement instrument evolves alongside the phenomenon it aims to capture, violating classical measurement‑error assumptions. Consequently, downstream analyses—such as difference‑in‑differences regressions linking exposure scores to employment trends—produce point estimates that not only lose statistical significance but even reverse direction depending on the model used. This instability threatens the credibility of policy recommendations that hinge on identifying at‑risk occupations, from workforce reskilling programs to regional economic planning.
The authors propose a pragmatic remedy: report findings from at least two or three frontier models. Convergent results signal robustness, while divergence highlights model‑driven uncertainty that warrants caution. This multi‑model protocol is inexpensive—roughly the cost of running analyses on three cloud providers—and has broader relevance wherever AI outputs drive consequential decisions, such as credit scoring or hiring. Embedding such checks now can prevent costly missteps later and restore confidence in AI‑augmented economic research.
When the ruler is made of the thing it measures: Multi-model evidence on AI occupational exposure scores
Comments
Want to join the conversation?
Loading comments...