When the Ruler Is Made of the Thing It Measures: Multi-Model Evidence on AI Occupational Exposure Scores

•May 11, 2026

CEPR — VoxEU•May 11, 2026

Companies Mentioned

Google

GOOG

Anthropic

PwC

Goldman Sachs

Why It Matters

The finding shows that policy decisions and academic conclusions about AI’s labor impact can hinge on the choice of model, making single‑model analyses potentially misleading.

Key Takeaways

•Exposure share varies 2.7% (Gemini) to 51.5% (Claude)
•Claude rates >80% of management jobs high‑exposure; Gemini rates <20%
•Regression estimates of AI impact flip sign across models
•Multi‑model checks recommended for robust AI‑labour research

Pulse Analysis

The surge of generative AI has prompted economists to quantify how many occupations are vulnerable to automation. Standard practice involves prompting a large language model to rate each job’s task set, producing an "exposure score" that feeds into policy briefs, corporate forecasts, and academic papers. This methodology, adopted by Goldman Sachs, the IMF, the ILO, and PwC, treats the AI model as a neutral ruler measuring a fixed labor reality. Yin et al.'s 2026 study challenges that premise by running the same scoring pipeline across four leading models—Gemini 2.5, Claude 4.5, GPT‑4, and ChatGPT‑5—revealing a nineteen‑fold spread in high‑exposure estimates.

The disparity stems from each model’s unique training corpus, calibration, and reinforcement signals, which bias how tasks are interpreted as automatable. Because the most rapidly advancing tasks also generate the bulk of new training data, the measurement instrument evolves alongside the phenomenon it aims to capture, violating classical measurement‑error assumptions. Consequently, downstream analyses—such as difference‑in‑differences regressions linking exposure scores to employment trends—produce point estimates that not only lose statistical significance but even reverse direction depending on the model used. This instability threatens the credibility of policy recommendations that hinge on identifying at‑risk occupations, from workforce reskilling programs to regional economic planning.

The authors propose a pragmatic remedy: report findings from at least two or three frontier models. Convergent results signal robustness, while divergence highlights model‑driven uncertainty that warrants caution. This multi‑model protocol is inexpensive—roughly the cost of running analyses on three cloud providers—and has broader relevance wherever AI outputs drive consequential decisions, such as credit scoring or hiring. Embedding such checks now can prevent costly missteps later and restore confidence in AI‑augmented economic research.

When the ruler is made of the thing it measures: Multi-model evidence on AI occupational exposure scores

Read Original Article

Comments

Want to join the conversation?

Loading comments...