The results show meaningful progress toward task-level parity, signaling potential productivity gains and disruption in specific white-collar workflows, but also underscore that current models are not yet poised to wholesale-automate occupations—affecting how firms, regulators and investors should plan for AI adoption.
OpenAI published a study comparing frontier language models to industry experts on realistic, digitally oriented tasks and found some models are approaching expert deliverable quality. Anthropic’s Claude Opus 4.1 outperformed OpenAI’s models and in many cases came close to human experts, while performance varied significantly by file type and sector (PDFs, PowerPoints and Excel tasks fared best). The study also found that sufficiently capable models—exemplified by GT5—can speed up expert workflows, but that weaker models do not provide review-time savings. Crucially, the paper focused only on predominantly digital tasks from high-GDP sectors and excluded many non-digital or peripheral duties, tempering claims of near-term job automation.
Comments
Want to join the conversation?
Loading comments...