Benchmarking large language models remains a nuanced challenge, as highlighted by two leading tests: MMLU, a 14,000‑question multiple‑choice exam covering fields from medicine to philosophy, and SWE‑bench, which tasks models with fixing authentic GitHub issues. The post examines how these benchmarks expose pitfalls such as Goodhart’s Law, data contamination, and the use of canary strings to detect leakage. It argues that high scores do not automatically translate into genuine intelligence or utility. Understanding these dynamics is essential for evaluating AI progress.
Anthropic’s new safety paper reframes AI misalignment as a statistical bias‑variance problem rather than a classic paper‑clip maximizer scenario. The research shows that as model intelligence and task complexity rise, both systematic bias and stochastic variance increase, heightening alignment risk....
The "Bitter Lesson" argues that raw scale—more data, compute, and larger models—consistently outperforms clever, hand‑crafted algorithms. Historically, breakthroughs from Deep Blue to AlexNet illustrate this pattern, and modern large language models reinforce it. AI developers spend months fine‑tuning prompts only to...
ChatGPT’s ability to follow instructions stems from a decade‑long research trajectory that began with reinforcement learning from human preferences. Early work such as Christiano et al. (2017) taught agents to play Atari and walk robots, laying the foundation for preference‑based...