My verdict is that it's significantly better than Gemini 3. It's at least as smart and just got more polish to it. Alignment on little details also significantly higher. Gemini 3 gets many things mixed up after a half-dozen messages, and completely confused after compaction.
With Opus 4.5, it seems you don't need to ask multiple times or ORDER it to do work, it just gets stuff done — even beyond 50% the token limit and after chat compaction! This kind of message is a thing...
These kinds of benchmarks are misleading without a joint metric showing much work was necessary by humans after the fact. How much time to clean up that 2h42m of code? Style and architecture need to make sense, not just passing tests. That's...

People working on basic code and reset their Agent chats every 4-5 replies I envy you. Having to work on deep context design work and at about 100k tokens, LLMs start to get lazy / confused. I resorted to giving them...

Gemini 3 review: it's fast, it's not dumb, but it's completely unusable in practice. It will get lost after a few edits then completely trash the file: issuing patch commands that include line numbers at best, and at worst it will...

Language models perform poorly on high-school math? 🙄 You don't want to hear this, but the problems started in grade-school. The moment we (collectively) found acceptable that mid-tier models could score only 75%-85% on a GSM test set of 1.32k straightforward...
The speed of a faster coding model is worth it, but it seems mis-priced. C1 gobbles through files, reasons more, expect extra feedback to reach similar place as slower model do with less of everything. Intuitively it feels more expensive "the...
Great idea for a metric to further improve what datasets the models train on. It likely leads to an answer that is not web-scale crawling... Less data is often better, better data takes less.

This paper from Meta about "Soft Tokens" in RL is interesting; it allows LLMs to invent their own non-discrete (recursive) representations in order to solve problems better... Results are mixed though: it's only a few percent better on GSM8k from pass@4...
The reason AI companies are rushing to release browsers: they don't want the responsibility / liability of scraping on their servers. They need to push that to the users! We'll be moving into an ever more gated internet soon...

Is this the only way to get coding agents to produce shippable quality code? https://t.co/bZvxMN6JEv
Without checking, what is the message behind the "Bitter Lesson", in your opinion? (a) all other things being equal, using more compute is better. (b) more compute is better than all the other things put together.
Even though this particular example worked out as you'd expect today, Open Source dynamics will certainly change. Accepting and merging contributions is always a risk and has high cost, so trusting an internal AI system for minor codebase improvements may become...
If you sub-optimally define any problem to require RL, when most can be solved with different approaches, then of course the RL hammer looks like the right solution! Rather than defending RL thru semantics, better would be to ask: how can...

It's hard to overstate how devastating this paper is, not only for reinforcement learning. They spent $4m of compute to find out that RL on LLMs basically taps out at 61% "asymptotic pass rate" (exact rate depends on context), but they...