
Avoid Pop‑Culture Model Names, Choose Professional Codenames
I know these are all unreliable leaks of internal code names but please, please AI labs, the only thing worse than calling your models GPT-5.5-xhigh-Codex-nano is giving them names like Agent Smith or Mythos, for obvious reasons. https://t.co/lYKDtd0MAp
AI Compute Shortage Looms After Last Year's Data Center Overbuild
Last year everyone spoke about over building of AI data centers, likely this year will start to demonstrate that there is not nearly enough compute to meet demand I think to degree to which AI is currently subsidized depends on...
ARC‑AGI‑3 Is a Distinct Test with Unique Metrics
It helps to think of ARC-AGI-3 as a different test entirely than the previous ARC-AGIs. It measures different things (though, as in the previous tests, precisely what it measures isn’t clear) and has different rules. That doesn’t mean it isn’t good,...

AI Rewrote Canon Webcam App in Rust, Fixing Crashes
Great little story from @danshapiro about how he asked a coding agent to fix the official webcam software from Canon that kept crashing. He woke up to a new, fully functional Rust webcam app that has worked ever since. ...

Open‑office Noise Hurts Analysis and Spikes Software Bugs
The background noise in open offices, in multiple experiments, decreases the ability to do analytical work, increase bugs in software, and decrease the ability to find bugs. Open floor plans are just a bad idea for any solo work. https://t.co/Y2KX5HjIEd
Small, Niche AI Models Crumble on Out-of-Distribution Challenges
Small AI models and specialized vertical AI models are very brittle. Any unusual situation or out-of-distribution issue and they break down. You also won’t get emergent leaps or good problem solving. They still have uses, but benchmarks don’t do a good...
Reliability Is a System Issue, Not Just Agent Design
Big lesson from high reliability organizations that AI agent builders need to learn is reliability is the property of systems. Current agentic tools are weaker than the agents: they are bad at agent-agent handoffs, escalation, when to call in humans....

AGI Alone Won’t Drive Transformation, Says Cowen
Maybe we should retroactively all just agree with @tylercowen that o3 was AGI so we can stop arguing about it. (Also, doing so will drive home the lesson that AGI alone is not enough for transformation) https://t.co/epKrL6nc4b
AI Models Still Lag on Simple Tasks
Its annoying that my tireless team of little computer people made out of statistical models that predict words based on the corpus of all human language & thus are reasonable approximations of a compression of the knowledge of humanity take...

AI Outperforms Humans, Larger Models Boost Creativity
Interesting finding in this paper showing that, for product development ideas, AIs consistently rank above humans (well, humans on Prolific) & larger and more recent models are more creative than previous ones. (It also tries a creativity intervention that doesn’t...

Codex vs Claude: Functional References vs Problem‑Solving Skills
Very different philosophies for skills in Codex versus Claude Code OpenAI seems to conceive of skills functionally, mostly matter-of-fact technical references for Codex. Claude skills are more about giving the AI approaches to problems See the difference in skill creator skills https://t.co/Gq3hcf1In0

Newer LLMs Fix Bias Against Less‑Educated Users
Anthropic showed older (2022) LLMs will give you less accurate answers if you seem less educated to the AI, but this issue has, as far as I know, been addressed in more recent models. https://t.co/iXnlBpYXLA https://t.co/ByclHjwywR

AI Coaching Boosts Measurable Empathy Communication Skills
AI can help us learn hard-to-teach skills, like empathy. Preregistered study of 968 people found almost no correlation between feeling empathic & communicating empathy. But a single practice session with an AI coach made people measurably better at it https://t.co/VDtdiNpw1J...

AI Now Builds Playable Game Mods Autonomously
This was kind of fun. Codex: "download nethack, add new items that would make the game easy to win and make me feel powerful" It did & it successfully gave me a new .exe file, navigating various issues to do so,...

Civ V Players Excel at Business Planning Skills
Its not Simcity, but business school students who were good at Civ V also turn out to be better planners, organizers, and problem-solvers in this small experiment. https://t.co/WGbAboe8kx
Humans, Not AI, Should Choose Project Credit
I don’t think AIs should be auto-adding themselves as credited on projects on Github or elsewhere. It primarily serves as a marketing tool to promote the product, but undermines the much more critical aspect that humans should be able to choose...

AI Learns Scholarly Taste, Predicts Hit Papers
Evidence that AI models can, indeed, learn "taste" in this paper where a small model, trained on citations, is able to predict which papers will be hits Citations, upvotes & shares are signals that can teach AI judgment about quality, not...

Boost AI Idea Diversity with Random Start/End Priming
This is a cool, practical technique for increasing AI idea diversity by adding random priming phrases & bits of end words Similar prompts produce similar ideas, but since LLMs attend more to the start & end of inputs, this approach pushes...
Alibaba and Xiaomi Shift Away From Open-Weight Models
This is looking like a good prediction. Alibaba’s Qwen and Xiaomi both seem to be steering away from open weights in the 2 weeks since this post.
Xiaomi's MiMo-V2-Pro Lacks Open Weights, Reflects Trend
This turned out to be Xiaomi's MiMo-V2-Pro, and it is fine but not at the frontier. Most interestingly, it is not open weights? This seems to be a trend in frontier Chinese models.
Big Three Labs Stuck Refining, Missing Future AI UX
There is some danger for the Big Three labs that they have run out of imagination and are now refining Codex/Claude Code/Antigravity, and building their next tools (Cowork, etc) to be similar. These were good UX for AI's use & limits...
AI Fiction Lacks Purpose, Readers Overinterpret Its Flaws
My experience so far with LLM fiction writing is that it takes advantage of our assumption that an author is writing things for a reason, so we are charitable to a book's quirks & do mental work to assign them...
LLM Fiction Feels Generic, Overly Metaphoric, Lacks Depth
I read a few dozen pages of this and it is not bad for LLM fiction, but also very very LLM-y, from the themes to the fact that there are lots of staccato conversations and meaningful silences and overwrought metaphors...
Calling New Currency “Tokens” Shows Lack of Imagination
It’s lame that the word we chose for an important new form of currency was “tokens,” real failure of imagination.

Recursive AI Summaries Spiral Into Meaningless Content
I had Codex build the content accordion from the cautionary tale "don't build the content accordion" inspired by X's new feature which uses AI to summarize X articles written with AI. It takes an X article, summarizes it in a tweet,...
AI-Generated Articles Dilute Tweets Into Endless Readings
This is genuinely a funny development for articles, given how many of them are obviously AI expansions of a single idea that could be expressed in a tweet. (There are some excellent articles as well, but now people won’t read those...
AI's Jagged Frontier Shows Humans Still Essential
We are back to the phase of the AI news cycle where people are underestimating how jagged the AI ability frontier is, as well as how much they still depend on expert human decision-making or guidance at key points in...

Heygen API Shows Dual Audience Writing, Needs Creative AI
Heygen’s APi documentation is a glimpse of how to write for your two audiences: humans and agents. (though I think their llms.txt file could do a lot more to get AIs “excited” to use their product in creative ways by explaining...
Claude Dispatch Matches OpenClaw, Offers Safer Data Handling
After using it a bit, Claude Cowork Dispatch covers 90% of what I was trying to use OpenClaw for, but feels far less likely to upload my entire drive to a malware site.
Demand for GPT‑5.4 Pro Knowledge‑Work Platform Grows
A knowledge-work platform built around GPT-5.4 Pro level intelligence would be really useful. The gap between other models and what Pro can do on complex intellectual work remains stark. I would love to have access in a Codex-like platform with...
Early Adopters Warn: AI Capabilities Aren’t yet Stable
One of the advantages of being an early user of LLMs is that I have seen The Curve with my own eyes (like in this post before ChatGPT or the term Generative AI). I notice recent AI users & companies adopting...

GPT‑4o Tutor Boosts Scores Equivalent to 6‑9 Months Schooling
AI really can help education: Randomized controlled experiment on high school students found a GPT-4o powered tutor that personalized problems for students raised final test scores by .15 SD, "equivalent to as much as six to nine months of additional...
AI Labs Overlook Managers: Tools Lag Behind Coding Focus
I get why AI labs are so focused on software development (it helps them get recursive improvement, and also they are coders so they think coding is the most vital thing), but there are 9.5x more managers than there are...
Embrace AI's Weirdness, Don't Force Standard IT Molds
Axiom: The form of AI that we ended up with is deeply weird in ways that we don't fully get. Attempts to pretend AI is less weird & apply it like a standard IT product will inevitably result in less...
AI Success Hinges on Organizational Redesign, Not Engineers
I am not sure "Forward Deployed AI Engineers" are going to deliver on what a lot of companies are hoping for. They are useful, yes, but AI applications are far less of a technical issue, and much more about rethinking...

ChatGPT Builds Functional Excel Strategy Game, Outshining Claude and Copilot
Hey Excel agents from Claude, OpenAI & MS Copilot: "make me a working strategy game in excel, it should have some form of graphics" Claude made a board and acted as game master, Copilot created a board but no game, ChatGPT...
Google, OpenAI, Anthropic Poised for Recursive AI Breakthrough
The failures of both Meta and xAI to maintain parity with the frontier labs, along with the fact that the Chinese open weights models continue to lag by months, means that recursive AI self-improvement, if it happens, will likely be...

GPT‑4 Debuted Early as Bing’s Erratic “Sydney”
Its the third anniversary of the launch of GPT-4, but its first known contact with the public was months earlier, when Bing/"Sydney," powered by GPT-4 was the subject of a complaint in India Worth reading. Early Sydney was famously insane. "It is...
AI VC Bets Risk Opposing Anthropic, OpenAI, Gemini Visions
VC investments typically take 5-8 years to exit. That means almost every AI VC investment right now is essentially a bet against the vision Anthropic, OpenAI, and Gemini have laid out.
AI Race Visualized: OpenAI Leads, Others Lag
I think this is a good way to visualize the AI race using the long-lived GPQA Diamond benchmark. You can see how long OpenAI had the field to itself, the rise (and collapse) of Meta, the sudden catch-up (and then stagnation)...
Prefer Concise, Author-Written AI Summaries over Influencer Narratives
Increasingly, I only trust posts summarizing AI papers that either (a) fit in the original Twitter character limit or (b) are written by the study's authors. The long narrative influencer posts written by Claude always have big errors, ask a...
Midjourney Remains Unmatched Despite Rivals' Precision
Even though other AI image generators are much better at accuracy and precision and instruction following and text, there really isn't a substitute for Midjourney.

Data Centers Use Little Water Nationwide, Yet Strain Local Supplies
Paper on data center water use makes two points 1) National data center water use in 2030 will remain “modest” compared to total public water supply (1.8%–3.7%) or agriculture (0.6%–1.2%) 2) For some localities, serving peak demand could be a big deal...
Improve Human‑AI Collaboration or Risk Being Overwhelmed
More evidence that we have to figure out how to improve the way humans and AIs work together, or we humans will end up overwhelmed.
AI's Exponential Rise Signals a Coding‑Free Future
I wrote about the exponential improvement path of AI, the early signs of massive transformations in the nature of work (including software companies where nobody codes any more), and how one week in February is an omen of our future...
GPT‑5.4 Benchmark Scores Will Split Audiences Into Freak‑Out Camps
No matter what GPT-5.4 scores on the METR long task horizon benchmark, there will be a group of people who will absolutely freak out. The score determines which group.

LLMs Achieve Rapid Logistic Gains on New Benchmark
Exponential improvements* everywhere for those with the eyes to see them. This is a cool benchmark, and was impossible for early non-reasoner LLMs to do at all. * Okay, technically "logistic improvement" because the maximum score is bounded at 100 (and...

Hunter Alpha on OpenRouter: Just Okay So Far
So far, Hunter Alpha, the new mystery model on OpenRouter is only okay. Some examples of the Lem Test and the Sparks TiKZ unicorn. https://t.co/iXLx1jnoRe
AI Chat Agents Are Transitional; We Need New UX Systems
Talking to agents in Slack, the new hot AI UX, will end up being just as much a transitional phase as talking to agents via chatbot websites. We need new systems to manage agentic work that also support new ways...
OpenClaw Overextends AI Anthropomorphism, Blurs Human Interaction
I’ve been in favor of functional anthropomorphism using AI (they work best if you treat working with AI like working with a person), but I am starting to wonder if OpenClaw takes it too far by basically forcing you to...