
ChatGPT Leads in Transparent Thinking Traces, Gemini Lags
Currently, ChatGPT has the best way of viewing thinking traces, a short summary of steps in the main window, and a detailed audit in the sidebar if you want it Claude does almost as well, but more summarized and harder to see calculations and code Its a big weak spot for Gemini https://t.co/fx9nZNAGaC
Choosing AI Prompt Files Is a Transient Debate
It is notable that we are all debating exactly which markdown files are most important to feed AI (skills, memory, tool instructions) and in which order to feed them to get the best output. Feels that this is likely a...
Muse Spark Surpasses Expectations After Llama 4 Hiatus
I think Muse Spark came in far better than most were expecting as the first new model attempt from Meta, especially given the fact that it has been a year since Llama 4 with no models at all (and that...

AI‑generated Fact Checks Win Broader, Less Biased Approval
Neat experiment finds AI fact checks are rated as more helpful & less ideological than human ones "LLM-generated Community Notes can achieve broader cross-ideological acceptance than human-written notes, receiving more positive ratings from raters across the political spectrum" https://t.co/Ofg1kYNxYe

Game Studios Show Mixed Success Adapting to AI
Our Lab just posted a new research report from Zimran Ahmed about how the game industry is adapting to AI. He spoke to people at 20 different studios and found a wide range of approaches to adapt (or failures to...
AI Reveals Raphael’s Hidden Plato‑Aristotle Tension
AI finally lets us see Raphael's The School of Athens the way Raphael obviously intended it, illustrating the delicate dance and subtle conflicts between Plato and Artistotle. (Seedance 2.0 is very fun to play with) https://t.co/YD7vVaRkFt
AI Shipping Speed Outpaces Market Absorption
The pace at which useful things are shipping also seems to be accelerating. Model releases are coming faster, of course, but so are significant application and enterprise products (especially from Anthropic). Almost certainly faster than the market can track or...
AI’s Jagged Weaknesses Outpace Human Quirks
Things that make the jagged intelligence of AI harder to deal with than the jaggedness of humans: 1) Weaknesses are not always intuitive or identifiable in advanced 2) All LLMs have similar weaknesses, so you can't just hire a different one 3) Jagged...

AI Makes Rapid, Creative UI Experiments Easy
One fun thing about AI is that it lets you play with interfaces and approaches to displaying information in new ways without a lot of effort. I got a an internet connected e-ink display and set it up to show me...
US Closed‑Source Models Lead Frontier AI; China Trails
So we now have a pretty good picture of the state of the frontier AI model makers. US closed source models continue to lead. Google, OpenAI, and Anthropic stand well ahead of the pack, and may have signs of recursive self-improvement....
Amazon Nova 2 Still Lags Behind Sonnet 4.5
So what's the deal with Amazon Nova? They released Nova 2 in December, and even then, the top flight Nova 2 model trailed Sonnet 4.5. And it still hasn't left preview.
Meta's Muse Spark Falls Short of Leading AIs
After playing with it a bit, Meta's Muse Spark Thinking is fine so far, but really doesn't match the current Big Three models. It also is a bit... weird. Like some strange language & tone, a little loose with facts,...
Apply Organizational Structures to Tame LLM Hallucinations
Hallucinations remain in LLMs, but note that over centuries we have developed complicated, successful machines that take uncertain output from unreliable sources & reduce the risk of errors. We call those machines organizational structures & we can apply similar approaches to...
Most CISOs Ignore Mythos Alerts; Threats Arrive Within Nine Months
Curious how many large organization CISO offices have taken the Mythos red team reports as the red alert that it is. (I suspect very few) Based on historical trends in AI they have, at most, about six to nine months until...

Mythos Threat: Few Firms Hold Power, China Closing Gap
In different hands, Mythos would be an unprecedented cyberweapon I am not sure how we deal with this, except to note a narrow window where we know only 3 companies could be at this level of capability. But it may be...

LLM Story Feels Polished yet Lacks Logical Cohesion
I think the story that was shared in the Mythos System Card still has the signs of flawed LLM writing (which looks like good writing at first glance): A story that doesn't really hold together logically, but sounds like it...

Mythos Clones Retain Unmistakably Claude-Like Dialogue
SuperClaude (Mythos) still seems irreducibly Claude-y given the transcripts in the system card. Here two versions of Mythos are forced to talk to each other across multiple rounds. They are less philosophical than Opus 4.6 or spiritual than Opus 4.1, but...
Mythos: Not Security‑Focused, Yet First Model Raising Risks
I was told about the Mythos release, but didn't have access, so have no personal experience to add. Two points from brief: 1) It is not built for IT security, it is just a good enough model that it is good at...

Measuring AI Accuracy Is Tricky: Errors Mirror Wikipedia
This article is a case study of why measuring AI performance is so hard. AI Overviews make mistakes. But the same mistakes are in Wikipedia. But the sources are harder to find when using AI. But the AI answers may be...
AI Labs Should Prioritize Job Augmentation Over Replacement
Its an important time for the AI labs to build interfaces around the goal of "job augmentation through AI" rather than building "job replacement through AI." Chatbots were mostly augments, requiring a human to work. Agentic work patterns are still...
People Will Love Their AI, Fear the AI Industry
I suspect that popularity of AI is going to start looking like surveys where people trust their own doctors but are distrustful of the medical establishment People will increasingly like “their AI” but will increasingly be anxious about “AI” as...

Reward A, Expect B? Rethink Your Incentives
Everyone should read "On the Folly of Rewarding A, While Hoping for B” at least once. https://t.co/tF4HGbrweX https://t.co/HDor3NsxBO
2025 GenAI Impact Minimal; 2027 Could Shift Dramatically
There were likely no major work impacts of GenAI in any large firm throughout 2025. We did not have agentic tools, adoption takes time, and everyone was experimenting with process. That is starting to change. Studies that show no impact...
LLMs Mimic Humans; Agents Mimic Organizations, Enabling Cheap Delegation
It is weird that you can approach LLMs as reasonable approximations of humans and get good results, but it is even weirder that you can approach agents as reasonable approximations of organizations (higher ability work is expensive so delegation is...

Gemma 4 Fast On-Device, but Small Models Lack Agency
I am impressed by Gemma 4, there’s a lot of power for an on-device model at fast speeds. But I am not convinced you can get real agentic workflows out of a small model on device. So much depends on...

More Tokens Keep Scaling AI Reasoning Performance
Unappreciated fact is the second scaling law does not seem to completely plateau in many tasks: throw more tokens at a reasoning AI model and get better answers, especially with a simple harness. Benchmark performance is actually limited by token usage....

AI Case Studies Double Revenue, Cut Capital Needs
Big deal paper here: field experiment on 515 startups, half shown case studies of how startups are successfully using AI. Those firms used AI 44% more, had 1.9x higher revenue, needed 39% less capital: 1) AI accelerates businesses 2) The challenge is understanding...
Configure OpenClaw To
Need to set up my OpenClaw to update and restart my Claude Dispatch to add computer use so I can use that instead.

Artemis Captures New “Blue Marble” Sequel
Likely the most widely-distributed photograph in history, the original Blue Marble, taken by an unknown astronaut in 1972 from Apollo 17, and becoming the model for many images of Earth. Now we have a sequel, from Artemis (the second image)....

Good AI Diagnoses, Bad Chat Interface Worsens Outcomes
This new Nature paper (using old models) illustrates the point of my latest Substack post on AI interfaces. AI did a good job diagnosing medical issues, but when users had to interact with chatbots the interface led to confusion &...

Frontier AI Halves Expert Task Time, 5.7‑month Doubling
Here’s an independent domain extension of METR’s famous time-horizon analysis, applying it to offensive cybersecurity with real human expert timing data Similar to METR: 5.7 months doubling time. Frontier models now succeed 50% of the time at tasks that take human...
RAG’s Brief, Intense Era Ends as New Context Paradigm Emerges
The RAG era was short-lived, but intense. (Not that RAG is not useful, but it is no longer the dominant paradigm for supplying context to agents)

Prompt Injection Works on Old Models, Not Frontier AI
New report from us: Can you prompt inject your way to an “A”? As LLMs increasingly are used as judges, people are inserting AI prompts into letters, CVs & papers. We tested whether it works. It does on older & smaller...
AI's Strangeness Matters: Don't Simplify It to IT Automation
My piece in the Economist where I argue against de-weirding AI. It is a strange technology with both risks & opportunities that need to be discovered. Pretending AI works like normal IT automation can result in bad outcomes for companies...
AI Labs Fail to Articulate Clear Future Vision
The AI labs have actually done a bad job explaining what the future they are building towards will actually look like for most of us. Even “Machines of Loving Grace” has very few well-articulated visions of what Anthropic hopes life will...
Chatbot Interfaces, Not Models, Are AI’s Real Bottleneck
The biggest bottleneck in AI for most people isn't the models. It's the chatbot. New interfaces like Claude Dispatch, are closing the gap between what AI can do and what people can actually use it for. For many folks, that...
Future Campuses Will Feature Signal‑free Faraday Cage Labs
My prediction for the latest trend in academic buildings: Faraday cage testing halls (including bathrooms) with no signal for assessment.
AI Policy Stuck Between Extremes, Ignoring Real Impacts
All AI policy is haunted by a failure of imagination. It is either nothing happens or apotheosis, people can’t seem to conceive of any other futures. It is amazing to predict massive AI development and expect nothing major to change in...
PDF‑Only Papers Reveal Science’s Slow AI Adoption
The fact that every scientific paper in 2026 is still uploaded only as fully formatted PDFs to academic archive sites that often limit downloads tells you everything you need to know about how quickly the scientific system is adjusting to...

GPT‑5.4 Pro Visually Parses Scientific Papers and Key Figures
One of the things that is useful about the ChatGPT GPT-5.4 Pro (and also Thinking) harness is that it is quite good at understanding how to read scientific papers, not just relying on text, but also figuring out which figures...
ARC-AGI-3 Designed for Zero AI Score, Watch Progress
This is true, but ARC-AGI-3 is also a test designed so that AI gets zero today, just as the earlier ARC-AGI tests were designed . Those tests were then mostly saturated with a year or two. The thing to watch with...

AI Boosts US Workers' Weekly Hours by 6%
The average American worker using AI reports time savings of 6%, or 2.5 hours in a work week. Those are similar to the UK & Netherlands, and slightly more than other EU countries. There some early, non-causal, signs that this is...
AI Agents Surge Token Demand, Sparking Compute Scarcity
It is trendy to discuss Jevon's Paradox in AI (as AI gets more efficient, overall use increases) but the current situation is much simpler: thanks to agents, token demand is surging and compute is supply constrained, at least for powerful...
AI Democratizes Expert Tricks, Making Them Go Viral
I would expect that a lot of things that were old hat to experts, but completely inaccessible to most people, will go viral in the coming months. Sure, anyone could have done those things before, but it required a lot of...
AGI Labs Likely Hide Superintelligence to Profit in Markets
The easiest way to make money fast from a superhuman artificial intelligence would be in the financial markets, almost by definition. So the first lab to develop one, if AGI is possible, would almost certainly keep it quiet for as...

AI Tutors Boost Learning; Unguided AI Shortcuts Education
The research team (including @hamsabastani who is on X) found that letting students just use AI resulted in them using it to accidentally shortcut learning But both that study and a separate RCT found that AIs prompted to act as a...

LLM Trained From Scratch on 28,000 Victorian Texts
Want to talk to the past? Here is an LLM "trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 and 1899, drawn from a dataset made available by the British Library." Quite different from...
AI Vision Models Promise Breakthrough Accessibility for Blind Users
Curious if there has been any good articles written on the impact of VLMs on low-vision and blind people. The advent of a universal text reading, and visual description system seems like it would be a big advance as a...

AI Game Design Monologue Turns Gothic, Jokes About Velvet Cape
Asking Codex to build a SimGothicManor game and really enjoying how much of its internal planning monologue has become obsessed with tongue-in-cheek gothic, such as worrying about "scope creep in a velvet cape" https://t.co/AtixDfXWNM
AI Advances: Same Hardware, Stunning Results in 18 Months
One way to see the advancement of AI is to see how much further you can get with new models on the same hardware Here is "an otter using a laptop on an airplane" generated on my home computer using the...