AI models are getting stronger, but their confidence still grows faster than their certainty. More parameters expand what the system can talk about, not what it can truly verify. The space of possible answers grows faster than the ground beneath those answers. The model keeps speaking fluently even when it has stepped past what the data can support. Hallucinations persist because the underlying act hasn’t changed. The model is still predicting the next token at high resolution. Scaling makes the guesswork smoother, not more grounded. When a model drifts outside the parts of the world it actually understands, it keeps going with the same ease and rhythm. The language gets more polished. The mistakes get harder to spot. What improves isn’t truth. What improves is persuasion. That’s why stronger models feel more accurate even when they aren’t. The errors look like insight until you examine the details. The better the model gets at shaping an argument, the easier it is to forget the argument was never backed by a source of truth in the first place. The path forward is to build systems around the model that constrain the drift. Retrieval to anchor the answer. Feedback loops to keep it honest. Guardrails that force the model back into what is actually known. Power alone won’t remove hallucinations. Only grounding will.
@PAstynome When World Models really show up in spatial computing devices (in about two years) it will get far worse. And when brain computer interfaces show up it will get far worse again. So, we need to figure out addiction issues now.
@zephyr_z9 TPUs are much more stable on 8 bit training (aqt etc) than NVIDIA chips at massive scale Previous gen was a bit sensitive on topology but looks like less of an issue for ironwood
Following everyone in AI does makes the For You feed much, much smarter. Unfortunately for you X doesn’t let you follow 31,000 people. I wish you could see mine. I put them all into lists so you can get some of the...
@pagilgukey @JohnThilen @dwarkesh_sp @ilyasut Yes. With Codex I meant GPT-5.1-Codex versus GPT-5.1-Codex-Max
@ScienceUnderSec Inspiring work. Let's put all the federal datasets on HF to unleash open and collaborative AI for science!
The dentists' office of the future. Here today at @TeamCloudberry. Customer satisfaction numbers are way up. Welcome to the automated dentists' office. The AI does all the grunge tasks other than putting their fingers in people's mouths. Customer service. Tracking of everything....
In some way, scaling is holding back progress. And either way these mega-size clusters are going to be useful. Right now, most of the capacity is used to do a crazy large run + serving existing customers. It would be good...
Today's big model release is FLUX.2 from @bfl_ml. It's like Christmas before Thanksgiving lately, isn't it?
used google's new IDE for 5 minutes, was impressed, hit what I thought was a paywall but it's not a paywall, you just can't give google money? I don't understand https://t.co/cKUTuUQNsD
Marketing is the new bottleneck. I tossed that out as a quick thought. The replies came fast. The private messages went deeper. People wanted to hear more and understand exactly what I meant, so I turned it into a thread....
@w3whq @kenwarner GANs were 2015ish, Denoising Diffusion Probabilistic Models were 2020ish, aka 5 years later. Timeline expectations are crazy these days!
After listening to @ilyasut I am even more convinced we won’t know when AGI gets here. But liked how he predicted humans will change as AI gets better. Economic boom times arrive as more people get into automating their businesses and...
@JohnThilen @dwarkesh_sp @ilyasut In addition, and that’s the important point, I think GPT-5 is smaller than GPT-4.5.
@JohnThilen @dwarkesh_sp @ilyasut I am speculating that all GPT-5.1 models (instant, thinking, Pro) are the same model but with different inference scaling budgets. Same for GPT-5 Codex. And Gemini 3 Pro and Gemini 3 Deep Think are probably also the same...
How do devs feel about this job change? Example below in Codex. Pros: you can kick off tasks from anywhere (just talked to one dev who started multiple codex tasks while getting into a cab), you get multiple versions to pull...
@GiorgioMantova @dwarkesh_sp @ilyasut I’d say this is the jump from last gen to current gen, but I think the argument is that further improvements will fizzle out in the next gen if we keep scaling pre-training. Ie it won’t give...
Reachy mini is my new podcast assistant! Coming soon with @ti_morse... https://t.co/VfUQn1Cgz6
@_The_Prophet__ TPUs had low availability for ages and also low memory relatively on the v6e especially versus the hoppers working pretty much out of the box similar to a100s Grace Blackwell is the next thing that needs reworking so there is...
@_The_Prophet__ TPUs have been more stable for training than CUDA equivalents for a couple of years now, especially on large batch sizes XLA is pretty good now! For inference it makes even less of a difference (We previously trained sota models on thousands...
I agree. Incredible interview by @dwarkesh_sp of @ilyasut. I could listen to both for months and not get bored. It's like being at a great university and hearing the best professor. I love X. This just LIT UP the AI community....
One AI use case that is only getting more popular: LLM as a judge. Everyone still talks about AI generating more content, but not enough people are talking about: 1) the horrific deluge of noise we're going to have to deal...
it is wild how far we’ve come since the hot image gen summer of 2022 image cred: @bfl_ml https://t.co/i5KbszlFDM
the community’s favorite image creation and editing model just got better: welcome, FLUX.2 by @bfl_ml 🤩 https://t.co/iLrbYYK4bd
I think it is somewhat true though that scaling helps with benchmark performance but not necessarily with with new model capabilities. Like the example he mentioned > U: "Please code xyz." > M: "Ok here is xyz." > U: "You have a bug." >...

So you are a CV engineer, what do you know about Computer Vision? I have used YOLO for... https://t.co/lX4OrFFYqE
Here's our paper: https://t.co/RmNft3zU5Z
Excited to present our new AI paper as a @NeurIPSConf spotlight next week: we find that the problem of controlling artificial superintelligence remains unsolved. With simulations and scaling laws, we find that an implementation of the least unpromising...
OpenAI is very deliberate about how they talk about Codex. It's not positioned as an operating system. It's heavily positioned as a teammate. Their site says: "Your new coding partner", "accelerates your team" Their job postings say: "we're building an AI software...
@dwarkesh_sp @ilyasut “The Age of Scaling is over.” I agree with that. Basically, since GPT 4.5 a lot of the perceived real-world progress was driven by clever engineering wrappers (context filtering, inference scaling, multi-turn tricks, retrieval, tool use, etc).
Excited about the Genesis mission - congrats to @POTUS @SecretaryWright @ScienceUnderSec @mkratsios47 @sriramk! We've experienced first-hand how more openness and collaboration in the US can massively accelerate progress. In my opinion, that's what led to the current AI boom and US...
Just shared this brilliant mind map on the 15 key architectural characteristics of AI agents — absolutely packed with insights! Modularity, evolvability, context awareness, security compliance… everything you need to design robust agents. Huge thanks to @Python_Dv for creating this gem
Ok, so what Ilya saw was extreme benchmaxxing, which in turn prompted him to create his own company to do LLM development the proper way?! Makes sense, I sympathize with that.
@giffmana @dileeplearning the "correct-unintended" rules were just that -- correct on the demonstrations but using "shortcuts" (e.g., the numerical value of a color). We also saw a small percentage of "correct-unintended" rules that humans generated, but much less...
📢 Image-GS: Content-Adaptive Image Reconstruction using 2D Gaussians In this week’s deep dive, we explore Image-GS, a groundbreaking framework that reimagines how images can be represented, compressed, restored, and upsampled using adaptive 2D Gaussian splats. Unlike traditional codecs or neural...
@giffmana @dileeplearning There was a big difference between "not classified" rules generated by humans and "correct-unintended" rules generated by machines. For humans, the "not classified" rules were generally humans writing nonsensical things like ⬇️
One more comment is that giving this image to an AI and asking about it is not sufficient to show the diff because it's all over the training data by now. You'd have to use a new, very recent image,...
@matejhladky_dev AI has crushed it since this post way beyond expectation. I made the same category of mistake all of AI was making, of thinking we have to discover and write the algorithm. You don't. You pretrain and then finetune...
I've had medium success asking LLMs if a thing exists, it works out of the box for some of the more well-known things (e.g. both GPT 5.1 and Gemini 3 know about this function if you describe the tensor transformation...
@UmmayHabiba0 @SchneiderNA Certainly. We’re witnessing a major shift in real time. AI and energy tech are finally converging in ways that will reshape how industries operate and how infrastructure is built. Here’s the video if you’d like to take a look: 📺...
@the_AI_girl @SchneiderNA Absolutely, the momentum building across AI, energy, and infrastructure is setting the stage for a major transformation in the U.S. economy. I just shared more of my insights here on @LinkedIn : https://t.co/WwaOkGdcNm Big shifts ahead.
Always a slightly mixed feeling to write pretty good first-principles code to do some tensor rearrangement, only to find that PyTorch has a built in function that does it faster. I had made a point of at least skimming the docs...
85% of organizations believe responsible AI is a top management issue. Yet only 25% have governance mechanisms in place to address it. This trust gap is costing companies dearly. In Europe alone, 68% of companies don't understand their EU AI...
1.2 million samples. BM25, Embeddings and Hybrid search. Tutorial and code comes tomorrow! Stay tuned! https://t.co/FlmaDlpASR
totally forgot about this experiment where i found it was faster and cheaper to do classification via embeddings vs using the fastest/cheapest llm (at the time)
@jasonth0 did an experiment a bit back, and found that embedding based classification seemed consistently faster and cheaper than using the cheapest/fastest model (at the time) https://t.co/uuEPwu88cg
kinda like this, but instead of using vec2text - i found grabbing a few samples from each cluster and feeding into an llm came up with better names (not surprisingly) https://t.co/K8phyyDFdR
last weekend i went down the rabbit hole of how to build dynamic ontologies, and kept coming back to clustering of embeddings curious if anyone has cool experiments i could look at around this
AI manipulation techniques revealed. For our free newsletter this week, we cover temporal hacking: AI systems that game human attention over months. @IrenaCronin and I write this newsletter every week. Temporal hacking describes AI systems that optimize for long term outcomes by subtly...
We scored every major AI landing page analyzer across the same criteria we use for real CRO audits. Comprehensiveness. Specificity. Originality. Realistic implementation. Correctness. The highest score was 5/15. Several landed at zero. This isn’t a knock on AI. It’s a reflection of...