Efficiency and performance tweaks in the transformer architecture usually focus(ed) on the normalization, attention, and FFN modules. For instance: Normalization: LayerNorm -> RMSNorm -> Dynamic TanH Attention: Grouped-query attention, sliding window, multi-head latent attention, sparse attention FFN: GeLU -> SiLU, SiLU -> SwiGLU, Mixture of Experts. Well, I just saw the New Year’s gift from DeepSeek, which includes some improvements to the residual path. In short, it’s built on the hyper-connections (HC) approach, which generalizes the regular (identity) residual connection into a learned one by widening the residual stream via multiple parallel ones and allowing information to mix across those parallel layers. They then take the HC idea a step further and propose mHC, which constrains the residual mixing to lie on a structured, norm-preserving manifold. They found that this "m"-modification improves training stability. This adds a small amount of overhead, but they get much better training stability and convergence. Arxiv link to the paper: https://lnkd.in/gSSgev3r

Efficiency and performance tweaks in the transformer architecture usually focus(ed) on the normalization, attention, and FFN modules. Well, here is a holiday gift from DeepSeek (https://t.co/ow1RpEG2Bv). Finally some improvements of the residual path as well. https://t.co/XhnZwfL5of
I just uploaded my State of LLMs 2025 report, where I take a look at the progress, problems, and predictions for the year. Originally, I aimed for a concise overview and outlook, but (like always) that turned into quite the...
Maybe a good analogy for how we should use AI in a sustainable way is chess. Chess engines surpassed human players decades ago, yet professional chess played by humans is still active and thriving. I am not a chess expert, but...
The LLM training eras: 202x Pre-training (foundation) 2022 RLHF + PPO 2023 LoRA SFT 2024 Mid-Training 2025 RLVR + GRPO
I really didn't expect another major open-weight LLM release this December, but here we go: NVIDIA released their new Nemotron 3 series this week. It comes in 3 sizes: 1. Nano (30B-A3B), 2. Super (100B), 3. and Ultra (500B). Architecture-wise, the models are a Mixture-of-Experts...
If you are interested in understanding the design and components of modern LLM architectures, I have extensively grown and updated the Big Architecture Comparison article I published last summer. It grew 2x in size since then: https://lnkd.in/g-dwdPqy 1. DeepSeek V3/R1...
Hold on a sec, Mistral 3 Large uses the DeepSeek V3 architecture, including MLA? Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x...
My biennial update to the "Hello World"s of ML & AI: 2013: RandomForestClassifier on Iris 2015: XGBoost on Titanic 2017: MLPs on MNIST 2019: AlexNet on CIFAR-10 2021: DistilBERT on IMDb movie reviews 2023: Llama 2 with LoRA on Alpaca 50k 2025: Qwen3 with RLVR on MATH-500
I couldn't make it to NeurIPS this year, but I had been looking forward to the research spotlight interviews my colleague @natolambert is hosting. If you want to chat for 10-15 min to promote your work & latest research, I...
The DeepSeek team just shared another model this week: DeepSeek V3.2. I put together a technical tour that walks through the key ideas and earlier models that led to this release: 🔗 https://lnkd.in/g9fcKkmm In the article, I cover the main...

This interesting week started with DeepSeek V3.2! I just wrote up a technical tour of the predecessors and components that led up to this: 🔗 https://t.co/JSAd9cx2s6 - Multi-Head Latent Attention - RLVR - Sparse Attention - Self-Verification - GRPO Updates https://t.co/5f965hR70I

Looks like we got a new DeepSeek model over the holidays (again) Basically pushes RLVR & self-refinement to gold-level scores on IMO 2025. Coincidentally, I am currently working on the self-refinement chapter, and this comes in handy as a nice, scaled-up case...
As we head into a long weekend, some of you may be looking for reading material. Good news is that Chapter 4 on inference scaling was just released earlier this week! This chapter introduces the core ideas behind inference scaling...
@pagilgukey @JohnThilen @dwarkesh_sp @ilyasut Yes. With Codex I meant GPT-5.1-Codex versus GPT-5.1-Codex-Max
In some way, scaling is holding back progress. And either way these mega-size clusters are going to be useful. Right now, most of the capacity is used to do a crazy large run + serving existing customers. It would be good...
@w3whq @kenwarner GANs were 2015ish, Denoising Diffusion Probabilistic Models were 2020ish, aka 5 years later. Timeline expectations are crazy these days!
@JohnThilen @dwarkesh_sp @ilyasut In addition, and that’s the important point, I think GPT-5 is smaller than GPT-4.5.
@JohnThilen @dwarkesh_sp @ilyasut I am speculating that all GPT-5.1 models (instant, thinking, Pro) are the same model but with different inference scaling budgets. Same for GPT-5 Codex. And Gemini 3 Pro and Gemini 3 Deep Think are probably also the same...
@GiorgioMantova @dwarkesh_sp @ilyasut I’d say this is the jump from last gen to current gen, but I think the argument is that further improvements will fizzle out in the next gen if we keep scaling pre-training. Ie it won’t give...
I think it is somewhat true though that scaling helps with benchmark performance but not necessarily with with new model capabilities. Like the example he mentioned > U: "Please code xyz." > M: "Ok here is xyz." > U: "You have a bug." >...
@dwarkesh_sp @ilyasut “The Age of Scaling is over.” I agree with that. Basically, since GPT 4.5 a lot of the perceived real-world progress was driven by clever engineering wrappers (context filtering, inference scaling, multi-turn tricks, retrieval, tool use, etc).