Sebastian Raschka - Latest News and Information
  • All Technology
  • AI
  • Autonomy
  • B2B Growth
  • Big Data
  • BioTech
  • ClimateTech
  • Consumer Tech
  • Crypto
  • Cybersecurity
  • DevOps
  • Digital Marketing
  • Ecommerce
  • EdTech
  • Enterprise
  • FinTech
  • GovTech
  • Hardware
  • HealthTech
  • HRTech
  • LegalTech
  • Nanotech
  • PropTech
  • Quantum
  • Robotics
  • SaaS
  • SpaceTech
AllNewsDealsSocialBlogsVideosPodcastsDigests

Technology Pulse

EMAIL DIGESTS

Daily

Every morning

Weekly

Sunday recap

NewsDealsSocialBlogsVideosPodcasts
Sebastian Raschka

Sebastian Raschka

Creator
0 followers

ML/AI research engineer. Ex stats professor. Author of "Build a Large Language Model From Scratch" (https://t.co/O8LAAMRzzW) & reasoning (https://t.co/5TueQKx2Fk)

Recent Posts

Hyper‑connections Residuals Boost Transformer Stability and Performance
Social•Jan 1, 2026

Hyper‑connections Residuals Boost Transformer Stability and Performance

Efficiency and performance tweaks in the transformer architecture usually focus(ed) on the normalization, attention, and FFN modules. For instance: Normalization: LayerNorm -> RMSNorm -> Dynamic TanH Attention: Grouped-query attention, sliding window, multi-head latent attention, sparse attention FFN: GeLU -> SiLU, SiLU -> SwiGLU, Mixture of Experts. Well, I just saw the New Year’s gift from DeepSeek, which includes some improvements to the residual path. In short, it’s built on the hyper-connections (HC) approach, which generalizes the regular (identity) residual connection into a learned one by widening the residual stream via multiple parallel ones and allowing information to mix across those parallel layers. They then take the HC idea a step further and propose mHC, which constrains the residual mixing to lie on a structured, norm-preserving manifold. They found that this "m"-modification improves training stability. This adds a small amount of overhead, but they get much better training stability and convergence. Arxiv link to the paper: https://lnkd.in/gSSgev3r

By Sebastian Raschka
DeepSeek Introduces Residual Path Enhancements for Transformers
Social•Jan 1, 2026

DeepSeek Introduces Residual Path Enhancements for Transformers

Efficiency and performance tweaks in the transformer architecture usually focus(ed) on the normalization, attention, and FFN modules. Well, here is a holiday gift from DeepSeek (https://t.co/ow1RpEG2Bv). Finally some improvements of the residual path as well. https://t.co/XhnZwfL5of

By Sebastian Raschka
2025 LLMs Reach Gold-Level Reasoning, Scaling Surge
Social•Dec 30, 2025

2025 LLMs Reach Gold-Level Reasoning, Scaling Surge

I just uploaded my State of LLMs 2025 report, where I take a look at the progress, problems, and predictions for the year. Originally, I aimed for a concise overview and outlook, but (like always) that turned into quite the...

By Sebastian Raschka
AI Should Be a Chess Partner, Not a Replacement
Social•Dec 27, 2025

AI Should Be a Chess Partner, Not a Replacement

Maybe a good analogy for how we should use AI in a sustainable way is chess. Chess engines surpassed human players decades ago, yet professional chess played by humans is still active and thriving. I am not a chess expert, but...

By Sebastian Raschka
LLM Training Evolves: From Pre‑training to RLVR
Social•Dec 22, 2025

LLM Training Evolves: From Pre‑training to RLVR

The LLM training eras: 202x Pre-training (foundation) 2022 RLHF + PPO 2023 LoRA SFT 2024 Mid-Training 2025 RLVR + GRPO

By Sebastian Raschka
NVIDIA Opens Nemotron 3 Nano: 30B MoE‑Mamba Hybrid
Social•Dec 20, 2025

NVIDIA Opens Nemotron 3 Nano: 30B MoE‑Mamba Hybrid

I really didn't expect another major open-weight LLM release this December, but here we go: NVIDIA released their new Nemotron 3 series this week. It comes in 3 sizes: 1. Nano (30B-A3B), 2. Super (100B), 3. and Ultra (500B). Architecture-wise, the models are a Mixture-of-Experts...

By Sebastian Raschka
Updated LLM Architecture Comparison Now Covers 17 Models
Social•Dec 14, 2025

Updated LLM Architecture Comparison Now Covers 17 Models

If you are interested in understanding the design and components of modern LLM architectures, I have extensively grown and updated the Big Architecture Comparison article I published last summer. It grew 2x in size since then: https://lnkd.in/g-dwdPqy 1. DeepSeek V3/R1...

By Sebastian Raschka
Mistral 3 Large Halves Experts, Doubles Their Size
Social•Dec 12, 2025

Mistral 3 Large Halves Experts, Doubles Their Size

Hold on a sec, Mistral 3 Large uses the DeepSeek V3 architecture, including MLA? Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x...

By Sebastian Raschka
From Random Forests to LLMs: A 12‑Year Evolution
Social•Dec 8, 2025

From Random Forests to LLMs: A 12‑Year Evolution

My biennial update to the "Hello World"s of ML & AI: 2013: RandomForestClassifier on Iris 2015: XGBoost on Titanic 2017: MLPs on MNIST 2019: AlexNet on CIFAR-10 2021: DistilBERT on IMDb movie reviews 2023: Llama 2 with LoRA on Alpaca 50k 2025: Qwen3 with RLVR on MATH-500

By Sebastian Raschka
Join Natolambert’s NeurIPS Research Spotlight Interviews
Social•Dec 4, 2025

Join Natolambert’s NeurIPS Research Spotlight Interviews

I couldn't make it to NeurIPS this year, but I had been looking forward to the research spotlight interviews my colleague @natolambert is hosting. If you want to chat for 10-15 min to promote your work & latest research, I...

By Sebastian Raschka
DeepSeek V3.2 Boosts Efficiency with Sparse Attention and Verifiable RL
Social•Dec 3, 2025

DeepSeek V3.2 Boosts Efficiency with Sparse Attention and Verifiable RL

The DeepSeek team just shared another model this week: DeepSeek V3.2. I put together a technical tour that walks through the key ideas and earlier models that led to this release: 🔗 https://lnkd.in/g9fcKkmm In the article, I cover the main...

By Sebastian Raschka
DeepSeek V3.2 Unveils Multi‑Head Latent Attention Evolution
Social•Dec 3, 2025

DeepSeek V3.2 Unveils Multi‑Head Latent Attention Evolution

This interesting week started with DeepSeek V3.2! I just wrote up a technical tour of the predecessors and components that led up to this: 🔗 https://t.co/JSAd9cx2s6 - Multi-Head Latent Attention - RLVR - Sparse Attention - Self-Verification - GRPO Updates https://t.co/5f965hR70I

By Sebastian Raschka
DeepSeek Model Hits Gold on IMO 2025, Boosts Self‑Refinement
Social•Nov 29, 2025

DeepSeek Model Hits Gold on IMO 2025, Boosts Self‑Refinement

Looks like we got a new DeepSeek model over the holidays (again) Basically pushes RLVR & self-refinement to gold-level scores on IMO 2025. Coincidentally, I am currently working on the self-refinement chapter, and this comes in handy as a nice, scaled-up case...

By Sebastian Raschka
Inference Scaling Boosts LLM Accuracy From 15% to 52%
Social•Nov 26, 2025

Inference Scaling Boosts LLM Accuracy From 15% to 52%

As we head into a long weekend, some of you may be looking for reading material. Good news is that Chapter 4 on inference scaling was just released earlier this week! This chapter introduces the core ideas behind inference scaling...

By Sebastian Raschka
Comparing GPT‑5.1‑Codex to GPT‑5.1‑Codex‑Max
Social•Nov 26, 2025

Comparing GPT‑5.1‑Codex to GPT‑5.1‑Codex‑Max

@pagilgukey @JohnThilen @dwarkesh_sp @ilyasut Yes. With Codex I meant GPT-5.1-Codex versus GPT-5.1-Codex-Max

By Sebastian Raschka
Spare Compute Needed to Accelerate Idea Testing
Social•Nov 25, 2025

Spare Compute Needed to Accelerate Idea Testing

In some way, scaling is holding back progress. And either way these mega-size clusters are going to be useful. Right now, most of the capacity is used to do a crazy large run + serving existing customers. It would be good...

By Sebastian Raschka
AI Breakthroughs Now Arrive in Just Five Years
Social•Nov 25, 2025

AI Breakthroughs Now Arrive in Just Five Years

@w3whq @kenwarner GANs were 2015ish, Denoising Diffusion Probabilistic Models were 2020ish, aka 5 years later. Timeline expectations are crazy these days!

By Sebastian Raschka
GPT-5 Expected to Be Smaller Than GPT-4.5
Social•Nov 25, 2025

GPT-5 Expected to Be Smaller Than GPT-4.5

@JohnThilen @dwarkesh_sp @ilyasut In addition, and that’s the important point, I think GPT-5 is smaller than GPT-4.5.

By Sebastian Raschka
GPT‑5.1 and Gemini 3 Variants Share Identical Core Model
Social•Nov 25, 2025

GPT‑5.1 and Gemini 3 Variants Share Identical Core Model

@JohnThilen @dwarkesh_sp @ilyasut I am speculating that all GPT-5.1 models (instant, thinking, Pro) are the same model but with different inference scaling budgets. Same for GPT-5 Codex. And Gemini 3 Pro and Gemini 3 Deep Think are probably also the same...

By Sebastian Raschka
Scaling Pre‑training Hits Diminishing Returns for Future Generations
Social•Nov 25, 2025

Scaling Pre‑training Hits Diminishing Returns for Future Generations

@GiorgioMantova @dwarkesh_sp @ilyasut I’d say this is the jump from last gen to current gen, but I think the argument is that further improvements will fizzle out in the next gen if we keep scaling pre-training. Ie it won’t give...

By Sebastian Raschka
Scaling Boosts Benchmarks, Not Genuine Problem‑solving Ability
Social•Nov 25, 2025

Scaling Boosts Benchmarks, Not Genuine Problem‑solving Ability

I think it is somewhat true though that scaling helps with benchmark performance but not necessarily with with new model capabilities. Like the example he mentioned > U: "Please code xyz." > M: "Ok here is xyz." > U: "You have a bug." >...

By Sebastian Raschka
Beyond Scaling: Engineering Tricks Now Drive AI Progress
Social•Nov 25, 2025

Beyond Scaling: Engineering Tricks Now Drive AI Progress

@dwarkesh_sp @ilyasut “The Age of Scaling is over.” I agree with that. Basically, since GPT 4.5 a lot of the perceived real-world progress was driven by clever engineering wrappers (context filtering, inference scaling, multi-turn tricks, retrieval, tool use, etc).

By Sebastian Raschka

Page 2 of 2

← Prev12