Sebastian Raschka

Sebastian Raschka

Creator
0 followers

ML/AI research engineer. Ex stats professor. Author of "Build a Large Language Model From Scratch" (https://t.co/O8LAAMRzzW) & reasoning (https://t.co/5TueQKx2Fk)

Hyper‑connections Residuals Boost Transformer Stability and Performance
SocialJan 1, 2026

Hyper‑connections Residuals Boost Transformer Stability and Performance

Efficiency and performance tweaks in the transformer architecture usually focus(ed) on the normalization, attention, and FFN modules. For instance: Normalization: LayerNorm -> RMSNorm -> Dynamic TanH Attention: Grouped-query attention, sliding window, multi-head latent attention, sparse attention FFN: GeLU -> SiLU,...

By Sebastian Raschka
DeepSeek Introduces Residual Path Enhancements for Transformers
SocialJan 1, 2026

DeepSeek Introduces Residual Path Enhancements for Transformers

Efficiency and performance tweaks in the transformer architecture usually focus(ed) on the normalization, attention, and FFN modules. Well, here is a holiday gift from DeepSeek (https://t.co/ow1RpEG2Bv). Finally some improvements of the residual path as well. https://t.co/XhnZwfL5of

By Sebastian Raschka
2025 LLMs Reach Gold-Level Reasoning, Scaling Surge
SocialDec 30, 2025

2025 LLMs Reach Gold-Level Reasoning, Scaling Surge

I just uploaded my State of LLMs 2025 report, where I take a look at the progress, problems, and predictions for the year. Originally, I aimed for a concise overview and outlook, but (like always) that turned into quite the...

By Sebastian Raschka
AI Should Be a Chess Partner, Not a Replacement
SocialDec 27, 2025

AI Should Be a Chess Partner, Not a Replacement

Maybe a good analogy for how we should use AI in a sustainable way is chess. Chess engines surpassed human players decades ago, yet professional chess played by humans is still active and thriving. I am not a chess expert, but...

By Sebastian Raschka
LLM Training Evolves: From Pre‑training to RLVR
SocialDec 22, 2025

LLM Training Evolves: From Pre‑training to RLVR

The LLM training eras: 202x Pre-training (foundation) 2022 RLHF + PPO 2023 LoRA SFT 2024 Mid-Training 2025 RLVR + GRPO

By Sebastian Raschka
NVIDIA Opens Nemotron 3 Nano: 30B MoE‑Mamba Hybrid
SocialDec 20, 2025

NVIDIA Opens Nemotron 3 Nano: 30B MoE‑Mamba Hybrid

I really didn't expect another major open-weight LLM release this December, but here we go: NVIDIA released their new Nemotron 3 series this week. It comes in 3 sizes: 1. Nano (30B-A3B), 2. Super (100B), 3. and Ultra (500B). Architecture-wise, the models are a Mixture-of-Experts...

By Sebastian Raschka
Updated LLM Architecture Comparison Now Covers 17 Models
SocialDec 14, 2025

Updated LLM Architecture Comparison Now Covers 17 Models

If you are interested in understanding the design and components of modern LLM architectures, I have extensively grown and updated the Big Architecture Comparison article I published last summer. It grew 2x in size since then: https://lnkd.in/g-dwdPqy 1. DeepSeek V3/R1...

By Sebastian Raschka
Mistral 3 Large Halves Experts, Doubles Their Size
SocialDec 12, 2025

Mistral 3 Large Halves Experts, Doubles Their Size

Hold on a sec, Mistral 3 Large uses the DeepSeek V3 architecture, including MLA? Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x...

By Sebastian Raschka
From Random Forests to LLMs: A 12‑Year Evolution
SocialDec 8, 2025

From Random Forests to LLMs: A 12‑Year Evolution

My biennial update to the "Hello World"s of ML & AI: 2013: RandomForestClassifier on Iris 2015: XGBoost on Titanic 2017: MLPs on MNIST 2019: AlexNet on CIFAR-10 2021: DistilBERT on IMDb movie reviews 2023: Llama 2 with LoRA on Alpaca 50k 2025: Qwen3 with RLVR on MATH-500

By Sebastian Raschka
Join Natolambert’s NeurIPS Research Spotlight Interviews
SocialDec 4, 2025

Join Natolambert’s NeurIPS Research Spotlight Interviews

I couldn't make it to NeurIPS this year, but I had been looking forward to the research spotlight interviews my colleague @natolambert is hosting. If you want to chat for 10-15 min to promote your work & latest research, I...

By Sebastian Raschka
DeepSeek V3.2 Boosts Efficiency with Sparse Attention and Verifiable RL
SocialDec 3, 2025

DeepSeek V3.2 Boosts Efficiency with Sparse Attention and Verifiable RL

The DeepSeek team just shared another model this week: DeepSeek V3.2. I put together a technical tour that walks through the key ideas and earlier models that led to this release: 🔗 https://lnkd.in/g9fcKkmm In the article, I cover the main...

By Sebastian Raschka
DeepSeek V3.2 Unveils Multi‑Head Latent Attention Evolution
SocialDec 3, 2025

DeepSeek V3.2 Unveils Multi‑Head Latent Attention Evolution

This interesting week started with DeepSeek V3.2! I just wrote up a technical tour of the predecessors and components that led up to this: 🔗 https://t.co/JSAd9cx2s6 - Multi-Head Latent Attention - RLVR - Sparse Attention - Self-Verification - GRPO Updates https://t.co/5f965hR70I

By Sebastian Raschka
DeepSeek Model Hits Gold on IMO 2025, Boosts Self‑Refinement
SocialNov 29, 2025

DeepSeek Model Hits Gold on IMO 2025, Boosts Self‑Refinement

Looks like we got a new DeepSeek model over the holidays (again) Basically pushes RLVR & self-refinement to gold-level scores on IMO 2025. Coincidentally, I am currently working on the self-refinement chapter, and this comes in handy as a nice, scaled-up case...

By Sebastian Raschka
Inference Scaling Boosts LLM Accuracy From 15% to 52%
SocialNov 26, 2025

Inference Scaling Boosts LLM Accuracy From 15% to 52%

As we head into a long weekend, some of you may be looking for reading material. Good news is that Chapter 4 on inference scaling was just released earlier this week! This chapter introduces the core ideas behind inference scaling...

By Sebastian Raschka
Comparing GPT‑5.1‑Codex to GPT‑5.1‑Codex‑Max
SocialNov 26, 2025

Comparing GPT‑5.1‑Codex to GPT‑5.1‑Codex‑Max

@pagilgukey @JohnThilen @dwarkesh_sp @ilyasut Yes. With Codex I meant GPT-5.1-Codex versus GPT-5.1-Codex-Max

By Sebastian Raschka
Spare Compute Needed to Accelerate Idea Testing
SocialNov 25, 2025

Spare Compute Needed to Accelerate Idea Testing

In some way, scaling is holding back progress. And either way these mega-size clusters are going to be useful. Right now, most of the capacity is used to do a crazy large run + serving existing customers. It would be good...

By Sebastian Raschka
AI Breakthroughs Now Arrive in Just Five Years
SocialNov 25, 2025

AI Breakthroughs Now Arrive in Just Five Years

@w3whq @kenwarner GANs were 2015ish, Denoising Diffusion Probabilistic Models were 2020ish, aka 5 years later. Timeline expectations are crazy these days!

By Sebastian Raschka
GPT-5 Expected to Be Smaller Than GPT-4.5
SocialNov 25, 2025

GPT-5 Expected to Be Smaller Than GPT-4.5

@JohnThilen @dwarkesh_sp @ilyasut In addition, and that’s the important point, I think GPT-5 is smaller than GPT-4.5.

By Sebastian Raschka
GPT‑5.1 and Gemini 3 Variants Share Identical Core Model
SocialNov 25, 2025

GPT‑5.1 and Gemini 3 Variants Share Identical Core Model

@JohnThilen @dwarkesh_sp @ilyasut I am speculating that all GPT-5.1 models (instant, thinking, Pro) are the same model but with different inference scaling budgets. Same for GPT-5 Codex. And Gemini 3 Pro and Gemini 3 Deep Think are probably also the same...

By Sebastian Raschka
Scaling Pre‑training Hits Diminishing Returns for Future Generations
SocialNov 25, 2025

Scaling Pre‑training Hits Diminishing Returns for Future Generations

@GiorgioMantova @dwarkesh_sp @ilyasut I’d say this is the jump from last gen to current gen, but I think the argument is that further improvements will fizzle out in the next gen if we keep scaling pre-training. Ie it won’t give...

By Sebastian Raschka
Scaling Boosts Benchmarks, Not Genuine Problem‑solving Ability
SocialNov 25, 2025

Scaling Boosts Benchmarks, Not Genuine Problem‑solving Ability

I think it is somewhat true though that scaling helps with benchmark performance but not necessarily with with new model capabilities. Like the example he mentioned > U: "Please code xyz." > M: "Ok here is xyz." > U: "You have a bug." >...

By Sebastian Raschka
Beyond Scaling: Engineering Tricks Now Drive AI Progress
SocialNov 25, 2025

Beyond Scaling: Engineering Tricks Now Drive AI Progress

@dwarkesh_sp @ilyasut “The Age of Scaling is over.” I agree with that. Basically, since GPT 4.5 a lot of the perceived real-world progress was driven by clever engineering wrappers (context filtering, inference scaling, multi-turn tricks, retrieval, tool use, etc).

By Sebastian Raschka
Seeing Benchmaxxing, Ilya Launches Company for Proper LLM Development
SocialNov 25, 2025

Seeing Benchmaxxing, Ilya Launches Company for Proper LLM Development

Ok, so what Ilya saw was extreme benchmaxxing, which in turn prompted him to create his own company to do LLM development the proper way?! Makes sense, I sympathize with that.

By Sebastian Raschka
Assume AI on Take‑home Exams; Avoid Knowledge‑based Tests
SocialNov 24, 2025

Assume AI on Take‑home Exams; Avoid Knowledge‑based Tests

In my classes, a big portion of the grade was HW and project-based (https://t.co/gDwMlp0iwV) but I still had to include traditional exams. And then, during COVID, we had to redesign the exams so that students can take them at home. It...

By Sebastian Raschka
LLMs‑from‑scratch Hits 1,000 Commits on Monday
SocialNov 24, 2025

LLMs‑from‑scratch Hits 1,000 Commits on Monday

Just a regular Monday morning. Saw that LLMs-from-scratch got its 1000th commit. Feels a bit surreal 🤯😊 https://t.co/n3EWkoEMHp

By Sebastian Raschka
Olmo 3 Shows Transparent Architecture with Sliding‑Window Attention
SocialNov 23, 2025

Olmo 3 Shows Transparent Architecture with Sliding‑Window Attention

There have been lots of interesting LLM releases last week. My favorite was actually the Olmo 3 release. Olmo models are always a highlight since they are fully transparent (including training methods and datasets) and come with very detailed technical...

By Sebastian Raschka
Own Hardware Pipelines Enable Aggressive, Low‑cost Inference Scaling
SocialNov 18, 2025

Own Hardware Pipelines Enable Aggressive, Low‑cost Inference Scaling

Wanted to say this is because the competition focused on cheaper models. But that eve doesn't seem true. What seems more plausible is that teams with strong training pipelines and their own hardware can push much more aggressive inference-time scaling at...

By Sebastian Raschka
Training Vs. Inference: Break‑Even at 25 Million Queries
SocialNov 18, 2025

Training Vs. Inference: Break‑Even at 25 Million Queries

What should we focus on, (more) LLM training or inference scaling? (A question I got asked multiple times now, so here are some thoughts.) Training is usually very, very expensive, but it is a one-time cost. Inference-scaling is comparatively cheap, but...

By Sebastian Raschka
Inference Scaling Boosts LLM Accuracy From 15% to 52%
SocialNov 15, 2025

Inference Scaling Boosts LLM Accuracy From 15% to 52%

Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one. I...

By Sebastian Raschka
Read Offline First, Then Code to Master LLM Books
SocialNov 12, 2025

Read Offline First, Then Code to Master LLM Books

I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all...

By Sebastian Raschka
Read Offline First, Then Type and Run Code
SocialNov 12, 2025

Read Offline First, Then Type and Run Code

I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all...

By Sebastian Raschka
Key LLM Architecture Insights in 25‑Minute Talk
SocialNov 8, 2025

Key LLM Architecture Insights in 25‑Minute Talk

My "The Building Blocks of Today’s and Tomorrow’s Language Models" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq If you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture...

By Sebastian Raschka
Exploring Emerging Alternatives to Standard Large Language Models
SocialNov 4, 2025

Exploring Emerging Alternatives to Standard Large Language Models

My new field guide to alternatives to standard LLMs: Gated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers. https://t.co/ZpWugAccgQ https://t.co/255yQXaDcM

By Sebastian Raschka
Linear Attention Makes a Comeback with Efficient LLMs
SocialNov 1, 2025

Linear Attention Makes a Comeback with Efficient LLMs

With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened. First, linear attention variants have been around for...

By Sebastian Raschka
Fp16 Viable with Proper Normalization; Bf16 Still Safer
SocialOct 31, 2025

Fp16 Viable with Proper Normalization; Bf16 Still Safer

I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). I can see why with QKNorm...

By Sebastian Raschka
DGX Beats Mac Mini, Offers Seamless NVIDIA Sync Setup
SocialOct 30, 2025

DGX Beats Mac Mini, Offers Seamless NVIDIA Sync Setup

A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week. Before...

By Sebastian Raschka
Testing Inference-Scaling Techniques to Shape Upcoming Chapter
SocialOct 28, 2025

Testing Inference-Scaling Techniques to Shape Upcoming Chapter

On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter: A) Parallel Sampling - Self-Consistency (Majority Vote) - Rejection Sampling - Best-of-N (Verifier-Based) B) Sequential Refinement - Self-Refinement - Power Sampling - MCMC (Simple) - MCMC (Block as in "Reasoning with Sampling" paper) - Tree-of-Thought ......

By Sebastian Raschka
MiniMax-M2 Boosts Performance with Full Attention and Per‑Layer QK‑Norm
SocialOct 28, 2025

MiniMax-M2 Boosts Performance with Full Attention and Per‑Layer QK‑Norm

Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my "The Big LLM Architecture Comparison" with entry number 13! 1️⃣ Full attention modules: As shown in the overview figure below, I grouped...

By Sebastian Raschka
Presenting 2025 LLM Landscape at PyTorch Conference
SocialOct 22, 2025

Presenting 2025 LLM Landscape at PyTorch Conference

Excited to be (finally) heading to the PyTorch Conference! I’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural...

By Sebastian Raschka
Explore the Power of Mixture of Experts (MoE)
SocialOct 20, 2025

Explore the Power of Mixture of Experts (MoE)

🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i

By Sebastian Raschka
New Chapter Releases Symbolic Verifier for LLM Evaluation
SocialOct 16, 2025

New Chapter Releases Symbolic Verifier for LLM Evaluation

Chapter 3 on building a symbolic verifier for LLMs from scratch is now live: https://mng.bz/lZ5B . And with this, the first 176 pages of Build A Reasoning Model (From Scratch) are now available. This verifier is a useful method for...

By Sebastian Raschka
Mac Mini Excels at Inference; DGX Spark Needed for Fine‑tuning
SocialOct 15, 2025

Mac Mini Excels at Inference; DGX Spark Needed for Fine‑tuning

Saw that DGX Spark vs Mac Mini M4 Pro benchmark plot making the rounds (looks like it came from @lmsysorg). Thought I’d share a few notes as someone who actually uses a Mac Mini M4 Pro and has been tempted...

By Sebastian Raschka
Launch of Readsail: Curated AI Learning Platform
SocialOct 15, 2025

Launch of Readsail: Curated AI Learning Platform

If your company wants to help employees stay up to date with AI, I've teamed up with @natolambert and others to launch @readsail. It's essentially a platform that makes it easy to manage ongoing AI learning and brings you a...

By Sebastian Raschka
Sliding Window Attention Streamlines Long-Sequence Processing
SocialOct 13, 2025

Sliding Window Attention Streamlines Long-Sequence Processing

Sliding Window Attention 🔗 https://t.co/LIR9gCActT https://t.co/NSn4D4vLoF

By Sebastian Raschka