
LinkedIn Architecture for Production-Scale LLM Semantic Search
LinkedIn has replaced its keyword and DLRM‑based search stack with a two‑stage LLM semantic search system that combines a GPU‑accelerated exhaustive bi‑encoder retriever and a 0.6 B parameter Small Language Model ranker. By employing multi‑teacher, multi‑task distillation, offline context summarization, 50 % MLP pruning, and a custom prefill‑only inference engine, the architecture achieves a 75× increase in throughput, handling hundreds of thousands of queries per second within strict latency budgets. The design eliminates approximate nearest‑neighbor indices, leverages shared‑prefix KV caching, and optimizes scoring to make cross‑encoder‑level ranking viable at production scale.

How to Pick the Right ML Team
The author, a Google veteran who has moved between anti‑abuse, YouTube Ads, and YouTube Shopping Recommendations, argues that choosing an ML team should prioritize fit over brand prestige. He notes that high‑profile teams often transition to maintenance work after their...
![ML@SCALE - 1:1 - 100 Billion Rows, Three Mistakes, One Lesson [Edition #1]](/cdn-cgi/image/width=1200,quality=75,format=auto,fit=cover/https://substackcdn.com/image/fetch/$s_!INXp!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F486d4b79-6177-4bf3-b025-c4abbc2aa8c4_944x944.png)
ML@SCALE - 1:1 - 100 Billion Rows, Three Mistakes, One Lesson [Edition #1]
Meta staff ML engineer Sanket discusses building recommender systems that train on over 100 billion rows. He highlights that most friction in ML velocity comes from experiment‑setup overhead, not compute. Sanket recounts three costly production failures—a self‑fulfilling model, evaluation‑data leakage, and...

ByteDance’s TokenMixer-Large: Scaling Ranking Models
ByteDance unveiled TokenMixer-Large, a deep ranking model that overcomes the gradient‑vanishing problem of its predecessor RankMixer. The architecture introduces a symmetric Mixing‑Reverting block that keeps token dimensions aligned, enabling very deep networks. By stripping away memory‑bound operators and relying almost...
![Why Your $130K ML Pipeline Is Starving 65 Percent of New Merchants [Edition #11]](/cdn-cgi/image/width=1200,quality=75,format=auto,fit=cover/https://substackcdn.com/image/fetch/$s_!INXp!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F486d4b79-6177-4bf3-b025-c4abbc2aa8c4_944x944.png)
Why Your $130K ML Pipeline Is Starving 65 Percent of New Merchants [Edition #11]
QuickBite, a Series D food‑delivery platform with 100 million orders, relies on its Mercury ranking engine to personalize a home‑screen feed of over 200 merchants. The pipeline handles 8,000‑14,500 requests per second, using a point‑wise XGBoost model trained on 180 days of...

Embedding Features in Weights to Kill Retrieval Latency
Pinterest replaced its traditional Two‑Tower retrieval system with a GPU‑centric neural network that can model deep user‑item interactions. By embedding high‑value candidate features directly into the model as registered buffers, the data fetch step was eliminated, cutting latency from roughly...
![A 0.44 Recall Collapse That Looked Like 0.81 Global Success [Edition #10]](/cdn-cgi/image/width=1200,quality=75,format=auto,fit=cover/https://substackcdn.com/image/fetch/$s_!INXp!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F486d4b79-6177-4bf3-b025-c4abbc2aa8c4_944x944.png)
A 0.44 Recall Collapse That Looked Like 0.81 Global Success [Edition #10]
LexiSearch, a Series A legal‑tech SaaS, hit 50,000 enterprise seats and logged 300% year‑over‑year growth in document ingestion, now indexing 25 million files. Its dual‑tower bi‑encoder search engine processes an average 120 queries per second, peaking at 350 QPS, with a...

A Blueprint for Scaling Recommender Systems
Meta unveiled a two‑tier architecture for hyperscale recommender systems that separates a massive Foundation Model (FM) from lightweight surface‑specific Expert models. The FM learns universal, lifelong user representations and generates target‑aware embeddings that capture a user’s interest in each candidate...
![12M Dollars Lost to an AUC Metric That Ignored Probability Calibration [Edition #9]](/cdn-cgi/image/width=1200,quality=75,format=auto,fit=cover/https://substackcdn.com/image/fetch/$s_!INXp!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F486d4b79-6177-4bf3-b025-c4abbc2aa8c4_944x944.png)
12M Dollars Lost to an AUC Metric That Ignored Probability Calibration [Edition #9]
AdTechFlow, a growth‑stage demand‑side platform, recently surpassed $300 million in annual ad spend and posted 40 percent year‑over‑year growth. Its real‑time bidding engine handles 180,000‑260,000 requests per second, processing roughly 450 billion impressions each month. The company’s pCTR model is retrained weekly and...

Alibaba’s EST: Decoupling Compute From Sequence Length in CTR Scaling
Alibaba’s Efficiently Scalable Transformer (EST) redesigns click‑through‑rate (CTR) models by separating user‑behavior computation from candidate‑item processing. The architecture replaces full self‑attention with Lightweight Cross‑Attention (LCA) and introduces Content Sparse Attention (CSA) to handle multimodal signals in linear time. By caching...
![0.08% False Positive Rate That Masked a $4.2M Attack [Edition #8]](/cdn-cgi/image/width=1200,quality=75,format=auto,fit=cover/https://substackcdn.com/image/fetch/$s_!INXp!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F486d4b79-6177-4bf3-b025-c4abbc2aa8c4_944x944.png)
0.08% False Positive Rate That Masked a $4.2M Attack [Edition #8]
FinShield, a Series B fintech, expanded its cross‑border payment rails to 14 markets and now processes about 8 million transactions daily. Its real‑time anti‑abuse gateway uses an XGBoost‑NN ensemble retrained weekly on a 90‑day sliding window, delivering 45 ms P99 latency and 99.99%...

Generative RecSys Won’t Save You: What Actually Matters at Billion-User Scale
The post argues that generative recommender systems, especially large‑language‑model (LLM) agents, are not a panacea for billion‑user platforms. While the RecSys 2025 keynote showcased a generative era, the author warns that conversational agents break the 200 ms latency budget and inflate...

Unpacking LinkedIn’s Move to Semantic Search
LinkedIn has re‑engineered its search stack, replacing lexical BM25 matching with a GPU‑accelerated semantic pipeline that uses dense embeddings for retrieval and a 0.6 billion‑parameter small language model (SLM) for ranking. The team built an LLM‑based “judge” to generate tens of...
![A $1.1M Generative Recommender That Collapsed Into a 2000 Video Loop [Edition #7]](/cdn-cgi/image/width=1200,quality=75,format=auto,fit=cover/https://substackcdn.com/image/fetch/$s_!INXp!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F486d4b79-6177-4bf3-b025-c4abbc2aa8c4_944x944.png)
A $1.1M Generative Recommender That Collapsed Into a 2000 Video Loop [Edition #7]
StreamPulse, a Series C video‑first platform with 200 million daily users, swapped its legacy two‑stage recommendation pipeline for a generative semantic retrieval system built on a 1.2 billion‑parameter transformer decoder. The new architecture predicts “Semantic IDs” from user histories, cutting latency to...

Anthropic Shipped Three Regressions in a Month and Their Evals Didn’t Catch One of Them
Anthropic disclosed that three unrelated changes to Claude Code rolled out between March and April caused noticeable drops in model performance. The first altered the default reasoning effort from high to medium, the second introduced a caching bug that cleared...
