Machine learning at scale - Latest News and Information

All News Deals Social Blogs Videos Podcasts Digests

Machine learning at scale

Machine learning at scale

Creator

0 followers

Machine learning systems in the real world.

LinkedIn Architecture for Production-Scale LLM Semantic Search

Blog•Jun 14, 2026

LinkedIn Architecture for Production-Scale LLM Semantic Search

LinkedIn has replaced its keyword and DLRM‑based search stack with a two‑stage LLM semantic search system that combines a GPU‑accelerated exhaustive bi‑encoder retriever and a 0.6 B parameter Small Language Model ranker. By employing multi‑teacher, multi‑task distillation, offline context summarization, 50 % MLP pruning, and a custom prefill‑only inference engine, the architecture achieves a 75× increase in throughput, handling hundreds of thousands of queries per second within strict latency budgets. The design eliminates approximate nearest‑neighbor indices, leverages shared‑prefix KV caching, and optimizes scoring to make cross‑encoder‑level ranking viable at production scale.

By Machine learning at scale

How to Pick the Right ML Team

Blog•Jun 10, 2026

How to Pick the Right ML Team

The author, a Google veteran who has moved between anti‑abuse, YouTube Ads, and YouTube Shopping Recommendations, argues that choosing an ML team should prioritize fit over brand prestige. He notes that high‑profile teams often transition to maintenance work after their...

By Machine learning at scale

ML@SCALE - 1:1 - 100 Billion Rows, Three Mistakes, One Lesson [Edition #1]

Blog•Jun 7, 2026

ML@SCALE - 1:1 - 100 Billion Rows, Three Mistakes, One Lesson [Edition #1]

Meta staff ML engineer Sanket discusses building recommender systems that train on over 100 billion rows. He highlights that most friction in ML velocity comes from experiment‑setup overhead, not compute. Sanket recounts three costly production failures—a self‑fulfilling model, evaluation‑data leakage, and...

By Machine learning at scale

ByteDance’s TokenMixer-Large: Scaling Ranking Models

Blog•May 31, 2026

ByteDance’s TokenMixer-Large: Scaling Ranking Models

ByteDance unveiled TokenMixer-Large, a deep ranking model that overcomes the gradient‑vanishing problem of its predecessor RankMixer. The architecture introduces a symmetric Mixing‑Reverting block that keeps token dimensions aligned, enabling very deep networks. By stripping away memory‑bound operators and relying almost...

By Machine learning at scale

Why Your $130K ML Pipeline Is Starving 65 Percent of New Merchants [Edition #11]

Blog•May 30, 2026

Why Your $130K ML Pipeline Is Starving 65 Percent of New Merchants [Edition #11]

QuickBite, a Series D food‑delivery platform with 100 million orders, relies on its Mercury ranking engine to personalize a home‑screen feed of over 200 merchants. The pipeline handles 8,000‑14,500 requests per second, using a point‑wise XGBoost model trained on 180 days of...

By Machine learning at scale

Embedding Features in Weights to Kill Retrieval Latency

Blog•May 24, 2026

Embedding Features in Weights to Kill Retrieval Latency

Pinterest replaced its traditional Two‑Tower retrieval system with a GPU‑centric neural network that can model deep user‑item interactions. By embedding high‑value candidate features directly into the model as registered buffers, the data fetch step was eliminated, cutting latency from roughly...

By Machine learning at scale

A 0.44 Recall Collapse That Looked Like 0.81 Global Success [Edition #10]

Blog•May 23, 2026

A 0.44 Recall Collapse That Looked Like 0.81 Global Success [Edition #10]

LexiSearch, a Series A legal‑tech SaaS, hit 50,000 enterprise seats and logged 300% year‑over‑year growth in document ingestion, now indexing 25 million files. Its dual‑tower bi‑encoder search engine processes an average 120 queries per second, peaking at 350 QPS, with a...

By Machine learning at scale

A Blueprint for Scaling Recommender Systems

Blog•May 17, 2026

A Blueprint for Scaling Recommender Systems

Meta unveiled a two‑tier architecture for hyperscale recommender systems that separates a massive Foundation Model (FM) from lightweight surface‑specific Expert models. The FM learns universal, lifelong user representations and generates target‑aware embeddings that capture a user’s interest in each candidate...

By Machine learning at scale

12M Dollars Lost to an AUC Metric That Ignored Probability Calibration [Edition #9]

Blog•May 16, 2026

12M Dollars Lost to an AUC Metric That Ignored Probability Calibration [Edition #9]

AdTechFlow, a growth‑stage demand‑side platform, recently surpassed $300 million in annual ad spend and posted 40 percent year‑over‑year growth. Its real‑time bidding engine handles 180,000‑260,000 requests per second, processing roughly 450 billion impressions each month. The company’s pCTR model is retrained weekly and...

By Machine learning at scale

Alibaba’s EST: Decoupling Compute From Sequence Length in CTR Scaling

Blog•May 10, 2026

Alibaba’s EST: Decoupling Compute From Sequence Length in CTR Scaling

Alibaba’s Efficiently Scalable Transformer (EST) redesigns click‑through‑rate (CTR) models by separating user‑behavior computation from candidate‑item processing. The architecture replaces full self‑attention with Lightweight Cross‑Attention (LCA) and introduces Content Sparse Attention (CSA) to handle multimodal signals in linear time. By caching...

By Machine learning at scale

0.08% False Positive Rate That Masked a $4.2M Attack [Edition #8]

Blog•May 9, 2026

0.08% False Positive Rate That Masked a $4.2M Attack [Edition #8]

FinShield, a Series B fintech, expanded its cross‑border payment rails to 14 markets and now processes about 8 million transactions daily. Its real‑time anti‑abuse gateway uses an XGBoost‑NN ensemble retrained weekly on a 90‑day sliding window, delivering 45 ms P99 latency and 99.99%...

By Machine learning at scale

Generative RecSys Won’t Save You: What Actually Matters at Billion-User Scale

Blog•May 6, 2026

Generative RecSys Won’t Save You: What Actually Matters at Billion-User Scale

The post argues that generative recommender systems, especially large‑language‑model (LLM) agents, are not a panacea for billion‑user platforms. While the RecSys 2025 keynote showcased a generative era, the author warns that conversational agents break the 200 ms latency budget and inflate...

By Machine learning at scale

Unpacking LinkedIn’s Move to Semantic Search

Blog•May 3, 2026

Unpacking LinkedIn’s Move to Semantic Search

LinkedIn has re‑engineered its search stack, replacing lexical BM25 matching with a GPU‑accelerated semantic pipeline that uses dense embeddings for retrieval and a 0.6 billion‑parameter small language model (SLM) for ranking. The team built an LLM‑based “judge” to generate tens of...

By Machine learning at scale

A $1.1M Generative Recommender That Collapsed Into a 2000 Video Loop [Edition #7]

Blog•May 2, 2026

A $1.1M Generative Recommender That Collapsed Into a 2000 Video Loop [Edition #7]

StreamPulse, a Series C video‑first platform with 200 million daily users, swapped its legacy two‑stage recommendation pipeline for a generative semantic retrieval system built on a 1.2 billion‑parameter transformer decoder. The new architecture predicts “Semantic IDs” from user histories, cutting latency to...

By Machine learning at scale

Anthropic Shipped Three Regressions in a Month and Their Evals Didn’t Catch One of Them

Blog•Apr 27, 2026

Anthropic Shipped Three Regressions in a Month and Their Evals Didn’t Catch One of Them

Anthropic disclosed that three unrelated changes to Claude Code rolled out between March and April caused noticeable drops in model performance. The first altered the default reasoning effort from high to medium, the second introduced a caching bug that cleared...

By Machine learning at scale

Machine learning at scale | Pulse