Build Hour: Prompt Caching

OpenAI
OpenAIFeb 18, 2026

Why It Matters

Prompt caching lets enterprises lower AI operational costs and accelerate response times, making large‑scale, multimodal applications financially viable and more responsive.

Key Takeaways

  • Prompt caching halves latency for long prompts
  • Cache hit rates rise with explicit prompt cache keys
  • Extended caching stores prefixes up to 24 hours
  • Static prompt prefixes boost cost savings dramatically
  • Use context engineering to maximize cache efficiency

Summary

The Build Hour session introduced OpenAI’s prompt caching feature, a mechanism that reuses computation for repeated prompt prefixes to cut latency and reduce API costs. Erica explained that once a request exceeds 1,024 tokens, OpenAI begins caching 128‑token blocks, automatically handling text, image, and audio inputs without code changes. Developers can extend cache lifetimes to 24 hours and influence routing with an optional prompt cache key, ensuring similar requests land on the same engine for higher hit rates.

Key data points highlighted include a 50‑90% discount on cached tokens across model families and up to a 99% discount for speech‑to‑speech caching. In a benchmark of 2,300 prompts ranging from 1,024 to 200,000 tokens, cached requests showed a 67% faster time‑to‑first‑token for the longest inputs, while short prompts saw modest latency gains. A live demo with an AI styling assistant demonstrated cost reductions from $0.35 to $0.21 per batch when leveraging implicit caching and a prompt cache key, while latency remained comparable for 2,000‑token prompts.

The session also covered the technical underpinnings: OpenAI hashes the first 256 tokens and checks for matching 128‑token chunks, reusing attention matrix outputs (floating‑point numbers) rather than recomputing them. Developers are advised to keep prompt prefixes deterministic—avoiding timestamps or stray whitespace—and to employ context engineering, truncation, summarization, and appropriate endpoint selection to maximize cache hits.

For businesses, adopting prompt caching can translate into substantial cost savings at scale and more predictable response times for heavy‑weight workloads, especially in multimodal applications like image batch processing or long conversational threads. By structuring prompts for cacheability and using the prompt cache key, teams can achieve higher throughput without sacrificing model intelligence.

Original Description

Build faster, cheaper, and with lower latency using prompt caching. This Build Hour breaks down how prompt caching works and how to design your prompts to maximize cache hits. Learn what’s actually being cached, when caching applies, and how small changes in your prompts can have a big impact on cost and performance.
Erika Kettleson (Solutions Engineer) covers:
• What prompt caching is and why it matters for real-world apps
• How cache hits work (prefixes, token thresholds, and continuity)
• Best practices like using the Responses API and prompt_cache_key
• How to measure cache hit rate, latency, and token savings
• Customer Spotlight: Warp (ttps://www.warp.dev/) led by Suraj Gupta (Team Lead) to explain the impact of prompt caching
👉 Follow along with the code repo: http://github.com/openai/build-hours
👉 Sign up for upcoming live Build Hours: https://webinar.openai.com/buildhours
00:00 Introduction
02:37 Foundations, Mechanics, API Walkthrough
12:11 Demo: Batch Image Processing
16:55 Demo: Branching Chat
26:02 Demo: Long Running Compaction
32:39 Cache Discount Pricing Overview
36:03 Customer Spotlight: Warp
49:37 Q&A

Comments

Want to join the conversation?

Loading comments...