OpenAI Prompt Caching

OpenAI Prompt Caching

October 4, 2024·
Shaun Smith

OpenAI Launch Prompt Caching

OpenAI Prompt Caching Logo

OpenAI Prompt Caching

OpenAI introduced prompt caching for their GPT-4o and o1 series models on October 3rd, offering immediate cost savings and performance benefits without requiring any changes to existing API integrations.

This follows the earlier significant price drops on the GPT-4o model in August, and takes a different approach to the equivalent feature still in beta from Anthropic.

Implementation

When submitting several prompts within a short time frame, typically 5-10 minutes, those that are longer than 1024 tokens (about 750 words) and start with the same text will be cached. This means you automatically benefit from lower costs and quicker responses.

Turn Cost with Caching

Chat Turn Cost with Caching

We’ve previously discussed the benefits of visible chat costs, and this is a good opportunity to update our application to show cached token usage.

The OpenAI API has long had an option to return the number of input and output tokens used during a call. The latest version expands this to include Cache and Reasoning token information.

Starting a new chat session with a long System Prompt (a considerable 14,000 tokens), and asking GPT-4o to write 500 words about cats, dogs and then mice gives us the following turn costs:

# Prompt Cached Input Output Turn Cost
1 Write 500 words on cats 0 14045 606 $0.0412
2 Write 500 words on dogs 14464 201 606 $0.0246
3 Write 500 words on mice 15104 181 641 $0.0257
OpenAI Usage Dashboard

OpenAI Dashboard View

It works well - each chat turn is cumulatively cached, and the lower pricing is applied. We can confirm this via the OpenAI Usage Dashboard, which now shows a breakdown of how many input tokens were cached. Of course, other chat sessions within the same organisation using the same System Prompt over the time period would also benefit from the caching.

Remember, the caching feature is completely transparent. Our Chat Platform only needed an update to display the new pricing.

Pricing Comparison

Chat applications may not be where we’d expect to see the biggest cost benefits of caching. This is also where direct comparisons against the Anthropic pricing model become a bit more difficult due to the different pricing structures.

In the Anthropic Beta implementation, there are separate prices for the initial Cache Write (+25%) and later Cache Reads (-90%). We’ll use GPT-4o and Claude Sonnet 3.5 for the comparison as these are competing models at similar pricepoints.

The ratio of cacheable to changing tokens will vary by task, so we’ll compare two scenarios: document summarisation and knowledge base questions.

  1. Document Summarisation. For this scenario we’ll model summarising 10,000 documents. The prompt to create the summary is cacheable, while the document content varies1.

  2. Knowledge Base Questions. We’ll model asking 10,000 short questions and answers against a comprehensive knowledge base. In this example, the Knowledge Base itself is cacheable, whilst the questions change2.

With current pricing3, the cost of running each of the scenarios is shown below:

Scenario Model Uncached Cached Saving
Summarisation GPT-4o $662.50 $618.75 6.60%
Sonnet 3.5 $855.00 $760.50 11.05%
Knowledge Base GPT-4o $775.00 $400.04 48.38%
Sonnet 3.5 $936.00 $125.99 86.54%

For cases where real-time responses aren’t needed, OpenAI offer batch pricing which offers a 50% discount for Input and Output tokens, and returns results within 24 hours.

Scenario Model Cached Batched Saving
Summarisation GPT-4o $618.75 $331.25 46.46%
Knowledge Base GPT-4o $400.04 $387.50 3.13%

We’ve compared the batched costs against the cached costs as GPT-4o no longer offers the uncached pricing, and it’s likely these scenarios would be able to take advantage of caching.

Anthropic’s pricing clearly favours scenarios with large proportions of cacheable content, leading to some very significant savings.

For tasks which don’t require real-time responses, batch pricing is the best choice as it provides heavily discounted output tokens. Since output tokens are typically the most expensive component of an API call, often costing many times more than input tokens, the 50% reduction leads to substantial savings compared to real-time inference, especially for output-heavy workloads like detailed summarization or open-ended question answering.

Implementation Experience

API Changes

Anthropic’s Prompt Caching API is still in Beta, and requires a reasonable amount of developer effort to use. The client application needs to set an HTTP header to enable the feature, and then add “cache_control” indicators in to the Messages call to show which parts of the prompt are to be cached. Succesful Cache Read and Writes can be monitored via the API response.

OpenAI’s latest version of their API contains type information for the new “Cached” and “Reasoning” token counters available within the Usage Information. Applications that show/calculate message costs (such as Big-AGI and PromptFoo and our own LLMindset Chat) need to be updated to show the revised pricing structure. This is slightly more effort than simply changing the pricing parameter, as cached scenarios require additional arithmetic to calculate correctly.

Cache Usage and Expiry

Both the Anthropic and OpenAI have a minimum cache size of 1024 tokens. Anthropic specify a hard 300 second purge period, meaning that if the cache is not used within that period (by sumbitting a prompt) then it will need to be re-established. Interestingly, OpenAI specify 300-600 seconds typically, but up to 3600s (one hour) being possible during off-peak periods. Neither platform allows the user to manually purge or otherwise control the cache once in use.

Summary

The competition between OpenAI and Anthropic continues to the benefit of consumers, as model quality and features improve and pricing decreases.

OpenAI’s pricing strategy has been notably aggressive. Initially priced higher than Sonnet 3.5, GPT-4o saw a significant reduction in August, with output token costs dropping by 33%. Now, the introduction of prompt caching further reduces prices, enhancing affordability. Although in some scenarios the Anthropic cache implementation can offer huge discounts, the development expense required to take advantage of the feature is a barrier that could easily exceed the benefits.

What’s interesting are the divergent routes that the OpenAI and Anthropic Product Teams have taken in launching Prompt Caching. I am sure detailed analysis was conducted on usage patterns and encouraging consumer behaviour, and it would seem that OpenAI decided that the back-end efficiency gains in caching all input and offering a general discount was the right way to go. Of course, the Anthropic discount on cached tokens is huge - but the pricing mechanism of having 3 input prices (standard, cache-write and cache-read) and need for API changes looks complex in comparison.

The ball is now in Anthropic’s court to respond to the latest OpenAI releases (on the front-end, the API and Models). Batch Mode pricing is a clear area where Anthropic have a gap, and as Haiku 3.5 and Opus 3.5 release there is surely pressure on their Small and Medium Model list pricing.

Model Choice, Inference Options (e.g. Reasoning) and Pricing Options are making the LLM landscape more complex, but increasingly valuable and cost-effective. Feel free to Contact Us for guidance.

Footnotes


  1. Document Summarisation assumptions: Summarisation Prompt: 3,500 tokens, Average Document Length: 15,000 tokens, Expected Output Length : 2,000 tokens. ↩︎

  2. Knowledge Base Questions assumptions: Knowledge Base: 30,000 tokens, Average Question Length: 200 tokens, Expected Output Length: 200 tokens ↩︎

  3. Pricing per m/tok on 4th October 2024:

    Token Type GPT-4o Batch GPT-4o Sonnet 3.5
    Input $1.25 $2.50 $3.00
    Output $5.00 $10.00 $15.00
    Cache Write - $2.50 $3.75
    Cache Read - $1.25 $0.30
     ↩︎