Haiku 3.5 Launch Benchmark
Introduction
Anthropic fulfilled their October 2024 launch announcement with Haiku 3.5 being made available on November 4th. Haiku models are positioned as Anthropic’s fastest and most cost-effective options in their range.
Haiku 3.5 now has the most recent knowledge cut-off of Anthropic’s models (July 2024), and doubles its predecessor’s maximum output length from 4096 to 8192 tokens.
Performance is in a different class to its predecessor - performing well in the content evaluation task. The list price is now 4x higher than the original Haiku 3 - though in practice for this test it was 5.8x higher due to increased output token usage.
Compared with Sonnet 3.5, the high scores comes at a lower cost but with less predictability - showing similar variance to the beta Open AI o1-mini model.
Benchmark Approach
The benchmark measures model capabilities through a simple content evaluation task. Models score a technical article’s suitability (0-10) for two personas - a technical expert and a non-technical manager - without explicit scoring criteria. Human evaluators typically show a 5-6 point difference in scores, reflecting the content’s varying accessibility to different audiences.
This design reveals three key capabilities:
- Task Understanding: Reasoning about audience needs with minimal prompting
- Content Analysis: Evaluating technical depth and maintaining distinct persona perspectives
- Performance Stability: Consistent scoring and human-like differentiation across multiple runs
Benchmark Scores
The below scores show Haiku 3.5 compared with the refreshed Sonnet 3.5 and the original Haiku 3. The same prompt was run 200 times with You are a helpful AI Assistant
system prompt and the API default temperature of 1.
Model ID | Name | Score | σ(Score) | Alice | Bob | Output Tokens |
---|---|---|---|---|---|---|
claude-3-5-haiku-20241022 |
Haiku 3.5 | 6.13 | 1.14 | 8.82 | 2.69 | 47508 |
claude-3-5-sonnet-20241022 |
Sonnet 3.5 | 5.75 | 0.52 | 8.86 | 3.10 | 33639 |
claude-3-haiku-20240307 |
Haiku 3 | 1.64 | 0.55 | 8.22 | 6.59 | 21674 |
The charts below show the distribution of scores for each model.
Use the buttons to view a specific model, the legend to select for comparison.
The persona scoring for Alice (good content match) is remarkably consistent between Haiku 3.5 and Sonnet 3.5, with Bob (poor content match) being the main cause of differentiation.
Use the buttons to view a specific model, the legend to select for comparison.
Use the buttons to view a specific model, the legend to select for comparison.
Haiku 3.5 is notably more verbose than both Sonnet 3.5 and its predecessor. This is noticeable in the price comparison with Haiku 3.5 looking more like a discounted Sonnet 3.5 than a price increased Haiku 3.
Model | Input Cost | Output Cost | Total Cost | Input Tokens | Output Tokens |
---|---|---|---|---|---|
Haiku 3.5 | $0.1720 | $0.2375 | $0.4095 | 172000 | 47508 |
Sonnet 3.5 | $0.5160 | $0.5046 | $1.0206 | 172000 | 33639 |
Haiku 3 | $0.0430 | $0.0271 | $0.0701 | 172000 | 21674 |
Conclusion
The new Haiku 3.5 presents a new price/performance option that appears to compete more with OpenAI’s GPT-4o than its predecessor, priced at $1.00/$5.00 per million tokens compared to GPT-4o’s $2.50/$10.00. The launch has come with impressive industry benchmark scores that have been used to justify the price.
It does feel that Anthropic are constraining themselves with their naming scheme - the updated Sonnet 3.5 justified a new review, and it feels that there is now space in the range for a lower priced model that would normally have taken the “Haiku” moniker.
The lack of vision processing on launch is a disappointment - especially with the potential for lowering the cost of using the token hungry Computer Use features. The high performance variance is intriguing, and I’ll be certainly be exploring whether that’s a benefit for creative tasks.
Still, if able to take advantage of combined caching and batching, the new Haiku 3.5 offers incredible price performance once those discounts are taken in to account.