Sonnet 3.5 - Latest Model Benchmark
Introduction
On June 20, 2024, Anthropic released Claude 3.5 Sonnet, promising “frontier intelligence at 2x the speed” of Claude 3 Opus. This latest iteration promises improved performance in various benchmarks, including graduate-level reasoning, undergraduate-level knowledge, and coding proficiency. But how does it fare in real-world, casual prompting scenarios? We’ll compare Sonnet 3.5 to its predecessors, Sonnet 3 and Opus 3 using our earlier Casual Prompting Benchmark.
The Benchmark: Casual Prompting Performance
As detailed in our previous article, the casual prompting benchmark simulates real-world usage by casual, interactive users. It involves scoring the suitability of a technical article for two personas: Alice (a technical expert) and Bob (a non-technical business user). This test helps us understand how well the models interpret and respond to everyday queries without extensive prompt tuning.
Results: Sonnet 3.5
Here are the Sonnet 3.5 results compared with it’s immediate predecssor, as well as the still-flagship Opus.
Score Differences - All Scenarios
Frequency | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Opus 3 (Empty) | 0 | 0 | 0 | 46 | 4 | 0 |
Opus 3 (Helpful) | 0 | 0 | 0 | 37 | 12 | 1 |
Sonnet 3.5 (Empty) | 0 | 0 | 33 | 17 | 0 | 0 |
Sonnet 3.5 (Helpful) | 0 | 0 | 16 | 33 | 1 | 0 |
Sonnet 3 (Empty) | 0 | 0 | 36 | 14 | 0 | 0 |
Sonnet 3 (Helpful) | 0 | 1 | 41 | 8 | 0 | 0 |
Score Differentiation
The chart above shows the frequency of score differences (Alice’s score minus Bob’s score) across 50 runs for each model. A higher score difference indicates better differentiation between the technical and non-technical personas. The table below shows the averages across the runs.
Model | Mean (Empty) | Mean (Helpful) | Change |
---|---|---|---|
Opus 3 | 4.08 | 4.28 | ▲ +0.20 |
Sonnet 3.5 | 3.34 | 3.70 | ▲ +0.36 |
Sonnet 3 | 3.28 | 3.14 | ▼ -0.14 |
We can see that:
- Sonnet 3.5 shows improved differentiation compared to Sonnet 3, with a higher proportion of 4-point differences.
- Opus 3 still leads in differentiation, consistently producing 4 and 5-point differences.
- Sonnet 3.5 shows the highest sensitivity among the tested models to the introduction of a System Prompt.
Persona Scores
Here are the individual persona scores from the models:
Model | Sys Prompt | Alice Score | Bob Score | Difference |
---|---|---|---|---|
Opus 3 | Empty | 8.00 | 3.92 | 4.08 |
Helpful | ▲ 8.04 | ▼ 3.76 | ▲ 4.28 | |
Sonnet 3.5 | Empty | 8.00 | 4.66 | 3.34 |
Helpful | ▲ 8.04 | ▼ 4.34 | ▲ 3.70 | |
Sonnet 3 | Empty | 8.02 | 4.74 | 3.28 |
Helpful | ▲ 8.08 | ▲ 4.94 | ▼ 3.14 |
Scoring for Alice is consistent across models and system prompt presence, with Opus 3 and Sonnet 3.5 scoring identically in this test. Improved performances come from a more critical assessment of Bob’s score. With a System Prompt, the difference between Sonnet 3 and 3.5 is 0.56 points - a big differential in this test.
Token Usage and Efficiency
Finally, let’s examine the token usage patterns and costs for each model:
Completion Tokens - All Scenarios
Model & Prompt | 0-50 | 51-100 | 101-150 | 151-200 | 201-250 | 251+ |
---|---|---|---|---|---|---|
Opus 3 (Empty) | 0 | 0 | 7 | 38 | 5 | 0 |
Opus 3 (Helpful) | 1 | 0 | 8 | 34 | 6 | 1 |
Sonnet 3.5 (Empty) | 0 | 0 | 7 | 37 | 5 | 1 |
Sonnet 3.5 (Helpful) | 0 | 0 | 0 | 16 | 32 | 2 |
Sonnet 3 (Empty) | 37 | 3 | 3 | 5 | 1 | 1 |
Sonnet 3 (Helpful) | 37 | 4 | 3 | 4 | 2 | 0 |
Sonnet 3.5 is now by far the most verbose model. With a System Prompt, Sonnet 3.5 uses 70% more tokens than Opus 3 in this task!
Testing Costs
Model | Sys Prompt | Output Tokens | Input Cost | Output Cost | Total Cost |
---|---|---|---|---|---|
Opus 3 | Empty | 9,076 | $0.611 | $0.681 | $1.292 |
Helpful | 8,926 | $0.616 | $0.669 | $1.285 | |
Sonnet 3.5 | Empty | 9,571 | $0.122 | $0.144 | $0.266 |
Helpful | 15,047 | $0.123 | $0.226 | $0.349 | |
Sonnet 3 | Empty | 3,938 | $0.122 | $0.059 | $0.181 |
Helpful | 2,818 | $0.123 | $0.042 | $0.165 |
Sonnet 3.5 remains much cheaper than Opus 3.
However, in comparison with Sonnet 3, the test with the system prompt has over double the cost. While the increased verbosity might contribute to the improved differentation performance, it’s an important factor to consider when evaluating or migrating to the model.
Summary and Implications
The performance improvements in Sonnet 3.5 have several real-world implications:
Enhanced Differentiation: The model’s ability to better distinguish between technical and non-technical personas suggests improved performance in tasks requiring nuanced understanding of user backgrounds and needs.
Responsiveness to System Prompts: Sonnet 3.5 shows greater sensitivity to system prompts, potentially allowing for more fine-tuned control over its behavior.
Cost-Performance Trade-off: Although Sonnet 3.5 offers improved performance, its increased token usage might lead to higher costs compared to the earlier version. However, for tasks requiring higher accuracy and nuance, this trade-off will be worthwhile.
Bridging the Gap: Sonnet 3.5 closes the gap to Opus 3, offering much improved performance for the same price as the legacy model. .
Potential for Complex Tasks: The improved performance in this casual prompting scenario suggests that Sonnet 3.5 may excel in more complex tasks such as context-sensitive customer support and orchestrating multi-step workflows, as claimed in its release notes.
Conclusion
Sonnet 3.5 demonstrates clear improvements over its predecessor in our casual prompting benchmark. It shows better differentiation between personas and increased responsiveness to system prompts, bringing its performance closer to that of Claude 3 Opus.
However, these improvements come with increased token usage, which may affect cost considerations in real-world applications. The choice between 3.5 and its counterparts will depend on the specific requirements of each use case, balancing the need for nuanced understanding against budget constraints.