Sonnet 3.5 - Latest Model Benchmark

Posts

July 1, 2024·

Introduction

On June 20, 2024, Anthropic released Claude 3.5 Sonnet, promising “frontier intelligence at 2x the speed” of Claude 3 Opus. This latest iteration promises improved performance in various benchmarks, including graduate-level reasoning, undergraduate-level knowledge, and coding proficiency. But how does it fare in real-world, casual prompting scenarios? We’ll compare Sonnet 3.5 to its predecessors, Sonnet 3 and Opus 3 using our earlier Casual Prompting Benchmark.

The Benchmark: Casual Prompting Performance

As detailed in our previous article, the casual prompting benchmark simulates real-world usage by casual, interactive users. It involves scoring the suitability of a technical article for two personas: Alice (a technical expert) and Bob (a non-technical business user). This test helps us understand how well the models interpret and respond to everyday queries without extensive prompt tuning.

Results: Sonnet 3.5

Here are the Sonnet 3.5 results compared with it’s immediate predecssor, as well as the still-flagship Opus.

Click the buttons below to compare models with and without system prompt:

Score Differences - All Scenarios

Frequency	2	3	4	5	6
Opus 3 (Empty)	0	0	46	4	0
Opus 3 (Helpful)	0	0	37	12	1
Sonnet 3.5 (Empty)	0	33	17	0	0
Sonnet 3.5 (Helpful)	0	16	33	1	0
Sonnet 3 (Empty)	0	36	14	0	0
Sonnet 3 (Helpful)	1	41	8	0	0

Score Differentiation

The chart above shows the frequency of score differences (Alice’s score minus Bob’s score) across 50 runs for each model. A higher score difference indicates better differentiation between the technical and non-technical personas. The table below shows the averages across the runs.

Model	Mean (Empty)	Mean (Helpful)	Change
Opus 3	4.08	4.28	▲ +0.20
Sonnet 3.5	3.34	3.70	▲ +0.36
Sonnet 3	3.28	3.14	▼ -0.14

We can see that:

Sonnet 3.5 shows improved differentiation compared to Sonnet 3, with a higher proportion of 4-point differences.
Opus 3 still leads in differentiation, consistently producing 4 and 5-point differences.
Sonnet 3.5 shows the highest sensitivity among the tested models to the introduction of a System Prompt.

Persona Scores

Here are the individual persona scores from the models:

Model	Sys Prompt	Alice Score	Bob Score	Difference
Opus 3	Empty	8.00	3.92	4.08
	Helpful	▲ 8.04	▼ 3.76	▲ 4.28
Sonnet 3.5	Empty	8.00	4.66	3.34
	Helpful	▲ 8.04	▼ 4.34	▲ 3.70
Sonnet 3	Empty	8.02	4.74	3.28
	Helpful	▲ 8.08	▲ 4.94	▼ 3.14

Scoring for Alice is consistent across models and system prompt presence, with Opus 3 and Sonnet 3.5 scoring identically in this test. Improved performances come from a more critical assessment of Bob’s score. With a System Prompt, the difference between Sonnet 3 and 3.5 is 0.56 points - a big differential in this test.

Token Usage and Efficiency

Finally, let’s examine the token usage patterns and costs for each model:

Completion Tokens - All Scenarios

Model & Prompt	0-50	51-100	101-150	151-200	201-250	251+
Opus 3 (Empty)	0	0	7	38	5	0
Opus 3 (Helpful)	1	0	8	34	6	1
Sonnet 3.5 (Empty)	0	0	7	37	5	1
Sonnet 3.5 (Helpful)	0	0	0	16	32	2
Sonnet 3 (Empty)	37	3	3	5	1	1
Sonnet 3 (Helpful)	37	4	3	4	2	0

Sonnet 3.5 is now by far the most verbose model. With a System Prompt, Sonnet 3.5 uses 70% more tokens than Opus 3 in this task!

Testing Costs

Model	Sys Prompt	Output Tokens	Input Cost	Output Cost	Total Cost
Opus 3	Empty	9,076	$0.611	$0.681	$1.292
	Helpful	8,926	$0.616	$0.669	$1.285
Sonnet 3.5	Empty	9,571	$0.122	$0.144	$0.266
	Helpful	15,047	$0.123	$0.226	$0.349
Sonnet 3	Empty	3,938	$0.122	$0.059	$0.181
	Helpful	2,818	$0.123	$0.042	$0.165

Sonnet 3.5 remains much cheaper than Opus 3.

However, in comparison with Sonnet 3, the test with the system prompt has over double the cost. While the increased verbosity might contribute to the improved differentation performance, it’s an important factor to consider when evaluating or migrating to the model.

Summary and Implications

The performance improvements in Sonnet 3.5 have several real-world implications:

Enhanced Differentiation: The model’s ability to better distinguish between technical and non-technical personas suggests improved performance in tasks requiring nuanced understanding of user backgrounds and needs.

Responsiveness to System Prompts: Sonnet 3.5 shows greater sensitivity to system prompts, potentially allowing for more fine-tuned control over its behavior.

Cost-Performance Trade-off: Although Sonnet 3.5 offers improved performance, its increased token usage might lead to higher costs compared to the earlier version. However, for tasks requiring higher accuracy and nuance, this trade-off will be worthwhile.

Bridging the Gap: Sonnet 3.5 closes the gap to Opus 3, offering much improved performance for the same price as the legacy model. .

Potential for Complex Tasks: The improved performance in this casual prompting scenario suggests that Sonnet 3.5 may excel in more complex tasks such as context-sensitive customer support and orchestrating multi-step workflows, as claimed in its release notes.

Conclusion

Sonnet 3.5 demonstrates clear improvements over its predecessor in our casual prompting benchmark. It shows better differentiation between personas and increased responsiveness to system prompts, bringing its performance closer to that of Claude 3 Opus.

However, these improvements come with increased token usage, which may affect cost considerations in real-world applications. The choice between 3.5 and its counterparts will depend on the specific requirements of each use case, balancing the need for nuanced understanding against budget constraints.

Claude Projects - First Impressions OpenAI GPT Suite - Casual Performance (Jul 24)