Sonnet 3.5 Refresh Benchmark

Posts

October 24, 2024·

Introduction

Anthropic has released a significantly upgraded version of Sonnet 3.5, including an experimental “Computer Use” feature enabling Claude to operate computers directly via both Command Line and GUI.

While maintaining the same version number and knowledge cutoff date, testing reveals improvements in Code Generation quality and measurably improved reasoning capabilities.

In this article, we’ll compare the performance of old and new models, perform a Content Analysis benchmark and look at some of the under-the-hood changes.

Artifact Code Generation

While writing the Claude Artifacts Article last week, I used Create an Asteroids game as a simple prompt to have Claude produce a version of the well known game.

Multiple generations of the code were produced against the old model, allowing direct comparisons between versions.

First, a couple of examples produced using the old Sonnet 3.5:

Then the exact same prompt with the Upgraded Sonnet 3.5:

We immediately see a huge increase in the quality of produced game. The Upgraded Sonnet 3.5 regularly produces games containing features that the old version did not - for example:

Collision Detection
Restart on Game Over
Irregular shaped Asteroids
Disintegrating/Splitting Asteroids
Ship Thruster Animation

There are also far fewer defects with acceleration or rotation of the spaceship. These examples were produced with the same prompt, in the same way, just one week apart.

The difference in output quality is breathtaking, and reinforces that changing the internal version number understates the changes within.

Content Analysis Benchmark

For this benchmark, I am refreshing all of the Claude benchmark results. The benchmark is run via the API with a system prompt of “You are a helpful AI assistant” and the default temperature of 1.

ℹ️

The content analysis benchmark evaluates how models assess a news article from two different perspectives, represented by personas Alice and Bob. For the article, the model assigns a suitability score (0-10) for both personas. While individual scores are subjective, the benchmark measures the model’s ability to reason from different viewpoints.

The difference between Alice and Bob’s scores is particularly meaningful - differences closer to 5 points indicate strong performance, showing the model can effectively distinguish between distinct audience perspectives. Full details are available in this article.

Here are the results - scores are the average of 200 runs for each model:

Model ID	Name	Output Tokens	Alice	Bob	Diff
`claude-3-5-sonnet-20241022`	Sonnet 3.5 (Refresh)	33639	8.86	3.10	5.76
`claude-3-5-sonnet-20240620`	Sonnet 3.5¹	47051	8.98	4.00	4.98
`claude-3-opus-20240229`	Opus 3²	3600	8.08	3.53	4.55

The refreshed Sonnet 3.5 shows the greatest ability to differentiate the content between the personas. It is also notably less verbose than its predecessor, producing fewer output tokens across the 200 runs.

All three models are sensitive to the System Prompt, showing an improvement when it is present.

Claude.ai and API Changes

In use, the model appears faster and more responsive in chat interfaces, and the benchmarks completed in about 2/3 of the time using the refreshed model compared with the old one. The old version of Sonnet 3.5 was removed from Claude.ai on the day of release, but is still available through the API.

The Claude.ai Sonnet 3.5 System Prompt has also been updated along with the launch. This is the third published update: each roughly doubling in length!

Date	System Prompt Length
July 12 2024	5797 characters
Sept 9 2024	10888 characters
Oct 22 2024	24884 characters

Additional changes have been made in the API to support the new “Computer Use” capability. The following tools are now built in to the model and available to use via the API:

Animation of Artifact in-line editing — Artifact Diff Editing

Tool Name	Purpose
ComputerUseTool_20241022	Interpret screen shots, direct user interface interaction events
BashTool_20241022	Run and read results of shell commands
TextEditor_20241022	Modify textual content

One notable change to interacting with the new version of Claude has been that suggested changes to text are now sometimes supplied in “diff” format - specifying which lines to remove and add. The Claude.ai is now also using this to modify code in Artifacts rather than reproducing it. I expect this powerful change is related to the new TextEditor tool now build in to the model.

Conclusion

The refreshed Sonnet 3.5 is noticeably different than it’s predecessor despite carrying an identical name and April ‘24 knowledge cut-off.

Claude.ai Chatbox — Sonnet 3.5 (New) Chatbox

First impressions are that the model feels faster, produces visibly better code and demonstrates measurably improved reasoning whilst incorporating the new Computer Use capabilities.

While Haiku 3.5 was announced, it is not yet available - but expected by the end of October ‘24. This model promises to be particularly interesting as Anthropic are claiming it will outperform the respected Opus 3 model at a number of tasks. Additionally, given the current expense of operating “Computer Use” through screenshot interactions, I anticipate that it will incorporate the necessary tools - opening the possibility to drive the desktop interface with a cheaper model and leaving orchestration to the more capable Sonnet 3.5.

Finally, there was no mention of Opus 3.5 - and it has now disappeared from the Models page. It remains to be seen if this is a permanent shift in strategy, or whether Anthropic believe there is still space for a “Large” model given their remarkable improvements to Sonnet 3.5.

Footnotes

The old Sonnet 3.5 model has produced significantly better scores than when first benchmarked. ↩︎
Opus 3 produced no other outputs than the scores - different to earlier runs using identical method. ↩︎

Claude Artifacts - Build Interactive Apps and Dashboards Claude Analysis Tool - First Look