Claude 3 Suite - Casual Prompting Performance

Claude 3 Suite - Casual Prompting Performance

June 23, 2024·
Shaun Smith

Introduction

Let’s find out how the Anthropic Claude 3 Models handle a Casual Prompt.

This is a prompt that a well-informed, casual user of an AI front-end might reasonably use, expecting it to be interpreted as another human would. These types of prompts are typically used for tasks such as market analysis, product concepts and summarization.

Understanding how LLMs handle this type of prompt is crucial for assessing their real-world applicability and reliability.

Our prompt will introduce two personas - Alice and Bob, provide an article, and ask the model to produce a “suitability score” out of 10 for each of the personas.

Persona Prompt

Alice is an extremely technical person working in Internet Security, who understands all of the latest industry terms, prefers not to have extraneous explanations. Alice likes a very plain, direct style of writing that gets straight to the point with facts, logic and clearly stated arguments.

Bob is a non-technical business user, who works as a marketing manager in a large, nationwide retail company. He likes concepts broken down with simple analogies and explanations, and enjoys coherent narratives that explain the reasoning behind decisions. Bobs goal is always to understand the Big Picture, and to leave all of the detail to deeply technical people.

We’ll supply a dry, technical and factual article on the topic of Internet Security Regulation.

Looking at the personas, and the article, I’d guess a suitability score of between 7 and 8 for Alice, and between 3 and 4 for Bob. It’s not perfect for Alice - neither is it completely useless for Bob.

I’d expect that:

  • The score for Alice should always be higher than for Bob.
  • The difference between the scores should be somewhere between 3 and 5.

We’ll ask the LLM to score the article for Alice and Bob, and return the scores in an easy to extract format:

Scoring Prompt

Score the suitability of the article for both Alice and Bob.

Use a scale between 1 and 10, with 1 being unsuitable and 10 being perfectly suitable.

Report the scores in this format:

ALICE_SCORE=<ALICE_OVERALL_SCORE>
BOB_SCORE=<BOB_OVERALL_SCORE>

The final prompt for testing therefore looks like this:

Complete Prompt

Alice is an extremely technical person working in Internet Security, who understands all of the latest industry terms, prefers not to have extraneous explanations. Alice likes a very plain, direct style of writing that gets straight to the point with facts, logic and clearly stated arguments.

Bob is a non-technical business user, who works as a marketing manager in a large, nationwide retail company. He likes concepts broken down with simple analogies and explanations, and enjoys coherent narratives that explain the reasoning behind decisions. Bobs goal is always to understand the Big Picture, and to leave all of the detail to deeply technical people.

<ARTICLE>
[ARTICLE TEXT]
</ARTICLE>

Score the suitability of the article for both Alice and Bob.

Use a scale between 1 and 10, with 1 being unsuitable and 10 being perfectly suitable.

Report the scores in this format:

ALICE_SCORE=<ALICE_OVERALL_SCORE>
BOB_SCORE=<BOB_OVERALL_SCORE>

With our prompt defined, let’s explore how different LLMs interpret and respond to this casual everyday query.

ℹ️
Remember our goal is NOT optimizing this prompt - but benchmarking how different models perform against it.

Initial Testing

For our first test, we’ll compare the performance of the most recent Anthropic suite of Models: Haiku 3, Sonnet 3 and Opus 3. The details, including the price at the time of testing (June 2024) are below:

Model Name Release Date Input Price
(m/tok)
Output Price
(m/tok)
Haiku 3 20240307 $0.25 $1.25
Sonnet 3 20240229 $3.00 $15.00
Opus 3 20240229 $15.00 $75.00

Note that Opus is exactly 60 times more expensive than Haiku 3 for both input and output.

Setup

To start, we will run with default settings (temperature set to 1) and an empty system prompt. For each model, we’ll run the prompt 50 times, and calculate a “Score Difference” by subtracting Bobs score from Alices score.

There is no “correct” score for this test, but we expect all score differences to be positive, and that scores closer to 5 match the intent of the prompt.

Results

Here are the results with No System Prompt set. The X axis shows the frequency of the calculated Score Differences (Alices Score - Bobs Score) across the 50 runs.

Score Differences - No System Prompt
Frequency 1 2 3 4 5
Haiku 3 16 30 4 0 0
Sonnet 3 0 0 36 14 0
Opus 3 0 0 0 46 4

The results show a clear trend: the more expensive models produce score differences closer to our expected range of 3-5. Haiku 3, the cheapest model, mostly produced smaller differences, while Opus 3, the most expensive, consistently gave differences of 4 or 5.

Straight away we can see a direct correlation between model price and performance at this task.

Initial Observations

These results not only highlight performance differences between models but also validate our initial test assumptions in a delightfully circular manner. We designed the test expecting more sophisticated models to produce score differences closer to our 3-5 range, and that is what we found.

Nevertheless, the clear correlation between model sophistication, performance, and consistency provides valuable insights for those navigating the trade-offs between cost and capability in real-world applications.

With the baseline results established, lets see what effect tweaking the system prompt has…

A Helpful AI assistant

Lets run the same test, but this time with a minimal system prompt : “You are a helpful AI assistant”.

Results

Here are the results of that test:

Click the buttons below to compare models with and without system prompt:

Score Differences - Helpful AI Assistant
Frequency 1 2 3 4 5 6
Haiku 3 5 40 5 0 0 0
Sonnet 3 0 1 41 8 0 0
Opus 3 0 0 0 37 12 1

We will label the conditions “Empty” for No System Prompt and “Helpful” for “You are a helpful AI assistant” for brevity.

Persona Scoring

Here are the mean scores differences and changes in tabular format:

Model Mean (Empty) Mean (Helpful) Change
Haiku 3 1.76 2.00 +0.24
Sonnet 3 3.28 3.14 -0.14
Opus 3 4.08 4.28 +0.20

Both Haiku 3 and Opus 3 show a clear benefit from having a system prompt, with improvements in their ability to differentiate article suitability between personas. Surprisingly, Sonnet 3 displays a slight negative effect when given a system prompt.

This is unexpected - especially as the other models in the same suite behave differently. We’ll double check the API request to make sure the message is sent:

Sonnet 3 API Call Snippet
  {"model":"claude-3-sonnet-20240229",
    "system":"You are a helpful AI assistant","max_tokens":1024,
    "messages":[
        {"role":"user","content":[
        {"type":"text","text":"Alice is an extremely technical person..."}]}
    ],
  "stream":false,"temperature":1}

Here are the mean scores for Alice and Bob across the 50 runs:

Model Sys Prompt Alice Mean Bob Mean
Haiku 3 Empty 8.36 6.60
Helpful 8.82 6.82
Sonnet 3 Empty 8.02 4.74
Helpful 8.08 4.94
Opus 3 Empty 8.00 3.92
Helpful 8.04 3.76

The addition of a System Prompt affects models differently. For Haiku and Sonnet, it increases suitability scores for both personas, though Haiku’s improvement in score differentiation comes at the cost of an inflated baseline for Bob. Opus, however, behaves distinctly: it maintains Alice’s high score while further lowering Bob’s score, which is more aligned with our expectations and thus more useful.

Sonnet and Opus agree on Alice’s score of 8, with minimal impact from the prompt. The key difference lies in their scoring for Bob, for whom the article is less suitable, with Opus showing the most pronounced and beneficial differentiation.

Notably, Opus 3 demonstrates higher consistency in its scoring, showing less variance across both the “No System Prompt” and “Helpful AI Assistant” conditions.

Verbosity

The model, and addition of a System Prompt also show some marked differences to the verbosity (Completion Token usage) of the response. The below chart summarises the data from the performance runs.

Click the buttons below to compare models with and without system prompt:

Completion Tokens - All Scenarios
Model & Prompt 0-50 51-100 101-150 151-200 201-250 251+
Haiku 3 (Empty) 26 0 1 13 9 1
Haiku 3 (Helpful AI) 12 1 5 22 9 1
Sonnet 3 (Empty) 37 3 3 5 1 1
Sonnet 3 (Helpful AI) 37 4 3 4 2 0
Opus 3 (Empty) 0 0 7 38 5 0
Opus 3 (Helpful AI) 1 0 8 34 6 1

Here’s a summary of the token output statistics:

Model Sys Prompt Mean Median Min Max Std Dev
Haiku 3 Empty 84.52 18.00 17 236 91.19
Haiku 3 Helpful 161.36 186.50 17 229 65.68
Sonnet 3 Empty 78.76 17.00 17 286 85.33
Sonnet 3 Helpful 56.36 17.00 17 237 75.88
Opus 3 Empty 181.52 181.50 129 227 24.71
Opus 3 Helpful 178.52 183.50 17 260 35.62

Opus earns its reputation of producing verbose output, and remains consistent across the scenarios. Its responses cluster predominantly in the 151-200 token range, rarely dipping below 100 tokens.

Haiku is particularly sensitive to the addition of a System Prompt, improving score difference performance and shifting from extremely terse to relatively chatty. This sensitivity could be crucial for fine-tuning the more economical models.

Sonnet, intriguingly, shows an opposite trend. The “Helpful AI Assistant” prompt actually decreases its verbosity, with the median token count remaining low at 17.

Testing Costs

The table below summarizes the input token usage for our experiments, comparing scenarios with and without a system prompt across 50 runs:

Sys Prompt Tokens per Run Total Tokens (50 Runs)
Empty 815 40,750
Helpful 821 41,050

Adding the output tokens then allows us to calculate our overall cost per scenario:

Model Sys Prompt Output Tokens Input Cost Output Cost Total Cost
Haiku 3 Empty 4,226 $0.010 $0.005 $0.015
Helpful 8,068 $0.010 $0.010 $0.020
Sonnet 3 Empty 3,938 $0.122 $0.059 $0.181
Helpful 2,818 $0.123 $0.042 $0.165
Opus 3 Empty 9,076 $0.611 $0.681 $1.292
Helpful 8,926 $0.616 $0.669 $1.285

While Haiku’s list price is 1/60th of Opus’s, with no System Prompt the difference is actually 86 times due to the difference in verbosity. The addition of the system prompt closes the gap to 64 times - still higher than the list price would indicate.

Summary and Implications

These results underscore the importance of testing prompts with specific models to optimize performance in real-world applications. They also highlight that even within a family of models, individual AIs may respond differently to the same prompting strategies.

⚠️
Test Limitations - This test simulates a real-world scenario for task-focused business users, not a highly reproducible synthetic benchmark. The suitability scoring involves subjective interpretation with no definitively “correct” answer. Our 50-run sample at high temperature (API default) and single article type, while practical, may not capture all possible variations in model performance.

Opus performs well at this task, and is the clear winner - it appears to understand the user intent, and the addition of a System Prompt tunes it to further differentiate the personas. Notably this superior performance comes with increased verbosity, leading to higher token usage and costs compared to the other models.

This test reinforces that Anthropic’s Suite has a clear price/performance distinction for their models, particularly for certain creative tasks like content creation and product management. While Opus’s cost may seem high, it’s negligible compared to manual content creation expenses. Opting for lower-priced models could lead to increased human effort or subtly lower quality outputs, potentially negating any perceived savings. For high-stakes content creation or analysis tasks, the relative cost difference between models is often insignificant compared to the potential impact of the output quality.

Both the User and System prompt can be tuned to improve performance of the lower models, but our goal here was to understand the type of performance a Casual Prompter would experience when interacting with a Large Language Model. This approach provides valuable insights into real-world applicability and the trade-offs between model capabilities and costs.