OpenAI GPT-4o mini - Benchmark
Introducing GPT-4o mini
In our recent roundup of the OpenAI model suite, we found that GPT-3.5 wasn’t competitive, with an aged cut-off date and some erratic task performance in our content rating benchmark.
OpenAI have today replaced GPT-3.5 Turbo with GPT-4o mini, offering improved performance at a lower price point. Full details are available at the announcement page.
GPT-4o mini now stands as a direct competitor to Anthropic’s Haiku 3 model, with a recent knowledge cut-off and industry leading 16,000 output token capability.
Model | Knowledge Cut Off |
Input Price (m/tok) |
Output Price (m/tok) |
Context Window | Max Output |
---|---|---|---|---|---|
GPT-4o mini | Oct 2023 | $0.15 | $0.60 | 128,000 | 16,000 |
Haiku 3 | Aug 2023 | $0.25 | $1.25 | 200,000 | 4096 |
GPT-3.5 Turbo | Sep 2021 | $0.50 | $1.50 | 16,000 | 4096 |
Information correct at publication date of 18 July 2024
Content Rating Benchmark
We’ll run our content rating benchmark against GPT-4o mini, and compare the results with Haiku 3 and GPT-3.5 Turbo at the same task. The benchmark provides 2 Personas, and performs a content suitability rating for a supplied technical article. Task performance is measured in the difference between the scores for each persona. This benchmark is intended to simulate an unoptimised “casual use” of the model to perform a content management task.
The benchmark is run at an API default Temperature of 1, with 50 runs without a System Prompt, and 50 runs with a simple “You are a helpful AI assistant” system prompt.
GPT-4o mini vs Haiku - Results
With a system prompt, the Score Difference calculations for GPT-4o mini and GPT-3.5 are identical, with Haiku 3 showing small variances across the 50 runs:
Click the buttons below to view a specific model, use the legend to select for comparison.
Here’s a summary of the mean scores and differences:
Model | Sys Prompt | Alice Score | Bob Score | Difference |
---|---|---|---|---|
GPT-4o mini | Empty | 7.00 | 5.00 | 2.00 |
Helpful | 7.00 | 5.00 | 2.00 | |
Haiku 3 | Empty | 8.36 | 6.60 | 1.76 |
Helpful | ▲ 8.82 | ▲ 6.82 | ▲ 2.00 | |
GPT-3.5 | Empty | 7.04 | 8.02 | -0.98 |
Helpful | ▲ 8.00 | ▼ 6.00 | ▲ 2.00 |
GPT-4o mini shows absolute consistency in performance both with and without a system prompt, maintaining a 2-point difference between Alice and Bob’s scores. Notably, whilst the score difference is the same, the actual scores for the personas are lower by 1 point. This scoring for Alice is now consistent with GPT-4o when using a System Prompt.
Haiku 3 demonstrates more variability, with the system prompt increasing both personas’ scores as well as the score difference.
Token Usage Analysis
Let’s examine the token usage patterns for these models:
Click the buttons below to view a specific model, use the legend to select for comparison.
Token usage statistics:
Model | Sys Prompt | Mean | Median | Min | Max | Std Dev |
---|---|---|---|---|---|---|
GPT-4o mini | Empty | 11.00 | 11.00 | 11 | 11 | 0.00 |
Helpful | 11.00 | 11.00 | 11 | 11 | 0.00 | |
Haiku 3 | Empty | 84.52 | 18.00 | 17 | 236 | 91.19 |
Helpful | ▲ 161.36 | ▲ 186.50 | 17 | ▼ 229 | ▼ 65.68 | |
GPT-3.5 | Empty | 11.00 | 11.00 | 11 | 11 | 0.00 |
Helpful | 11.00 | 11.00 | 11 | 11 | 0.00 |
GPT-4o mini and GPT-3.5 Turbo consistently output only the requested scores.
Haiku 3, in contrast, shows much more variability in token usage, especially with the system prompt. Haiku 3 often provides additional context or explanation along with the scores, which could be beneficial or detrimental depending on the use case.
Cost Analysis
Let’s compare the costs for running this benchmark (50 runs each with and without system prompt):
Model | Input Cost | Output Cost | Total Cost |
---|---|---|---|
GPT-4o mini | $0.01124 | $0.00066 | $0.01190 |
Haiku 3 | $0.01000 | $0.01500 | $0.02500 |
GPT-3.5 | $0.03770 | $0.00160 | $0.03930 |
GPT-4o mini proves to be the most cost-effective option for this task, costing less than half of Haiku 3 and about a third of GPT-3.5 Turbo.
Conclusion
While GPT-4o mini shows impressive performance in this benchmark, it’s important to note that this test is limited to a single specific content rating task. Different use cases may yield different results. As always, it’s crucial to test models on tasks specific to your application to determine the best fit.
GPT-3.5 Turbo is effectively replaced - GPT-4o mini outperforms it in every aspect: consistency, capability, accuracy, and price.
Competitive edge over Haiku 3 - GPT-4o mini shows comparable task performance to Haiku 3, with more consistent scoring and significantly lower costs.
Consistency is key - GPT-4o mini’s consistent performance across different system prompts suggests it may be more reliable in production environments where prompt engineering might be limited.
Cost-effectiveness - At less than half the price of Haiku 3 for this task, GPT-4o mini presents a compelling option for cost-sensitive applications.
Token efficiency - GPT-4o mini’s consistent, minimal token usage could be advantageous in scenarios where concise responses are preferred.
The extended context window and generous output token widen the use-cases for GPT-4o mini, especially for conducting simple tasks on large batches of text. The batch pricing option (50% discount for <24hr turnaround) provides an even more cost-effective of way of using the model if the turnaround time is acceptable.
It will be interesting to see how Anthropic respond with the launch of Haiku 3.5 later this year. At first glance, GPT-4o mini demonstrates competitive performance with Haiku 3, while offering advantages in price, consistency, and output capacity. Batch Pricing adds an extra option for usage with no equivalent service from Anthropic. With both Anthropic and OpenAI refreshing their model line-ups, the second half of 2024 promises intense competition in the commercial AI model market.