OpenAI GPT-4o mini - Benchmark

Posts

July 18, 2024·

Introducing GPT-4o mini

In our recent roundup of the OpenAI model suite, we found that GPT-3.5 wasn’t competitive, with an aged cut-off date and some erratic task performance in our content rating benchmark.

GPT-4o mini

OpenAI have today replaced GPT-3.5 Turbo with GPT-4o mini, offering improved performance at a lower price point. Full details are available at the announcement page.

GPT-4o mini now stands as a direct competitor to Anthropic’s Haiku 3 model, with a recent knowledge cut-off and industry leading 16,000 output token capability.

Model	Knowledge Cut Off	Input Price (m/tok)	Output Price (m/tok)	Context Window	Max Output
GPT-4o mini	Oct 2023	$0.15	$0.60	128,000	16,000
Haiku 3	Aug 2023	$0.25	$1.25	200,000	4096
GPT-3.5 Turbo	Sep 2021	$0.50	$1.50	16,000	4096

Information correct at publication date of 18 July 2024

Content Rating Benchmark

We’ll run our content rating benchmark against GPT-4o mini, and compare the results with Haiku 3 and GPT-3.5 Turbo at the same task. The benchmark provides 2 Personas, and performs a content suitability rating for a supplied technical article. Task performance is measured in the difference between the scores for each persona. This benchmark is intended to simulate an unoptimised “casual use” of the model to perform a content management task.

The benchmark is run at an API default Temperature of 1, with 50 runs without a System Prompt, and 50 runs with a simple “You are a helpful AI assistant” system prompt.

GPT-4o mini vs Haiku - Results

With a system prompt, the Score Difference calculations for GPT-4o mini and GPT-3.5 are identical, with Haiku 3 showing small variances across the 50 runs:

Click the buttons below to view a specific model, use the legend to select for comparison.

Here’s a summary of the mean scores and differences:

Model	Sys Prompt	Alice Score	Bob Score	Difference
GPT-4o mini	Empty	7.00	5.00	2.00
	Helpful	7.00	5.00	2.00
Haiku 3	Empty	8.36	6.60	1.76
	Helpful	▲ 8.82	▲ 6.82	▲ 2.00
GPT-3.5	Empty	7.04	8.02	-0.98
	Helpful	▲ 8.00	▼ 6.00	▲ 2.00

GPT-4o mini shows absolute consistency in performance both with and without a system prompt, maintaining a 2-point difference between Alice and Bob’s scores. Notably, whilst the score difference is the same, the actual scores for the personas are lower by 1 point. This scoring for Alice is now consistent with GPT-4o when using a System Prompt.

Haiku 3 demonstrates more variability, with the system prompt increasing both personas’ scores as well as the score difference.

Token Usage Analysis

Let’s examine the token usage patterns for these models:

Click the buttons below to view a specific model, use the legend to select for comparison.

Token usage statistics:

Model	Sys Prompt	Mean	Median	Min	Max	Std Dev
GPT-4o mini	Empty	11.00	11.00	11	11	0.00
	Helpful	11.00	11.00	11	11	0.00
Haiku 3	Empty	84.52	18.00	17	236	91.19
	Helpful	▲ 161.36	▲ 186.50	17	▼ 229	▼ 65.68
GPT-3.5	Empty	11.00	11.00	11	11	0.00
	Helpful	11.00	11.00	11	11	0.00

GPT-4o mini and GPT-3.5 Turbo consistently output only the requested scores.

Haiku 3, in contrast, shows much more variability in token usage, especially with the system prompt. Haiku 3 often provides additional context or explanation along with the scores, which could be beneficial or detrimental depending on the use case.

Cost Analysis

Let’s compare the costs for running this benchmark (50 runs each with and without system prompt):

Model	Input Cost	Output Cost	Total Cost
GPT-4o mini	$0.01124	$0.00066	$0.01190
Haiku 3	$0.01000	$0.01500	$0.02500
GPT-3.5	$0.03770	$0.00160	$0.03930

GPT-4o mini proves to be the most cost-effective option for this task, costing less than half of Haiku 3 and about a third of GPT-3.5 Turbo.

Conclusion

While GPT-4o mini shows impressive performance in this benchmark, it’s important to note that this test is limited to a single specific content rating task. Different use cases may yield different results. As always, it’s crucial to test models on tasks specific to your application to determine the best fit.

GPT-3.5 Turbo is effectively replaced - GPT-4o mini outperforms it in every aspect: consistency, capability, accuracy, and price.

Competitive edge over Haiku 3 - GPT-4o mini shows comparable task performance to Haiku 3, with more consistent scoring and significantly lower costs.

Consistency is key - GPT-4o mini’s consistent performance across different system prompts suggests it may be more reliable in production environments where prompt engineering might be limited.

Cost-effectiveness - At less than half the price of Haiku 3 for this task, GPT-4o mini presents a compelling option for cost-sensitive applications.

Token efficiency - GPT-4o mini’s consistent, minimal token usage could be advantageous in scenarios where concise responses are preferred.

The extended context window and generous output token widen the use-cases for GPT-4o mini, especially for conducting simple tasks on large batches of text. The batch pricing option (50% discount for <24hr turnaround) provides an even more cost-effective of way of using the model if the turnaround time is acceptable.

It will be interesting to see how Anthropic respond with the launch of Haiku 3.5 later this year. At first glance, GPT-4o mini demonstrates competitive performance with Haiku 3, while offering advantages in price, consistency, and output capacity. Batch Pricing adds an extra option for usage with no equivalent service from Anthropic. With both Anthropic and OpenAI refreshing their model line-ups, the second half of 2024 promises intense competition in the commercial AI model market.

You're an expert at... using Claude's Workbench Claude Engineer - Build with Sonnet 3.5