OpenAI o1-mini Revisited

Posts

November 3, 2024·

Introduction

When we first looked at the o1 models a few weeks ago, access was limited and testing was through the ChatGPT interface. At the time, we questioned the consistency of task analysis and spurious tokens/hallucinations appearing in the reasoning summaries.

Now that the models are widely available via the API we can get a more detailed look at their performance. For this article, we’ll compare o1-mini against the current standard OpenAI models: GPT-4o and GPT-4o-mini.

In summary, o1-mini returns the best average score but is inconsistent, with some failures typically only seen with smaller models. Reasoning token usage varies widely, with no clear correlation to task performance.

Benchmark Approach

The benchmark measures model capabilities through a simple content evaluation task. Models score a technical article’s suitability (0-10) for two personas - a technical expert and a non-technical manager - without explicit scoring criteria. Human evaluators typically show a 5-6 point difference in scores, reflecting the content’s varying accessibility to different audiences.

This design reveals three key capabilities:

Task Understanding: Reasoning about audience needs with minimal prompting
Content Analysis: Evaluating technical depth and maintaining distinct persona perspectives
Performance Stability: Consistent scoring and human-like differentiation across multiple runs

Benchmark Scores

We tested each model using identical prompts across 200 runs. GPT-4o and GPT-4o-mini used You are a helpful AI Assistant. system prompt and the API default temperature setting of 1.

o1-mini, which doesn’t allow system prompt configuration, used its fixed temperature of 1.

Model ID	Name	Score	σ(Score)	Alice	Bob	Output Tokens
`o1-mini-2024-09-12`	o1-mini	4.84	1.12	8.19	3.35	85898
`gpt-4o-2024-08-06`	GPT-4o	3.88	0.49	7.91	4.03	2221
`gpt-4o-mini-2024-07-18`	GPT-4o-mini	2.42	0.84	6.81	4.38	2236

While o1-mini achieved the highest mean score (4.84), there is a wide distribution of scores. By comparison, GPT-4o shows more stable performance with a mean score of 3.88 and standard deviation of 0.49.

Practically speaking, o1-mini is most likely to produce high-quality scores (5-6), but approximately 10% of its responses fall below GPT-4o’s typical performance level, and even produced an invalid negative score.

The charts below show the distribution of score differences, and persona specific scores.

Use the buttons to view a specific model, the legend to select for comparison.

Reasoning Token Usage

The o1 model series incorporates ‘Reasoning’ tokens during inference, charged at standard output token rates but hidden from users. Our analysis reveals no clear correlation between reasoning token count and performance quality.

Use the buttons to view a specific model, the legend to select for comparison.

While all models produce similar-sized visible responses (~23 tokens), o1-mini’s hidden reasoning tokens vary dramatically - from 128 to 1,216 per response. This variation significantly impacts both response time and cost, with higher token usage generally corresponding to longer processing times but not necessarily better results.

The table below shows the o1-mini results grouped into quintiles based on reasoning token usage, each containing 40 results.

Quintile	Token Range	Score	σ(Score)	Σ Reasoning Tokens	Average Time	Reasoning Cost	Total Cost
1	128 - 320	4.83	0.90	10368	2854ms	$0.12	$0.23
2	320 - 384	4.80	1.52	13376	3038ms	$0.16	$0.26
3	384 - 448	4.65	1.12	15552	3360ms	$0.19	$0.29
4	448 - 512	4.95	0.93	17984	3526ms	$0.22	$0.32
5	512 - 1216	5.00	1.06	24000	4320ms	$0.29	$0.39
Total				81280		$0.98	$1.49

There is no clear correlation between the number of reasoning tokens used and task performance. As expected, response time increases with reasoning token usage. By way of comparison, both GPT-4o and GPT-4o-mini average around 500ms per response.

Interactive Scatter Plot of Reasoning Tokens vs. Score

Artifact Link

Conclusion

o1-mini delivers the highest average score, matching the limited manual testing conducted on launch. Despite the high average, performance is unpredictable.

The variance in reasoning token usage suggests that task analysis is inconsistent, with no obvious scaling with task difficulty or evaluation performance.

At $3.00/$12.00 per million input/output tokens, o1-mini’s advertised rates appear competitive with models like Claude 3.5 Sonnet ($3.00/$15.00). In practice, its high reasoning token usage results in actual costs ($1.49) higher than Sonnet 3.5 ($1.02) for the same benchmark. The benfit reasoning tokens should be improved performance - its unfortunate they remain invisible yet charged for - while OpenAI retain the option to monitor them.

Both the o1-models remain in beta and text only. The OpenAI Cookbook contains some interesting guidance for using o1 to break down tasks for automation. As the Tools and Assistant API’s become available, it will be interesting to see how these capabilities can be used for complex automation tasks.

For the moment, the o1-mini beta remains best suited to supervised use-cases such as ChatGPT or Github CoPilot.

Claude Analysis Tool - First Look Haiku 3.5 Launch Benchmark