OpenAI o1 - First Impressions

OpenAI o1 - First Impressions

September 14, 2024·
Shaun Smith

OpenAI o1: A New Approach

On September 12th 2024, OpenAI released a preview of a new class of Generative AI model: a reasoning model named “o1”, and a smaller companion model “o1-mini”.

OpenAI o1 logo

OpenAI o1

The o1 models have been trained to “think through problems”, and conduct complex chain-of-thought reasoning in the output context window to enhance it’s responses. The models are specifically optimised for improved performance at science, coding and maths problems, and perform extremely well in reasoning-heavy benchmarks.

The o1 model truly is a preview: access is extremely limited (30 messages per-week for ChatGPT Plus users; API access limited to the highest tier of API users). It is also text-only, and does not support tool calling (e.g. web search, code execution).

Update 17/09: OpenAI have increased the limits to 50 messages per-week for o1-preview and 50 messages per-day for o1-mini

Update 23/12: Refer also to our o1-mini revisited article.

Summary

Both o1-preview and o1-mini perform extremely well in our content management benchmark, matching well-engineered prompts against standard models - and hitting the limit of what this benchmark can achieve. Detailed performance analysis is difficult due to limited access and the hidden reasoning output.

Even so, it seems clear that for most users with complex tasks and direct prompts, models automated reasoning for responses will provide a step-change in output quality and utility. OpenAI are intending to reinforce the model’s reasoning logic by monitoring and incorporating ideas from the throught chains produced during use, promising long-term improvements in performance.

For most general use cases, the trade-off between increased inference time and reduced prompt engineering effort will be worthwhile. While the o1 models may take somewhat longer to generate responses due to the additional reasoning steps, this is offset by the significant decrease in time and effort required for crafting prompts. Users can let the model independently work out the reasoning, allowing for more efficient and effective use across a wide range of complex tasks without extensive prompt optimization.

Content Analysis Prompt Benchmark

To get a feel for how o1-preview and o1-mini performs, we’ll use our simple content scoring benchmark.

ℹ️
This content benchmark introduces two personas - Alice and Bob, calculates a suitability score out of 10 for each persona for a supplied news article. There isn’t a “correct” score, the goal is to have the model distinguish, reason and assess content. When introducing the benchmark, we claimed that scores closer to 5 match the intent of the prompt. The prompts and method are described in this article.

The prompt we are using is consistent with OpenAI’s advice for prompting their reasoning model.

o1 API Launch Limitations

o1 API Launch Limitations

The o1 models have not been made generally available via the API, so these tests have been run manually via the ChatGPT front-end. It’s notable that neither System Prompt nor temperature can be adjusted via the API and that streaming of tokens is unavailable.

For comparison, the benchmark has also been run against Claude Sonnet 3.5 and Opus 3 via the Claude.ai front-end. It’s appreciated that Anthropic have released their system prompt, and this improves performance against our benchmark baseline.

Complete results are tablulated in the Appendix. The below results are calculated from 10 runs.

Model Run Type Score Diff. Alice Mean Bob Mean
o1-preview ChatGPT 5.30 9.10 3.80
Opus 3 Claude.ai 5.20 8.30 3.10
o1-mini ChatGPT 4.80 8.00 3.20
Sonnet 3.5 Claude.ai 4.20 8.20 4.00

The following is notable:

  • The average “thinking time” for o1-preview was 14 seconds.
  • o1-preview produces the greatest score difference seen in this test so far.
  • As shown in earlier articles, Sonnet 3.5 is particularly verbose in its outputs compared with the OpenAI models.
  • Opus 3 via the Claude.ai front-end performs extremely well at this task - showcasing it’s incredible raw performance.

As the output tokens used for reasoning are not available, it is impossible to say how much the inference would have cost. We will return to this once the API has been made more generally available.

Hallucinations and Reasoning Variance.

Here’s where things get a little strange.

The ChatGPT interface shows a generated summary of the reasoning steps, although unfortunately OpenAI have decided that the underlying chain-of-thought should remain hidden from the User. On examination, around about half the runs included either a hallucination or spurious tokens in the summary of the chain-of-thought. This included:

  • A statement that Alice and Bob communicate via Zoom.
    Score Reasoning Error

    Score Mismatch

  • The phrase Gary's technical jargon included at the end of a reasoning block. (There is no reference to a Gary in any of the input data).
  • The words iphy and cRipplekin FCC appearing spuriously in the reasoning outputs.
  • The score calculated at the end of the reasoning not matching the emitted score (see screenshot).

Without access to the underlying reasoning, it’s impossible to say whether these are simply display rendering and summarisation errors, or something happening deeper within the reasoning model.

These issues don’t seem to have affected the quality of the final output, however for more sensitive tasks I think it is worth reviewing the reasoning steps to check for errors. Whilst this class of errors is a latent risk of LLM inference, they are unusual to see in a task this straightforward.

Run
Number
Reasoning
Steps
Refers to OpenAI
Policy
Hallucination /
Spurious Token
1 7 No No
2 5 Yes Yes (“Zoom”)
3 8 Yes Yes (“Gary’s technical jargon”)
4 10 Yes No
5 9 Yes Yes (“cRipplekin FCC”)
6 6 No Yes (“iphy”)
7 8 Yes No
8 4 No No
9 10 Yes Yes (Scoring)
10 8 Yes No

Summary of reasoning steps in test runs

Another observation is that given the same task, the reasoning steps applied each time varied more than I expected. Even over relatively few runs, the following stood out:

  1. Inconsistent Step Ordering - The order in which the LLM addresses different aspects of the task varies considerably across runs. Some begin with a clear task definition, while others jump directly into content analysis. The placement of steps like profile definition, article evaluation, and guideline adherence is inconsistent, suggesting a non-linear thought process.
  2. Varying Depth of Analysis - The depth of analysis for each step fluctuates between runs. Some provide detailed breakdowns of the article’s content and regulatory implications, while others offer only surface-level observations. This inconsistency indicates that the model doesn’t maintain a uniform level of detail across different instances of the same task.
  3. Fluctuating Focus on Core Elements - While all runs touch on the key elements (Alice and Bob’s profiles, article content, scoring), the emphasis placed on each varies. Some runs spend more time on profile analysis, others on content evaluation, and a few on the scoring process itself. This suggests that the model doesn’t consistently prioritize the same aspects of the task.

I often recommend people use the regeneration features in most chat applications to exploit the randomness behind LLM inference. However, I would expect that task analysis and chain-of-thought construction itself would be more consistent and exhibit less variability.

I expect that these are exactly the kind of areas that OpenAI will be concentrating on improving and reinforcing their model with during the preview to release period.

Reasoning with the Anthropic Metaprompt

The Anthropic prompt generator generates prompts that include a number of reasoning techniques. For comparison, we’ll try using the prompt generator to see how it impacts the performance of the Claude models.

The table below includes scoring produced by using the Claude.ai front-end, the generated Meta Prompt via the console (temperature=1, no System Prompt) and via the API with a basic “You are a helpful AI” System Prompt.

Model Run Type Score Diff. Alice Mean Bob Mean
Opus 3 Meta Prompt 5.40 8.50 3.10
o1-preview ChatGPT 5.30 9.10 3.80
Opus 3 Claude.ai 5.20 8.30 3.10
Sonnet 3.5 Meta Prompt 5.00 8.20 3.20
o1-mini ChatGPT 4.80 8.00 3.20
Opus 3 API (“Helpful AI”) 4.28 8.04 3.76
Sonnet 3.5 Claude.ai 4.00 8.20 4.20
Sonnet 3.5 API (“Helpful AI”) 3.70 8.04 4.34

This reinforces what we already know: that better prompting leads to better outputs. The prompt generator delivers an impressive performance improvement - especially for Sonnet 3.5. I expect for the immediate future, it will remain the case that for specialised tasks, specifying model behaviour (or re-using earlier automated chains) will offer the best price:performance ratio and stability of performance.

Final Thoughts

I look forward to continuing to test with both o1-preview and o1-mini on a range of different reasoning tasks - especially coding - within the rate limits available.

A challenge with traditional instruct models for interactive users is being able to distinguish between OK answers (that are still impressive) versus excellent answers that truly exploit the power of the model. In a lot of cases - especially for one-off tasks - o1-preview removes the effort and will give excellent answers to direct prompts the first time. Embedded reasoning promises to get the best outputs from the underlying model, that have previously been hidden by needing crafted prompts.

I expect the integration of Tools within the automated chain-of-thought reasoning will lead to some surprising capabilities. For example, the ability to refer to API documentation, find the underlying source code on GitHub for analsyis, the ability to write and run test cases during reasoning is an exciting prospect, and further reduce the friction of integrating LLMs in a number of workflows.

It’s possible that we will see futher divergence in model types available - for example, those optimised for multi-modal, low-latency inference as needed by OpenAI’s advanced voice mode, versus those designed to automatically assist the user with complex reasoning tasks. For general purpose interactive use, the added compute expense of reasoning for every input is unlikely to be desirable (note that ChatGPT already supports mid-chat model switching).

As always, in any article about OpenAI I will remind people to check their privacy settings.

Appendix (All Scores)

In previous runs, the benchmark is run via the API with a simple “You are a helpful AI Assistant” prompt using the default model temperature (usually temperature=1).

Because a well-crafted System Prompt improves performance, I have added some additional comparisons for Sonnet 3.5 and Opus 3 via the Claude.ai chat front-end. Runs via the ChatGPT or Claude.ai front-ends are highlighed in blue in the below table.

Runs conducted via the Anthropic Console using the Prompt Generator are indicated in orange.

Model Run Type Runs Score Diff. Alice Mean Bob Mean Class Launch
Opus 3 Meta Prompt 10 5.40 8.50 3.10 Large Mar ‘24
o1-preview ChatGPT 10 5.30 9.10 3.80 Reasoning Sep ‘24
Opus 3 Claude.ai 10 5.20 8.30 3.10 Large Mar ‘24
Sonnet 3.5 Meta Prompt 10 5.00 8.20 3.20 Medium Jun ‘24
o1-mini ChatGPT 10 4.80 8.00 3.20 Reasoning Sep ‘24
GPT-4 API 50 4.72 8.72 4.00 Large Mar ‘23
Opus 3 API 50 4.28 8.04 3.76 Large Mar ‘24
Sonnet 3.5 Claude.ai 10 4.20 8.20 4.00 Medium Mar ‘24
Sonnet 3.5 API 50 3.70 8.04 4.34 Medium Jun ‘24
GPT-4o API 50 3.58 7.00 3.42 Medium May ‘24
Sonnet 3 API 50 3.14 8.08 4.94 Medium Mar ‘24
Haiku 3 API 50 2.00 8.82 6.82 Small Mar ‘24
GPT-4o mini API 50 2.00 7.00 5.00 Small Jul ‘24
GPT 3.5 Turbo API 50 2.00 8.00 6.00 Medium Aug ‘23