
OpenAI o1 - First Impressions


OpenAI o1: A New Approach On September 12th 2024, OpenAI released a preview of a new class of Generative AI model: a reasoning model named “o1”, and a smaller companion model “o1-mini”. OpenAI o1 The o1 models have been trained to “think through problems”, and conduct complex chain-of-thought reasoning in the output context window to enhance it’s responses. The models are specifically optimised for improved performance at science, coding and maths problems, and perform extremely well in reasoning-heavy benchmarks.

Read more →

September 14, 2024

Visible Costs, Smarter Chats


Explore how displaying chat costs per turn in LLM training environments enhances user understanding and optimises interaction with AI models.

Read more →

August 6, 2024

ChatGPT and - Chat Productivity Techniques


Introduction This guide provides practical tips for managing chat conversations with ChatGPT, Claude, and similar services. It focuses on techniques that can significantly improve the accuracy and relevance of responses, especially in longer conversations. Getting the best results involves more than just prompt engineering. Effective chat management is crucial for conducting complex queries and tasks, and generates more relevant, higher quality responses. Understanding Context Windows AI models have a limit to how much conversation history they can consider, known as the context window.

Read more →

July 30, 2024

Claude Engineer - Build with Sonnet 3.5


Introduction Claude Engineer GitHub Last week, Claude Engineer 2.0 was released, advancing the field of AI-assisted software development. This tool combines intelligent context management, strategic prompting, and file manipulation capabilities. The result is a powerful command-line interface designed to enhance various software development tasks. Introducing Claude Engineer 2.0, with agents! 🚀 Biggest update yet with the addition of a code editor and code execution agents, and dynamic editing. When editing files (especially large ones), Engineer will direct a coding agent, and the agent will provide changes in batches.

Read more →

July 23, 2024

OpenAI GPT-4o mini - Benchmark


Introducing GPT-4o mini In our recent roundup of the OpenAI model suite, we found that GPT-3.5 wasn’t competitive, with an aged cut-off date and some erratic task performance in our content rating benchmark. GPT-4o mini OpenAI have today replaced GPT-3.5 Turbo with GPT-4o mini, offering improved performance at a lower price point. Full details are available at the announcement page. GPT-4o mini now stands as a direct competitor to Anthropic’s Haiku 3 model, with a recent knowledge cut-off and industry leading 16,000 output token capability.

Read more →

July 18, 2024

You're an expert at... using Claude's Workbench


Introduction “You are an expert in” is a well known prompting ritual, and a standard piece of advice - but is it effective? Recent research describes it as “not working”. At the same time, the Anthropic API documentation says that “With a role, Claude catches critical issues that could cost millions”… so it’s a potentially high stakes question. Role Prompting Tip from Anthropic API Docs This is a good excuse for us to try out the new Anthropic Developer Console that lets us set up Prompts and Evaulations for testing.

Read more →

July 15, 2024

Chat Interface Roundup (Jul 24)


Introduction The end of 2022 marked the release of ChatGPT to consumers, offering the first conversational interface to a powerful Generative AI platform. The simplicity of natural language chat and seemingly limitless capabilities of GPT led to the fastest growing user base ever recorded. Chat interfaces remain popular and accessible, and continue to incorporate features that expand the utility of Generative AI, such as Document Retrieval, agent-assisted Code Execution, and Web Search.

Read more →

July 12, 2024

OpenAI GPT Suite - Casual Performance (Jul 24)


Introduction OpenAI’s model lineup has changed recently. As of early July 2024, their API pricing page showcases two primary models: GPT-4o and GPT-3.5 Turbo, with GPT-4 and GPT-4 Turbo now in the “Older Models” section. OpenAI’s FAQ offers this guidance: Use GPT-4o for complex tasks that require advanced reasoning, creativity, or in-depth analysis. Use GPT-3.5 Turbo for simpler tasks like content generation, summarization, or straightforward Q&A. Interestingly, GPT-4 still features prominently on the consumer ChatGPT interface, advertised for complex tasks.

Read more →

July 2, 2024

Sonnet 3.5 - Latest Model Benchmark


Introduction On June 20, 2024, Anthropic released Claude 3.5 Sonnet, promising “frontier intelligence at 2x the speed” of Claude 3 Opus. This latest iteration promises improved performance in various benchmarks, including graduate-level reasoning, undergraduate-level knowledge, and coding proficiency. But how does it fare in real-world, casual prompting scenarios? We’ll compare Sonnet 3.5 to its predecessors, Sonnet 3 and Opus 3 using our earlier Casual Prompting Benchmark. The Benchmark: Casual Prompting Performance As detailed in our previous article, the casual prompting benchmark simulates real-world usage by casual, interactive users.

Read more →

July 1, 2024

Claude Projects - First Impressions


Introduction Yesterday, Anthropic released a new “Projects” feature for their front-end. It allows users to work with multiple documents and set up large context conversations without the need for multiple uploads and extensive priming when starting a session. Context Stuffing vs. RAG Snippets When exploring content interaction through LLM front ends like, we often encounter two distinct methods: Context Stuffing - This technique involves loading complete documents into the working memory of the LLM.

Read more →

June 26, 2024

Claude 3 Suite - Casual Prompting Performance


Introduction Let’s find out how the Anthropic Claude 3 Models handle a Casual Prompt. This is a prompt that a well-informed, casual user of an AI front-end might reasonably use, expecting it to be interpreted as another human would. These types of prompts are typically used for tasks such as market analysis, product concepts and summarization. Understanding how LLMs handle this type of prompt is crucial for assessing their real-world applicability and reliability.

Read more →

June 23, 2024