Posts

Haiku 3.5 Launch Benchmark

#anthropic#benchmark#haiku#sonnet

Benchmark analysis of Anthropic's Haiku 3.5 model, examining performance in content evaluation tasks, real-world costs versus advertised pricing, and direct comparison with the refreshed Sonnet 3.5 and Haiku 3.

Read more →

November 4, 2024

OpenAI o1-mini Revisited

#prompting#benchmark#openai#o1-mini#gpt-4o#gpt-4o-mini#llm#reasoning

Benchmark analysis of o1-mini's performance in content evaluation tasks, revealing performance characteristics, patterns in hidden reasoning tokens and comparing effectiveness against GPT-4o and GPT-4o-mini.

Read more →

November 3, 2024

Claude Analysis Tool - First Look

#anthropic#claude#privacy#openai#chatgpt#data-analysis#visualisation

First look at Claude's new Analysis Tool, combining local processing with interactive visualizations. Its innovative design hints at broader possibilities for data analysis. Includes practical tips and a feature comparison with ChatGPT.

Read more →

October 28, 2024

Sonnet 3.5 Refresh Benchmark

#anthropic#benchmark#claude#computer-use#code-generation#llm

Analysis of Anthropic's October '24 Sonnet 3.5 refresh, comparing code generation quality through Asteroids game implementations, benchmarking reasoning capabilities, and exploring new Computer Use features. Includes performance comparisons with previous versions and practical API examples.

Read more →

October 24, 2024

Claude Artifacts - Build Interactive Apps and Dashboards

#prompting#anthropic#tooling#artifacts

Discover how to produce amazing and useful Claude Artifacts using expert techniques. Build interactive applications and prototypes, with real-world examples. Learn how to build dashboards to visualise and process data, store information and more. All Prompts are included.

Read more →

October 16, 2024

Anthropic Launch Batch API - up to 95% discount

#anthropic#openai#pricing#batching#llm

Anthropic introduces batch pricing for Claude models, offering up to 95% discounts when combined with caching. We analyse the implications, compare with OpenAI's offerings, and explore how this impacts LLM cost optimisation strategies for businesses and developers.

Read more →

October 14, 2024

OpenAI Prompt Caching

#openai#anthropic#tooling#prompting#llm

Explore OpenAI's new prompt caching feature, its impact on performance and cost savings, and a comparison with Anthropic's approach.

Read more →

October 4, 2024

OpenAI o1 - First Impressions

#openai#o1#benchmark#anthropic#prompt-engineering

OpenAI o1: A New Approach On September 12th 2024, OpenAI released a preview of a new class of Generative AI model: a reasoning model named “o1”, and a smaller companion model “o1-mini”. OpenAI o1 The o1 models have been trained to “think through problems”, and conduct complex chain-of-thought reasoning in the output context window to enhance it’s responses. The models are specifically optimised for improved performance at science, coding and maths problems, and perform extremely well in reasoning-heavy benchmarks.

Read more →

September 14, 2024

Visible Costs, Smarter Chats

#prompting#training#tooling#prompt-engineering

Explore how displaying chat costs per turn in LLM training environments enhances user understanding and optimises interaction with AI models.

Read more →

August 6, 2024

ChatGPT and Claude.ai - Chat Productivity Techniques

#prompt-engineering#chatgpt#claude#openai#anthropic#llm

Introduction This guide provides practical tips for managing chat conversations with ChatGPT, Claude, and similar services. It focuses on techniques that can significantly improve the accuracy and relevance of responses, especially in longer conversations. Getting the best results involves more than just prompt engineering. Effective chat management is crucial for conducting complex queries and tasks, and generates more relevant, higher quality responses. Understanding Context Windows AI models have a limit to how much conversation history they can consider, known as the context window.

Read more →

July 30, 2024

Claude Engineer - Build with Sonnet 3.5

#prompting#anthropic#tooling#claude-engineer

Introduction Claude Engineer GitHub Last week, Claude Engineer 2.0 was released, advancing the field of AI-assisted software development. This tool combines intelligent context management, strategic prompting, and file manipulation capabilities. The result is a powerful command-line interface designed to enhance various software development tasks. Introducing Claude Engineer 2.0, with agents! 🚀 Biggest update yet with the addition of a code editor and code execution agents, and dynamic editing. When editing files (especially large ones), Engineer will direct a coding agent, and the agent will provide changes in batches.

Read more →

July 23, 2024

OpenAI GPT-4o mini - Benchmark

#gpt-4o-mini#haiku-3#benchmark#openai

Introducing GPT-4o mini In our recent roundup of the OpenAI model suite, we found that GPT-3.5 wasn’t competitive, with an aged cut-off date and some erratic task performance in our content rating benchmark. GPT-4o mini OpenAI have today replaced GPT-3.5 Turbo with GPT-4o mini, offering improved performance at a lower price point. Full details are available at the announcement page. GPT-4o mini now stands as a direct competitor to Anthropic’s Haiku 3 model, with a recent knowledge cut-off and industry leading 16,000 output token capability.

Read more →

July 18, 2024

You're an expert at... using Claude's Workbench

#prompting#anthropic#llm-testing#tooling

Technical analysis of role prompting using Claude's Workbench. From initial prompt comparison through rubric development to quantitative scoring, we examine the real impact of role-based instructions. The results demonstrate when and why roles improve LLM outputs

Read more →

July 15, 2024

Chat Interface Roundup (Jul 24)

#claude#chatgpt#big-agi#huggingface-chatui#librechat#anythingllm#tooling

Introduction The end of 2022 marked the release of ChatGPT to consumers, offering the first conversational interface to a powerful Generative AI platform. The simplicity of natural language chat and seemingly limitless capabilities of GPT led to the fastest growing user base ever recorded. Chat interfaces remain popular and accessible, and continue to incorporate features that expand the utility of Generative AI, such as Document Retrieval, agent-assisted Code Execution, and Web Search.

Read more →

July 12, 2024

OpenAI GPT Suite - Casual Performance (Jul 24)

#gpt-4o#gpt-4#gpt-3.5#benchmark#openai

Introduction OpenAI’s model lineup has changed recently. As of early July 2024, their API pricing page showcases two primary models: GPT-4o and GPT-3.5 Turbo, with GPT-4 and GPT-4 Turbo now in the “Older Models” section. OpenAI’s FAQ offers this guidance: Use GPT-4o for complex tasks that require advanced reasoning, creativity, or in-depth analysis. Use GPT-3.5 Turbo for simpler tasks like content generation, summarization, or straightforward Q&A. Interestingly, GPT-4 still features prominently on the consumer ChatGPT interface, advertised for complex tasks.

Read more →

July 2, 2024

Sonnet 3.5 - Latest Model Benchmark

#sonnet-3.5#benchmark#anthropic

Introduction On June 20, 2024, Anthropic released Claude 3.5 Sonnet, promising “frontier intelligence at 2x the speed” of Claude 3 Opus. This latest iteration promises improved performance in various benchmarks, including graduate-level reasoning, undergraduate-level knowledge, and coding proficiency. But how does it fare in real-world, casual prompting scenarios? We’ll compare Sonnet 3.5 to its predecessors, Sonnet 3 and Opus 3 using our earlier Casual Prompting Benchmark. The Benchmark: Casual Prompting Performance As detailed in our previous article, the casual prompting benchmark simulates real-world usage by casual, interactive users.

Read more →

July 1, 2024

Claude Projects - First Impressions

#anthropic#claude#projects#tooling

Introduction Yesterday, Anthropic released a new “Projects” feature for their Claude.ai front-end. It allows users to work with multiple documents and set up large context conversations without the need for multiple uploads and extensive priming when starting a session. Context Stuffing vs. RAG Snippets When exploring content interaction through LLM front ends like Claude.ai, we often encounter two distinct methods: Context Stuffing - This technique involves loading complete documents into the working memory of the LLM.

Read more →

June 26, 2024

Claude 3 Suite - Casual Prompting Performance

#opus-3#sonnet-3#haiku-3#benchmark#anthropic

Introduction Let’s find out how the Anthropic Claude 3 Models handle a Casual Prompt. This is a prompt that a well-informed, casual user of an AI front-end might reasonably use, expecting it to be interpreted as another human would. These types of prompts are typically used for tasks such as market analysis, product concepts and summarization. Understanding how LLMs handle this type of prompt is crucial for assessing their real-world applicability and reliability.

Read more →

June 23, 2024