Posts
Haiku 3.5 Launch Benchmark
#anthropic#benchmark#haiku#sonnet
Benchmark analysis of Anthropic's Haiku 3.5 model, examining performance in content evaluation tasks, real-world costs versus advertised pricing, and direct comparison with the refreshed Sonnet 3.5 and Haiku 3.
November 4, 2024
OpenAI o1-mini Revisited
#prompting#benchmark#openai#o1-mini#gpt-4o#gpt-4o-mini#llm#reasoning
Benchmark analysis of o1-mini's performance in content evaluation tasks, revealing performance characteristics, patterns in hidden reasoning tokens and comparing effectiveness against GPT-4o and GPT-4o-mini.
November 3, 2024
Claude Analysis Tool - First Look
#anthropic#claude#privacy#openai#chatgpt#data-analysis#visualisation
First look at Claude's new Analysis Tool, combining local processing with interactive visualizations. Its innovative design hints at broader possibilities for data analysis. Includes practical tips and a feature comparison with ChatGPT.
October 28, 2024
Sonnet 3.5 Refresh Benchmark
#anthropic#benchmark#claude#computer-use#code-generation#llm
Analysis of Anthropic's October '24 Sonnet 3.5 refresh, comparing code generation quality through Asteroids game implementations, benchmarking reasoning capabilities, and exploring new Computer Use features. Includes performance comparisons with previous versions and practical API examples.
October 24, 2024
Claude Artifacts - Build Interactive Apps and Dashboards
#prompting#anthropic#tooling#artifacts
Discover how to produce amazing and useful Claude Artifacts using expert techniques. Build interactive applications and prototypes, with real-world examples. Learn how to build dashboards to visualise and process data, store information and more. All Prompts are included.
October 16, 2024
Anthropic Launch Batch API - up to 95% discount
#anthropic#openai#pricing#batching#llm
Anthropic introduces batch pricing for Claude models, offering up to 95% discounts when combined with caching. We analyse the implications, compare with OpenAI's offerings, and explore how this impacts LLM cost optimisation strategies for businesses and developers.
October 14, 2024
OpenAI Prompt Caching
#openai#anthropic#tooling#prompting#llm
Explore OpenAI's new prompt caching feature, its impact on performance and cost savings, and a comparison with Anthropic's approach.
October 4, 2024
OpenAI o1 - First Impressions
#openai#o1#benchmark#anthropic#prompt-engineering
OpenAI o1: A New Approach On September 12th 2024, OpenAI released a preview of a new class of Generative AI model: a reasoning model named “o1”, and a smaller companion model “o1-mini”. OpenAI o1 The o1 models have been trained to “think through problems”, and conduct complex chain-of-thought reasoning in the output context window to enhance it’s responses. The models are specifically optimised for improved performance at science, coding and maths problems, and perform extremely well in reasoning-heavy benchmarks.
September 14, 2024
Visible Costs, Smarter Chats
#prompting#training#tooling#prompt-engineering
Explore how displaying chat costs per turn in LLM training environments enhances user understanding and optimises interaction with AI models.
August 6, 2024
ChatGPT and Claude.ai - Chat Productivity Techniques
#prompt-engineering#chatgpt#claude#openai#anthropic#llm
Introduction This guide provides practical tips for managing chat conversations with ChatGPT, Claude, and similar services. It focuses on techniques that can significantly improve the accuracy and relevance of responses, especially in longer conversations. Getting the best results involves more than just prompt engineering. Effective chat management is crucial for conducting complex queries and tasks, and generates more relevant, higher quality responses. Understanding Context Windows AI models have a limit to how much conversation history they can consider, known as the context window.
July 30, 2024
Claude Engineer - Build with Sonnet 3.5
#prompting#anthropic#tooling#claude-engineer
Introduction Claude Engineer GitHub Last week, Claude Engineer 2.0 was released, advancing the field of AI-assisted software development. This tool combines intelligent context management, strategic prompting, and file manipulation capabilities. The result is a powerful command-line interface designed to enhance various software development tasks. Introducing Claude Engineer 2.0, with agents! 🚀 Biggest update yet with the addition of a code editor and code execution agents, and dynamic editing. When editing files (especially large ones), Engineer will direct a coding agent, and the agent will provide changes in batches.
July 23, 2024
OpenAI GPT-4o mini - Benchmark
#gpt-4o-mini#haiku-3#benchmark#openai
Introducing GPT-4o mini In our recent roundup of the OpenAI model suite, we found that GPT-3.5 wasn’t competitive, with an aged cut-off date and some erratic task performance in our content rating benchmark. GPT-4o mini OpenAI have today replaced GPT-3.5 Turbo with GPT-4o mini, offering improved performance at a lower price point. Full details are available at the announcement page. GPT-4o mini now stands as a direct competitor to Anthropic’s Haiku 3 model, with a recent knowledge cut-off and industry leading 16,000 output token capability.
July 18, 2024
You're an expert at... using Claude's Workbench
#prompting#anthropic#llm-testing#tooling
Technical analysis of role prompting using Claude's Workbench. From initial prompt comparison through rubric development to quantitative scoring, we examine the real impact of role-based instructions. The results demonstrate when and why roles improve LLM outputs
July 15, 2024
Chat Interface Roundup (Jul 24)
#claude#chatgpt#big-agi#huggingface-chatui#librechat#anythingllm#tooling
Introduction The end of 2022 marked the release of ChatGPT to consumers, offering the first conversational interface to a powerful Generative AI platform. The simplicity of natural language chat and seemingly limitless capabilities of GPT led to the fastest growing user base ever recorded. Chat interfaces remain popular and accessible, and continue to incorporate features that expand the utility of Generative AI, such as Document Retrieval, agent-assisted Code Execution, and Web Search.
July 12, 2024
OpenAI GPT Suite - Casual Performance (Jul 24)
#gpt-4o#gpt-4#gpt-3.5#benchmark#openai
Introduction OpenAI’s model lineup has changed recently. As of early July 2024, their API pricing page showcases two primary models: GPT-4o and GPT-3.5 Turbo, with GPT-4 and GPT-4 Turbo now in the “Older Models” section. OpenAI’s FAQ offers this guidance: Use GPT-4o for complex tasks that require advanced reasoning, creativity, or in-depth analysis. Use GPT-3.5 Turbo for simpler tasks like content generation, summarization, or straightforward Q&A. Interestingly, GPT-4 still features prominently on the consumer ChatGPT interface, advertised for complex tasks.
July 2, 2024
Sonnet 3.5 - Latest Model Benchmark
#sonnet-3.5#benchmark#anthropic
Introduction On June 20, 2024, Anthropic released Claude 3.5 Sonnet, promising “frontier intelligence at 2x the speed” of Claude 3 Opus. This latest iteration promises improved performance in various benchmarks, including graduate-level reasoning, undergraduate-level knowledge, and coding proficiency. But how does it fare in real-world, casual prompting scenarios? We’ll compare Sonnet 3.5 to its predecessors, Sonnet 3 and Opus 3 using our earlier Casual Prompting Benchmark. The Benchmark: Casual Prompting Performance As detailed in our previous article, the casual prompting benchmark simulates real-world usage by casual, interactive users.
July 1, 2024
Claude Projects - First Impressions
#anthropic#claude#projects#tooling
Introduction Yesterday, Anthropic released a new “Projects” feature for their Claude.ai front-end. It allows users to work with multiple documents and set up large context conversations without the need for multiple uploads and extensive priming when starting a session. Context Stuffing vs. RAG Snippets When exploring content interaction through LLM front ends like Claude.ai, we often encounter two distinct methods: Context Stuffing - This technique involves loading complete documents into the working memory of the LLM.
June 26, 2024
Claude 3 Suite - Casual Prompting Performance
#opus-3#sonnet-3#haiku-3#benchmark#anthropic
Introduction Let’s find out how the Anthropic Claude 3 Models handle a Casual Prompt. This is a prompt that a well-informed, casual user of an AI front-end might reasonably use, expecting it to be interpreted as another human would. These types of prompts are typically used for tasks such as market analysis, product concepts and summarization. Understanding how LLMs handle this type of prompt is crucial for assessing their real-world applicability and reliability.
June 23, 2024