Files and Resources with MCP - Part 1

Posts

January 22, 2025·

Introduction

As AI assistants become more embedded, they need reliable ways to work with files, images, and other content types. Whether you’re building autonomous agents or simpler workflow automation and extensions, handling diverse content effectively is crucial. The Model Context Protocol (MCP) provides a standardized approach for managing these interactions between Users, AI Assistants, and external tools.

This is the first of two articles exploring content handling in MCP. Here we examine current implementation patterns and practical examples with Claude Desktop. The second article will look at more general patterns and parts of the specification, particularly for building more sophisticated plug-and-play agentic systems.

Components and Terminology

For clarity, we distinguish between the Host application (which manages the overall user experience) and the MCP Client (a library the Host uses to communicate with MCP Servers).

Component	Description
Host	The primary application (e.g. Claude Desktop, LibreChat) managing user-assistant conversations
Assistant	The AI model generating responses and requesting tool operations
Assistant API	Handles message processing, tokenization, and tool availability
MCP Server	Executes tool operations requested by the assistant
MCP Client	A library used by the Host to interact with MCP Servers

Understanding these components’ roles is essential as we explore how different content types flow through the system.

Content Types and Processing

Large Language Models (LLMs) primarily process text, including formats like Markdown, JSON, and source code. Modern multi-modal models support additional content types:

Claude Sonnet 3.5: Text, Images (Vision), and PDFs
OpenAI GPT-4-Audio-Preview: Text and Audio processing
Google Gemini 2.0: Text, Image (Vision and Generation), and audio generation

When users share non-text content, the Assistant API creates a Content Block and tokenizes it for the model¹. This process is efficient - for example, images typically require only 1,500 tokens, equivalent to approximately 1,000 words².

The host application manages the presentation and handling of all supported content types.

Tool Use: Vision and Generation

Modern LLMs handle images differently depending on whether they’re processing or generating them. Let’s examine this through a practical example.

OpenAI’s GPT-4o and o1 models have vision capabilities - they can analyze uploaded images, but cannot generate them. However, ChatGPT users can create images because the application provides access to image generation through tools.

Image of ChatGPT Image Generation, showing a kitten and a ball of yarn. — **ChatGPT Image Generation** - The model incorrectly guesses the balls of yarn unless the image is uploaded by the User.

Here’s how image generation works in ChatGPT:

The Host application tells the Assistant that an Image Generator tool is available
When a user requests an image, the Assistant creates a suitable prompt
The Assistant requests the Host to call the DALL-E Image Generator tool
The Host executes this request and displays the result to the user

Importantly, generated images are ‘attached’ rather than ’embedded’ in the conversation. You can verify this yourself: ask ChatGPT detailed questions about an image it just generated - it will only be able to guess based on the prompt it created. However, if you upload the generated image back to the chat, it can then analyze specific details. This architectural choice treats generated content as user-facing output rather than part of the conversation context.

While the ChatGPT example demonstrates basic tool usage, MCP implementations offer a more comprehensive framework for handling both tools and resources. These patterns apply equally to other content types such as PDFs, structured data, or audio files. The key principles of tokenization, resource handling, and tool responses remain consistent regardless of the underlying data format.

Claude Desktop and MCP Implementation

Claude Desktop with the File and MCP Attachment icons highlighted — Claude File and MCP Resource attachment

Claude Desktop (v0.78) currently offers the most sophisticated implementation of MCP features. MCP Servers expose Resources, Prompts and Tools to the Host application - Resources and Prompts to return data to the Host and Assistant, while Tools allow the Assistant to make requests to the MCP Server.

General MCP Guidance

When building MCP Servers, several fundamental constraints and patterns shape both Tool and Resource implementations:

Output token limits constrain argument size for Tool calls, although substantial text is possible
Tool Call requests and arguments consume Context Window space, favouring concise interactions
Use the model’s inference capabilities appropriately - avoid redundant content transmission
Tokenized content (e.g. uploaded images or PDFs) cannot be reconstructed for sending to the MCP Server
Prefer using URIs over embedded content for large files, or if the content is not needed in the conversation context

These fundamental constraints inform how we handle both Tools and Resources in practice, as we’ll see in the following sections.

Resource Handling

MCP Resources represent data that can be accessed by the Assistant through the MCP Server. Unlike simple file attachments, Resources can be dynamic (like database queries) or static (like files). Resources are exposed to the Host application and can be attached to messages, allowing the Assistant to analyze their content.

To understand how these components work together in practice, let’s examine a typical resource handling flow:

sequenceDiagram
    actor User
    participant Host as Host and MCP Client
(e.g. Claude Desktop)
    participant MCP Server
    participant Assistant API
    participant Assistant as Assistant LLM
(e.g. Claude)

    Note over User,Assistant: MCP Resource Usage in a Conversation

    User->>+Host: View Resources
    Host->>+MCP Server: List Resources
    MCP Server-->>-Host: Available Resources
    Host-->>-User: Display Resources

    User->>+Host: "Can you analyze this file?"
+ Selected Resource
    Host->>+MCP Server: Read Resource
    MCP Server-->>-Host: Resource Content

    Note over Host: Compose Message:
1. Text Content Block
"Can you analyze..."
2. Resource Content Block
Base64: "...Y0IGVuY29kZWQ="

    Host->>+Assistant API: Send message array:
{role: "user", content: [...]}

    Note over Assistant API: Tokenize content blocks
for model consumption

    Assistant API->>+Assistant: Tokenized messages
    Assistant->>-Assistant API: Generated response
    Assistant API-->>-Host: {role: "assistant", content: "..."}
    Host-->>-User: Display formatted response

Current Limitations

Claude Desktop with an 'Call Stack Size' error message — Claude Desktop error message for oversized resource

While Claude Desktop handles both File (“Paperclip”) and Resource (“Connector”) attachments similarly, there are important differences:

MCP Resources can represent dynamic data like database queries, not just files
Claude Desktop’s resource handling has size limitations - large images work via File attachment but cause ‘stack size’ errors as Resources
Content returned from MCP Servers must be under 1MB in size.

For experimenting with larger resources and content types, mcp-hfspace with Claude Desktop mode set to false is a convenient way of doing so.

File Handling in MCP Servers

MCP Servers need two key elements to handle files effectively:

Physical access to the files on the server
A way to communicate available files to the Assistant for tool calls (Resource Discovery)

MCP Servers started by Claude Desktop run with the user’s account permissions and full environment access³. While this provides convenient filesystem and network access during development, it raises important security considerations for production deployments.

Best practices for secure file handling include:

Anticipating tighter sandboxing in deployment
Preferring MCP protocol features for resource access
Designing for compatibility with remote MCP Servers using SSE transport
Configuring the MCP Server with specific, allowed file directories during startup (to later be replaced with Roots)

Resource Discovery

Claude Desktop calling a Tool providing a list of resources. — Claude Desktop responding to a prompt for `'the most recent image'`

For effective resource handling, the Assistant needs to know which resources are available. The current version of Claude Desktop doesn’t automatically expose listed Resources to the Assistant. To bridge this gap, there are three main approaches:

Direct user entry of resource identifiers
Tool calls that return the Resource List, enabling the Assistant to discover resources automatically
Prompts that return the Resource List for User review and selection

The MCP Specification encourages meaningful resource descriptions:

A description of what this resource represents. This can be used by clients to improve the LLM’s understanding of available resources. It can be thought of like a ‘hint’ to the model.

This reinforces the importance of structuring Resource Lists in a way that LLMs can effectively process (and the expectation that the Assistant will have access to them).

Content Handling and Return Types

Tool Responses contain blocks of Text, Images or Embedded Resources.

When content is returned:

Text and Images are tokenized and added to the conversation context
Images become available to Claude’s Vision capabilities
Non-text content types are not processed (Claude Desktop displays Unsupported image type: audio/wav for an Embedded Resource).
Must be under 1MB in size⁴.

For content not intended for the conversation context (like large files or binary data), the MCP Server can save files to a configured directory and return a message to the User with the location.

Embedded User Interfaces

MCP Servers can provide their own user interfaces through embedded web servers, enabling interactive access to resources that might not be directly available through the filesystem. The mcp-webcam (@llmindset/mcp-webcam) project demonstrates this approach.

When Claude Desktop launches the MCP Server, it starts a local web server hosting an interface for webcam interaction. The server provides both a tool call allowing Claude to capture webcam images in the conversation, and an ephemeral Resource for capturing frames. Screenshot capture happens through direct user interaction with the web interface.

This pattern extends naturally to other scenarios, where an MCP Server might:

Provide secure interfaces to database queries
Handle user authentication flows
Enable user approval workflows
Manage interactive request queues
Surface local system resources safely

For multi-user environments, this approach currently requires session key management between the tool calls and web interface. This will be simplified once the MCP protocol implements shared authentication between Servers and Hosts.

Looking Ahead

The patterns and implementations we’ve explored demonstrate MCP’s current capabilities for content handling. While there are some limitations, particularly around resource sizes and authentication, the protocol provides a solid foundation for building AI-enabled applications.

In the next article, we’ll examine emerging patterns for working with MCP, including improvements to resource templates, audience handling, and content type management. We’ll also look at how the specification might evolve to better support multi-modal interactions and enterprise deployment patterns.

Commonly, binary resources will be encoded as Base64 content blocks. Some APIs allow the use of URIs instead. Another pattern is replacing encoded content with a temporary identifier to keep multi-turn conversations efficient - see here. ↩︎
Both the Claude Vision and OpenAI Vision documentation describe this process well. It is worth noting that the API may resize images on input. ↩︎
It is possible to deploy MCP Servers using Docker (see here and here). This makes managaging deployment dependencies easier, as well as providing options to sandbox deployments. This post shows a clever use of Docker and LLMs for automating deployment of MCP Servers. ↩︎
Ideally, the Image would be passed to the Assistant API to handle tokenization. ↩︎

Building with MCP - Notes