Definition

A multimodal model is one that accepts inputs from multiple data modalities. The most common combination is text plus images (vision-language models). Some models also accept audio, video, or documents (PDF, spreadsheet). Outputs are typically text, though some models generate images or audio as well.

In the Anthropic Messages API, images are passed as image content blocks alongside text blocks:

{
  "role": "user",
  "content": [
    { "type": "image", "source": { "type": "base64", "media_type": "image/png", "data": "..." } },
    { "type": "text", "text": "What is shown in this screenshot?" }
  ]
}

Images consume tokens. A 1024x1024 PNG costs approximately 1,600 tokens in most APIs. Token cost scales with resolution; resize images before sending when high resolution is not needed.

Multimodal capabilities include:

  • OCR and text extraction from images and screenshots.
  • Diagram and chart interpretation.
  • UI understanding (screenshots of interfaces).
  • Document parsing (PDF pages passed as images).

Limitations: models may hallucinate text in images under poor contrast or small font sizes.

When it applies

Use multimodal input when structured data is only available as an image (scanned documents, screenshots, charts). Extract text with OCR first when the image contains dense text and accuracy matters. Prefer structured document formats (HTML, PDF-text) over image rendering when available; text is cheaper and more accurate.

Resize images to the minimum resolution needed: max-dimension: 1568px is the practical ceiling for Claude; larger images are downscaled server-side.

Example

import anthropic, base64
from pathlib import Path
 
client = anthropic.Anthropic()
img_data = base64.standard_b64encode(Path("chart.png").read_bytes()).decode()
 
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=512,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_data}},
            {"type": "text", "text": "Describe the trend shown in this chart."}
        ]
    }]
)
print(response.content[0].text)
  • context-window - images consume context window tokens; plan for this when combining text and images.
  • token - image token cost varies by resolution and API; check pricing before sending large images.
  • completion - multimodal requests return standard text completions.
  • tool-call - multimodal input pairs naturally with tool calls for image-triggered actions.
  • prompt-design - prompting strategies for vision tasks.

Citing this term

See Multimodal (llmbestpractices.com/glossary/multimodal).