What is Multimodal?

Multimodal capabilities allow LLM models to process and understand different types of content beyond text, including:

Images (photos, diagrams, screenshots)
Audio (speech, music)
Video (clips, recordings)
Documents (PDFs, spreadsheets)

Amux normalizes multimodal content formats across providers, allowing you to use the same code with different vision and multimodal models.

Working with Images

Image from URL

Pass image URLs directly in message content:

import { createBridge } from '@amux.ai/llm-bridge'
import { openaiAdapter } from '@amux.ai/adapter-openai'
import { anthropicAdapter } from '@amux.ai/adapter-anthropic'

const bridge = createBridge({
  inbound: openaiAdapter,
  outbound: anthropicAdapter,
  config: { apiKey: process.env.ANTHROPIC_API_KEY }
})

const response = await bridge.chat({
  model: 'gpt-4o',
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: 'What is in this image?'
        },
        {
          type: 'image_url',
          image_url: {
            url: 'https://example.com/image.jpg'
          }
        }
      ]
    }
  ]
})

console.log(response.choices[0].message.content)

The URL must be publicly accessible. For private images, use base64 encoding instead.

Image from Base64

For local images or private files, encode as base64:

import fs from 'fs'

// Read and encode image
const imageBuffer = fs.readFileSync('./image.jpg')
const base64Image = imageBuffer.toString('base64')

const response = await bridge.chat({
  model: 'gpt-4o',
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: 'Describe this image in detail'
        },
        {
          type: 'image_url',
          image_url: {
            url: `data:image/jpeg;base64,${base64Image}`
          }
        }
      ]
    }
  ]
})

Multiple Images

Send multiple images in a single request:

const response = await bridge.chat({
  model: 'gpt-4o',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Compare these two images' },
        {
          type: 'image_url',
          image_url: { url: 'https://example.com/image1.jpg' }
        },
        {
          type: 'image_url',
          image_url: { url: 'https://example.com/image2.jpg' }
        }
      ]
    }
  ]
})

Vision Use Cases

Image Analysis

const response = await bridge.chat({
  model: 'gpt-4o',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'What objects are in this image?' },
        { type: 'image_url', image_url: { url: imageUrl } }
      ]
    }
  ]
})

OCR (Text Extraction)

const response = await bridge.chat({
  model: 'gpt-4o',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Extract all text from this image' },
        { type: 'image_url', image_url: { url: screenshotUrl } }
      ]
    }
  ]
})

Visual Question Answering

const response = await bridge.chat({
  model: 'gpt-4o',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'How many people are in this photo?' },
        { type: 'image_url', image_url: { url: photoUrl } }
      ]
    }
  ]
})

Code from Screenshots

const response = await bridge.chat({
  model: 'gpt-4o',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Convert this UI screenshot to React code' },
        { type: 'image_url', image_url: { url: uiScreenshotUrl } }
      ]
    }
  ]
})

Audio and Video (Qwen, Gemini)

Some providers support audio and video inputs:

Audio Input (Qwen)

import { qwenAdapter } from '@amux.ai/adapter-qwen'

const bridge = createBridge({
  inbound: qwenAdapter,
  outbound: qwenAdapter,
  config: { apiKey: process.env.QWEN_API_KEY }
})

const response = await bridge.chat({
  model: 'qwen-audio-turbo',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Transcribe this audio' },
        {
          type: 'audio_url',
          audio_url: { url: 'https://example.com/audio.mp3' }
        }
      ]
    }
  ]
})

Video Input (Gemini, Qwen)

import { geminiAdapter } from '@amux.ai/adapter-gemini'

const bridge = createBridge({
  inbound: geminiAdapter,
  outbound: geminiAdapter,
  config: { apiKey: process.env.GEMINI_API_KEY }
})

const response = await bridge.chat({
  model: 'gemini-1.5-pro',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Summarize what happens in this video' },
        {
          type: 'video_url',
          video_url: { url: 'https://example.com/video.mp4' }
        }
      ]
    }
  ]
})

Streaming with Multimodal

Vision and multimodal requests support streaming:

const stream = await bridge.chat({
  model: 'gpt-4o',
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Describe this image in detail' },
        { type: 'image_url', image_url: { url: imageUrl } }
      ]
    }
  ],
  stream: true
})

for await (const event of stream) {
  if (event.type === 'content') {
    process.stdout.write(event.content.delta)
  }
}

Provider Compatibility

Multimodal support across providers:

Provider	Vision (Images)	Audio	Video	Documents
OpenAI	✅ GPT-4o, GPT-4 Turbo	❌	❌	❌
Anthropic	✅ Claude 3.5, Claude 3	❌	❌	✅
DeepSeek	❌	❌	❌	❌
Moonshot	❌	❌	❌	❌
Zhipu	✅ GLM-4V	❌	❌	❌
Qwen	✅ Qwen-VL	✅ Qwen-Audio	✅ Qwen2-VL	❌
Gemini	✅ Gemini 1.5	✅	✅	✅

Not all models from a provider support multimodal. Check specific model capabilities before use.

Vision-Capable Models

OpenAI

gpt-4o - Latest multimodal model
gpt-4o-mini - Smaller, faster vision model
gpt-4-turbo - Previous generation vision
gpt-4-vision-preview - Earlier preview

Anthropic

claude-3-5-sonnet-20241022 - Best vision capabilities
claude-3-opus-20240229 - High accuracy vision
claude-3-sonnet-20240229 - Balanced vision
claude-3-haiku-20240307 - Fast vision

Zhipu

glm-4v - Vision model

Qwen

qwen-vl-plus - Vision understanding
qwen-vl-max - Advanced vision
qwen2-vl-7b - Open source vision
qwen-audio-turbo - Audio understanding

Gemini

gemini-1.5-pro - Multimodal (vision, audio, video)
gemini-1.5-flash - Fast multimodal

Best Practices

1. Image Size and Format

Optimize images before sending:

// Supported formats: JPEG, PNG, WebP, GIF
// Recommended: JPEG for photos, PNG for screenshots

// Keep file size reasonable (< 20MB)
const maxSize = 20 * 1024 * 1024  // 20MB

2. Image Quality vs Cost

Higher resolution costs more tokens:

// For basic analysis, resize large images
import sharp from 'sharp'

const resized = await sharp(imageBuffer)
  .resize(1024, 1024, { fit: 'inside' })
  .toBuffer()

3. Clear Instructions

Be specific about what you want from the image:

// ❌ Vague
{ type: 'text', text: 'What is this?' }

// ✅ Specific
{ type: 'text', text: 'List all text visible in this screenshot, maintaining the original layout' }

4. Error Handling

Handle image loading errors:

try {
  const response = await bridge.chat({
    model: 'gpt-4o',
    messages: [
      {
        role: 'user',
        content: [
          { type: 'text', text: 'Describe this image' },
          { type: 'image_url', image_url: { url: imageUrl } }
        ]
      }
    ]
  })
} catch (error) {
  if (error.message.includes('image')) {
    console.error('Failed to process image:', error)
    // Fallback or retry logic
  }
}

5. Multi-turn Vision Conversations

Continue conversations about images:

const messages = [
  {
    role: 'user',
    content: [
      { type: 'text', text: 'What is in this image?' },
      { type: 'image_url', image_url: { url: imageUrl } }
    ]
  }
]

const response1 = await bridge.chat({ model: 'gpt-4o', messages })
messages.push(response1.choices[0].message)

// Ask follow-up question
messages.push({
  role: 'user',
  content: 'What color is the car in the image?'
})

const response2 = await bridge.chat({ model: 'gpt-4o', messages })