Multimodal
Work with images, audio, and video across different LLM providers
What is Multimodal?
Multimodal capabilities allow LLM models to process and understand different types of content beyond text, including:
- Images (photos, diagrams, screenshots)
- Audio (speech, music)
- Video (clips, recordings)
- Documents (PDFs, spreadsheets)
Amux normalizes multimodal content formats across providers, allowing you to use the same code with different vision and multimodal models.
Working with Images
Image from URL
Pass image URLs directly in message content:
import { createBridge } from '@amux.ai/llm-bridge'
import { openaiAdapter } from '@amux.ai/adapter-openai'
import { anthropicAdapter } from '@amux.ai/adapter-anthropic'
const bridge = createBridge({
inbound: openaiAdapter,
outbound: anthropicAdapter,
config: { apiKey: process.env.ANTHROPIC_API_KEY }
})
const response = await bridge.chat({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'What is in this image?'
},
{
type: 'image_url',
image_url: {
url: 'https://example.com/image.jpg'
}
}
]
}
]
})
console.log(response.choices[0].message.content)The URL must be publicly accessible. For private images, use base64 encoding instead.
Image from Base64
For local images or private files, encode as base64:
import fs from 'fs'
// Read and encode image
const imageBuffer = fs.readFileSync('./image.jpg')
const base64Image = imageBuffer.toString('base64')
const response = await bridge.chat({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'Describe this image in detail'
},
{
type: 'image_url',
image_url: {
url: `data:image/jpeg;base64,${base64Image}`
}
}
]
}
]
})Multiple Images
Send multiple images in a single request:
const response = await bridge.chat({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Compare these two images' },
{
type: 'image_url',
image_url: { url: 'https://example.com/image1.jpg' }
},
{
type: 'image_url',
image_url: { url: 'https://example.com/image2.jpg' }
}
]
}
]
})Vision Use Cases
Image Analysis
const response = await bridge.chat({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'What objects are in this image?' },
{ type: 'image_url', image_url: { url: imageUrl } }
]
}
]
})OCR (Text Extraction)
const response = await bridge.chat({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Extract all text from this image' },
{ type: 'image_url', image_url: { url: screenshotUrl } }
]
}
]
})Visual Question Answering
const response = await bridge.chat({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'How many people are in this photo?' },
{ type: 'image_url', image_url: { url: photoUrl } }
]
}
]
})Code from Screenshots
const response = await bridge.chat({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Convert this UI screenshot to React code' },
{ type: 'image_url', image_url: { url: uiScreenshotUrl } }
]
}
]
})Audio and Video (Qwen, Gemini)
Some providers support audio and video inputs:
Audio Input (Qwen)
import { qwenAdapter } from '@amux.ai/adapter-qwen'
const bridge = createBridge({
inbound: qwenAdapter,
outbound: qwenAdapter,
config: { apiKey: process.env.QWEN_API_KEY }
})
const response = await bridge.chat({
model: 'qwen-audio-turbo',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Transcribe this audio' },
{
type: 'audio_url',
audio_url: { url: 'https://example.com/audio.mp3' }
}
]
}
]
})Video Input (Gemini, Qwen)
import { geminiAdapter } from '@amux.ai/adapter-gemini'
const bridge = createBridge({
inbound: geminiAdapter,
outbound: geminiAdapter,
config: { apiKey: process.env.GEMINI_API_KEY }
})
const response = await bridge.chat({
model: 'gemini-1.5-pro',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Summarize what happens in this video' },
{
type: 'video_url',
video_url: { url: 'https://example.com/video.mp4' }
}
]
}
]
})Streaming with Multimodal
Vision and multimodal requests support streaming:
const stream = await bridge.chat({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image in detail' },
{ type: 'image_url', image_url: { url: imageUrl } }
]
}
],
stream: true
})
for await (const event of stream) {
if (event.type === 'content') {
process.stdout.write(event.content.delta)
}
}Provider Compatibility
Multimodal support across providers:
| Provider | Vision (Images) | Audio | Video | Documents |
|---|---|---|---|---|
| OpenAI | ✅ GPT-4o, GPT-4 Turbo | ❌ | ❌ | ❌ |
| Anthropic | ✅ Claude 3.5, Claude 3 | ❌ | ❌ | ✅ |
| DeepSeek | ❌ | ❌ | ❌ | ❌ |
| Moonshot | ❌ | ❌ | ❌ | ❌ |
| Zhipu | ✅ GLM-4V | ❌ | ❌ | ❌ |
| Qwen | ✅ Qwen-VL | ✅ Qwen-Audio | ✅ Qwen2-VL | ❌ |
| Gemini | ✅ Gemini 1.5 | ✅ | ✅ | ✅ |
Not all models from a provider support multimodal. Check specific model capabilities before use.
Vision-Capable Models
OpenAI
gpt-4o- Latest multimodal modelgpt-4o-mini- Smaller, faster vision modelgpt-4-turbo- Previous generation visiongpt-4-vision-preview- Earlier preview
Anthropic
claude-3-5-sonnet-20241022- Best vision capabilitiesclaude-3-opus-20240229- High accuracy visionclaude-3-sonnet-20240229- Balanced visionclaude-3-haiku-20240307- Fast vision
Zhipu
glm-4v- Vision model
Qwen
qwen-vl-plus- Vision understandingqwen-vl-max- Advanced visionqwen2-vl-7b- Open source visionqwen-audio-turbo- Audio understanding
Gemini
gemini-1.5-pro- Multimodal (vision, audio, video)gemini-1.5-flash- Fast multimodal
Best Practices
1. Image Size and Format
Optimize images before sending:
// Supported formats: JPEG, PNG, WebP, GIF
// Recommended: JPEG for photos, PNG for screenshots
// Keep file size reasonable (< 20MB)
const maxSize = 20 * 1024 * 1024 // 20MB2. Image Quality vs Cost
Higher resolution costs more tokens:
// For basic analysis, resize large images
import sharp from 'sharp'
const resized = await sharp(imageBuffer)
.resize(1024, 1024, { fit: 'inside' })
.toBuffer()3. Clear Instructions
Be specific about what you want from the image:
// ❌ Vague
{ type: 'text', text: 'What is this?' }
// ✅ Specific
{ type: 'text', text: 'List all text visible in this screenshot, maintaining the original layout' }4. Error Handling
Handle image loading errors:
try {
const response = await bridge.chat({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image' },
{ type: 'image_url', image_url: { url: imageUrl } }
]
}
]
})
} catch (error) {
if (error.message.includes('image')) {
console.error('Failed to process image:', error)
// Fallback or retry logic
}
}5. Multi-turn Vision Conversations
Continue conversations about images:
const messages = [
{
role: 'user',
content: [
{ type: 'text', text: 'What is in this image?' },
{ type: 'image_url', image_url: { url: imageUrl } }
]
}
]
const response1 = await bridge.chat({ model: 'gpt-4o', messages })
messages.push(response1.choices[0].message)
// Ask follow-up question
messages.push({
role: 'user',
content: 'What color is the car in the image?'
})
const response2 = await bridge.chat({ model: 'gpt-4o', messages })Limitations
Token Costs
Images consume significant tokens:
- Small image (512x512): ~85 tokens
- Medium image (1024x1024): ~255 tokens
- Large image (2048x2048): ~765 tokens
Rate Limits
Vision requests may have lower rate limits than text-only requests. Check provider documentation.
Content Policies
All providers have content safety policies. Avoid sending:
- Inappropriate or explicit content
- Personal/sensitive information
- Copyrighted material without permission