MiMo v2 Omni
MiMo v2 Omni is Xiaomi's flagship multimodal model supporting text, image, audio, and video inputs with a 262K token context window.
API Pricing
| Provider | Input / 1M | Output / 1M | Updated |
|---|---|---|---|
| $0.400 | $2.00 | 4/14/2026 |
Prices updated daily. Last check: 4/14/2026
Model Details
General
- Creator
- Xiaomi
- Family
- MiMo
- Tier
- Flagship
- Context Window
- 262K
- Modalities
- Text, Image, Audio, Video
Capabilities
- Tool Calling
- No
- Open Source
- No
Strengths & Limitations
- Supports four modalities: text, image, audio, and video input
- Large 262,144 token context window for processing extensive content
- Flagship-tier model with comprehensive multimodal understanding
- Developed by Xiaomi, offering potential ecosystem integration
- Handles video input, which is less common among available models
- Audio processing capabilities alongside visual and text understanding
- Substantial context capacity for complex multimodal tasks
- No tool calling or function execution capabilities
- Proprietary model with no open-source availability
- Limited performance benchmarks publicly available
- Newer entrant compared to established model providers
- No streaming response metrics disclosed
Key Features
About MiMo v2 Omni
Common Use Cases
MiMo v2 Omni is designed for complex multimodal applications that require understanding and reasoning across text, images, audio, and video content. Its large context window makes it suitable for analyzing lengthy multimedia presentations, processing educational content with mixed media, content moderation across different formats, and media summarization tasks. The model works well for consumer applications where rich media understanding is needed, such as smart home integration, multimedia content creation assistance, and cross-modal search and retrieval. Without tool calling capabilities, it focuses on understanding and generation rather than agentic workflows, making it appropriate for content analysis, creative applications, and scenarios requiring comprehensive multimodal comprehension rather than external system integration.
Frequently Asked Questions
How much does MiMo v2 Omni cost per million tokens?
MiMo v2 Omni pricing varies by provider and may differ for different input modalities (text, image, audio, video). Check the pricing table above for current rates across all available providers.
What is MiMo v2 Omni best used for?
MiMo v2 Omni excels at multimodal tasks requiring understanding across text, images, audio, and video content. It's particularly suited for media analysis, content summarization across formats, educational applications with mixed media, and consumer applications requiring comprehensive multimedia understanding.
Does MiMo v2 Omni support tool calling and function execution?
No, MiMo v2 Omni does not include tool calling capabilities. The model focuses on multimodal understanding and generation rather than agentic workflows or external system integration, making it better suited for content analysis and creative applications than automated task execution.