FlagshipXiaomi

MiMo v2 Omni

Name: MiMo v2 Omni
Author: Xiaomi

MiMo v2 Omni is Xiaomi's flagship multimodal model supporting text, image, audio, and video inputs with a 262K token context window.

Context 262K

Tier Flagship

Modalities text, image, audio, video

Contact providers for pricing

Compare Prices

API Pricing

No pricing data available for this model at the moment.

Prices updated daily. Last check: Jul 13, 2026

Performance & Benchmarks

Source: Artificial Analysis →

Intelligence

36.4 / 100

Reasoning & Knowledge

GPQA Diamond85.5%
Humanity's Last Exam20.4%

Coding

SciCode39.5%

Agentic & Tool Use

Terminal-Bench Hard35.6%
τ²-bench88.0%

Instruction & Long Context

IFBench67.3%
Long-Context Reasoning63.7%

Benchmarks measured Jul 2026. Scores are independent evaluations, not vendor-reported.

Model Details

General

Creator: Xiaomi
Family: MiMo
Tier: Flagship
Context Window: 262K
Modalities: Text, Image, Audio, Video

Capabilities

Tool Calling: No
Open Source: No

Strengths & Limitations

Strengths

Supports four modalities: text, image, audio, and video input
Large 262,144 token context window for processing extensive content
Flagship-tier model with comprehensive multimodal understanding
Developed by Xiaomi, offering potential ecosystem integration
Handles video input, which is less common among available models
Audio processing capabilities alongside visual and text understanding
Substantial context capacity for complex multimodal tasks

Limitations

No tool calling or function execution capabilities
Proprietary model with no open-source availability
Limited performance benchmarks publicly available
Newer entrant compared to established model providers
No streaming response metrics disclosed

Key Features

•262,144 token context window

•Text input and generation

•Image processing and understanding

•Audio input processing

•Video content analysis

•Multimodal reasoning across input types

•Flagship-tier model performance

•Proprietary Xiaomi model architecture

About MiMo v2 Omni

MiMo v2 Omni is Xiaomi's flagship model in the MiMo family, representing the company's entry into large-scale multimodal AI. As a proprietary model from the Chinese technology company, it positions Xiaomi alongside other tech giants offering comprehensive AI solutions beyond their traditional hardware focus. The model supports four modalities—text, image, audio, and video—making it one of the more comprehensive multimodal offerings available. With a 262,144 token context window, it can process substantial amounts of content across these different input types. However, the model does not include tool calling capabilities, which distinguishes it from many other flagship models that emphasize agentic workflows. MiMo v2 Omni appears designed for applications requiring rich media understanding and generation, particularly in consumer and enterprise scenarios where Xiaomi's ecosystem integration might provide advantages. Its multimodal capabilities make it suitable for content analysis, media processing, and applications that need to reason across different types of input simultaneously.

Common Use Cases

MiMo v2 Omni is designed for complex multimodal applications that require understanding and reasoning across text, images, audio, and video content. Its large context window makes it suitable for analyzing lengthy multimedia presentations, processing educational content with mixed media, content moderation across different formats, and media summarization tasks. The model works well for consumer applications where rich media understanding is needed, such as smart home integration, multimedia content creation assistance, and cross-modal search and retrieval. Without tool calling capabilities, it focuses on understanding and generation rather than agentic workflows, making it appropriate for content analysis, creative applications, and scenarios requiring comprehensive multimodal comprehension rather than external system integration.

Frequently Asked Questions

How much does MiMo v2 Omni cost per million tokens?

MiMo v2 Omni pricing varies by provider and may differ for different input modalities (text, image, audio, video). Check the pricing table above for current rates across all available providers.

What is MiMo v2 Omni best used for?

MiMo v2 Omni excels at multimodal tasks requiring understanding across text, images, audio, and video content. It's particularly suited for media analysis, content summarization across formats, educational applications with mixed media, and consumer applications requiring comprehensive multimedia understanding.

Does MiMo v2 Omni support tool calling and function execution?

No, MiMo v2 Omni does not include tool calling capabilities. The model focuses on multimodal understanding and generation rather than agentic workflows or external system integration, making it better suited for content analysis and creative applications than automated task execution.