FlagshipAlibaba

Qwen 3 VL 235B

Name: Qwen 3 VL 235B
Availability: InStock
Author: Alibaba

Qwen 3 VL 235B is Alibaba's flagship multimodal model with vision and text capabilities, featuring a 262K token context window for complex reasoning tasks.

Context 262K

Tier Flagship

Modalities text, image

Input from

$0.200 / 1M tokens

across 3 providers

Compare Prices

API Pricing

Cheapest on OpenRouter — 33% below avg

Provider	Input / 1M	Output / 1M	Cached / 1M	Speed	TTFT	Updated
OpenRouter	$0.200	$0.880	$0.110	56.6 t/s	1.1s	7/13/2026
Deep Infra	$0.200	$0.880	$0.110	56.6 t/s	1.1s	7/13/2026
Amazon AWSBatch	$0.260	$1.33	-	56.6 t/s	1.1s	7/13/2026
Amazon AWS	$0.530	$2.66	-	56.6 t/s	1.1s	7/13/2026

Prices updated daily. Last check: Jul 13, 2026

Performance & Benchmarks

Source: Artificial Analysis →

Intelligence

14.3 / 100

Math

70.7 / 100

Output Speed

56.6 t/s

Latency (TTFT)

1.1s

Reasoning & Knowledge

MMLU-Pro82.3%
GPQA Diamond71.2%
Humanity's Last Exam6.3%

Coding

LiveCodeBench59.4%
SciCode35.9%

Math

AIME 202570.7%

Agentic & Tool Use

Terminal-Bench Hard6.8%
τ²-bench35.1%

Instruction & Long Context

IFBench42.7%
Long-Context Reasoning31.7%

Benchmarks measured Jul 2026. Scores are independent evaluations, not vendor-reported.

Model Details

General

Creator: Alibaba
Family: Qwen
Tier: Flagship
Context Window: 262K
Modalities: Text, Image

Capabilities

Tool Calling: No
Open Source: No

Strengths & Limitations

Strengths

Large 262K token context window supports extensive multimodal conversations
235B parameter scale provides substantial reasoning capabilities
Multimodal support handles both text and image inputs natively
Output speed of 60.22 tokens per second for responsive generation
Flagship-tier model from Alibaba with latest architectural improvements
Extended context enables analysis of long documents with embedded visuals

Limitations

No function calling or tool use capabilities
Proprietary model with no open-source weights available
Time to first token of over 1 second may impact latency-sensitive applications
Limited to text and image modalities without audio or video support
Larger model size may result in higher computational costs

Key Features

•262,144 token context window

•Multimodal input support (text and images)

•235 billion parameter architecture

•Streaming response generation

•Batch processing capabilities

•Cross-modal reasoning between text and visual content

•Document analysis with embedded graphics

•Visual question answering

About Qwen 3 VL 235B

Qwen 3 VL 235B is Alibaba's flagship model in the Qwen family, representing the company's most capable offering for multimodal tasks requiring both text and image understanding. As a 235 billion parameter model, it sits at the top of Alibaba's model hierarchy and competes with other large-scale multimodal models in the market. The model supports both text and image inputs with a substantial 262,144 token context window, enabling analysis of lengthy documents alongside visual content. Performance benchmarks show the model generates approximately 60 tokens per second with a time-to-first-token of around 1 second. The model does not currently support function calling or tool use capabilities, focusing instead on direct text and image reasoning tasks. Qwen 3 VL 235B targets enterprise and research applications requiring sophisticated multimodal analysis, such as document understanding with charts and diagrams, visual question answering, and complex reasoning over mixed media content. The extended context window allows for processing substantial amounts of text and multiple images within a single conversation.

Common Use Cases

Qwen 3 VL 235B is designed for complex multimodal applications requiring sophisticated reasoning across text and visual content. Its large context window makes it particularly suitable for analyzing lengthy documents that contain charts, diagrams, or images, such as research papers, technical manuals, or financial reports. The model excels at visual question answering, content moderation involving images, educational applications requiring diagram explanation, and enterprise document processing workflows. The flagship-tier capabilities enable advanced reasoning tasks like comparing multiple images, extracting information from complex visual layouts, and maintaining context across extended multimodal conversations.

Frequently Asked Questions

How much does Qwen 3 VL 235B cost per million tokens?

Qwen 3 VL 235B pricing varies by provider and may have different rates for text versus image tokens. Check the pricing table above for current rates across all available providers.

What is Qwen 3 VL 235B best used for?

Qwen 3 VL 235B excels at complex multimodal tasks requiring analysis of both text and images, particularly document understanding with visual elements, visual question answering, and reasoning over mixed media content within its 262K token context window.

Does Qwen 3 VL 235B support function calling or tool use?

No, Qwen 3 VL 235B does not currently support function calling or tool use capabilities. It focuses on direct text and image analysis rather than external tool integration.