LightweightAlibaba

Qwen 3 VL 30B

Name: Qwen 3 VL 30B
Availability: InStock
Author: Alibaba

Qwen 3 VL 30B is Alibaba's lightweight vision-language model with text and image input capabilities and a 131K token context window.

Context 131K

Tier Lightweight

Modalities text, image

Input from

$0.130 / 1M tokens

across 4 providers

Compare Prices

API Pricing

Cheapest on OpenRouter — 31% below avg

Provider	Input / 1M	Output / 1M	Cached / 1M	Speed	TTFT	Updated
OpenRouter	$0.130	$0.520	-	128 t/s	975ms	7/13/2026
Deep Infra	$0.150	$0.600	-	128 t/s	975ms	7/13/2026
IO.NET	$0.186	$1.20	$0.093	128 t/s	975ms	6/18/2026
Scaleway	$0.290	$1.74	-	128 t/s	975ms	6/18/2026

Prices updated daily. Last check: Jul 13, 2026

Performance & Benchmarks

Source: Artificial Analysis →

Intelligence

10.0 / 100

Math

72.3 / 100

Output Speed

128 t/s

Latency (TTFT)

975ms

Reasoning & Knowledge

MMLU-Pro76.4%
GPQA Diamond69.5%
Humanity's Last Exam6.4%

Coding

LiveCodeBench47.6%
SciCode30.8%

Math

AIME 202572.3%

Agentic & Tool Use

Terminal-Bench Hard6.1%
τ²-bench19.0%

Instruction & Long Context

IFBench33.1%
Long-Context Reasoning23.7%

Benchmarks measured Jul 2026. Scores are independent evaluations, not vendor-reported.

Model Details

General

Creator: Alibaba
Family: Qwen
Tier: Lightweight
Context Window: 131K
Modalities: Text, Image

Capabilities

Tool Calling: No
Open Source: No

Strengths & Limitations

Strengths

Supports both text and image input processing
131,072 token context window for long document analysis
Fast output generation at 128.99 tokens per second
Lightweight architecture enables cost-effective deployment
Reasonable time to first token at 1,075 milliseconds
Part of established Qwen model family ecosystem

Limitations

No tool calling or function execution capabilities
Proprietary model with no open source weights available
Limited to text and image modalities only
Positioned as lightweight tier rather than maximum capability
Smaller parameter count may limit complex reasoning tasks

Key Features

•Text and image multimodal input

•131,072 token context window

•Vision-language understanding

•Streaming response generation

•Document and image analysis

•Fast inference optimization

•Lightweight model architecture

About Qwen 3 VL 30B

Qwen 3 VL 30B is a lightweight vision-language model developed by Alibaba as part of the Qwen family. Positioned as an efficient option in Alibaba's model lineup, it offers multimodal capabilities while maintaining faster inference speeds compared to larger models in the family. The model supports both text and image inputs with a 131,072 token context window, enabling processing of long documents alongside visual content. Performance benchmarks show an output speed of 128.99 tokens per second with a time to first token of 1,075 milliseconds. However, the model does not include tool calling capabilities, focusing instead on core vision-language understanding tasks. Qwen 3 VL 30B targets applications requiring efficient multimodal processing where speed and cost-effectiveness are priorities over maximum capability. It competes with other lightweight vision models by offering reasonable performance for common vision-language tasks while maintaining faster response times than flagship alternatives.

Common Use Cases

Qwen 3 VL 30B is suited for applications requiring efficient vision-language processing at scale, such as document analysis with embedded images, content moderation combining text and visual elements, and automated image captioning or description tasks. Its lightweight architecture and fast inference speed make it appropriate for high-volume scenarios where cost efficiency matters more than maximum capability, including customer service applications processing screenshots, educational content analysis, and basic visual question answering systems where rapid response times are prioritized.

Frequently Asked Questions

How much does Qwen 3 VL 30B cost per million tokens?

Qwen 3 VL 30B pricing varies by provider and may include separate rates for input text, input images, and output tokens. Check the pricing table above for current rates across all available providers.

What is Qwen 3 VL 30B best used for?

Qwen 3 VL 30B excels at efficient vision-language tasks like document analysis with images, content moderation, image captioning, and visual question answering where speed and cost-effectiveness are important. Its 131K context window supports long documents with visual elements.

Does Qwen 3 VL 30B support function calling or tool use?

No, Qwen 3 VL 30B does not include tool calling capabilities. It focuses on core vision-language understanding tasks with text and image inputs, making it suitable for applications that need multimodal processing without external tool integration.