Qwen 3 VL 30B
Qwen 3 VL 30B is Alibaba's lightweight vision-language model with text and image input capabilities and a 131K token context window.
API Pricing
| Provider | Input / 1M | Output / 1M | Speed | TTFT | Updated |
|---|---|---|---|---|---|
| $0.130 | $0.520 | 129 t/s | 1.1s | 4/14/2026 |
Prices updated daily. Last check: 4/14/2026
Model Details
General
- Creator
- Alibaba
- Family
- Qwen
- Tier
- Lightweight
- Context Window
- 131K
- Modalities
- Text, Image
Capabilities
- Tool Calling
- No
- Open Source
- No
Strengths & Limitations
- Supports both text and image input processing
- 131,072 token context window for long document analysis
- Fast output generation at 128.99 tokens per second
- Lightweight architecture enables cost-effective deployment
- Reasonable time to first token at 1,075 milliseconds
- Part of established Qwen model family ecosystem
- No tool calling or function execution capabilities
- Proprietary model with no open source weights available
- Limited to text and image modalities only
- Positioned as lightweight tier rather than maximum capability
- Smaller parameter count may limit complex reasoning tasks
Key Features
About Qwen 3 VL 30B
Common Use Cases
Qwen 3 VL 30B is suited for applications requiring efficient vision-language processing at scale, such as document analysis with embedded images, content moderation combining text and visual elements, and automated image captioning or description tasks. Its lightweight architecture and fast inference speed make it appropriate for high-volume scenarios where cost efficiency matters more than maximum capability, including customer service applications processing screenshots, educational content analysis, and basic visual question answering systems where rapid response times are prioritized.
Frequently Asked Questions
How much does Qwen 3 VL 30B cost per million tokens?
Qwen 3 VL 30B pricing varies by provider and may include separate rates for input text, input images, and output tokens. Check the pricing table above for current rates across all available providers.
What is Qwen 3 VL 30B best used for?
Qwen 3 VL 30B excels at efficient vision-language tasks like document analysis with images, content moderation, image captioning, and visual question answering where speed and cost-effectiveness are important. Its 131K context window supports long documents with visual elements.
Does Qwen 3 VL 30B support function calling or tool use?
No, Qwen 3 VL 30B does not include tool calling capabilities. It focuses on core vision-language understanding tasks with text and image inputs, making it suitable for applications that need multimodal processing without external tool integration.