LightweightAlibaba

Qwen 3.5 35B

Qwen 3.5 35B is Alibaba's lightweight multimodal model supporting text, image, and video inputs with a 262K token context window.

Context 262K
Tier Lightweight
Modalities text, image, video
Input from
$0.163 / 1M tokens
across 1 provider

API Pricing

ProviderInput / 1MOutput / 1MSpeedTTFTUpdated
$0.163$1.30143 t/s1.1s4/14/2026

Prices updated daily. Last check: 4/14/2026

Model Details

General

Creator
Alibaba
Family
Qwen
Tier
Lightweight
Context Window
262K
Modalities
Text, Image, Video

Capabilities

Tool Calling
No
Open Source
No

Strengths & Limitations

  • Supports video input processing alongside text and images
  • Large 262K token context window for extensive document processing
  • Fast output generation at 142.97 tokens per second
  • Multimodal capabilities in a lightweight-tier model
  • Sub-second time to first token at 997ms
  • Part of Alibaba's established Qwen model family
  • Handles complex multimedia reasoning tasks
  • No tool calling or function execution capabilities
  • Proprietary model with weights not publicly available
  • Lightweight tier may have reduced reasoning compared to larger models
  • Limited to inference only without fine-tuning access

Key Features

262K token context window
Text input and generation
Image input processing
Video input analysis
Multimodal reasoning
Streaming response support
Fast inference speeds
Cross-modal content understanding

About Qwen 3.5 35B

Qwen 3.5 35B is a lightweight-tier model from Alibaba's Qwen family, designed to provide multimodal capabilities at a more accessible scale than larger models. As part of Alibaba's latest generation of language models, it sits in the lightweight category while maintaining sophisticated functionality across multiple input types. The model features a 262,144 token context window and supports text, image, and video modalities, making it capable of processing complex multimedia inputs. Performance benchmarks show an output speed of 142.97 tokens per second with a time to first token of 997 milliseconds. The model handles multimodal reasoning tasks across visual and textual content, though it does not include tool calling capabilities. Qwen 3.5 35B targets use cases requiring multimodal understanding without the computational overhead of frontier models. Its combination of video processing capabilities and substantial context window positions it for applications needing efficient multimedia analysis and content generation workflows.

Common Use Cases

Qwen 3.5 35B is well-suited for applications requiring efficient multimodal processing, particularly those involving video content analysis, multimedia document understanding, and visual question answering. Its lightweight tier makes it appropriate for high-volume content moderation, automated video summarization, educational content analysis, and customer service applications that need to process images or videos. The large context window enables processing of lengthy multimedia documents or multiple files in a single request, while the fast inference speeds support real-time applications. Organizations needing cost-effective multimodal AI without the overhead of frontier models will find it suitable for content analysis, media processing workflows, and applications where video understanding is required alongside text processing.

Frequently Asked Questions

How much does Qwen 3.5 35B cost per million tokens?

Qwen 3.5 35B pricing varies by provider and may have different rates for text versus image/video tokens. Check the pricing table above for current rates across all available providers.

What is Qwen 3.5 35B best used for?

Qwen 3.5 35B excels at multimodal tasks requiring video processing, content analysis, and multimedia document understanding. Its lightweight tier and fast inference make it ideal for high-volume applications like content moderation, video summarization, and customer service scenarios involving visual content.

Does Qwen 3.5 35B support tool calling or function execution?

No, Qwen 3.5 35B does not support tool calling or function execution capabilities. It focuses on multimodal understanding and generation tasks across text, image, and video inputs without external tool integration.