LightweightBaidu

ERNIE 4.5 VL 28B

Name: ERNIE 4.5 VL 28B
Author: Baidu

ERNIE 4.5 VL 28B is Baidu's lightweight multimodal model with vision capabilities and a 30K token context window for efficient text and image processing.

Context 30K

Tier Lightweight

Modalities text, image

Contact providers for pricing

Compare Prices

API Pricing

No pricing data available for this model at the moment.

Prices updated daily. Last check: Jul 13, 2026

Model Details

General

Creator: Baidu
Family: ERNIE
Tier: Lightweight
Context Window: 30K
Modalities: Text, Image

Capabilities

Tool Calling: No
Open Source: No

Strengths & Limitations

Strengths

Multimodal support for both text and image inputs
30,000 token context window for processing longer documents and conversations
28 billion parameter architecture balances capability with efficiency
Part of Baidu's established ERNIE model family with Chinese language optimization
Lightweight tier positioning enables faster inference compared to flagship models
Vision-language capabilities for image analysis and multimodal reasoning

Limitations

No tool calling or function execution capabilities
Proprietary model with weights not publicly available
Smaller context window compared to flagship models in the 100K+ range
Limited to text and image modalities without audio or video support
Lightweight tier may have reduced reasoning complexity versus flagship alternatives

Key Features

•Text and image input processing

•30,000 token context window

•28 billion parameter architecture

•Multimodal reasoning across text and visual content

•Image description and analysis capabilities

•Visual question answering

•Streaming response support

•Chinese language optimization

About ERNIE 4.5 VL 28B

ERNIE 4.5 VL 28B is Baidu's lightweight multimodal language model in the ERNIE family, positioned as an efficient option for text and image understanding tasks. The model features 28 billion parameters and represents Baidu's approach to balancing performance with computational efficiency in the vision-language domain. The model supports both text and image inputs with a 30,000 token context window, enabling it to process documents, images, and conversations of moderate length. As a vision-language model, ERNIE 4.5 VL 28B can analyze visual content, describe images, answer questions about visual elements, and perform multimodal reasoning tasks that require understanding both textual and visual information. ERNIE 4.5 VL 28B is designed for applications requiring multimodal capabilities without the computational overhead of larger flagship models. Its lightweight architecture makes it suitable for scenarios where vision-language processing is needed at scale or where response speed is prioritized over maximum capability complexity.

Common Use Cases

ERNIE 4.5 VL 28B is well-suited for applications requiring efficient multimodal processing, particularly in scenarios involving Chinese language content and visual analysis. Its lightweight architecture makes it appropriate for document analysis with images, e-commerce product description generation, content moderation involving both text and images, educational applications that need to process textbooks with diagrams, and customer service scenarios where visual context is important. The 30K context window supports moderate-length conversations and document processing while maintaining cost efficiency. Organizations needing vision-language capabilities at scale, particularly in Chinese markets or multilingual applications, can benefit from its balanced performance-to-efficiency ratio without requiring the computational resources of flagship multimodal models.

Frequently Asked Questions

How much does ERNIE 4.5 VL 28B cost per million tokens?

ERNIE 4.5 VL 28B pricing varies by provider and may have different rates for text and image processing. Check the pricing table above for current rates across all available providers.

What is ERNIE 4.5 VL 28B best used for?

ERNIE 4.5 VL 28B excels at multimodal tasks requiring both text and image understanding, including document analysis with visual elements, image description, visual question answering, and content moderation. Its lightweight architecture makes it particularly suitable for high-volume applications and scenarios where Chinese language support is important.

Does ERNIE 4.5 VL 28B support tool calling or function execution?

No, ERNIE 4.5 VL 28B does not support tool calling or function execution capabilities. The model is focused on text and image understanding tasks rather than agentic workflows that require external tool integration.