LightweightBaidu

ERNIE 4.5 VL 28B

ERNIE 4.5 VL 28B is Baidu's lightweight multimodal model with vision capabilities and a 30K token context window for efficient text and image processing.

Context 30K
Tier Lightweight
Modalities text, image
Input from
$0.140 / 1M tokens
across 1 provider

API Pricing

ProviderInput / 1MOutput / 1MUpdated
$0.140$0.5604/14/2026

Prices updated daily. Last check: 4/14/2026

Model Details

General

Creator
Baidu
Family
ERNIE
Tier
Lightweight
Context Window
30K
Modalities
Text, Image

Capabilities

Tool Calling
No
Open Source
No

Strengths & Limitations

  • Multimodal support for both text and image inputs
  • 30,000 token context window for processing longer documents and conversations
  • 28 billion parameter architecture balances capability with efficiency
  • Part of Baidu's established ERNIE model family with Chinese language optimization
  • Lightweight tier positioning enables faster inference compared to flagship models
  • Vision-language capabilities for image analysis and multimodal reasoning
  • No tool calling or function execution capabilities
  • Proprietary model with weights not publicly available
  • Smaller context window compared to flagship models in the 100K+ range
  • Limited to text and image modalities without audio or video support
  • Lightweight tier may have reduced reasoning complexity versus flagship alternatives

Key Features

Text and image input processing
30,000 token context window
28 billion parameter architecture
Multimodal reasoning across text and visual content
Image description and analysis capabilities
Visual question answering
Streaming response support
Chinese language optimization

About ERNIE 4.5 VL 28B

ERNIE 4.5 VL 28B is Baidu's lightweight multimodal language model in the ERNIE family, positioned as an efficient option for text and image understanding tasks. The model features 28 billion parameters and represents Baidu's approach to balancing performance with computational efficiency in the vision-language domain. The model supports both text and image inputs with a 30,000 token context window, enabling it to process documents, images, and conversations of moderate length. As a vision-language model, ERNIE 4.5 VL 28B can analyze visual content, describe images, answer questions about visual elements, and perform multimodal reasoning tasks that require understanding both textual and visual information. ERNIE 4.5 VL 28B is designed for applications requiring multimodal capabilities without the computational overhead of larger flagship models. Its lightweight architecture makes it suitable for scenarios where vision-language processing is needed at scale or where response speed is prioritized over maximum capability complexity.

Common Use Cases

ERNIE 4.5 VL 28B is well-suited for applications requiring efficient multimodal processing, particularly in scenarios involving Chinese language content and visual analysis. Its lightweight architecture makes it appropriate for document analysis with images, e-commerce product description generation, content moderation involving both text and images, educational applications that need to process textbooks with diagrams, and customer service scenarios where visual context is important. The 30K context window supports moderate-length conversations and document processing while maintaining cost efficiency. Organizations needing vision-language capabilities at scale, particularly in Chinese markets or multilingual applications, can benefit from its balanced performance-to-efficiency ratio without requiring the computational resources of flagship multimodal models.

Frequently Asked Questions

How much does ERNIE 4.5 VL 28B cost per million tokens?

ERNIE 4.5 VL 28B pricing varies by provider and may have different rates for text and image processing. Check the pricing table above for current rates across all available providers.

What is ERNIE 4.5 VL 28B best used for?

ERNIE 4.5 VL 28B excels at multimodal tasks requiring both text and image understanding, including document analysis with visual elements, image description, visual question answering, and content moderation. Its lightweight architecture makes it particularly suitable for high-volume applications and scenarios where Chinese language support is important.

Does ERNIE 4.5 VL 28B support tool calling or function execution?

No, ERNIE 4.5 VL 28B does not support tool calling or function execution capabilities. The model is focused on text and image understanding tasks rather than agentic workflows that require external tool integration.