LightweightGoogle

Gemma 3 4B

Gemma 3 4B is Google's lightweight multimodal model supporting text and image inputs with a 131K token context window.

Context 131K
Tier Lightweight
Modalities text, image
Input from
$0.020 / 1M tokens
across 3 providers

API Pricing

Cheapest on Amazon AWS 43% below avg
ProviderInput / 1MOutput / 1MSpeedTTFTUpdated
$0.020$0.04031.1 t/s1.2s4/14/2026
$0.040$0.08031.1 t/s1.2s4/4/2026
$0.040$0.08031.1 t/s1.2s4/14/2026
$0.040$0.08031.1 t/s1.2s4/14/2026

Prices updated daily. Last check: 4/14/2026

Model Details

General

Creator
Google
Family
Gemma
Tier
Lightweight
Context Window
131K
Modalities
Text, Image

Capabilities

Tool Calling
No
Open Source
No

Strengths & Limitations

  • Supports multimodal inputs including text and images
  • Large 131K token context window for a lightweight model
  • Fast inference at 30.32 output tokens per second
  • Efficient 4B parameter size reduces computational requirements
  • Part of Google's established Gemma model family
  • Reasonable time to first token at 1095ms
  • Suitable for resource-constrained multimodal applications
  • No tool calling or function execution capabilities
  • Proprietary model - weights not publicly available
  • Smaller parameter count limits complex reasoning compared to larger models
  • Limited modality support compared to models with audio or video input

Key Features

131K token context window
Text input and generation
Image input processing
Multimodal understanding
Streaming response support
4 billion parameter architecture
Lightweight inference profile
Google Gemma family integration

About Gemma 3 4B

Gemma 3 4B is Google's lightweight model in the Gemma family, designed for efficient multimodal processing at reduced computational cost. As a 4 billion parameter model, it sits in the lightweight tier, offering a balance between capability and resource efficiency. The model supports both text and image inputs with a 131K token context window, enabling multimodal applications while maintaining fast inference speeds. Benchmark data shows it achieves 30.32 output tokens per second with a time to first token of 1095ms. The model does not include tool calling capabilities, focusing instead on core text generation and image understanding tasks. Gemma 3 4B targets use cases where multimodal capability is needed but computational resources or latency requirements favor a smaller model over larger alternatives. It competes with other lightweight multimodal models in scenarios requiring efficient image and text processing.

Common Use Cases

Gemma 3 4B is well-suited for applications requiring multimodal processing with efficiency constraints, such as image captioning, visual question answering, document analysis with mixed text and images, and content moderation systems. Its lightweight architecture makes it appropriate for high-volume scenarios where fast inference is prioritized over maximum capability, including mobile applications, edge deployments, or services processing large numbers of image-text pairs. The 131K context window enables processing of longer documents with embedded images while maintaining reasonable computational costs.

Frequently Asked Questions

How much does Gemma 3 4B cost per million tokens?

Gemma 3 4B pricing varies by provider and may differ for text versus image tokens. Check the pricing table above for current rates across all available providers.

What is Gemma 3 4B best used for?

Gemma 3 4B excels at multimodal tasks requiring both text and image processing where efficiency is important, such as image captioning, visual question answering, document analysis, and content classification. Its lightweight architecture makes it ideal for high-volume applications or resource-constrained environments.

Does Gemma 3 4B support function calling or tool use?

No, Gemma 3 4B does not support tool calling or function execution capabilities. It focuses on core text generation and image understanding tasks without external tool integration.