LightweightIBM

Granite 4.0 H Micro

Granite 4.0 H Micro is IBM's lightweight text model designed for high-speed inference with a 131K token context window.

Context 131K

Tier Lightweight

Input from

$0.017 / 1M tokens

across 1 provider

Compare Prices

API Pricing

Provider	Input / 1M	Output / 1M	Updated
OpenRouter	$0.017	$0.112	5/28/2026

Prices updated daily. Last check: May 29, 2026

Model Details

General

Creator: IBM
Family: Granite
Tier: Lightweight
Context Window: 131K
Modalities: Text

Capabilities

Tool Calling: No
Open Source: No

Strengths & Limitations

Strengths

Fast output generation at 464 tokens per second
131K token context window supports long document processing
Lightweight architecture enables efficient deployment
IBM enterprise backing with corporate support
Optimized for high-throughput inference workloads
Competitive context size for micro-tier model class

Limitations

No tool calling or function execution capabilities
Text-only processing - no image or multimodal support
Higher time to first token at 8.7 seconds compared to some peers
Proprietary model - weights not publicly available
Limited reasoning capabilities compared to larger models in family

Key Features

•131,000 token context window

•Text-only input and output

•High-speed inference at 464 tokens/second

•Streaming response generation

•Enterprise-grade model deployment

•Batch processing support

•API-based access

About Granite 4.0 H Micro

Granite 4.0 H Micro is IBM's lightweight model in the Granite 4.0 family, positioned for applications requiring fast inference speeds rather than maximum capability. As a micro-tier model, it sits below IBM's more capable Granite variants while offering competitive performance for its size class. The model processes text-only inputs with a 131,000 token context window, providing substantial capacity for document processing and multi-turn conversations. Performance benchmarks show output generation at 464 tokens per second with an 8.7 second time to first token, indicating optimization for throughput over latency. The model handles standard text generation tasks but does not include tool calling capabilities or multimodal processing. Granite 4.0 H Micro targets use cases where speed and efficiency matter more than advanced reasoning capabilities. Its combination of decent context length and fast token generation makes it suitable for high-volume text processing workloads where cost efficiency is prioritized over the sophisticated capabilities found in frontier models.

Common Use Cases

Granite 4.0 H Micro is well-suited for high-volume text processing applications where speed and cost efficiency are priorities. Its fast token generation makes it effective for content summarization, document analysis, and automated text generation workflows that process large quantities of material. The 131K context window enables processing of substantial documents, research papers, or multi-turn conversations without truncation. Organizations needing reliable text processing for customer service automation, content moderation, or data extraction tasks can benefit from its combination of decent capability and optimized performance. The model works well for applications where advanced reasoning or tool use are not required, but consistent, fast text processing is essential.

Frequently Asked Questions

How much does Granite 4.0 H Micro cost per million tokens?

Granite 4.0 H Micro pricing varies by provider and usage type. Check the pricing table above for current rates across all available providers offering this model.

What is Granite 4.0 H Micro best used for?

Granite 4.0 H Micro excels at high-volume text processing tasks requiring fast inference speeds. It's ideal for document summarization, content generation, text analysis, and automated writing workflows where speed and cost efficiency matter more than advanced reasoning capabilities.

How does Granite 4.0 H Micro compare to other lightweight models?

Granite 4.0 H Micro offers a notably large 131K context window for its micro tier, combined with fast 464 tokens/second output generation. However, it has a longer time to first token at 8.7 seconds and lacks tool calling capabilities that some competing lightweight models provide.