LightweightIBM

Granite 4.0 H Micro

Granite 4.0 H Micro is IBM's lightweight text model designed for high-speed inference with a 131K token context window.

Context 131K
Tier Lightweight
Input from
$0.017 / 1M tokens
across 1 provider

API Pricing

ProviderInput / 1MOutput / 1MSpeedTTFTUpdated
$0.017$0.110394 t/s8.7s4/14/2026

Prices updated daily. Last check: 4/14/2026

Model Details

General

Creator
IBM
Family
Granite
Tier
Lightweight
Context Window
131K
Modalities
Text

Capabilities

Tool Calling
No
Open Source
No

Strengths & Limitations

  • Fast output generation at 464 tokens per second
  • 131K token context window supports long document processing
  • Lightweight architecture enables efficient deployment
  • IBM enterprise backing with corporate support
  • Optimized for high-throughput inference workloads
  • Competitive context size for micro-tier model class
  • No tool calling or function execution capabilities
  • Text-only processing - no image or multimodal support
  • Higher time to first token at 8.7 seconds compared to some peers
  • Proprietary model - weights not publicly available
  • Limited reasoning capabilities compared to larger models in family

Key Features

131,000 token context window
Text-only input and output
High-speed inference at 464 tokens/second
Streaming response generation
Enterprise-grade model deployment
Batch processing support
API-based access

About Granite 4.0 H Micro

Granite 4.0 H Micro is IBM's lightweight model in the Granite 4.0 family, positioned for applications requiring fast inference speeds rather than maximum capability. As a micro-tier model, it sits below IBM's more capable Granite variants while offering competitive performance for its size class. The model processes text-only inputs with a 131,000 token context window, providing substantial capacity for document processing and multi-turn conversations. Performance benchmarks show output generation at 464 tokens per second with an 8.7 second time to first token, indicating optimization for throughput over latency. The model handles standard text generation tasks but does not include tool calling capabilities or multimodal processing. Granite 4.0 H Micro targets use cases where speed and efficiency matter more than advanced reasoning capabilities. Its combination of decent context length and fast token generation makes it suitable for high-volume text processing workloads where cost efficiency is prioritized over the sophisticated capabilities found in frontier models.

Common Use Cases

Granite 4.0 H Micro is well-suited for high-volume text processing applications where speed and cost efficiency are priorities. Its fast token generation makes it effective for content summarization, document analysis, and automated text generation workflows that process large quantities of material. The 131K context window enables processing of substantial documents, research papers, or multi-turn conversations without truncation. Organizations needing reliable text processing for customer service automation, content moderation, or data extraction tasks can benefit from its combination of decent capability and optimized performance. The model works well for applications where advanced reasoning or tool use are not required, but consistent, fast text processing is essential.

Frequently Asked Questions

How much does Granite 4.0 H Micro cost per million tokens?

Granite 4.0 H Micro pricing varies by provider and usage type. Check the pricing table above for current rates across all available providers offering this model.

What is Granite 4.0 H Micro best used for?

Granite 4.0 H Micro excels at high-volume text processing tasks requiring fast inference speeds. It's ideal for document summarization, content generation, text analysis, and automated writing workflows where speed and cost efficiency matter more than advanced reasoning capabilities.

How does Granite 4.0 H Micro compare to other lightweight models?

Granite 4.0 H Micro offers a notably large 131K context window for its micro tier, combined with fast 464 tokens/second output generation. However, it has a longer time to first token at 8.7 seconds and lacks tool calling capabilities that some competing lightweight models provide.