Skip to main content

Frequently Asked Questions

What is an LLM API and how does inference pricing work?

An LLM API lets you send text (and sometimes images, audio, or video) to a Large Language Model and get a response. Providers charge per token — typically priced per 1 million tokens. Input tokens (your prompt) and output tokens (the model's response) are billed at separate rates, and output tokens usually cost 2-5x more than input tokens.

How many LLM models can I compare here?

We track 171 LLM models across 14 inference providers. Models span Anthropic, Mistral, Cohere, DeepSeek, Baidu, and more. Each model has different context windows, modality support, and per-token pricing — use the filters above to narrow down to your use case.

What is a context window and why does it matter?

The context window is the maximum number of tokens an LLM can process in a single request — including both your prompt and its response. Larger context windows (128K to 1M+) let the model handle long documents, multi-turn conversations, and large codebases without dropping earlier content. Most modern frontier models support at least 128K tokens.

What are model modalities (text, image, video, audio)?

Modalities describe what kinds of input a model can accept. Text-only models read and write text. Multimodal models can also process images (vision), video frames, or audio. Pick a modality that matches your application — e.g., image input for OCR, document understanding, or visual Q&A.

How do I pick the cheapest provider for a given model?

Open the model detail page to see every provider that hosts it, side-by-side, with current per-token pricing. Open-weight models (Llama, Mistral, Qwen, DeepSeek) are typically available across multiple providers — Together AI, Fireworks AI, Groq, DeepInfra, Replicate — at varying rates and speeds. Proprietary models (GPT, Claude, Gemini) usually require their original creator or specific cloud partners.

What is the difference between open and proprietary LLMs?

Proprietary models (OpenAI GPT, Anthropic Claude, Google Gemini) are only accessible through their creator or licensed cloud partners (Azure, Bedrock, Vertex AI). Open-weight models (Meta Llama, Mistral, DeepSeek, Qwen) can be self-hosted or rented through any provider that runs them — usually at a much lower price. Open models give you portability and price competition; proprietary models often lead on capability benchmarks.

Do batch APIs really cost less?

Yes. Many providers offer batch (asynchronous) pricing at a 50% discount versus the standard real-time API. Batch jobs may take minutes to hours to complete, so they suit non-time-sensitive workloads like offline data labeling, evals, content generation, or analytics. If your application can tolerate the delay, batch APIs can roughly halve your inference bill.

What is prompt caching and when should I use it?

Prompt caching lets providers re-use the cost of processing repeated prefix tokens — usually a long system prompt or document — across multiple requests. Cached input tokens are typically billed at 10-25% of the normal rate. It is most useful when you send the same large preamble to many user queries: chatbots with long instructions, retrieval pipelines with shared context, or tools with persistent few-shot examples.