Question 1

What is an LLM API and how does inference pricing work?

Accepted Answer

An LLM API lets you send text (and sometimes images, audio, or video) to a Large Language Model and get a response. Providers charge per token — typically priced per 1 million tokens. Input tokens (your prompt) and output tokens (the model's response) are billed at separate rates, and output tokens usually cost 2-5x more than input tokens.

Question 2

How many LLM models can I compare here?

Accepted Answer

We track 171 LLM models across 14 inference providers. Models span Anthropic, Mistral, Cohere, DeepSeek, Baidu, and more. Each model has different context windows, modality support, and per-token pricing — use the filters above to narrow down to your use case.

Question 3

What is a context window and why does it matter?

Accepted Answer

The context window is the maximum number of tokens an LLM can process in a single request — including both your prompt and its response. Larger context windows (128K to 1M+) let the model handle long documents, multi-turn conversations, and large codebases without dropping earlier content. Most modern frontier models support at least 128K tokens.

Question 4

What are model modalities (text, image, video, audio)?

Accepted Answer

Modalities describe what kinds of input a model can accept. Text-only models read and write text. Multimodal models can also process images (vision), video frames, or audio. Pick a modality that matches your application — e.g., image input for OCR, document understanding, or visual Q&A.

Question 5

How do I pick the cheapest provider for a given model?

Accepted Answer

Open the model detail page to see every provider that hosts it, side-by-side, with current per-token pricing. Open-weight models (Llama, Mistral, Qwen, DeepSeek) are typically available across multiple providers — Together AI, Fireworks AI, Groq, DeepInfra, Replicate — at varying rates and speeds. Proprietary models (GPT, Claude, Gemini) usually require their original creator or specific cloud partners.

Question 6

What is the difference between open and proprietary LLMs?

Accepted Answer

Proprietary models (OpenAI GPT, Anthropic Claude, Google Gemini) are only accessible through their creator or licensed cloud partners (Azure, Bedrock, Vertex AI). Open-weight models (Meta Llama, Mistral, DeepSeek, Qwen) can be self-hosted or rented through any provider that runs them — usually at a much lower price. Open models give you portability and price competition; proprietary models often lead on capability benchmarks.

Question 7

Do batch APIs really cost less?

Accepted Answer

Yes. Many providers offer batch (asynchronous) pricing at a 50% discount versus the standard real-time API. Batch jobs may take minutes to hours to complete, so they suit non-time-sensitive workloads like offline data labeling, evals, content generation, or analytics. If your application can tolerate the delay, batch APIs can roughly halve your inference bill.

Question 8

What is prompt caching and when should I use it?

Accepted Answer

Prompt caching lets providers re-use the cost of processing repeated prefix tokens — usually a long system prompt or document — across multiple requests. Cached input tokens are typically billed at 10-25% of the normal rate. It is most useful when you send the same large preamble to many user queries: chatbots with long instructions, retrieval pipelines with shared context, or tools with persistent few-shot examples.

LLM Models

GPT-OSS-120B

Llama 3.1 8B

Llama 3.3 70B

Mistral Small

DeepSeek R1

DeepSeek V3

GPT-OSS-20B

DeepSeek V3.1

GLM-4.7

GLM-5

Kimi K2

Kimi K2.5

Llama 3.1 405B

Llama 4 Scout

MiniMax M2.5

Frequently Asked Questions