LLM Models
Browse 214 LLM models — compare specs, API pricing, and provider availability
GPT-OSS-120B
OpenAI · 128K ctx · cutoff Jan 2025
Llama 3.3 70B
Meta · 128K ctx · cutoff Mar 2024
GPT-OSS-20B
OpenAI · 128K ctx · cutoff Jan 2025
DeepSeek V3
DeepSeek · 64K ctx · cutoff Jun 2024
GLM-4.7
Zhipu · 128K ctx
GLM-5
Zhipu · 128K ctx
Kimi K2
Moonshot · 128K ctx
Kimi K2.5
Moonshot · 128K ctx
Llama 3.1 8B
Meta · 128K ctx · cutoff Dec 2023
MiniMax M2.5
MiniMax · 128K ctx
DeepSeek R1
DeepSeek · 64K ctx · cutoff Nov 2024
DeepSeek V4 Pro
DeepSeek · 1.0M ctx
GLM-5.1
Zhipu · 200K ctx
Llama 4 Maverick 17B
Meta · 128K ctx
Llama 4 Scout
Meta · 328K ctx
Frequently Asked Questions
What is an LLM API and how does inference pricing work?
An LLM API lets you send text (and sometimes images, audio, or video) to a Large Language Model and get a response. Providers charge per token — typically priced per 1 million tokens. Input tokens (your prompt) and output tokens (the model's response) are billed at separate rates, and output tokens usually cost 2-5x more than input tokens.
How many LLM models can I compare here?
We track 214 LLM models across 16 inference providers. Models span BAAI, Anthropic, Mistral, Cohere, DeepSeek, and more. Each model has different context windows, modality support, and per-token pricing — use the filters above to narrow down to your use case.
What is a context window and why does it matter?
The context window is the maximum number of tokens an LLM can process in a single request — including both your prompt and its response. Larger context windows (128K to 1M+) let the model handle long documents, multi-turn conversations, and large codebases without dropping earlier content. Most modern frontier models support at least 128K tokens.
What are model modalities (text, image, video, audio)?
Modalities describe what kinds of input a model can accept. Text-only models read and write text. Multimodal models can also process images (vision), video frames, or audio. Pick a modality that matches your application — e.g., image input for OCR, document understanding, or visual Q&A.
How do I pick the cheapest provider for a given model?
Open the model detail page to see every provider that hosts it, side-by-side, with current per-token pricing. Open-weight models (Llama, Mistral, Qwen, DeepSeek) are typically available across multiple providers — Together AI, Fireworks AI, Groq, DeepInfra, Replicate — at varying rates and speeds. Proprietary models (GPT, Claude, Gemini) usually require their original creator or specific cloud partners.
What is the difference between open and proprietary LLMs?
Proprietary models (OpenAI GPT, Anthropic Claude, Google Gemini) are only accessible through their creator or licensed cloud partners (Azure, Bedrock, Vertex AI). Open-weight models (Meta Llama, Mistral, DeepSeek, Qwen) can be self-hosted or rented through any provider that runs them — usually at a much lower price. Open models give you portability and price competition; proprietary models often lead on capability benchmarks.
Do batch APIs really cost less?
Yes. Many providers offer batch (asynchronous) pricing at a 50% discount versus the standard real-time API. Batch jobs may take minutes to hours to complete, so they suit non-time-sensitive workloads like offline data labeling, evals, content generation, or analytics. If your application can tolerate the delay, batch APIs can roughly halve your inference bill.
What is prompt caching and when should I use it?
Prompt caching lets providers re-use the cost of processing repeated prefix tokens — usually a long system prompt or document — across multiple requests. Cached input tokens are typically billed at 10-25% of the normal rate. It is most useful when you send the same large preamble to many user queries: chatbots with long instructions, retrieval pipelines with shared context, or tools with persistent few-shot examples.