Skip to main content
Deep Infra logo

Deep Infra

Optimized inference for open-source models

Rapidly-catching neocloud🇺🇸 USinferenceopen-sourcebudget

Last reviewed Mar 14, 2026

Deep Infra is an AI inference cloud that provides serverless APIs for 100+ models and dedicated GPU rentals. Features OpenAI-compatible endpoints, custom model deployments, and infrastructure optimized for scale.

5
GPU Models
$0.89
From / hour
69
LLM Models
$0.02
From / 1M input

Available GPUs

Hourly on-demand pricing. Click column headers to sort.

Prices last updated: June 19, 2026

Pricing
GPU Model
Memory
GPUs
Price / hr
Updated
Source
A100 SXM80GB
1×
$0.890/hr
6/19/2026
B200192GB
1×
$2.79/hr
6/19/2026
H100 SXM80GB
1×
$1.79/hr
6/19/2026
H200141GB
1×
$2.19/hr
6/19/2026
HGX B300288GB
1×
$4.20/hr
6/19/2026

LLM API Pricing

Pay-per-token pricing. Prices shown per 1M tokens.

Prices last updated: June 19, 2026

ModelCreatorContextInput/1MOutput/1MUpdated
Mistral131K$0.020$0.0406/19/2026
Meta128K$0.020$0.0306/19/2026
OpenAI128K$0.030$0.1406/19/2026
NVIDIA131K$0.040$0.1606/4/2026
Mistral32K$0.050$0.0806/19/2026
Google131K$0.050$0.1006/19/2026
NVIDIA262K$0.050$0.2006/19/2026
Google131K$0.050$0.1506/19/2026
Google262K$0.070$0.3406/19/2026
Microsoft16K$0.070$0.1406/19/2026

Pros & Cons

Advantages

  • Simple OpenAI-compatible API alongside controllable GPU rentals
  • Competitive hourly rates for flagship NVIDIA GPUs including latest B200 and B300
  • Fast provisioning with SSH access for dedicated instances (ready in ~10 seconds)
  • Supports custom deployments in addition to hosted public models

Limitations

  • Region list is not clearly published in the public marketing pages
  • Primarily focused on inference and GPU rentals rather than broader cloud services
  • Newer player compared to established cloud providers

Key Features

Drop-in OpenAI Replacement

OpenAI-compatible API for 100+ models including DeepSeek, Qwen, Llama 4, Claude, and Gemini families with autoscaling

Dedicated GPU Rentals

B200 instances with SSH access spin up in about 10 seconds and bill hourly

Custom LLM Deployments

Deploy your own Hugging Face models onto dedicated A100, H100, H200, B200, or B300 GPUs

Transparent GPU Pricing

Published per-GPU hourly rates for A100, H100, H200, B200, and B300 with competitive pricing

Inference-Optimized Hardware

All hosted models run on H100 or A100 hardware tuned for low latency

Comprehensive AI APIs

Support for text generation, vision and OCR, embeddings and reranking, image and video generation, and speech recognition

SOC 2 & ISO 27001 Compliance

Zero retention policy with enterprise-grade security certifications for data privacy and protection

Compute Services

Serverless Inference

Hosted model APIs with autoscaling on H100/A100 hardware.

  • Drop-in OpenAI replacement - swap base URL, keep existing code
  • 100+ models including latest DeepSeek V4, Qwen 3, Llama 4, Claude 4.5, and Gemini 3 families
  • Comprehensive AI capabilities: text, vision, embeddings, image/video generation, speech
  • Always-fresh model catalog with new releases deployed quickly

Dedicated GPU Instances

On-demand GPU nodes with SSH access for custom workloads.

Pricing Options

OptionDetails
Serverless pay-per-tokenOpenAI-compatible inference APIs with pay-per-request billing on H100/A100 hardware
Dedicated GPU hourly ratesPublished transparent hourly pricing for A100, H100, H200, B200, and B300 GPUs with pay-as-you-go billing
No long-term commitmentsFlexible hourly billing for dedicated instances with no prepayments or contracts required

Availability & Support

Regions

Region list not published on the GPU Instances page; promo mentions Nebraska availability alongside multi-region autoscaling messaging.

Support

Documentation site, dashboard guidance, Discord community link, and contact-sales options.

Getting Started

  1. 1

    Create an account

    Sign up (GitHub-supported) and open the Deep Infra dashboard

  2. 2

    Enable billing

    Add a payment method to unlock GPU rentals and API usage

  3. 3

    Pick a GPU option

    Choose serverless APIs or dedicated A100, H100, H200, B200, or B300 instances

  4. 4

    Launch and connect

    Start instances with SSH access or call the OpenAI-compatible API endpoints

  5. 5

    Monitor usage

    Track spend and instance status from the dashboard and shut down when idle

Compare Providers

Find the best prices for the same GPUs from other providers

CoreWeave logo

CoreWeave

5 shared GPUs with Deep Infra

RunPod logo

RunPod

5 shared GPUs with Deep Infra

Oracle Cloud logo

Oracle Cloud

5 shared GPUs with Deep Infra