LightweightNVIDIA

Nemotron Nano 12B v2 VL

Name: Nemotron Nano 12B v2 VL
Availability: InStock
Author: NVIDIA

Nemotron Nano 12B v2 VL is NVIDIA's lightweight multimodal model supporting text, image, and video inputs with a 131K token context window.

Context 131K

Tier Lightweight

Modalities text, image, video

Input from

$0.100 / 1M tokens

across 1 provider

Compare Prices

API Pricing

Cheapest on Amazon AWS — 33% below avg

Provider	Input / 1M	Output / 1M	Speed	TTFT	Updated
Amazon AWSBatch	$0.100	$0.300	220 t/s	773ms	7/13/2026
Amazon AWS	$0.200	$0.600	220 t/s	773ms	7/13/2026

Prices updated daily. Last check: Jul 13, 2026

Performance & Benchmarks

Source: Artificial Analysis →

Intelligence

4.6 / 100

Math

26.7 / 100

Output Speed

220 t/s

Latency (TTFT)

773ms

Reasoning & Knowledge

MMLU-Pro64.9%
GPQA Diamond43.9%
Humanity's Last Exam4.5%

Coding

LiveCodeBench34.5%
SciCode17.6%

Math

AIME 202526.7%

Agentic & Tool Use

Terminal-Bench Hard0.0%
τ²-bench19.3%

Instruction & Long Context

IFBench25.9%
Long-Context Reasoning17.0%

Benchmarks measured Jul 2026. Scores are independent evaluations, not vendor-reported.

Model Details

General

Creator: NVIDIA
Family: Nemotron
Tier: Lightweight
Context Window: 131K
Modalities: Text, Image, Video

Capabilities

Tool Calling: No
Open Source: No
Aliases: NVIDIA Nemotron Nano 2 VL

Strengths & Limitations

Strengths

Supports three modalities including video input, expanding beyond typical text-image models
131K token context window allows processing of longer documents and conversations
Fast inference at 146.88 output tokens per second for responsive applications
Quick time to first token at 242ms reduces perceived latency
12B parameter size offers efficiency advantages over larger multimodal models
Video processing capability enables temporal understanding of visual content
Lightweight tier positioning provides cost advantages for high-volume applications

Limitations

No tool calling or function calling capabilities
Proprietary model with no open source weights available
Smaller parameter count may limit complex reasoning compared to frontier models
Limited documentation on specific video processing capabilities and formats
Newer model with less extensive real-world testing compared to established alternatives

Key Features

•131,072 token context window

•Text input and generation

•Image understanding and analysis

•Video processing capabilities

•Fast inference with 146.88 tokens/second output

•Low latency at 242ms time to first token

•Streaming response support

•Multimodal prompt processing

About Nemotron Nano 12B v2 VL

Nemotron Nano 12B v2 VL is NVIDIA's lightweight multimodal model in the Nemotron family, designed for efficient processing of text, image, and video inputs. As a 12-billion parameter model, it sits in the lightweight tier, offering a balance between capability and computational efficiency. The model features a 131,072 token context window and supports three modalities: text, image, and video processing. With benchmark performance showing 146.88 output tokens per second and a time to first token of 242 milliseconds, it demonstrates responsive inference speeds suitable for interactive applications. The model does not include tool calling capabilities, focusing instead on core multimodal understanding tasks. Nemotron Nano 12B v2 VL is positioned for applications requiring multimodal understanding at scale, where the combination of video processing capabilities and efficient inference makes it suitable for content analysis, media processing workflows, and applications where visual understanding needs to be integrated with text processing in a cost-effective manner.

Common Use Cases

Nemotron Nano 12B v2 VL is well-suited for applications requiring efficient multimodal processing, particularly where video understanding is needed. Its lightweight architecture makes it ideal for content moderation systems that need to analyze text, images, and video at scale, media processing workflows that require understanding of visual content with accompanying text, and educational applications that work with multimedia content. The fast inference speeds and reasonable context window make it appropriate for interactive applications like chatbots with visual capabilities, automated content tagging systems, and customer service applications that need to process visual queries alongside text conversations.

Frequently Asked Questions

How much does Nemotron Nano 12B v2 VL cost per million tokens?

Nemotron Nano 12B v2 VL pricing varies by provider and may include different rates for text versus image/video processing. Check the pricing table above for current rates across all available providers.

What is Nemotron Nano 12B v2 VL best used for?

Nemotron Nano 12B v2 VL excels at multimodal tasks requiring text, image, and video understanding, particularly for high-volume applications where efficiency matters. Its video processing capabilities make it suitable for content analysis, media workflows, and applications needing temporal visual understanding combined with fast inference speeds.

Does Nemotron Nano 12B v2 VL support video analysis and what formats?

Yes, Nemotron Nano 12B v2 VL supports video processing as one of its three modalities alongside text and images. However, specific supported video formats and processing capabilities depend on the provider implementation. Check with your chosen provider for detailed video format support and any limitations on video length or resolution.