Author's Description
NVIDIA Nemotron Nano 2 VL is a 12-billion-parameter open multimodal reasoning model designed for video understanding and document intelligence. It introduces a hybrid Transformer-Mamba architecture, combining transformer-level accuracy with Mamba’s memory-efficient sequence modeling for significantly higher throughput and lower latency. The model supports inputs of text and multi-image documents, producing natural-language outputs. It is trained on high-quality NVIDIA-curated synthetic datasets optimized for optical-character recognition, chart reasoning, and multimodal comprehension. Nemotron Nano 2 VL achieves leading results on OCRBench v2 and scores ≈ 74 average across MMMU, MathVista, AI2D, OCRBench, OCR-Reasoning, ChartQA, DocVQA, and Video-MME—surpassing prior open VL baselines. With Efficient Video Sampling (EVS), it handles long-form videos while reducing inference cost. Open-weights, training data, and fine-tuning recipes are released under a permissive NVIDIA open license, with deployment supported across NeMo, NIM, and major inference runtimes.
Key Specifications
Supported Parameters
This model supports the following parameters:
Features
This model supports the following features:
Performance Summary
NVIDIA's Nemotron Nano 12B 2 VL, a 12-billion-parameter multimodal model, demonstrates exceptional speed, consistently ranking among the fastest models, and offers highly competitive pricing. Its reliability is strong, with a 91% success rate across benchmarks. Designed for video understanding and document intelligence, it leverages a hybrid Transformer-Mamba architecture for high throughput and low latency. The model exhibits significant strengths in knowledge-based tasks, achieving 99.5% accuracy in General Knowledge and a perfect 100.0% in Ethics, where it is noted as the most accurate model at its price point and among models of similar speed. It also performs well in Reasoning (94.0%) and Email Classification (98.0%). However, Nemotron Nano 2 VL shows notable weaknesses in Instruction Following, scoring 0.0% accuracy, and struggles with Coding (65.6%) and Mathematics (52.6%). Its performance on Hallucinations (80.0%) suggests room for improvement in acknowledging uncertainty. Despite these areas for development, its leading results on OCRBench v2 and strong average across various multimodal benchmarks highlight its specialized capabilities.
Model Pricing
Current Pricing
| Feature | Price (per 1M tokens) |
|---|---|
| Prompt | $0.2 |
| Completion | $0.6 |
Price History
Available Endpoints
| Provider | Endpoint Name | Context Length | Pricing (Input) | Pricing (Output) |
|---|---|---|---|---|
|
DeepInfra
|
DeepInfra | nvidia/nemotron-nano-12b-v2-vl | 131K | $0.2 / 1M tokens | $0.6 / 1M tokens |
Benchmark Results
| Benchmark | Category | Reasoning | Strategy | Free | Executions | Accuracy | Cost | Duration |
|---|
Other Models by nvidia
|
|
Released | Params | Context |
|
Speed | Ability | Cost |
|---|---|---|---|---|---|---|---|
| NVIDIA: Llama 3.3 Nemotron Super 49B V1.5 | Oct 10, 2025 | 49B | 131K |
Text input
Text output
|
★ | ★★★★ | $$$$ |
| NVIDIA: Nemotron Nano 9B V2 | Sep 05, 2025 | 9B | 128K |
Text input
Text output
|
★ | ★★ | $ |
| NVIDIA: Llama 3.3 Nemotron Super 49B v1 Unavailable | Apr 08, 2025 | 49B | 131K |
Text input
Text output
|
★★★ | ★★ | $$ |
| NVIDIA: Llama 3.1 Nemotron Ultra 253B v1 | Apr 08, 2025 | 253B | 131K |
Text input
Text output
|
★ | ★★ | $$$$$ |
| NVIDIA: Llama 3.1 Nemotron 70B Instruct | Oct 14, 2024 | 70B | 131K |
Text input
Text output
|
★★★ | ★★ | $$$ |