NVIDIA: Nemotron Nano 12B 2 VL

Image input Text input Text output Free Option
Author's Description

NVIDIA Nemotron Nano 2 VL is a 12-billion-parameter open multimodal reasoning model designed for video understanding and document intelligence. It introduces a hybrid Transformer-Mamba architecture, combining transformer-level accuracy with Mamba’s memory-efficient sequence modeling for significantly higher throughput and lower latency. The model supports inputs of text and multi-image documents, producing natural-language outputs. It is trained on high-quality NVIDIA-curated synthetic datasets optimized for optical-character recognition, chart reasoning, and multimodal comprehension. Nemotron Nano 2 VL achieves leading results on OCRBench v2 and scores ≈ 74 average across MMMU, MathVista, AI2D, OCRBench, OCR-Reasoning, ChartQA, DocVQA, and Video-MME—surpassing prior open VL baselines. With Efficient Video Sampling (EVS), it handles long-form videos while reducing inference cost. Open-weights, training data, and fine-tuning recipes are released under a permissive NVIDIA open license, with deployment supported across NeMo, NIM, and major inference runtimes.

Key Specifications
Cost
$$$$
Context
131K
Parameters
12B
Released
Oct 28, 2025
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Reasoning Stop Frequency Penalty Presence Penalty Top P Response Format Temperature Seed Include Reasoning Min P Max Tokens
Features

This model supports the following features:

Response Format Reasoning
Performance Summary

NVIDIA's Nemotron Nano 12B 2 VL, a 12-billion-parameter multimodal model, demonstrates exceptional speed, consistently ranking among the fastest models, and offers highly competitive pricing. Its reliability is strong, with a 91% success rate across benchmarks. Designed for video understanding and document intelligence, it leverages a hybrid Transformer-Mamba architecture for high throughput and low latency. The model exhibits significant strengths in knowledge-based tasks, achieving 99.5% accuracy in General Knowledge and a perfect 100.0% in Ethics, where it is noted as the most accurate model at its price point and among models of similar speed. It also performs well in Reasoning (94.0%) and Email Classification (98.0%). However, Nemotron Nano 2 VL shows notable weaknesses in Instruction Following, scoring 0.0% accuracy, and struggles with Coding (65.6%) and Mathematics (52.6%). Its performance on Hallucinations (80.0%) suggests room for improvement in acknowledging uncertainty. Despite these areas for development, its leading results on OCRBench v2 and strong average across various multimodal benchmarks highlight its specialized capabilities.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.2
Completion $0.6

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
DeepInfra
DeepInfra | nvidia/nemotron-nano-12b-v2-vl 131K $0.2 / 1M tokens $0.6 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by nvidia