NVIDIA: Llama 3.3 Nemotron Super 49B V1.5

Text input Text output
Author's Description

Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and multi-turn chat, followed by multiple RL stages; Reward-aware Preference Optimization (RPO) for alignment, RL with Verifiable Rewards (RLVR) for step-wise reasoning, and iterative DPO to refine tool-use behavior. A distillation-driven Neural Architecture Search (“Puzzle”) replaces some attention blocks and varies FFN widths to shrink memory footprint and improve throughput, enabling single-GPU (H100/H200) deployment while preserving instruction following and CoT quality. In internal evaluations (NeMo-Skills, up to 16 runs, temp = 0.6, top_p = 0.95), the model reports strong reasoning/coding results, e.g., MATH500 pass@1 = 97.4, AIME-2024 = 87.5, AIME-2025 = 82.71, GPQA = 71.97, LiveCodeBench (24.10–25.02) = 73.58, and MMLU-Pro (CoT) = 79.53. The model targets practical inference efficiency (high tokens/s, reduced VRAM) with Transformers/vLLM support and explicit “reasoning on/off” modes (chat-first defaults, greedy recommended when disabled). Suitable for building agents, assistants, and long-context retrieval systems where balanced accuracy-to-cost and reliable tool use matter.

Key Specifications
Cost
$$$$
Context
131K
Parameters
49B
Released
Oct 10, 2025
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Reasoning Stop Frequency Penalty Top P Response Format Temperature Include Reasoning Min P Max Tokens Tools Presence Penalty Tool Choice Seed
Features

This model supports the following features:

Response Format Reasoning Tools
Performance Summary

The NVIDIA: Llama 3.3 Nemotron Super 49B V1.5 model, created on October 10, 2025, demonstrates strong performance in several key areas, particularly in its reliability and specialized capabilities. With an exceptional reliability ranking of 97% success rate, it consistently provides usable responses, minimizing technical failures. While its speed ranking places it in the 18th percentile, indicating longer response times, its moderate pricing at the 38th percentile offers a balanced cost-efficiency. The model excels in Ethics (100% accuracy), General Knowledge (99%), Email Classification (99%), and Reasoning (92%), showcasing its robust understanding and processing capabilities. Its internal evaluations also highlight impressive results in MATH500 (97.4% pass@1) and AIME-2024 (87.5%), underscoring its strength in mathematical and scientific reasoning. A notable weakness is its performance in Instruction Following (60.2% accuracy), which falls in the 64th percentile, suggesting room for improvement in complex multi-step directives. Despite a 90% accuracy in Hallucinations, its 37th percentile ranking indicates it's not among the top performers in acknowledging uncertainty. The model's design, incorporating distillation-driven Neural Architecture Search, aims for practical inference efficiency, making it suitable for agentic workflows, assistants, and long-context retrieval systems where accuracy-to-cost balance and reliable tool use are critical.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.1
Completion $0.4

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
DeepInfra
DeepInfra | nvidia/llama-3.3-nemotron-super-49b-v1.5 131K $0.1 / 1M tokens $0.4 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by nvidia