NVIDIA: Llama 3.1 Nemotron Ultra 253B v1

Name: NVIDIA: Llama 3.1 Nemotron Ultra 253B v1
Brand: nvidia
Price: 6e-7 USD
Availability: InStock
Rating: 2.2 (7 reviews)

Back

Text input Text output

Author's Description

Llama-3.1-Nemotron-Ultra-253B-v1 is a large language model (LLM) optimized for advanced reasoning, human-interactive chat, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta’s Llama-3.1-405B-Instruct, it has been significantly customized using Neural Architecture Search (NAS), resulting in enhanced efficiency, reduced memory usage, and improved inference latency. The model supports a context length of up to 128K tokens and can operate efficiently on an 8x NVIDIA H100 node. Note: you must include `detailed thinking on` in the system prompt to enable reasoning. Please see [Usage Recommendations](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1#quick-start-and-usage-recommendations) for more.

Key Specifications

Cost

$$$$$

Context

131K

Parameters

253B

Released

Apr 08, 2025

Speed

★

Ability

★★

Reliability

★★★

Hugging Face

Supported Parameters

This model supports the following parameters:

Presence Penalty Max Tokens Frequency Penalty Top P Structured Outputs Temperature Response Format Reasoning Include Reasoning

Features

This model supports the following features:

Reasoning Response Format Structured Outputs

Performance Summary

NVIDIA: Llama 3.1 Nemotron Ultra 253B v1 demonstrates exceptional performance across several critical areas, consistently ranking among the fastest models with an Infinityth percentile across 8 benchmarks. It offers moderate pricing, positioned at the 22nd percentile across 7 benchmarks, and exhibits outstanding reliability with a 99% success rate, indicating minimal technical failures. The model excels in preventing hallucinations, achieving 100% accuracy in tests designed to assess its ability to acknowledge uncertainty, making it the most accurate model at its price point and speed. Similarly, it achieved perfect scores in Email Classification and Ethics benchmarks, showcasing robust understanding of context and moral reasoning. Its Coding capabilities are very strong, with 93.0% accuracy (86th percentile), and it performs well in complex Reasoning tasks, scoring 80.0% (70th percentile). General Knowledge is solid at 97.5% accuracy. However, a significant weakness is observed in Instruction Following, where it scored 14.1% and 0.0% in two separate benchmarks, indicating a critical area for improvement. Derived from Meta’s Llama-3.1-405B-Instruct and optimized via Neural Architecture Search, the model is designed for advanced reasoning, human-interactive chat, RAG, and tool-calling, supporting a 128K token context length with enhanced efficiency.

Model Pricing

Current Pricing

Feature	Price (per 1M tokens)
Prompt	$0.6
Completion	$1.8

Price History

Available Endpoints

Provider	Endpoint Name	Context Length	Pricing (Input)	Pricing (Output)
Nebius	Nebius \| nvidia/llama-3.1-nemotron-ultra-253b-v1	131K	$0.6 / 1M tokens	$1.8 / 1M tokens

Benchmark Results

Benchmark	Category	Reasoning	Strategy	Free	Executions	Accuracy	Cost	Duration

Other Models by nvidia

	Released	Params	Context	Filter by Modalities All Modalities	Speed	Ability	Cost
NVIDIA: Nemotron Nano 12B 2 VL	Oct 28, 2025	12B	131K	Image input Video input Text input Text output	★	★★	$$$$
NVIDIA: Llama 3.3 Nemotron Super 49B V1.5	Oct 10, 2025	49B	131K	Text input Text output	★	★★★★	$$$$
NVIDIA: Nemotron Nano 9B V2	Sep 05, 2025	9B	128K	Text input Text output	★	★★	$
NVIDIA: Llama 3.3 Nemotron Super 49B v1 Unavailable	Apr 08, 2025	49B	131K	Text input Text output	★★★	★★	$$
NVIDIA: Llama 3.1 Nemotron 70B Instruct	Oct 14, 2024	70B	131K	Text input Text output	★★★	★★	$$$