Author's Description
Llama-3.3-Nemotron-Super-49B-v1 is a large language model (LLM) optimized for advanced reasoning, conversational interactions, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta's Llama-3.3-70B-Instruct, it employs a Neural Architecture Search (NAS) approach, significantly enhancing efficiency and reducing memory requirements. This allows the model to support a context length of up to 128K tokens and fit efficiently on single high-performance GPUs, such as NVIDIA H200. Note: you must include `detailed thinking on` in the system prompt to enable reasoning. Please see [Usage Recommendations](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1#quick-start-and-usage-recommendations) for more.
Key Specifications
Supported Parameters
This model supports the following parameters:
Performance Summary
The NVIDIA: Llama 3.3 Nemotron Super 49B v1 model demonstrates a balanced performance profile, particularly excelling in reliability and cost-effectiveness. It consistently provides evaluable responses with an 80% success rate, indicating strong technical stability. The model typically offers cost-effective solutions, ranking in the 73rd percentile for pricing across benchmarks. Its speed performance is competitive, placing it in the 56th percentile for response times. In terms of specific capabilities, the model shows high accuracy in General Knowledge (98.0%) and Email Classification (97.0%), with the latter also benefiting from exceptionally fast processing times. It performs well in Ethics, achieving 94.0% accuracy, though its percentile ranking suggests other models may surpass it in this domain. A notable strength is its efficiency, derived from a Neural Architecture Search approach, allowing it to support a 128K context length and operate on single high-performance GPUs. However, the model exhibits a significant weakness in Coding tasks, achieving only 6.0% accuracy, placing it in the 12th percentile. Its instruction following accuracy is moderate at 56.6%, and it shows slower durations for both instruction following and coding benchmarks. Overall, it is well-suited for advanced reasoning, conversational interactions, RAG, and tool-calling, provided coding capabilities are not a primary requirement.
Model Pricing
Current Pricing
Feature | Price (per 1M tokens) |
---|---|
Prompt | $0.13 |
Completion | $0.4 |
Price History
Available Endpoints
Provider | Endpoint Name | Context Length | Pricing (Input) | Pricing (Output) |
---|---|---|---|---|
Nebius
|
Nebius | nvidia/llama-3.3-nemotron-super-49b-v1 | 131K | $0.13 / 1M tokens | $0.4 / 1M tokens |
Benchmark Results
Benchmark | Category | Reasoning | Strategy | Free | Executions | Accuracy | Cost | Duration |
---|
Other Models by nvidia
|
Released | Params | Context |
|
Speed | Ability | Cost |
---|---|---|---|---|---|---|---|
NVIDIA: Nemotron Nano 9B V2 | Sep 05, 2025 | 9B | 128K |
Text input
Text output
|
★ | ★★ | $ |
NVIDIA: Llama 3.1 Nemotron Ultra 253B v1 | Apr 08, 2025 | 253B | 131K |
Text input
Text output
|
★ | ★★ | $$$$$ |
NVIDIA: Llama 3.1 Nemotron 70B Instruct | Oct 14, 2024 | 70B | 131K |
Text input
Text output
|
★★★ | ★★ | $$$ |