Author's Description
Llama-3.3-Nemotron-Super-49B-v1 is a large language model (LLM) optimized for advanced reasoning, conversational interactions, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta's Llama-3.3-70B-Instruct, it employs a Neural Architecture Search (NAS) approach, significantly enhancing efficiency and reducing memory requirements. This allows the model to support a context length of up to 128K tokens and fit efficiently on single high-performance GPUs, such as NVIDIA H200. Note: you must include `detailed thinking on` in the system prompt to enable reasoning. Please see [Usage Recommendations](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1#quick-start-and-usage-recommendations) for more.
Key Specifications
Supported Parameters
This model supports the following parameters:
Performance Summary
NVIDIA's Llama 3.3 Nemotron Super 49B v1 demonstrates competitive overall performance, particularly in its cost-effectiveness and reliability. The model exhibits competitive response times, ranking in the 54th percentile for speed across various benchmarks. It offers a cost-effective solution, placing in the 72nd percentile for price, indicating favorable operational expenses. Notably, its reliability is strong, achieving the 83rd percentile, signifying consistent and dependable output with minimal technical failures. Analyzing specific benchmark categories, the model excels in General Knowledge and Email Classification, achieving 98.0% and 97.0% accuracy respectively, showcasing its proficiency in factual recall and precise categorization. Its performance in Instruction Following is also solid at 56.6% accuracy. However, the model exhibits significant weaknesses in Coding, where it scored a very low 6.0% accuracy, coupled with high cost and long duration, suggesting it is not optimized for complex programming tasks. Performance in Reasoning (50.0% accuracy) and Ethics (94.0% accuracy, but low 29th percentile) indicates room for improvement in complex logical deduction and nuanced ethical judgment compared to peers. Key strengths include its high reliability, cost-efficiency, and strong capabilities in general knowledge and classification tasks. Its ability to support a 128K context length and fit on single high-performance GPUs is a significant advantage for deployment. The primary weakness lies in its limited proficiency in coding and potentially complex reasoning, which may require further optimization for such applications.
Model Pricing
Current Pricing
Feature | Price (per 1M tokens) |
---|---|
Prompt | $0.13 |
Completion | $0.4 |
Price History
Available Endpoints
Provider | Endpoint Name | Context Length | Pricing (Input) | Pricing (Output) |
---|---|---|---|---|
Nebius
|
Nebius | nvidia/llama-3.3-nemotron-super-49b-v1 | 131K | $0.13 / 1M tokens | $0.4 / 1M tokens |
Benchmark Results
Benchmark | Category | Reasoning | Free | Executions | Accuracy | Cost | Duration |
---|
Other Models by nvidia
|
Released | Params | Context |
|
Speed | Ability | Cost |
---|---|---|---|---|---|---|---|
NVIDIA: Llama 3.1 Nemotron Ultra 253B v1 | Apr 08, 2025 | 253B | 131K |
Text input
Text output
|
★ | ★★ | $$$$$ |
NVIDIA: Llama 3.1 Nemotron 70B Instruct | Oct 14, 2024 | 70B | 131K |
Text input
Text output
|
★★★ | ★★ | $$ |