NVIDIA: Llama 3.3 Nemotron Super 49B v1

Text input Text output Unavailable
Author's Description

Llama-3.3-Nemotron-Super-49B-v1 is a large language model (LLM) optimized for advanced reasoning, conversational interactions, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta's Llama-3.3-70B-Instruct, it employs a Neural Architecture Search (NAS) approach, significantly enhancing efficiency and reducing memory requirements. This allows the model to support a context length of up to 128K tokens and fit efficiently on single high-performance GPUs, such as NVIDIA H200. Note: you must include `detailed thinking on` in the system prompt to enable reasoning. Please see [Usage Recommendations](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1#quick-start-and-usage-recommendations) for more.

Key Specifications
Cost
$$
Context
131K
Parameters
49B
Released
Apr 08, 2025
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Top Logprobs Logit Bias Logprobs Stop Seed Top P Max Tokens Frequency Penalty Temperature Presence Penalty
Performance Summary

The NVIDIA: Llama 3.3 Nemotron Super 49B v1 model demonstrates a balanced performance profile, particularly excelling in reliability and cost-effectiveness. It consistently provides evaluable responses with an 80% success rate, indicating strong technical stability. The model typically offers cost-effective solutions, ranking in the 73rd percentile for pricing across benchmarks. Its speed performance is competitive, placing it in the 56th percentile for response times. In terms of specific capabilities, the model shows high accuracy in General Knowledge (98.0%) and Email Classification (97.0%), with the latter also benefiting from exceptionally fast processing times. It performs well in Ethics, achieving 94.0% accuracy, though its percentile ranking suggests other models may surpass it in this domain. A notable strength is its efficiency, derived from a Neural Architecture Search approach, allowing it to support a 128K context length and operate on single high-performance GPUs. However, the model exhibits a significant weakness in Coding tasks, achieving only 6.0% accuracy, placing it in the 12th percentile. Its instruction following accuracy is moderate at 56.6%, and it shows slower durations for both instruction following and coding benchmarks. Overall, it is well-suited for advanced reasoning, conversational interactions, RAG, and tool-calling, provided coding capabilities are not a primary requirement.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.13
Completion $0.4

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
Nebius
Nebius | nvidia/llama-3.3-nemotron-super-49b-v1 131K $0.13 / 1M tokens $0.4 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by nvidia