Meta: Llama 3.2 90B Vision Instruct

Text input Image input Text output
Author's Description

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks. This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Key Specifications
Cost
$$$$
Context
131K
Parameters
90B
Released
Sep 24, 2024
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Logit Bias Stop Min P Top P Max Tokens Frequency Penalty Temperature Presence Penalty
Performance Summary

Meta's Llama 3.2 90B Vision Instruct model demonstrates a strong overall performance profile, particularly excelling in reliability with an 84% success rate, indicating consistent and usable responses. In terms of speed, it generally performs in the top tier, ranking in the 63rd percentile, while offering competitive pricing at the 41st percentile. The model exhibits notable strengths in classification tasks, achieving 99.0% accuracy in Email Classification, placing it in the 80th percentile for that benchmark. It also performs well in Ethics (99.0% accuracy) and General Knowledge (97.5% accuracy), suggesting a robust understanding across diverse informational domains. However, its performance in Instruction Following (51.0% accuracy) and Reasoning (56.0% accuracy) is more moderate, indicating areas where further refinement could enhance its capabilities. A significant weakness is observed in the Coding benchmark, where it achieved only 6.0% accuracy, placing it in the 12th percentile. This suggests the model is not well-suited for programming-related tasks. Despite this, its high reliability and strong performance in visual and textual comprehension tasks make it a valuable asset for industries requiring advanced multimodal AI.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.35
Completion $0.4

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
Together
Together | meta-llama/llama-3.2-90b-vision-instruct 131K $0.35 / 1M tokens $0.4 / 1M tokens
DeepInfra
DeepInfra | meta-llama/llama-3.2-90b-vision-instruct 32K $0.35 / 1M tokens $0.4 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by meta-llama