Author's Description
Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).
Key Specifications
Supported Parameters
This model supports the following parameters:
Features
This model supports the following features:
Performance Summary
Meta's Llama 3.2 11B Vision Instruct model, created on September 24, 2024, demonstrates strong overall performance, particularly in its operational efficiency. It consistently ranks among the fastest models, achieving an Infinityth percentile in speed across seven benchmarks, and offers highly competitive pricing, placing in the 91st percentile across six benchmarks. The model also exhibits robust reliability, with a 94% success rate across seven benchmarks, indicating consistent and usable responses. In terms of specific benchmark performance, Llama 3.2 11B Vision Instruct shows notable strengths in Ethics (98.0% accuracy) and General Knowledge (89.3% accuracy), suggesting a strong grasp of ethical principles and broad factual information. Its Email Classification accuracy is also commendable at 94.0%. However, the model's performance in Instruction Following is a significant weakness, with one benchmark showing 0.0% accuracy and another at 29.0%. Coding and Reasoning also present areas for improvement, with accuracies of 64.0% and 54.0% respectively. Despite these variations, its core strength lies in its multimodal capabilities, bridging visual and textual data for applications like image captioning and visual question answering, making it suitable for industries requiring comprehensive visual-linguistic AI.
Model Pricing
Current Pricing
Feature | Price (per 1M tokens) |
---|---|
Prompt | $0.049 |
Completion | $0.68 |
Price History
Available Endpoints
Provider | Endpoint Name | Context Length | Pricing (Input) | Pricing (Output) |
---|---|---|---|---|
DeepInfra
|
DeepInfra | meta-llama/llama-3.2-11b-vision-instruct | 131K | $0.049 / 1M tokens | $0.049 / 1M tokens |
Cloudflare
|
Cloudflare | meta-llama/llama-3.2-11b-vision-instruct | 128K | $0.049 / 1M tokens | $0.68 / 1M tokens |
Lambda
|
Lambda | meta-llama/llama-3.2-11b-vision-instruct | 131K | $0.049 / 1M tokens | $0.049 / 1M tokens |
InferenceNet
|
InferenceNet | meta-llama/llama-3.2-11b-vision-instruct | 16K | $0.055 / 1M tokens | $0.055 / 1M tokens |
Together
|
Together | meta-llama/llama-3.2-11b-vision-instruct | 131K | $0.18 / 1M tokens | $0.18 / 1M tokens |
Together
|
Together | meta-llama/llama-3.2-11b-vision-instruct | 131K | $0.18 / 1M tokens | $0.18 / 1M tokens |
Benchmark Results
Benchmark | Category | Reasoning | Free | Executions | Accuracy | Cost | Duration |
---|
Other Models by meta-llama
|
Released | Params | Context |
|
Speed | Ability | Cost |
---|---|---|---|---|---|---|---|
Meta: Llama Guard 4 12B | Apr 29, 2025 | 12B | 163K |
Text input
Image input
Text output
|
★★★★ | ★ | $ |
Meta: Llama 4 Maverick | Apr 05, 2025 | 17B | 1M |
Text input
Image input
Text output
|
★★★★ | ★★★ | $$$ |
Meta: Llama 4 Scout | Apr 05, 2025 | 17B | 1M |
Text input
Image input
Text output
|
★★★★ | ★★★ | $$ |
Llama Guard 3 8B | Feb 12, 2025 | 8B | 131K |
Text input
Text output
|
★ | ★ | $ |
Meta: Llama 3.3 70B Instruct | Dec 06, 2024 | 70B | 131K |
Text input
Text output
|
★★★★ | ★★★★★ | $ |
Meta: Llama 3.2 1B Instruct | Sep 24, 2024 | 1B | 131K |
Text input
Text output
|
★★ | ★ | $ |
Meta: Llama 3.2 3B Instruct | Sep 24, 2024 | 3B | 131K |
Text input
Text output
|
★★★ | ★ | $ |
Meta: Llama 3.2 90B Vision Instruct | Sep 24, 2024 | 90B | 131K |
Text input
Image input
Text output
|
★★★ | ★★ | $$$$ |
Meta: Llama 3.1 405B (base) | Aug 01, 2024 | 405B | 32K |
Text input
Text output
|
★ | ★ | $$$$ |
Meta: Llama 3.1 70B Instruct | Jul 22, 2024 | 70B | 131K |
Text input
Text output
|
★★★★ | ★★ | $$ |
Meta: Llama 3.1 405B Instruct | Jul 22, 2024 | 405B | 32K |
Text input
Text output
|
★★★★ | ★★ | $$$$ |
Meta: Llama 3.1 8B Instruct | Jul 22, 2024 | 8B | 131K |
Text input
Text output
|
★★★ | ★★★ | $ |
Meta: LlamaGuard 2 8B | May 12, 2024 | 8B | 8K |
Text input
Text output
|
★★★★ | ★ | $$ |
Meta: Llama 3 8B Instruct | Apr 17, 2024 | 8B | 8K |
Text input
Text output
|
★★★ | ★★ | $ |
Meta: Llama 3 70B Instruct | Apr 17, 2024 | 70B | 8K |
Text input
Text output
|
★★★★ | ★★ | $$$ |
Meta: Llama 2 70B Chat Unavailable | Jun 19, 2023 | 70B | 4K |
Text input
Text output
|
— | — | $$$$ |