Meta: Llama 3.2 11B Vision Instruct

Text input Image input Text output
Author's Description

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Key Specifications
Cost
$$
Context
128K
Parameters
11B
Released
Sep 24, 2024
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Response Format Stop Seed Min P Top P Max Tokens Frequency Penalty Temperature Presence Penalty
Features

This model supports the following features:

Response Format
Performance Summary

Meta's Llama 3.2 11B Vision Instruct model, created on September 24, 2024, demonstrates strong performance in several key areas. It consistently ranks among the fastest models and offers highly competitive pricing, placing in the 90th percentile across 7 benchmarks. The model also exhibits exceptional reliability with a 95% success rate, indicating minimal technical failures. In terms of specific benchmarks, Llama 3.2 11B Vision shows a notable strength in Ethics, achieving 98.0% accuracy. It also performs reasonably well in General Knowledge (89.3% accuracy) and Email Classification (94.0% accuracy). However, the model exhibits significant weaknesses in Instruction Following, with one benchmark showing 0.0% accuracy and another at 29.0%. Its performance in Mathematics (50.0% accuracy), Reasoning (25.0% accuracy), and Coding (64.0% accuracy) suggests these areas could benefit from further improvement. The model's ability to integrate visual understanding with language processing makes it suitable for multimodal applications, despite some areas needing development in complex reasoning and instruction adherence.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.049
Completion $0.68

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
DeepInfra
DeepInfra | meta-llama/llama-3.2-11b-vision-instruct 131K $0.049 / 1M tokens $0.049 / 1M tokens
Cloudflare
Cloudflare | meta-llama/llama-3.2-11b-vision-instruct 128K $0.049 / 1M tokens $0.68 / 1M tokens
Lambda
Lambda | meta-llama/llama-3.2-11b-vision-instruct 131K $0.049 / 1M tokens $0.049 / 1M tokens
InferenceNet
InferenceNet | meta-llama/llama-3.2-11b-vision-instruct 16K $0.055 / 1M tokens $0.055 / 1M tokens
Together
Together | meta-llama/llama-3.2-11b-vision-instruct 131K $0.18 / 1M tokens $0.18 / 1M tokens
Together
Together | meta-llama/llama-3.2-11b-vision-instruct 131K $0.049 / 1M tokens $0.049 / 1M tokens
DeepInfra
DeepInfra | meta-llama/llama-3.2-11b-vision-instruct 131K $0.049 / 1M tokens $0.049 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by meta-llama