Meta: Llama 3.2 11B Vision Instruct

Text input Image input Text output Free Option
Author's Description

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Key Specifications
Cost
$$
Context
128K
Parameters
11B
Released
Sep 24, 2024
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Stop Presence Penalty Top P Temperature Seed Min P Response Format Frequency Penalty Max Tokens
Features

This model supports the following features:

Response Format
Performance Summary

Meta's Llama 3.2 11B Vision Instruct model, created on September 24, 2024, demonstrates strong overall performance, particularly in its operational efficiency. It consistently ranks among the fastest models, achieving an Infinityth percentile in speed across seven benchmarks, and offers highly competitive pricing, placing in the 91st percentile across six benchmarks. The model also exhibits robust reliability, with a 94% success rate across seven benchmarks, indicating consistent and usable responses. In terms of specific benchmark performance, Llama 3.2 11B Vision Instruct shows notable strengths in Ethics (98.0% accuracy) and General Knowledge (89.3% accuracy), suggesting a strong grasp of ethical principles and broad factual information. Its Email Classification accuracy is also commendable at 94.0%. However, the model's performance in Instruction Following is a significant weakness, with one benchmark showing 0.0% accuracy and another at 29.0%. Coding and Reasoning also present areas for improvement, with accuracies of 64.0% and 54.0% respectively. Despite these variations, its core strength lies in its multimodal capabilities, bridging visual and textual data for applications like image captioning and visual question answering, making it suitable for industries requiring comprehensive visual-linguistic AI.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.049
Completion $0.68

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
DeepInfra
DeepInfra | meta-llama/llama-3.2-11b-vision-instruct 131K $0.049 / 1M tokens $0.049 / 1M tokens
Cloudflare
Cloudflare | meta-llama/llama-3.2-11b-vision-instruct 128K $0.049 / 1M tokens $0.68 / 1M tokens
Lambda
Lambda | meta-llama/llama-3.2-11b-vision-instruct 131K $0.049 / 1M tokens $0.049 / 1M tokens
InferenceNet
InferenceNet | meta-llama/llama-3.2-11b-vision-instruct 16K $0.055 / 1M tokens $0.055 / 1M tokens
Together
Together | meta-llama/llama-3.2-11b-vision-instruct 131K $0.18 / 1M tokens $0.18 / 1M tokens
Together
Together | meta-llama/llama-3.2-11b-vision-instruct 131K $0.18 / 1M tokens $0.18 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Free Executions Accuracy Cost Duration
Other Models by meta-llama