Qwen: Qwen3 VL 8B Instruct

Image input Text input Text output
Author's Description

Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built for high-fidelity understanding and reasoning across text, images, and video. It features improved multimodal fusion with Interleaved-MRoPE for long-horizon temporal reasoning, DeepStack for fine-grained visual-text alignment, and text-timestamp alignment for precise event localization. The model supports a native 256K-token context window, extensible to 1M tokens, and handles both static and dynamic media inputs for tasks like document parsing, visual question answering, spatial reasoning, and GUI control. It achieves text understanding comparable to leading LLMs while expanding OCR coverage to 32 languages and enhancing robustness under varied visual conditions.

Key Specifications
Cost
$$$
Context
256K
Parameters
8B
Released
Oct 14, 2025
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Max Tokens Temperature Top P Response Format Presence Penalty Structured Outputs Tools Seed Tool Choice
Features

This model supports the following features:

Tools Response Format Structured Outputs
Performance Summary

Qwen3-VL-8B-Instruct, released by qwen, is a multimodal vision-language model designed for comprehensive understanding and reasoning across various media types. It exhibits moderate speed performance, ranking in the 27th percentile, and offers cost-effective solutions, placing it in the 68th percentile for price. A standout feature is its exceptional reliability, achieving a 100% success rate across all benchmarks, indicating consistent and stable operation. In terms of specific performance, the model demonstrates strong capabilities in General Knowledge and Ethics, scoring 96.0% and 99.0% accuracy respectively, suggesting a robust understanding of factual information and ethical principles. Its Email Classification accuracy is also solid at 95.0%. However, the model shows a notable weakness in handling hallucinations, with only 58.0% accuracy in correctly identifying fictional concepts, indicating a tendency to generate responses rather than acknowledge uncertainty. Instruction Following also presents a moderate challenge, with 56.3% accuracy. The model's strengths lie in its multimodal fusion, long context window (256K native, 1M extensible), and improved OCR coverage across 32 languages, making it suitable for complex document parsing and visual reasoning tasks despite its limitations in hallucination avoidance.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.18
Completion $0.7

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
Alibaba
Alibaba | qwen/qwen3-vl-8b-instruct 256K $0.18 / 1M tokens $0.7 / 1M tokens
DeepInfra
DeepInfra | qwen/qwen3-vl-8b-instruct 256K $0.064 / 1M tokens $0.4 / 1M tokens
DeepInfra
DeepInfra | qwen/qwen3-vl-8b-instruct 262K $0.18 / 1M tokens $0.69 / 1M tokens
Novita
Novita | qwen/qwen3-vl-8b-instruct 131K $0.064 / 1M tokens $0.4 / 1M tokens
Parasail
Parasail | qwen/qwen3-vl-8b-instruct 262K $0.15 / 1M tokens $0.7 / 1M tokens
NextBit
NextBit | qwen/qwen3-vl-8b-instruct 131K $0.12 / 1M tokens $0.7 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by qwen