Qwen: Qwen3 VL 235B A22B Instruct

Text input Image input Text output
Author's Description

Qwen3-VL-235B-A22B Instruct is an open-weight multimodal model that unifies strong text generation with visual understanding across images and video. The Instruct model targets general vision-language use (VQA, document parsing, chart/table extraction, multilingual OCR). The series emphasizes robust perception (recognition of diverse real-world and synthetic categories), spatial understanding (2D/3D grounding), and long-form visual comprehension, with competitive results on public multimodal benchmarks for both perception and reasoning. Beyond analysis, Qwen3-VL supports agentic interaction and tool use: it can follow complex instructions over multi-image, multi-turn dialogues; align text to video timelines for precise temporal queries; and operate GUI elements for automation tasks. The models also enable visual coding workflows—turning sketches or mockups into code and assisting with UI debugging—while maintaining strong text-only performance comparable to the flagship Qwen3 language models. This makes Qwen3-VL suitable for production scenarios spanning document AI, multilingual OCR, software/UI assistance, spatial/embodied tasks, and research on vision-language agents.

Key Specifications
Cost
$$$
Context
131K
Parameters
235B
Released
Sep 23, 2025
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Tools Include Reasoning Temperature Tool Choice Seed Response Format Max Tokens Structured Outputs Presence Penalty Reasoning Top P
Features

This model supports the following features:

Structured Outputs Reasoning Tools Response Format
Performance Summary

Qwen3-VL-235B-A22B Instruct demonstrates competitive response times, performing among the faster models with a 58th percentile ranking across benchmarks. It also offers competitive pricing, ranking in the 60th percentile. Notably, the model exhibits exceptional reliability, achieving a 100% success rate across all evaluated benchmarks, indicating consistent and dependable operation. The model excels in several key areas, achieving perfect accuracy in Hallucinations, General Knowledge, Reasoning, and Ethics benchmarks. Its performance in Hallucinations is particularly impressive, demonstrating a strong ability to acknowledge uncertainty. In Mathematics, it achieved 92.3% accuracy, placing it in the 74th percentile and making it the most accurate model at its price point for this category. While strong in many areas, its Coding performance, at 80.0% accuracy (44th percentile), indicates a relative weakness compared to its other capabilities. Email Classification also shows solid performance at 98.0% accuracy. The model's multimodal capabilities, including visual understanding across images and video, agentic interaction, and visual coding workflows, position it as a versatile tool for diverse applications such as document AI, multilingual OCR, and UI assistance.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.7
Completion $2.8

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
Alibaba
Alibaba | qwen/qwen3-vl-235b-a22b-instruct 131K $0.7 / 1M tokens $2.8 / 1M tokens
Novita
Novita | qwen/qwen3-vl-235b-a22b-instruct 131K $0.3 / 1M tokens $1.5 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by qwen