Qwen: Qwen2.5-VL 7B Instruct

Text input Image input Text output
Author's Description

Qwen2.5 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements: - SoTA understanding of images of various resolution & ratio: Qwen2.5-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. - Understanding videos of 20min+: Qwen2.5-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. - Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2.5-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. - Multilingual Support: to serve global users, besides English and Chinese, Qwen2.5-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub repo](https://github.com/QwenLM/Qwen2-VL). Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

Key Specifications
Cost
$$$
Context
32K
Parameters
500B (Rumoured)
Released
Aug 27, 2024
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Stop Presence Penalty Logit Bias Top P Temperature Seed Min P Frequency Penalty Logprobs Max Tokens Top Logprobs
Performance Summary

Qwen2.5-VL 7B Instruct demonstrates a balanced performance profile, positioning itself as a reliable and cost-effective multimodal AI model. It performs among the faster models, ranking in the 66th percentile for speed, and offers competitive pricing, placing in the 71st percentile for cost-effectiveness. Notably, its reliability is exceptional, achieving the 98th percentile, indicating minimal technical failures and consistent response delivery. Across benchmarks, Qwen2.5-VL 7B shows particular strength in Ethics, achieving perfect 100% accuracy, making it the most accurate model at its price point and among models of similar speed. It also performs well in General Knowledge (91.8% accuracy) and Email Classification (92.0% accuracy), though its percentile rankings in these areas suggest room for improvement compared to top performers. Weaknesses are apparent in Coding (71.0% accuracy, 37th percentile) and Reasoning (48.0% accuracy, 35th percentile), where its performance is below average. Instruction Following is moderate at 51.5% accuracy. Its key strengths lie in its robust reliability, cost-efficiency, and specialized excellence in ethical reasoning, while its multimodal capabilities for image and video understanding, and agentic operations, as described in its details, are significant features not directly captured by these text-based benchmarks.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.2
Completion $0.2

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
Hyperbolic
Hyperbolic | qwen/qwen-2-vl-7b-instruct 32K $0.2 / 1M tokens $0.2 / 1M tokens
InferenceNet
InferenceNet | qwen/qwen-2-vl-7b-instruct 128K $0.2 / 1M tokens $0.2 / 1M tokens
Kluster
Kluster | qwen/qwen-2-vl-7b-instruct 32K $0.2 / 1M tokens $0.2 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Free Executions Accuracy Cost Duration
Other Models by qwen