Qwen: Qwen2.5-VL 7B Instruct

Text input Image input Text output
Author's Description

Qwen2.5 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements: - SoTA understanding of images of various resolution & ratio: Qwen2.5-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. - Understanding videos of 20min+: Qwen2.5-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. - Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2.5-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. - Multilingual Support: to serve global users, besides English and Chinese, Qwen2.5-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub repo](https://github.com/QwenLM/Qwen2-VL). Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

Key Specifications
Cost
$$
Context
32K
Parameters
500B (Rumoured)
Released
Aug 27, 2024
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Top Logprobs Logit Bias Logprobs Stop Seed Min P Top P Max Tokens Frequency Penalty Temperature Presence Penalty
Performance Summary

Qwen2.5-VL 7B Instruct demonstrates exceptional speed, consistently ranking among the fastest models across various benchmarks. It also offers competitive pricing, typically providing cost-effective solutions. The model exhibits strong reliability with an 86% success rate, indicating consistent delivery of usable responses. In terms of performance across categories, Qwen2.5-VL 7B Instruct achieves perfect accuracy in Ethics, highlighting its robust moral reasoning capabilities. It also performs well in General Knowledge (91.8% accuracy) and Email Classification (92.0% accuracy). A significant strength lies in its multimodal capabilities, as described, including state-of-the-art image understanding across resolutions and ratios, and the ability to comprehend videos over 20 minutes. Its multilingual support for text within images is also a notable advantage for global applications. However, the model shows significant weaknesses in Mathematics, scoring 0.0% accuracy, suggesting a current limitation in complex mathematical problem-solving. Its performance in Reasoning (42.0% accuracy) and Instruction Following (51.5% accuracy) is moderate, indicating areas for potential improvement. While its Hallucinations accuracy is 86.0%, this places it in the 29th percentile, suggesting room for improvement in acknowledging uncertainty for fictional concepts.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.2
Completion $0.2

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
Hyperbolic
Hyperbolic | qwen/qwen-2-vl-7b-instruct 32K $0.2 / 1M tokens $0.2 / 1M tokens
InferenceNet
InferenceNet | qwen/qwen-2-vl-7b-instruct 128K $0.2 / 1M tokens $0.2 / 1M tokens
Kluster
Kluster | qwen/qwen-2-vl-7b-instruct 32K $0.2 / 1M tokens $0.2 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by qwen