ByteDance: UI-TARS 7B

Image input Text input Text output
Author's Description

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement learning-based reasoning, enabling robust action planning and execution across virtual interfaces. This model achieves state-of-the-art results on a range of interactive and grounding benchmarks, including OSworld, WebVoyager, AndroidWorld, and ScreenSpot. It also demonstrates perfect task completion across diverse Poki games and outperforms prior models in Minecraft agent tasks. UI-TARS-1.5 supports thought decomposition during inference and shows strong scaling across variants, with the 1.5 version notably exceeding the performance of earlier 72B and 7B checkpoints.

Key Specifications
Cost
$$
Context
128K
Parameters
7B
Released
Jul 22, 2025
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Presence Penalty Top P Frequency Penalty Max Tokens Logit Bias Seed Stop Temperature
Performance Summary

ByteDance: UI-TARS 7B demonstrates strong overall performance, particularly in operational efficiency. It performs among the fastest models, ranking in the top tier for speed (75th percentile), and consistently offers highly competitive pricing (81st percentile). The model also exhibits exceptional reliability with a 99% success rate across benchmarks, indicating minimal technical failures. In terms of specific benchmark performance, UI-TARS 7B shows notable strengths in ethical reasoning and general knowledge, achieving 99.0% and 93.0% accuracy respectively. It also performs well in mathematics (78.0%) and coding (68.0%). However, the model exhibits significant weaknesses in hallucination mitigation, with only 64.0% accuracy, suggesting a tendency to generate responses for fictional concepts. Its performance in email classification (82.0%), instruction following (32.3%), and complex reasoning (36.0%) is also below average compared to other models. Despite these areas for improvement, its core strength lies in its multimodal vision-language agent capabilities for GUI-based environments, as highlighted by its state-of-the-art results on interactive and grounding benchmarks like OSworld and WebVoyager, and perfect task completion in Poki games.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.1
Completion $0.2

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
Parasail
Parasail | bytedance/ui-tars-1.5-7b 128K $0.1 / 1M tokens $0.2 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by bytedance