Author's Description
ERNIE-4.5-VL-424B-A47B is a multimodal Mixture-of-Experts (MoE) model from Baidu’s ERNIE 4.5 series, featuring 424B total parameters with 47B active per token. It is trained jointly on text and image data using a heterogeneous MoE architecture and modality-isolated routing to enable high-fidelity cross-modal reasoning, image understanding, and long-context generation (up to 131k tokens). Fine-tuned with techniques like SFT, DPO, UPO, and RLVR, this model supports both “thinking” and non-thinking inference modes. Designed for vision-language tasks in English and Chinese, it is optimized for efficient scaling and can operate under 4-bit/8-bit quantization.
Key Specifications
Supported Parameters
This model supports the following parameters:
Features
This model supports the following features:
Performance Summary
Baidu's ERNIE 4.5 VL 424B A47B model demonstrates moderate speed performance, ranking in the 35th percentile across benchmarks, and offers competitive pricing, placing in the 44th percentile. Notably, its reliability is exceptionally strong, boasting a 94% success rate, indicating consistent and usable responses. The model exhibits strong performance in several key areas. It achieves perfect accuracy in Ethics, highlighting its robust moral reasoning capabilities, and shows high accuracy in General Knowledge (99.5%) and Email Classification (99.0%). Its Coding (89.0%) and Mathematics (88.0%) skills are also commendable. A notable strength is its ability to acknowledge uncertainty, with a 96.0% accuracy in the Hallucinations benchmark. However, the model shows a notable weakness in Reasoning, with an accuracy of 52.0%, placing it in the 35th percentile for this category. Instruction Following also presents a moderate challenge at 62.0% accuracy. Despite these areas for improvement, its multimodal MoE architecture, long-context generation (up to 131k tokens), and support for both "thinking" and non-thinking inference modes, coupled with efficient scaling capabilities, position it as a versatile tool for vision-language tasks in English and Chinese.
Model Pricing
Current Pricing
Feature | Price (per 1M tokens) |
---|---|
Prompt | $0.42 |
Completion | $1.25 |
Price History
Available Endpoints
Provider | Endpoint Name | Context Length | Pricing (Input) | Pricing (Output) |
---|---|---|---|---|
Novita
|
Novita | baidu/ernie-4.5-vl-424b-a47b | 123K | $0.42 / 1M tokens | $1.25 / 1M tokens |
Benchmark Results
Benchmark | Category | Reasoning | Strategy | Free | Executions | Accuracy | Cost | Duration |
---|
Other Models by baidu
|
Released | Params | Context |
|
Speed | Ability | Cost |
---|---|---|---|---|---|---|---|
Baidu: ERNIE 4.5 21B A3B | Aug 12, 2025 | 21B | 120K |
Text input
Text output
|
★★★★ | ★★★ | $$ |
Baidu: ERNIE 4.5 VL 28B A3B | Aug 12, 2025 | 28B | 30K |
Text input
Image input
Text output
|
★★★ | ★★★ | $$$ |
Baidu: ERNIE 4.5 300B A47B | Jun 30, 2025 | 300B | 123K |
Text input
Text output
|
★ | ★★ | $$$$ |