Baidu: ERNIE 4.5 VL 424B A47B

Text input Image input Text output
Author's Description

ERNIE-4.5-VL-424B-A47B is a multimodal Mixture-of-Experts (MoE) model from Baidu’s ERNIE 4.5 series, featuring 424B total parameters with 47B active per token. It is trained jointly on text and image data using a heterogeneous MoE architecture and modality-isolated routing to enable high-fidelity cross-modal reasoning, image understanding, and long-context generation (up to 131k tokens). Fine-tuned with techniques like SFT, DPO, UPO, and RLVR, this model supports both “thinking” and non-thinking inference modes. Designed for vision-language tasks in English and Chinese, it is optimized for efficient scaling and can operate under 4-bit/8-bit quantization.

Key Specifications
Cost
$$$$
Context
123K
Parameters
424B
Released
Jun 30, 2025
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Logit Bias Reasoning Include Reasoning Stop Seed Min P Top P Max Tokens Frequency Penalty Temperature Presence Penalty
Features

This model supports the following features:

Reasoning
Performance Summary

Baidu's ERNIE 4.5 VL 424B A47B model demonstrates moderate speed performance, ranking in the 35th percentile across benchmarks, and offers competitive pricing, placing in the 44th percentile. Notably, its reliability is exceptionally strong, boasting a 94% success rate, indicating consistent and usable responses. The model exhibits strong performance in several key areas. It achieves perfect accuracy in Ethics, highlighting its robust moral reasoning capabilities, and shows high accuracy in General Knowledge (99.5%) and Email Classification (99.0%). Its Coding (89.0%) and Mathematics (88.0%) skills are also commendable. A notable strength is its ability to acknowledge uncertainty, with a 96.0% accuracy in the Hallucinations benchmark. However, the model shows a notable weakness in Reasoning, with an accuracy of 52.0%, placing it in the 35th percentile for this category. Instruction Following also presents a moderate challenge at 62.0% accuracy. Despite these areas for improvement, its multimodal MoE architecture, long-context generation (up to 131k tokens), and support for both "thinking" and non-thinking inference modes, coupled with efficient scaling capabilities, position it as a versatile tool for vision-language tasks in English and Chinese.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.42
Completion $1.25

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
Novita
Novita | baidu/ernie-4.5-vl-424b-a47b 123K $0.42 / 1M tokens $1.25 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by baidu