Microsoft: Phi 4

Text input Text output
Author's Description

[Microsoft Research](/microsoft) Phi-4 is designed to perform well in complex reasoning tasks and can operate efficiently in situations with limited memory or where quick responses are needed. At 14 billion parameters, it was trained on a mix of high-quality synthetic datasets, data from curated websites, and academic materials. It has undergone careful improvement to follow instructions accurately and maintain strong safety standards. It works best with English language inputs. For more information, please see [Phi-4 Technical Report](https://arxiv.org/pdf/2412.08905)

Key Specifications
Cost
$$
Context
16K
Parameters
14B (Rumoured)
Released
Jan 09, 2025
Speed
Ability
Reliability
Supported Parameters

This model supports the following parameters:

Response Format Stop Seed Min P Top P Max Tokens Frequency Penalty Temperature Presence Penalty
Features

This model supports the following features:

Response Format
Performance Summary

Microsoft's Phi-4, a 14-billion parameter model, demonstrates a strong overall performance profile, particularly excelling in cost-efficiency and reliability. It typically performs among the fastest models, ranking in the 61st percentile for speed across five benchmarks. Consistently offering among the most competitive pricing, Phi-4 achieves an 85th percentile ranking for cost. The model exhibits exceptional reliability, boasting a 100% success rate across all benchmarks, indicating minimal technical failures and consistent provision of usable responses. In terms of specific benchmark performance, Phi-4 shows a notable strength in ethical reasoning, achieving a perfect 100% accuracy in the Ethics (Baseline) benchmark, making it the most accurate model at its price point and among models of similar speed. It also performs well in acknowledging uncertainty, with a 96.0% accuracy in the Hallucinations (Baseline) test, effectively identifying fictional concepts. General knowledge is solid at 96.8% accuracy. However, the model shows a weakness in complex reasoning tasks, scoring 50.0% accuracy in the Reasoning (Baseline) benchmark, placing it in the lower 34th percentile. Email classification also presents an area for improvement, with 94.0% accuracy, ranking in the 30th percentile. Its key strengths lie in its ethical understanding, reliability, and cost-effectiveness, while complex reasoning and specific classification tasks represent areas for further development.

Model Pricing

Current Pricing

Feature Price (per 1M tokens)
Prompt $0.07
Completion $0.14

Price History

Available Endpoints
Provider Endpoint Name Context Length Pricing (Input) Pricing (Output)
DeepInfra
DeepInfra | microsoft/phi-4 16K $0.07 / 1M tokens $0.14 / 1M tokens
Nebius
Nebius | microsoft/phi-4 16K $0.06 / 1M tokens $0.14 / 1M tokens
NextBit
NextBit | microsoft/phi-4 16K $0.06 / 1M tokens $0.14 / 1M tokens
Benchmark Results
Benchmark Category Reasoning Strategy Free Executions Accuracy Cost Duration
Other Models by microsoft