Frequently Asked Questions (FAQ)

Find answers to common questions about Benchable.

General Questions

Benchable is a web-based application designed to empower users to systematically evaluate and compare the performance, cost, and output quality of various AI models and configurations for specific tasks. It helps you streamline the process of testing model suitability and selecting the optimal model for your needs.

Benchable is for AI engineers, prompt engineers, software developers, researchers, product managers, and anyone needing to systematically test and compare AI prompt performance across different models.

Account & API Keys

Yes, you need to register for an account to create benchmarks, run executions, and save your work. Browsing public benchmarks and models is possible without an account.

BYOK means you use your own API keys from AI providers (like OpenRouter, OpenAI, etc.). This ensures that any costs incurred from model usage are billed directly to your provider account. Benchable stores your keys securely using encryption.

Currently we only support OpenRouter as a provider, but we are working on adding more providers soon.

We use which ever provider OpenRouter currently classifies as the Top Provider for the given model.

Benchmarks & Executions

Benchable offers various validation methods, including exact match, text containment, regex matching, JSON schema validation, and even semantic equivalence checks using another AI model. You configure these validation rules at the benchmark level.

Benchable uses the most intelligent model we can reasonably use for auto-generating benchmarks, system prompts, and test steps. This is currently Anthropic Claude 4 Opus 20250522 and we update this regularly.

For semantic validation (when comparing outputs for meaning rather than exact text), Benchable uses Google Gemini 2.5 Flash Preview 05 20 as the default model, though you can configure a different model for specific validation rules.

This is an important consideration for benchmark validity. When Benchable auto-generates a benchmark using Anthropic Claude 4 Opus 20250522, there could be a potential bias toward models from the same family or provider in the generated content, structure, or expected responses.

Best practices to ensure fairness:

  • Review and edit generated content: Always review auto-generated benchmarks and modify them as needed to ensure they test the capabilities you actually want to measure.
  • Use diverse validation methods: Combine exact match, regex, and semantic validation to reduce reliance on any single evaluation approach.
  • Create manual benchmarks: For critical evaluations, consider creating benchmarks manually or using content from established academic datasets.
  • Test with multiple model families: Include models from different providers (OpenAI, Google, Meta, Anthropic, etc.) to identify any systematic biases.
  • Use human evaluation: For important decisions, supplement automated validation with human evaluation of outputs.

Remember that auto-generation is a starting point to save time, not a replacement for thoughtful benchmark design.

No. You must not upload any personally identifiable information (PII) or sensitive personal data within your benchmark definitions, prompts, system prompts, or any associated uploaded files.

Benchable is not designed for storing or processing PII within benchmark content. You are solely responsible for the data you provide and for ensuring it does not contain PII. Please refer to our Privacy Policy and Terms of Service for more details.

Didn't find your answer? Contact us.