Default plot · Quality vs. Speed
The Pareto frontier
Top-right is best: faster output speed and higher quality. Qorinix lanes sit on the frontier — high quality at speeds public references cannot match.
Default plots · Per-metric ranking
Where each model lands on the metric that matters
Sorted bar plots make trade-offs explicit. Qorinix dominates speed, latency, and cost while staying competitive on quality.
Detailed leaderboard
Sortable, filterable leaderboard
Filter by category and sort by any column. Qorinix rows highlight in orange.
| # | Lane / Model | Quality | TTFT p50 | Total p95 | Output speed | JSON | Success | Cost / M | Cache saving | Value |
|---|
Category winners
Best-in-class per workload
Different workloads value different trade-offs. Here are the winners by intent.
Real-time agents
Qorinix 3.1
TTFT under 150 ms, throughput above 230 tok/s — voice agents, gaming NPCs, and trading alerts where every millisecond matters.
Why: lowest TTFT and total latency, with adaptive routing across speed-class.High-volume support automation
Qorinix 3.2
62% cache saving on repeated queries with quality matching frontier public models, at less than half the cost.
Why: semantic cache + Quality lane keeps unit economics healthy at scale.Long-form reasoning
ChatGPT (GPT-4o)
Highest reasoning index in the public reference set; pair with Qorinix routing for speed-tiered fallback.
Caveat: 4–5× slower TTFT and ~3.5× cost per million output tokens.Cost-sensitive batch
DeepSeek
Cheapest non-Qorinix lane; useful for offline batch where latency does not matter.
Caveat: low cache saving and middle-of-pack quality.Methodology
How the benchmark is computed
Transparency about prompt mix, measurement, and what is held server-side.
1 · Prompt mix
14,200 prompts per day distributed across reasoning (35%), code (25%), JSON / tool-use (20%), creative (15%), and short-form chat (5%). Prompts rotate every 72 hours.
2 · Latency measurement
TTFT measured server-side from request receipt to first byte of response. Total latency captured to last token. p50 is the median across the rolling 72-hour window; p95 is the slow-tail.
3 · Quality scoring
Composite of model-graded preference (LLM-as-judge with cross-model rotation), task-deterministic checks (HumanEval-lite for code, JSON-schema validation for tools), and reading-level coherence.
4 · Cost
Listed as the per-1M output-token list price applicable to the lane on the measurement day. Cache savings are computed on Qorinix-internal traffic and assume 40%+ semantic cache hit rate.
5 · What stays server-side
Exact provider model IDs, supplier API routes, credentials, failover order, and routing weights are never exposed in the public leaderboard. Only public benchmark names and observed metrics are shown.
6 · Updates
Numbers refresh continuously from production Arena traffic. The visible board is the rolling 72-hour aggregate. Anomalies (regional outage, supplier rate-limit) are flagged in the live status panel.
Test these numbers yourself.
Run the same prompt against all six lanes in the live Arena.
