Routing

Intelligent model selection based on capabilities, cost, and availability.

Overview

The routing engine selects the best model for each request based on:

Required capabilities — What the request needs (tool use, vision, etc.)
Cost preferences — Optimize for cost vs quality
Provider availability — Real-time health checks
Fallback configuration — Automatic failover chains

Capabilities

Capability	Description
`chat`	Basic chat completion
`tool_use`	Function/tool calling
`vision`	Image understanding
`code`	Code generation and analysis
`long_context`	100k+ token context window
`streaming`	Server-sent events streaming
`structured_output`	JSON mode or schema enforcement

Getting a Routing Decision

1curl -X POST http://localhost:3001/v1/routing/route \
2  -H "Content-Type: application/json" \
3  -d '{
4    "workspace_id": "ws_123",
5    "required_capabilities": ["chat", "tool_use"],
6    "prefer_cost": "low",
7    "allow_fallback": true
8  }'
9
10# Response:
11{
12  "success": true,
13  "selected_provider": "anthropic",
14  "selected_model": "claude-3-5-sonnet",
15  "rationale": "Selected for tool_use capability with lowest cost among capable models",
16  "candidates": [
17    {
18      "provider": "anthropic",
19      "model": "claude-3-5-sonnet",
20      "score": 0.92,
21      "capabilities_match": ["chat", "tool_use"],
22      "price_tier": "medium"
23    },
24    {
25      "provider": "openai",
26      "model": "gpt-4o",
27      "score": 0.88,
28      "capabilities_match": ["chat", "tool_use"],
29      "price_tier": "high"
30    }
31  ],
32  "fallback_chain": ["openai:gpt-4o", "google:gemini-pro"]
33}

Scoring Algorithm

Each model is scored based on weighted factors:

1// Default scoring weights
2{
3  "capability_match": 0.40,   // Must have required capabilities
4  "cost_efficiency": 0.25,   // Lower cost = higher score
5  "latency": 0.15,           // Faster = higher score  
6  "availability": 0.10,      // Current health status
7  "quality": 0.10            // Historical success rate
8}
9
10// Score calculation
11score = 
12  (capability_score * 0.40) +
13  (cost_score * 0.25) +
14  (latency_score * 0.15) +
15  (availability_score * 0.10) +
16  (quality_score * 0.10)

Weights can be adjusted per-workspace via the routing configuration API.

Fallback Chains

When a provider fails, the routing engine automatically tries the next model in the chain:

1// Configure a fallback chain
2{
3  "id": "production-chain",
4  "name": "Production Fallback",
5  "models": [
6    "anthropic:claude-3-5-sonnet",    // Primary
7    "openai:gpt-4o",                   // First fallback
8    "openai:gpt-4o-mini",              // Cheaper fallback
9    "google:gemini-pro"                // Last resort
10  ],
11  "failure_triggers": [
12    "rate_limit",
13    "timeout", 
14    "provider_error",
15    "capacity_exceeded"
16  ]
17}

Built-in Model Registry

RelayPlane includes a registry of major models with their capabilities:

1curl http://localhost:3001/v1/routing/models
2
3{
4  "models": [
5    {
6      "id": "claude-3-5-sonnet",
7      "provider": "anthropic",
8      "capabilities": ["chat", "tool_use", "vision", "code", "long_context", "streaming"],
9      "pricing": { "input_per_1k": 0.003, "output_per_1k": 0.015 },
10      "context_window": 200000,
11      "status": "available"
12    },
13    {
14      "id": "gpt-4o",
15      "provider": "openai",
16      "capabilities": ["chat", "tool_use", "vision", "code", "streaming", "structured_output"],
17      "pricing": { "input_per_1k": 0.005, "output_per_1k": 0.015 },
18      "context_window": 128000,
19      "status": "available"
20    },
21    {
22      "id": "gpt-4o-mini",
23      "provider": "openai",
24      "capabilities": ["chat", "tool_use", "streaming", "structured_output"],
25      "pricing": { "input_per_1k": 0.00015, "output_per_1k": 0.0006 },
26      "context_window": 128000,
27      "status": "available"
28    }
29    // ... more models
30  ]
31}

Provider Lanes

Models are categorized into lanes based on their provider type:

proprietary — OpenAI, Anthropic, Google (commercial APIs)
hosted — Together, Groq, Fireworks (hosted open models)
local — Ollama, LM Studio (self-hosted)
custom — Your own endpoints (vLLM, TGI, etc.)

Provider Health

The routing engine tracks provider health in real-time:

1// Health statuses
2{
3  "anthropic": {
4    "status": "healthy",
5    "latency_p50_ms": 450,
6    "success_rate": 0.998,
7    "last_check": "2026-02-06T12:00:00Z"
8  },
9  "openai": {
10    "status": "degraded",
11    "latency_p50_ms": 1200,
12    "success_rate": 0.95,
13    "last_check": "2026-02-06T12:00:00Z",
14    "issue": "Elevated latency detected"
15  }
16}