Cost & LLM Strategy

Tiered model routing to minimize cost without sacrificing quality

LLM cost is controllable if you treat it as an engineering problem, not a fixed API bill. Route by complexity, cache aggressively, and push deterministic logic out of the model entirely.

Design Principles

Four rules that govern every inference decision in the system.

Use cheapest viable model

Default to the smallest model that meets accuracy thresholds. Premium models are the exception, not the rule.

Reserve premium for complex reasoning

Only invoke expensive models for nuanced dispute resolution and multi-step policy interpretation.

Reduce tokens aggressively

Structured retrieval, summarization caches, and pre-filtered context keep token counts low without sacrificing relevance.

Push deterministic checks to rules engine

Policy checks that are binary (refund ≤ $X, within Y days) go through a rules engine — no LLM needed.

Model Routing

Three tiers — route by case complexity, pay only for the intelligence you need.

0
Tier 0
Low Complexity

Templates, classification, lightweight summarization

Ticket share~70%
Avg tokens/case~900
Cost/1M tokens$0.60
1
Tier 1
Medium Complexity

Contextual response drafting + policy retrieval

Ticket share~25%
Avg tokens/case~2,200
Cost/1M tokens$3.00
2
Tier 2
High Complexity

Nuanced reasoning for complex disputes and edge cases

Ticket share~5%
Avg tokens/case~4,500
Cost/1M tokens$12.00
Cheapest (70%)Mid-tier (25%)Premium (5%)

Cost at Scale

Projected monthly LLM cost at 1M tickets/month.

1M tickets / month
~$5,028/mo
Per ticket
~$0.005
Tier 0700K tickets
$378
7.5%
630M tokens
Tier 1250K tickets
$1,650
32.8%
550M tokens
Tier 250K tickets
$2,700
53.7%
225M tokens
Audit~30K tickets
$300
6%
~75M tokens

Key insight: Tier 2 handles only 5% of tickets but drives ~54% of total LLM cost. Every 1pp reduction in premium-model invocation rate saves ~$540/month at scale.

Cost Controls

Five mechanisms that prevent cost creep as volume scales.

Token budget per case type

Hard caps per policy category prevent runaway token spend on low-value cases.

Max context window + retrieval chunk limits

Cap retrieved chunks at 3–5 per query; truncate context to essential history only.

Response length caps

Model output capped at 300 tokens for Tier 0, 600 for Tier 1, 1,200 for Tier 2.

Reuse summaries across session

Case summaries generated once and cached — subsequent turns reference the summary, not raw history.

Batch post-resolution audits off-peak

Audits run in batch during low-traffic hours to avoid premium-rate inference costs.

KPI Dashboard

Five metrics that keep cost visible and actionable for ops and engineering.

Cost per assisted case

Blended LLM cost across all tiers for cases where copilot was engaged.

Cost per resolved case

End-to-end cost including audit, routing, and any supervisor review time.

Premium-model invocation rate

Percentage of cases escalated to Tier 2 — target <5%, alert at >8%.

Tokens per case by market/product

Segmented token usage reveals which markets or verticals drive disproportionate cost.

Cost vs quality frontier

Pareto chart of cost per case against suggestion acceptance rate — find the efficient frontier.

$0.005 per ticket at scale

At 1M tickets/month, tiered routing keeps total LLM cost under $5,100/month. The cost structure is dominated by Tier 2 — every improvement in classifier accuracy that correctly routes cases to lower tiers compounds into significant savings.