Cost-Aware GenAI: Model Routing for Serverless
The Cost Problem
Most GenAI deployments use a single model for all requests. A simple FAQ lookup gets the same expensive Claude Sonnet call as a complex multi-step analysis. This wastes money at scale.
Intelligent Model Routing
The solution is routing requests to the cheapest model that can handle them well:
- Simple queries (< 100 tokens input) → Claude Haiku ($0.25/M tokens)
- Standard queries (100-1000 tokens) → Claude Sonnet ($3/M tokens)
- Complex analysis (> 1000 tokens, multi-step) → Claude Opus ($15/M tokens)
Budget Enforcement
Beyond routing, production systems need hard budget limits:
- Per-endpoint daily and monthly caps
- Automatic model downgrade when budget is 80% consumed
- Alert-only mode for monitoring before enforcement
- Per-team and per-user allocation tracking
Results
Teams implementing cost-aware routing typically see 60-80% cost reduction without measurable quality degradation on simple queries. The key insight is that most production traffic consists of simple requests that don’t need the most powerful model.
Implementation
CostSentinel, part of the SubstrAI ecosystem, provides this as a drop-in middleware for any Lambda-based GenAI application. Define your budget in YAML, and the framework handles routing, tracking, and enforcement automatically.