Cost-Aware GenAI: Model Routing for Serverless

The Cost Problem

Most GenAI deployments use a single model for all requests. A simple FAQ lookup gets the same expensive Claude Sonnet call as a complex multi-step analysis. This wastes money at scale.

Intelligent Model Routing

The solution is routing requests to the cheapest model that can handle them well:

Simple queries (< 100 tokens input) → Claude Haiku ($0.25/M tokens)
Standard queries (100-1000 tokens) → Claude Sonnet ($3/M tokens)
Complex analysis (> 1000 tokens, multi-step) → Claude Opus ($15/M tokens)

Budget Enforcement

Beyond routing, production systems need hard budget limits:

Per-endpoint daily and monthly caps
Automatic model downgrade when budget is 80% consumed
Alert-only mode for monitoring before enforcement
Per-team and per-user allocation tracking

Results

Teams implementing cost-aware routing typically see 60-80% cost reduction without measurable quality degradation on simple queries. The key insight is that most production traffic consists of simple requests that don’t need the most powerful model.

Implementation

CostSentinel, part of the SubstrAI ecosystem, provides this as a drop-in middleware for any Lambda-based GenAI application. Define your budget in YAML, and the framework handles routing, tracking, and enforcement automatically.