Payments are unforgiving about time. A merchant’s checkout flow gives you maybe 800ms total before the customer notices, and within that you have to authorise the card, route the funds, and often score risk. If our AI layer eats more than 100–200ms of that budget, we lose the bid. So we treat latency as a first-class constraint, not a performance afterthought.
Where the time goes
The honest accounting from our production traces in Toronto (northamerica-northeast2):
- Cloud Run cold start, when it happens: 600–1200ms (we keep min instances at 1)
- Vertex AI Predict on a small custom model: 25–45ms p95
- Vertex AI Embeddings for retrieval: 30–60ms p95
- Gemini Flash, 2K input tokens, 200 output tokens: 280–420ms p95
- Gemini Pro, same shape: 800–1300ms p95
The headline is that LLM-in-the-hot-path is hard. For real-time risk scoring we use Vertex AI Predict on a tabular model and reserve Gemini for asynchronous case-building. For the rare cases where we do need an LLM in the bid path, Gemini Flash is the right tool and we cap the output at 60 tokens.
Eliminating cold starts
Cloud Run scale-to-zero is great for batch jobs and bad for latency-sensitive APIs. We pin --min-instances=1 on production services and accept the small idle cost. We also use Cloud Run’s second-generation execution environment, which boots faster than the first.
For surfaces that handle bursts (a marketing campaign creates a 20x spike in onboardings) we tune --max-concurrency deliberately. The default of 80 is fine for I/O-heavy workloads; it’s wrong for ones that spend their time in CPU.
Embeddings for cheap retrieval
Most of our latency wins came from leaning on embeddings instead of the LLM where possible. Sanctions matching, customer-cluster lookup, and policy clause retrieval all run as Matching Engine queries with sub-50ms p95. We then hand the matched candidates to a smaller model for reasoning. The cost is one extra hop; the saving is hundreds of milliseconds and a much steadier latency distribution.
Streaming where the user sees it
Where the LLM’s output is shown to a human (case files, narratives, treasury Q&A) we stream tokens. The perceived latency is the time-to-first-token, which Gemini Flash can hit in under 200ms. The total time may still be a second or two, but the user is reading by then.
We don’t stream where the output is consumed by another system. Streaming JSON is a footgun. For machine-to-machine responses we wait for the full structured object and validate it against a schema before passing it on.
Caching the prompt prefix
Gemini’s context caching cuts the per-request cost of long, stable prompt prefixes (think: policy manual, taxonomy, schema). We cache the system prompt and shared examples, then send only the per-request delta. On long prompts this saves 30–50% of the wall time and a similar amount of cost.
Tracing
OpenTelemetry traces from Cloud Run flow into Cloud Logging and then into BigQuery for retention. Every request carries a trace ID that follows it through retrieval, reasoning, and any human handoff. When a customer asks “why was this declined?” the trace tells the story.
We surface a per-route latency histogram in the on-call dashboard. The eyeball that sees a p95 creep up over a week is more valuable than any alert; alerts catch outages, dashboards catch regressions.
What we’re trying next
Two experiments on the bench: Gemini Flash 2 for hot-path narrative drafting (where Pro’s extra quality doesn’t show up in evals), and on-device inference via TFLite for the simplest risk models so we can shave the network hop entirely. We’ll write about whichever one ends up in production.