You Can't Fix What You Can't See
Most early-stage AI SaaS products are flying blind in production: no idea which users are getting errors, no visibility into LLM costs per customer, no alerts when response times degrade. Observability is not optional — it's how you maintain product quality as you scale.
The Observability Stack for AI SaaS
- Error tracking: Sentry (catches exceptions with stack traces and user context)
- LLM observability: LangSmith or Helicone (traces every LLM call)
- Application performance: Datadog or New Relic (latency, throughput, error rates)
- Cost monitoring: OpenAI usage dashboard + custom cost tracking
- Uptime: Better Uptime or Checkly (synthetic monitoring)
- Logs: Logtail or Datadog Logs (structured, searchable)
LLM Observability: What to Track
Every LLM call should log: model used, input tokens, output tokens, latency (time to first token, total time), cost, user ID, session ID, prompt template version, and whether the response was rated positively by the user.
Use Helicone as a proxy — it captures all this automatically with a single endpoint change:
const openai = new OpenAI({
baseURL: 'https://oai.helicone.ai/v1',
headers: { 'Helicone-Auth': 'Bearer sk-helicone-...' },
});Alerting Strategy
Alert on: error rate > 1% (P1), response time > 10s (P2), LLM cost per hour > $X (P2), database connection pool exhausted (P1), any 5XX spike (P1). Use PagerDuty for P1 (on-call rotation) and Slack for P2 (team channel).
Cost Attribution Per Customer
Track AI costs per user/organization and expose this in your admin dashboard. This lets you identify customers whose usage patterns are unprofitable and adjust pricing accordingly.