Why Testing AI Features is Hard (But Necessary)
Traditional software tests have deterministic expected outputs. AI features are probabilistic — the same input can produce different valid outputs. This makes conventional unit testing insufficient. You need a multi-layer testing strategy that accounts for AI's non-determinism.
Layer 1: Unit Tests for Deterministic Code
Everything that wraps your AI calls should still be tested deterministically: prompt construction functions, output parsers, cost calculators, rate limiters. These are regular TypeScript functions — test them with Vitest like any other code.
describe('Prompt Builder', () => {
it('includes user context in system prompt', () => {
const prompt = buildSystemPrompt({ userPlan: 'pro', language: 'en' });
expect(prompt).toContain('pro plan');
expect(prompt).toContain('English');
});
});Layer 2: Prompt Regression Testing
Run your prompts against a fixed dataset of inputs and expected output characteristics (not exact outputs). Use a scoring LLM to evaluate: "Does this output correctly answer the question? (1/0)", "Is this output safe and appropriate? (1/0)", "Is this the right format? (1/0)".
Store passing scores as baselines. Alert when scores drop after prompt changes. This catches prompt regressions before they reach production.
Layer 3: End-to-End Tests with Mocked AI
For E2E tests (Playwright), mock your AI endpoints to return fixed responses. Test the user workflow around AI — not the AI itself. Does the UI render the response correctly? Does it handle errors gracefully? Does billing trigger on AI use?
Layer 4: Production Quality Monitoring
Add a thumbs up/down rating to every AI response. Track: positive rate per feature, per model, per user cohort. Drop in positive rate = quality regression. Set alerts when positive rate drops below threshold.