Most “we shipped an LLM feature” posts are about a chat box. This one isn’t. The three Claude features I’m going to walk through are quiet — they sit inside a product where the LLM isn’t the headline. Nobody opens the app to chat with the AI. They open it to launch an A/B test, see whether their checkout variant is winning, and pick a winner. Claude shows up at exactly three moments in that flow:
- Pre-launch: “Get an AI review of this experiment before I start it.”
- Mid-experiment: “Did anything weird happen in the data last night?” (proactive, runs without the user clicking anything)
- Post-experiment: “Tell me what these results mean and what I should do.”
Each one has a different shape — different response format, different latency budget, different failure mode — and each one taught me a pattern I’d reuse. The piece I’d most like a future-me to remember: structured outputs and graceful degradation are 80% of shipping LLM features into a SaaS product. The model is the easy part.
The 100-line client that backs all three
Before the features: there is no SDK in this stack. The whole Anthropic client is one Go file, ~120 lines, raw net/http:
const (
anthropicMessagesURL = "https://api.anthropic.com/v1/messages"
anthropicVersion = "2023-06-01"
anthropicModel = "claude-sonnet-4-6"
maxTokens = 1024
httpTimeout = 30 * time.Second
)
type Client struct {
apiKey string
httpClient *http.Client
}
// NewClient returns nil if apiKey is empty — feature disabled.
func NewClient(apiKey string) *Client {
if apiKey == "" {
return nil
}
return &Client{apiKey: apiKey, httpClient: &http.Client{Timeout: httpTimeout}}
}
func (c *Client) Analyze(prompt string) (string, error) { /* ...POST, parse, return text... */ }
This is deliberate. The SDK would be fine, but I want exactly one place in the binary that knows what the Anthropic API looks like, and I want to read that file once and be done. A single-method client (Analyze(prompt) (string, error)) also forces every feature to do its own structuring on top — which turns out to be the right shape.
Two design decisions in this snippet are worth flagging:
1. The nil-client convention. If ANTHROPIC_API_KEY isn’t set, NewClient returns nil. Every feature that uses Claude checks if c.aiClient == nil at the top. That’s how the product runs in environments where the key isn’t configured (local dev, customer-side self-hosts) without each feature having to know about a feature flag. The features just degrade — the “Get AI Review” button returns 503, the anomaly job skips its Claude pass, the results analyzer returns a polite “temporarily unavailable” — and the rest of the product keeps working.
2. No retries, no prompt caching, no tool use. The three features are working in production without any of that. I’ll come back at the end with what I’d add and why I haven’t yet.
Feature 1: Pre-launch QA review (plain text, on-demand)
When a merchant is about to start an experiment, there’s a launch checklist dialog with a button: “Get AI Review.” Clicking it sends a short prompt to Claude containing the experiment name, target URL, estimated daily sessions, and the baseline conversion rate, and asks four questions:
- Is this a reasonable test to run?
- How long will it take to reach significance?
- Any obvious issues?
- What MDE (minimum detectable effect) is realistic?
The response is plain text. The merchant reads it. That’s the whole feature.
The reason this works as a plain-text response, with no structured parsing: nothing downstream depends on the model’s output. It’s pure advice, displayed in a panel. There’s no parser to break. The product is fine if the model rambles, hedges, or uses bullet points one day and prose the next. The latency budget is generous — the user clicked a button and expects to wait. The cost is whatever it costs once.
This is the easiest shape of LLM feature to ship, and the most underrated. If your team is debating whether to “do AI in the product,” start here. One button, one prompt, one text panel, no downstream parsing. The bar to ship is a week, not a quarter.
Feature 2: Anomaly detection (structured JSON, async, rate-limited)
This is the one that taught me the most. Every time an active experiment’s results are fetched, a background job runs (at most once per 24h per experiment) and sends Claude the last 21 days of daily conversion data, asking it to flag four specific patterns:
Check for:
1. NOVELTY EFFECT: Did the variant show a large initial lift that has been steadily declining?
2. TREND DIVERGENCE: Are the variants converging or diverging?
3. DAY-OF-WEEK PATTERN: Does the variant perform differently on weekends vs weekdays?
4. SUDDEN SHIFT: Did either variant's conversion rate change dramatically on a specific date?
Respond in JSON only (no markdown fences):
{
"anomalies_found": true/false,
"alerts": [
{
"type": "novelty_effect|trend_divergence|day_pattern|sudden_shift",
"severity": "info|warning|critical",
"message": "Human-readable description",
"recommendation": "What the merchant should do"
}
]
}
If no anomalies, return {"anomalies_found": false, "alerts": []}.
Do NOT flag minor daily variation or weekday/weekend differences under 10%.
Three things in that prompt are doing more work than they look like.
The closed enum of types and severities
The prompt enumerates exactly four anomaly types and three severities. The Go parser then enforces that enumeration with a whitelist:
validTypes := map[string]bool{
"novelty_effect": true, "trend_divergence": true,
"day_pattern": true, "sudden_shift": true,
}
validSeverities := map[string]bool{"info": true, "warning": true, "critical": true}
for _, a := range resp.Alerts {
if !validTypes[a.Type] || !validSeverities[a.Severity] {
continue // silently drop unknown types
}
alerts = append(alerts, AnomalyAlert{...})
}
If Claude invents "holiday_effect" or "medium", the alert is dropped, not stored, not displayed. The UI only has icons and copy for the four known types; an unknown type would render as a broken alert. The whitelist is the contract.
This is the pattern I’d most like to teach a junior engineer: treat the model as an unreliable upstream service, even when it’s getting it right 99% of the time. Validate on the way in. The cost is twenty lines of Go; the alternative is a UI bug that ships once a month and you can’t reproduce.
The “do NOT flag minor variation” line
The prompt has one explicit anti-pattern instruction at the end: “Do NOT flag minor daily variation or weekday/weekend differences under 10%.” Without it, Claude flags everything. The first version of this feature drove me crazy because it would faithfully report “Variant B converted at 3.21% on Tuesday and 3.18% on Wednesday — slight day-of-week pattern detected.” Technically true. Operationally useless.
The threshold (10%) lives in the prompt rather than in code, deliberately. It’s a softer signal than a hard if in the parser: Claude can still flag a 9% weekend dip if it’s accompanied by something else suspicious, and that’s the right behavior. Move the threshold into code and you lose that judgment.
Markdown fences anyway
The prompt says “Respond in JSON only (no markdown fences).” Claude sometimes adds them anyway. So the parser strips them before unmarshalling:
raw = strings.TrimSpace(raw)
if strings.HasPrefix(raw, "```") {
lines := strings.Split(raw, "\n")
if len(lines) > 2 {
raw = strings.Join(lines[1:len(lines)-1], "\n")
}
}
This is the cheapest possible defense, and it works. If you’re parsing structured output from any LLM, just write the fence-stripper. It’s six lines. Don’t fight it in the prompt.
Async and rate-limited
The anomaly job runs at most once per experiment per 24 hours. The user never waits on it. Results are persisted to a Postgres table and displayed in an alerts panel on next page load. If Claude is slow or down, nothing in the synchronous request path is affected. If the merchant peeks at their dashboard 18 times in a day, Claude is called once.
That last bit — rate-limiting upstream of the call, by checking the DB for the most recent run — is the easy version of cost control. No queues, no exponential backoff, no retry semantics. One SELECT MAX(created_at) per request. It costs nothing and it works.
Feature 3: Post-experiment results analysis (markdown, cached)
When the merchant clicks “Get AI Analysis” on a finished experiment, the backend builds a much fatter prompt — full funnel data, revenue per session, AOV, the last 7 days of daily trends per variant — and asks Claude for a structured markdown response: Summary, Key Findings, Recommendation, Estimated Impact.
Two patterns from this feature.
The platform-aware prompt builder
The app runs on Shopify and on generic HTML sites. A Shopify experiment has a built-in funnel (cart → checkout → payment → completion); an HTML experiment has whatever custom goals the merchant defined. Same Claude prompt, but the data section is built conditionally:
// pseudocode of the real builder
if experiment.Platform == "shopify" {
appendFunnelMetrics(b, results) // checkout_completed, add_to_cart, ...
appendRevenueMetrics(b, results) // AOV, revenue per session
} else {
appendCustomGoals(b, results) // whatever the merchant tagged
}
appendDailyTrend(b, results, days: 7)
The prompt template is one file (ai_prompt.go); the platform-specific sections are conditionals inside it. This works because Claude doesn’t care whether the body is structured around Shopify primitives or merchant-defined ones — it reads numbers, gives a recommendation. The structure is for the human reader of the eventual markdown, not the model.
DB-level caching keyed by experiment
Calling Claude on a finished experiment with the same data should always produce roughly the same analysis. So the answer gets cached in a ai_analysis_cache table, keyed by experiment ID. The frontend tries cache first; the backend writes through on a successful response; merchants get a “regenerate” button if they want a fresh take.
This is mundane caching — nothing fancier than an ETag would be. But it matters more than usual for LLM features because the call is both slow (1–5 seconds) and expensive (relatively, per inference) compared to anything else in the system. Caching at the DB level — rather than in-memory, behind the API gateway, or at the model provider — means the cache survives backend restarts, customer sessions, and tab refreshes. It’s the right place to put it.
The patterns, summarized
| Pattern | Where it shows up | Why it’s worth the lines of code |
|---|---|---|
| Nil-client convention | Every feature | Graceful degradation without per-feature feature flags. if c == nil { return 503 } beats a configuration system. |
| Closed-enum response validation | Anomaly detection | The model is an unreliable upstream. Whitelist the values the UI knows how to render; drop the rest silently. |
| Anti-pattern instructions in the prompt | Anomaly detection | Saying “don’t flag X” is cheaper than filtering X out in code, and preserves model judgment for edge cases. |
| Markdown fence stripping | Anomaly detection | Six lines of Go. Just write it. Don’t litigate with the model. |
| Async + DB-rate-limit | Anomaly detection | One SELECT MAX(created_at) beats a queueing system for “at most once per N hours.” |
| DB-level result caching | Results analysis | Survives restarts and sessions; obvious “regenerate” semantics. |
| Platform-conditional prompt body | Results analysis | One prompt template, conditional data sections. Beats two parallel templates. |
What’s missing — and why I haven’t added it yet
The honest list of things this client doesn’t do:
No prompt caching. The Anthropic API supports caching long stable prefixes (system prompts, schemas, big context blocks) so they don’t get billed per call. None of the three features uses it. The anomaly prompt has a stable instruction block of about 400 tokens that would be a clean cache hit, and the results-analysis prompt has more like 600. This is the single highest-ROI improvement I’d make, and it’s a half-day of work. I haven’t done it because the current volume is small enough that the cost saving rounds to a rounding error, and shipping the features mattered more than optimizing them. If you’re reading this and you’re past the “does this work?” stage, you should turn on prompt caching now.
No tool use. The anomaly detector reasons about data we hand it in a text block. A more honest version would give Claude a query_daily_data tool and let it pull what it needs. The current shape constrains the model to the 21 days × N variants we choose to send. The tool-use version would let it follow up — “let me look at the prior period for comparison.” This is also where I’d put a thinking-block configuration: anomaly detection is exactly the kind of “look carefully, reason step-by-step” task that benefits from explicit reasoning tokens.
No retries. A 30-second HTTP timeout, one shot. On failure, the user sees an error and can click again. This is fine because nothing in the product is downstream-blocking on the call — but it would be slightly less rude to retry idempotent failures once with backoff.
No streaming. The results analysis call takes a few seconds and the UI shows a spinner. Streaming would let the markdown render token-by-token. The product feels faster without it because the spinner is well-placed, but if I were starting today I’d stream.
Sonnet for everything. All three features use the same model. The anomaly detector — strict JSON, narrow task — could probably run on Haiku for a fraction of the cost. The results analyzer benefits from Sonnet’s longer reasoning. Differentiating those is on the list.
The headline
None of these features is glamorous. There’s no chat. No agent. No “AI co-pilot.” Just a button that returns advice, a background job that flags weirdness, and a cached summary. That’s the shape of most LLM features that actually ship inside SaaS products — and the patterns that make them reliable are mostly about treating the model like any other upstream service, with structured contracts on the way out and graceful failure on the way in.
The fanciest part of this code is six lines that strip markdown fences before json.Unmarshal. The rest is plumbing. That’s the right ratio.