- Testing retrieval changes
- Evaluating 1,200 queries
- Shipping our eval harness
- LangSmith
- OpenAI
- Postgres
Every backend engineer who transitions into AI work hits the same wall around week three: the code works, the demo runs, but something feels fundamentally different about how you reason about the system. It’s not the tools. It’s the mental model.
After surveying about forty AI engineering job descriptions and talking to a dozen engineers who’d made the transition, six shifts kept coming up:
- Eval-first thinking — you design how you’ll measure correctness before you write the feature
- Token economics — latency and cost are first-class constraints, not afterthoughts
- Retrieval intuition — knowing when embedding similarity is enough and when it isn’t
- Latency tradeoffs — streaming, caching, model routing, and the cost of each
- Failure-mode reasoning — LLMs fail differently than deterministic code; you need a different debugging frame
- Observability — traces, evals, and feedback loops as core infrastructure
This dispatch unpacks each one with examples from production systems.
Why this matters
Evaluation thinking is one of the six core skills that separates an AI engineer from a traditional backend developer.
Most engineers stop at “it works.”
Production AI engineers learn to ask: “What breaks when this changes?”
to Becoming an Applied AI Engineer