Eval & observability harness — inferLearn Field Log

Progress

Integrated LangSmith tracing.

LangSmith Trace View

Attempted heuristic evaluator.

Failure: False positives were too high. Regex matching against LLM output is too brittle.
Decision: Abandoned approach. Rolling back to standard exact-match where possible, LLM-as-a-judge everywhere else.

Baseline benchmark created.

Architecture Diagram

3 lessons from this build:

Exact-match evaluators fail quickly. Human language varies too much to rely strictly on regex checks.
LLM-as-a-judge works better than expected. Especially when constrained with a strict grading rubric and 5-shot examples.
Tracing is mandatory before optimization. You can’t effectively drop latency if you don’t know which node in the graph is blocking.