Eval Labs Canon

# Evaluation Philosophy > [!summary] > Eval Labs treats evaluation as product infrastructure, not a side task. --- ## Core belief AI systems get better only when their behavior is inspected carefully. Lucia is especially dependent on evaluation because her value is not merely whether she answers. Her value is whether she helps real people stay oriented while operations are messy. --- ## The main question For each response, ask: ```text Did this response actually help the user in this moment? ``` Not: ```text Did it sound impressive? ``` --- ## What good evaluation looks for A strong evaluation checks: - intent accuracy - operational usefulness - truth-state discipline - clarity - emotional containment - tone fit - next-action quality - cognitive load - whether the response preserves trust --- ## What Eval Labs protects against Eval Labs protects against: ```text polished nonsense generic assistant behavior fake completion cold correctness overlong answers weak prioritization tone drift intent misrouting regression after model or prompt changes ``` --- ## Human grading is not optional For Lucia, human review is not a temporary crutch. It is the product's judgment layer. Automated graders can eventually help find probable issues, but a human must decide whether Lucia's behavior truly works for the operational-emotional moment. --- ## Principle > [!warning] > Correct is not enough. A response must be useful in the real operating context. --- ## Strong evaluation behavior A strong reviewer: - reads the user's prompt slowly - identifies the actual user need - checks whether Lucia understood the emotional state - checks whether Lucia chose the right operational lane - scores honestly - writes specific notes - does not over-reward polished language - does not pass a response because it is "pretty good" --- ## Weak evaluation behavior A weak reviewer: - gives all 10s too easily - avoids writing notes - ignores emotional mismatch - rewards confidence even when the answer overclaims - treats generic redirection as acceptable - reviews the response without considering the product context --- ## The evaluator's job The evaluator's job is not to be nice to Lucia. The evaluator's job is to protect Lucia's future users. --- ## May 2026 philosophy update — guided judgment beats freeform annotation Eval Labs now treats non-expert review as guided judgment capture. This is a deliberate product decision. Reviewers should not be forced to become AI experts, label designers, or taxonomy writers. The system should make the desired judgment path obvious: ```text read prompt read Lucia response score dimensions answer quick review flag senior review if needed save ``` Senior adjudication owns canonical meaning. This improves consistency, reduces reviewer fatigue, and protects Lucia from noisy training signal.