Eval Labs Canon

# Structured Human Judgment Capture > [!summary] > Eval Labs captures human judgment through guided structure so the output is useful for product, engineering, and future model refinement. --- ## Core idea Eval Labs does not collect opinions for their own sake. It collects structured human judgment. The goal is to preserve the human signal while removing avoidable noise. --- ## Why freeform-only review fails Freeform review creates problems: ```text reviewer language drift inconsistent labels hard-to-compare exports ambiguous training signal slow onboarding employee hesitation ``` Freeform notes are still useful, but they should support structured review. They should not be the primary data layer. --- ## Preferred structure Eval Labs should collect: 1. ratings 2. guided quick-review answers 3. human guidance scores 4. escalation state 5. optional short notes 6. adjudication metadata when needed 7. Behavioral Observatory labels when assigned 8. exports that preserve all layers The current Human Guidance Evaluation dimensions are: ```text emotionalValidation cognitiveUnderstanding actionability toneAppropriateness authenticity notes ``` Warmth, intelligence, and emotional quality should map into these existing fields. They should not become separate magical categories. --- ## Why this improves data quality Structured human judgment makes it easier to compare: - model versions - prompt suites - behavior families - reviewer patterns - failure clusters - canon candidates It also helps prevent employee reviewers from accidentally training Lucia with inconsistent language. --- ## UX principle The better the review UX, the better the data. If reviewers hesitate, overthink, or invent categories, the signal gets worse. The interface should make the right review behavior feel obvious. The app may show suggested human guidance scores before the reviewer saves. Those suggestions can reduce friction, but the reviewer still owns the final score. Human Guidance Evaluation also produces a mean score and can surface hard-fail behavior when any scored dimension is very weak. --- ## Behavioral Observatory structure Behavioral Observatory adds a focused structured-label layer: ```text Intent Guest Affect Response Strategy Humanness Notes ``` These fields are Lucia-specific behavioral evidence for what the human needed, how the human felt, what Lucia did, how human the response felt, and what evidence should be preserved. Derived suggestions can prefill context, but only saved Behavioral Observatory labels count as persisted label data. --- ## Export principle Exports should preserve both: ```text simple employee signal senior adjudication signal ``` This lets analysis separate fast human reaction from final canonical meaning.