# Evaluation Philosophy
> [!summary]
> Eval Labs treats evaluation as product infrastructure, not a side task.
---
## Core belief
AI systems get better only when their behavior is inspected carefully.
Lucia is especially dependent on evaluation because her value is not merely whether she answers. Her value is whether she helps real people stay oriented while operations are messy.
---
## The main question
For each response, ask:
```text
Did this response actually help the user in this moment?
```
Not:
```text
Did it sound impressive?
```
---
## What good evaluation looks for
A strong evaluation checks:
- intent accuracy
- operational usefulness
- truth-state discipline
- clarity
- emotional containment
- tone fit
- next-action quality
- cognitive load
- whether the response preserves trust
---
## What Eval Labs protects against
Eval Labs protects against:
```text
polished nonsense
generic assistant behavior
fake completion
cold correctness
overlong answers
weak prioritization
tone drift
intent misrouting
regression after model or prompt changes
```
---
## Human grading is not optional
For Lucia, human review is not a temporary crutch.
It is the product's judgment layer.
Automated graders can eventually help find probable issues, but a human must decide whether Lucia's behavior truly works for the operational-emotional moment.
---
## Principle
> [!warning]
> Correct is not enough. A response must be useful in the real operating context.
---
## Strong evaluation behavior
A strong reviewer:
- reads the user's prompt slowly
- identifies the actual user need
- checks whether Lucia understood the emotional state
- checks whether Lucia chose the right operational lane
- scores honestly
- writes specific notes
- does not over-reward polished language
- does not pass a response because it is "pretty good"
---
## Weak evaluation behavior
A weak reviewer:
- gives all 10s too easily
- avoids writing notes
- ignores emotional mismatch
- rewards confidence even when the answer overclaims
- treats generic redirection as acceptable
- reviews the response without considering the product context
---
## The evaluator's job
The evaluator's job is not to be nice to Lucia.
The evaluator's job is to protect Lucia's future users.
---
## May 2026 philosophy update — guided judgment beats freeform annotation
Eval Labs now treats non-expert review as guided judgment capture.
This is a deliberate product decision.
Reviewers should not be forced to become AI experts, label designers, or taxonomy writers.
The system should make the desired judgment path obvious:
```text
read prompt
read Lucia response
score dimensions
answer quick review
flag senior review if needed
save
```
Senior adjudication owns canonical meaning.
This improves consistency, reduces reviewer fatigue, and protects Lucia from noisy training signal.