Eval Labs Canon

# How Eval Labs Improves Lucia > [!summary] > Eval Labs improves Lucia by turning subjective impressions into repeated, inspectable behavioral evidence and by proving the evaluation platform can support Lucia intelligence work at scale. --- ## The improvement mechanism Eval Labs helps Lucia improve by creating this loop: ```text Behavior observed → Pattern identified → Suggested review and human judgment compared → Run History / Analysis evidence inspected → Owner file inspected → Smallest correct patch made → Dev deployed → Same suite re-run → Human review confirms or rejects improvement ``` --- ## What Eval Labs catches Eval Labs can catch: - wrong intent routing - tone drift - weak containment - generic language - overclaiming - missing next moves - payment-risk prioritization errors - arrival-readiness misses - concierge readiness gaps - multilingual regressions - model upgrade regressions It also protects the evaluation platform itself by validating run creation, persistence, finalization, Run History, Analysis, and compact client state before employees depend on the system. --- ## Why repeated suites matter If we only test new prompts every time, we cannot tell whether Lucia improved. Custom suites let us compare before/after behavior. That turns product feel into product evidence. --- ## Lucia source-of-truth behavior Eval Labs does not patch Lucia. Eval Labs reveals where Lucia needs patching. For behavior issues, the likely Engine owner is usually one of: ```text operatorFocusBrain.js refineOperatorFocusOutput.js luciaModelConfig.js luciaModelGateway.js ``` Wrong mode usually starts in `operatorFocusBrain.js`. Right mode but awkward language may belong in `refineOperatorFocusOutput.js`. --- ## The real-world milestone Eval Labs is now officially being used for Lucia refinement against the dev Engine. That means it is no longer just documentation or future infrastructure. It is part of the live development loop. As of the 60-run AI-reviewed gate, Eval Labs is also proven as platform infrastructure for readiness checks. That is a product-infrastructure milestone, not human approval of Lucia quality. --- ## Updated improvement mechanism — from employee signal to canon signal The improvement loop now separates signal quality: ```text Lucia behavior observed → app suggestions provide initial signal → employee quick review captures reaction → Human Guidance Evaluation captures structured judgment → reviewer-owned final judgment is saved → senior reviewer adjudicates important cases → exported lifecycle evidence preserves the trail → reusable learning becomes canon candidate → engineering patches smallest correct layer → same suite is re-run → evidence confirms or rejects improvement ``` This prevents non-expert review from directly becoming Lucia doctrine while still letting the whole team contribute useful signal. The app may suggest, but the reviewer must decide. Reviewed exports preserve the signal chain: suggested review, employee review, Human Guidance Evaluation, adjudication metadata, lifecycle state, tester identity, and dirty / completion state. --- ## Intelligence stack role Eval Labs is part of Lucia's intelligence stack. It helps harden: - truthfulness - emotional containment - operational usefulness - intent routing - trust-state discipline - evaluator feedback loops - platform evidence recovery for future threads The Canon should therefore treat Eval Labs as product infrastructure, not as a side tool.