Eval Labs Canon

<div class="eval-home-hero"> <span class="eval-home-kicker">Evaluation Labs Canon</span> <h1 class="eval-home-title">How we teach Lucia to become trustworthy, useful, and calm.</h1> <p class="eval-home-subtitle">Eval Labs is the role-based human evaluation platform used to refine Lucia's operational intelligence, emotional containment, intent routing, and trust-state discipline.</p> </div> > [!summary] > Eval Labs is the proprietary role-based evaluation, analysis, and human judgment system used to test, refine, and protect Lucia's intelligence as it develops. It captures how Lucia performs under real operational pressure, helping ensure her responses become more useful, truthful, emotionally aware, and resistant to generic assistant drift. --- **Primary Eval Labs Logo:** | | | |---|---| | ![[assets/binary-eval-logo.png]]<br><em>EvaluationLabs.ai Primary Logo.</em> | Eval Labs serves as first-class Lucia intelligence infrastructure: the product surface for testing, reviewing, analyzing, and hardening Lucia behavior. | --- ## What Eval Labs is for Eval Labs exists to answer one question repeatedly: ```text Is Lucia actually becoming better for real operators and guests? ``` Not prettier. Not longer. Not more confident. Better. For Lucia, "better" means: - more accurate intent recognition - stronger operational prioritization - warmer but not mushy language - less cognitive load for the operator - fewer trust-state mistakes - less overclaiming - better hand-holding under pressure - clearer next actions - better continuity across refinements --- ## Current live status <ul class="eval-status-list"> <li>Eval Labs is a live internal product used for active Lucia intelligence development.</li> <li>Eval Labs is critical infrastructure, not a side prompt runner.</li> <li>Eval Labs points to the active dev Engine at <code>api-dev.hellolucia.ai</code>.</li> <li>The AI-reviewed platform readiness gate has passed: 60 / 60 runs, 3,000 prompts, 3,000 run items, 3,000 Lucia responses, and 3,000 reviews.</li> <li>This proves platform readiness, not human approval of Lucia quality.</li> <li>Registry Diagnostics is implemented as a read-only diagnostic surface for derived Dataset Registry and Human Review Queue 2.0 suggestions.</li> <li>Behavioral Observatory is implemented as a first-class behavioral label surface; saved labels persist only when Supabase confirms the save.</li> <li>Role-gated owner/admin/evaluator/tester behavior is implemented through Clerk public metadata.</li> <li>The Clerk session token includes <code>eval_labs_role</code> so Supabase RLS can recognize privileged owner/admin access.</li> <li>Supabase RLS protects persisted evidence. Real runs must persist to Supabase before they count as durable evidence.</li> <li>Team Review is implemented as the owner/admin oversight surface.</li> <li>Staged hydration loads run summaries first, then recent/deep evidence, so dashboards can render faster without fake metrics.</li> <li>Controlled human onboarding is role-based. Evaluator workspace polish and first-cohort guidance remain active hardening.</li> </ul> --- ## Current product surfaces Eval Labs now has distinct surfaces for normal testing, internal readiness gates, evidence review, and analysis. - `/` = Owner/Admin Home dashboard. - `/lucia/launcher` = launcher and workspace chooser. - `/lucia/custom` = Custom prompt tester. - `/lucia/custom/suites/:suiteId` = saved custom suite deep link. - `/lucia/auto-generated` = normal generated 50-prompt tester. - `/lucia/automated` = backward-compatible alias to `/lucia/auto-generated`. - `/guest-facing/verification` = Guest Facing Agent Verification Check. - `/guest-facing/verification/results` = Verification Results. - `/lucia/batch-runner` = Controlled Batch Runner. - `/lucia/automated/runs` = Run History. - `/team-review` = owner/admin Team Review. - `/registry-diagnostics` = Registry Diagnostics for derived Dataset Registry and Human Review Queue 2.0 inspection. - `/dataset-diagnostics` = legacy Registry Diagnostics alias. - `/behavioral-observatory` = Behavioral Observatory for saved reviewer behavioral labels. - `/analysis` = canonical Global Analysis surface. - `/experiments` = legacy Global Analysis alias. - `/analysis/runs/:sessionId` = Single Run Analysis. - `/runs/:sessionId/running` = in-flight run route. - `/runs/:sessionId/review` = Review Queue. - `/runs/:sessionId/review?eval=:caseId` = direct eval-item review link. Read next: [[04 - Product Surfaces and Route Map|Product Surfaces and Route Map]]. Access matrix: [[08 - Eval Labs Roles and Access Matrix|Eval Labs Roles and Access Matrix]]. --- ## Registry Diagnostics vs Behavioral Observatory These two surfaces must not be confused. Registry Diagnostics is diagnostic and derived. It inspects existing Eval Labs run/review evidence and shows suggested Dataset Registry membership and suggested Human Review Queue 2.0 lanes. It does not create labels, save human behavioral review decisions, or prove final membership. Behavioral Observatory is a labeling surface. It lets a reviewer inspect a conversation and save structured behavioral labels: Intent, Guest Affect, Response Strategy, Humanness, and Notes. When Supabase confirms the save, those labels count as persisted Behavioral Observatory evidence. Canon rule: ```text Derived suggestion is not a saved label. Registry Diagnostics is not Behavioral Observatory. Eval run review is not a Behavioral Observatory label. ``` Read next: - [[07 - Dataset Registry and Registry Diagnostics|Dataset Registry and Registry Diagnostics]] - [[07 - Behavioral Observatory|Behavioral Observatory]] - [[06 - Behavioral Label Persistence|Behavioral Label Persistence]] - [[04 - Eval Labs Step-by-Step Operator Guide|Eval Labs Step-by-Step Operator Guide]] --- ## Testing paths When a user logs into Eval Labs, their available path depends on role. ### Custom Prompt Test Use this when refining a specific Lucia behavior. Examples: ```text frazzled / overwhelmed prompts intent-layer routing memory regression short warm replies payment-risk prioritization concierge readiness language ``` Custom prompt tests support: - 1–10 manually entered prompts - exact prompt order preservation - saved reusable suites - re-running the same prompts across code changes - review, scoring, exports, and tester identity - owner/admin, evaluator, and tester access in the current role model ### Auto-generated 50-Prompt Test Use this for broader regression coverage. The auto-generated path is the normal generated prompt tester. Owner/admin, evaluator, and tester roles can use it. Tester onboarding should stay here and in Custom Prompt Test unless the role is intentionally changed. ### Guest Facing Agent Verification Check Use this for booked-guest verification behavior and results. Owner/admin and evaluator roles can use Verification Check and Verification Results. Tester cannot use verification surfaces. ### Controlled Batch Runner Use this for controlled platform-readiness checks. It supports: - 1-run smoke - 3-run checkpoint - 10-run checkpoint This is not a tester-facing prompt-testing lane. Owner/admin and evaluator roles can use it in the current access model. Tester cannot use it. --- ## The core loop ```text Create or load prompt suite → Run Lucia responses → Review suggested selections and score each item → Capture Quick Review and Human Guidance Evaluation → Finalize and export results → Identify behavioral pattern → Patch Lucia Engine / prompt / routing → Deploy to dev → Re-run the same suite → Compare results ``` This loop is the point. A single good response proves almost nothing. A repeated behavior pattern proves a lot. The AI-reviewed platform gate proves the platform lifecycle can complete. It does not prove Lucia is human-approved for real operator use. --- ## What new employees must understand Eval Labs is not asking reviewers whether they "like" Lucia's answer. Reviewers are judging whether Lucia did the job. For Lucia, the job is both: ```text operational intelligence emotional containment ``` A response can be factually correct and still fail if it makes the operator feel more scattered, unsupported, or unsure what to do next. A response can be warm and still fail if it dodges the operational decision. Human review remains the true Lucia behavioral-quality judgment layer. AI-reviewed runs are platform evidence. --- ## EvaluationLabs.ai Interface >![[fork-landing-page.png]] _Fork landing page for test type selection._ --- >![[Custom-launcher-with-test-suite-loaded-new-ui.png]] _Custom launcher with test suite loaded._ --- >![[Review-Queue-with-Lucia-response.png]] _Review Queue with Lucia response, scoring controls, and Save / Save & Next behavior._ --- ## Reading order for employees Evaluator shortcut: approved evaluators should start with [[00 - START HERE - Evaluator Onboarding|Evaluator Onboarding]] before reading the broader Canon. The mini-Canon is a scoped first-assignment guide; it does not launch onboarding or expand access. 1. [[00 - What Eval Labs Is|What Eval Labs Is]] 2. [[03 - Employee Onboarding Gate|Employee Onboarding Gate]] 3. [[03 - Current System State|Current System State]] 4. [[05 - Role and Access Model|Role and Access Model]] 5. [[08 - Eval Labs Roles and Access Matrix|Eval Labs Roles and Access Matrix]] 6. [[00 - Evaluation Philosophy|Evaluation Philosophy]] 7. [[02 - Human Grading Is the Product|Human Grading Is the Product]] 8. [[00 - How Eval Labs Works|How Eval Labs Works]] 9. [[04 - Product Surfaces and Route Map|Product Surfaces and Route Map]] 10. [[07 - Dataset Registry and Registry Diagnostics|Dataset Registry and Registry Diagnostics]] 11. [[07 - Behavioral Observatory|Behavioral Observatory]] 12. [[06 - Behavioral Label Persistence|Behavioral Label Persistence]] 13. [[04 - Eval Labs Step-by-Step Operator Guide|Eval Labs Step-by-Step Operator Guide]] 14. [[00 - Review Workflow|Review Workflow]] 15. [[01 - Quality Bar|Quality Bar]] 16. [[04 - Employee Review Layer|Employee Review Layer]] 17. [[01 - Custom Prompt Suites|Custom Prompt Suites]] 18. [[01 - Lucia-Specific Failure Modes|Lucia-Specific Failure Modes]] 19. [[00 - Team Usage Guidelines|Team Usage Guidelines]] 20. [[01 - Intent Layer Refinement Workflow|Intent Layer Refinement Workflow]] After those, a reviewer can begin only if their role and access scope are approved. --- ## May 2026 platform readiness gate The AI-reviewed platform readiness gate passed. Final verified result: ```text ready | 60 | 3000 | 3000 | 3000 | 3000 ``` This means: - 60 completed runs - 3,000 expected prompts - 3,000 `eval_run_items` - 3,000 Lucia responses - 3,000 reviews - compact localStorage with no persisted item-level payloads - owner-scoped local state for the tested account This verifies platform lifecycle readiness: run creation, response capture, review generation, review persistence, finalization, Run History truth, Global Analysis truth, Supabase count alignment, localStorage compactness, scoped visibility in the tested owner context, and controlled batch lifecycle. It does not verify: - Lucia is ready for real operator use - Lucia is human-approved - human evaluators agree with AI scoring - employee rollout is complete - every future access/security decision is complete Read next: [[04 - AI-Reviewed Platform Readiness Gate|AI-Reviewed Platform Readiness Gate]]. --- ## May 2026 doctrine update — layered review and adjudication Eval Labs now uses a layered review model: ```text Employee Review → Escalation Routing → Senior Adjudication → Canon Candidate / Reusable Learning ``` This matters because Eval Labs is no longer asking every reviewer to become an AI expert. Employees provide guided human judgment. Senior reviewers provide canonical meaning. The app may suggest. The reviewer must decide. Reviewed exports preserve structured review, suggested review, Employee Review, Human Guidance Evaluation, adjudication metadata, lifecycle state, tester identity, and dirty / completion state. Read next: 1. [[03 - Review Architecture|Review Architecture]] 2. [[04 - Employee Review Layer|Employee Review Layer]] 3. [[05 - Adjudication Doctrine|Adjudication Doctrine]] 4. [[06 - Structured Human Judgment Capture|Structured Human Judgment Capture]] 5. [[03 - Reviewer Cognitive Load Doctrine|Reviewer Cognitive Load Doctrine]]