Eval Labs Canon

# What Eval Labs Is > [!summary] > Eval Labs is Lucia's role-based human evaluation and platform-readiness infrastructure. It lets authorized users test, score, annotate, analyze, compare, oversee, and improve AI behavior over time. --- ## Definition Eval Labs is the internal product and workflow used to evaluate Lucia's responses with role-based human judgment. It is not simply a prompt runner. It is first-class Lucia intelligence infrastructure: the place where Lucia's behavior is tested, inspected, reviewed, analyzed, and improved. Eval Labs captures: - the prompt - Lucia's response - the run source - the suite context - the human evaluator - ratings - suggested review signal - quick employee review - Human Guidance Evaluation - pass/fail state - written notes - review lifecycle - adjudication metadata - role-gated access state - Run History evidence - Analysis and Single Run Analysis evidence - Registry Diagnostics evidence - Dataset Registry suggestion evidence - Human Review Queue 2.0 lane suggestion evidence - Behavioral Observatory labels - exports for analysis - dirty / completion state - behavior patterns over time --- ## Why Eval Labs exists Generic AI benchmarks are not enough for Lucia. Lucia is not being built to win trivia tests. Lucia is being built to help hospitality operators stay oriented, make good decisions, and trust the system under real pressure. That means we must evaluate qualities that normal benchmarks miss: ```text calm warmth trust intent accuracy operational usefulness containment truth-state discipline operator cognitive load ``` --- ## The product principle Eval Labs exists because Lucia cannot become excellent by vibes. Lucia needs repeated, inspectable, human-scored evaluation against real behavioral expectations. The product also needs controlled readiness evidence that proves the evaluation platform itself can create runs, capture responses, persist reviews, finalize sessions, hydrate Run History, hydrate Analysis, and keep local client state compact. The system must help us answer: - Did Lucia understand the user? - Did Lucia choose the right mode? - Did Lucia say too much or too little? - Did Lucia create calm or noise? - Did Lucia preserve trust? - Did Lucia overclaim? - Did Lucia give the right next move? --- ## What Eval Labs does today Eval Labs currently supports: - custom 1–10 prompt tests - saved custom prompt suites - auto-generated prompt tests - Guest Facing Agent Verification Check and Verification Results - Controlled Batch Runner readiness checks - shared Review Queue scoring - suggested selections - semantic scoring sliders - Quick Review - Human Guidance Evaluation - lifecycle finalization - Run History - Team Review for owner/admin oversight - Global Analysis - Single Run Analysis - Registry Diagnostics for derived Dataset Registry and Human Review Queue 2.0 inspection - Behavioral Observatory for saved reviewer behavioral labels - copy Session ID / copy Deep Link controls across key surfaces - Clerk role gating for owner, admin, evaluator, and tester - Supabase RLS protection for persisted evidence when the Clerk session carries `eval_labs_role` - tester identity capture through Clerk - JSON / CSV / Markdown exports - run source tagging: `custom`, `automated`, `manual` - Supabase persistence for suites, runs, items, and reviews - live testing against the active dev Lucia Engine --- ## What Eval Labs is not Eval Labs is not: ```text a generic chat app a one-off prompt playground a rubber-stamp review form a dataset label factory a vague behavior dashboard a replacement for product judgment a replacement for Lucia doctrine a claim that Lucia is human-approved a complete backend authorization boundary by itself ``` Eval Labs is: ```text Lucia-native behavioral judgment infrastructure ``` --- ## Who uses Eval Labs Eval Labs is designed for: - founders - owners/admins running platform and quality gates - Lucia evaluators - testers in the entry-level prompt-testing lane - employees testing approved prompt workflows - engineers validating behavior changes - product leads reviewing quality patterns - future QA and evaluation teams Owner/admin have full platform access, shared persisted evidence, Team Review, Global Analysis, and all current test surfaces. Evaluator is the full evaluator workbench role. Evaluators can use evaluator-safe test types and their own run/review/history routes, but they do not see Team Review or Global Analysis. Tester is the narrower onboarding role. Testers can use Custom Prompt Test and Auto-generated Prompt Test, but they cannot use verification, controlled batch, Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools. Read the canonical matrix: [[08 - Eval Labs Roles and Access Matrix|Eval Labs Roles and Access Matrix]]. --- ## Why saved suites matter Saved custom prompt suites changed Eval Labs from a testing slot machine into a regression lab. Before saved suites, reviewers could run broad generated tests. Now reviewers can repeatedly test the same exact prompts while Lucia's brain, intent layer, wording, memory, and routing evolve. That repeatability is the foundation of serious improvement. --- ## Current expanded definition Eval Labs now captures multiple layers of review signal: ```text ratings suggested review quick employee review human guidance evaluation derived dataset membership suggestions derived Human Review Queue 2.0 lane suggestions persisted Behavioral Observatory labels optional notes review state senior review routing canon candidate signal adjudication metadata review lifecycle dirty / completion state exports for analysis ``` The core architectural distinction is: ```text employee reaction ≠ canonical training label derived suggestion is not a saved label ``` Employees should provide structured reactions. Senior reviewers and adjudicators should assign canonical meaning. This protects Lucia from ontology drift while still allowing non-expert employees to participate in useful review work. Registry Diagnostics belongs to the derived-suggestion layer. It helps the team inspect how existing Eval Labs evidence appears to map to datasets and queue lanes. Behavioral Observatory belongs to the saved-label layer. It preserves intentional reviewer labels for intent, guest affect, response strategy, humanness, and notes when persistence succeeds. --- ## Readiness doctrine Eval Labs has now passed the AI-reviewed platform readiness gate: ```text 60 completed runs 3,000 prompts 3,000 eval_run_items 3,000 Lucia responses 3,000 reviews ``` That result proves the platform lifecycle can run end to end. It does not prove Lucia is ready for real operators. Human review remains the true Lucia behavioral-quality judgment layer. The current platform is implemented for controlled role-based human onboarding. Evaluator workspace polish and first-cohort guidance remain active hardening.