Eval Labs Canon

# Eval Labs Glossary > [!summary] > This glossary defines the terms employees will see while using Eval Labs and reading the Canon. --- ## Eval A structured test of Lucia's behavior. An eval is not only the prompt. It includes the response, review, scoring, and follow-up interpretation. --- ## Run One execution of a suite, generated prompt set, controlled batch, or other Eval Labs test path. A run contains run items. Run truth means the run lifecycle and persisted record agree. --- ## Run item One prompt/response pair inside a run. A run item is the unit reviewed in Review Queue and the unit labeled in Behavioral Observatory. --- ## Prompt The user message being tested. Example: ```text I feel totally out of the loop. ``` --- ## Lucia response The response generated by Lucia from the Engine under test. --- ## Review Queue The place where a human reviewer evaluates each generated response. The Review Queue is shared by both custom runs and automated runs. --- ## Review The human or AI-generated evaluation record attached to a run item in the Review Queue flow. Review evidence can include ratings, suggested values, Quick Review, Human Guidance Evaluation, notes, save state, and finalization context. Review evidence is not the same as a persisted Behavioral Observatory label. --- ## Registry Diagnostics The read-only diagnostic surface at: ```text /registry-diagnostics ``` Registry Diagnostics inspects existing Eval Labs run/review data and shows derived Dataset Registry membership suggestions and Human Review Queue 2.0 lane suggestions. It does not create labels or save Behavioral Observatory decisions. --- ## Dataset Registry The canonical diagnostic taxonomy used to group Eval Labs evidence into meaningful dataset categories. In the current Registry Diagnostics surface, dataset use is diagnostic and derived. --- ## Dataset A named group of Eval Labs examples or signals that belong to the same behavioral/product area. A dataset can help organize evaluation evidence, but a derived match is not final human truth. --- ## Dataset membership suggestion A derived suggestion that a run item appears to belong to a dataset. It means: ```text The model found evidence that this item may belong here. ``` It does not mean: ```text A human approved this dataset membership. ``` --- ## Review queue lane A workflow lane suggested for a run item. In Registry Diagnostics, lane suggestions are diagnostic and derived. They are not saved queue decisions. --- ## Review Queue 2.0 The emerging review-routing model that suggests lanes for existing Eval Labs evidence. Current Registry Diagnostics output is for inspection, not final employee workflow UX. --- ## Human Review Queue 2.0 The human-review workflow model behind Review Queue 2.0 lane suggestions. In the current Registry Diagnostics surface, Human Review Queue 2.0 lanes are derived suggestions only. They are not saved queue assignments and they are not Behavioral Observatory labels. --- ## Derived signal A signal inferred from existing Eval Labs data. Derived signals can help prefill, suggest, or inspect behavior. Derived signals are not saved human judgment. --- ## Persisted label A label saved to durable storage and reloadable after refresh. For Behavioral Observatory, a persisted label means Supabase confirmed a row in `public.eval_behavioral_labels`. --- ## Behavioral Observatory The first-class Eval Labs product surface at: ```text /behavioral-observatory ``` Behavioral Observatory lets a reviewer inspect a conversation and save structured behavioral labels. --- ## Behavioral label A saved Behavioral Observatory judgment for a run item. Current fields: ```text intent guest_affect response_strategy humanness notes ``` Behavioral labels are stored in `public.eval_behavioral_labels` when persistence succeeds. --- ## Intent What the human was trying to do. Behavioral Observatory currently supports: ```text Booking Help Check-In Checkout Billing Noise Room Issue Concierge Other ``` --- ## Guest Affect The human's emotional state in the conversation. Behavioral Observatory currently supports: ```text Neutral Mildly Upset Upset Grateful ``` Use the smallest truthful affect. Do not dramatize. --- ## Response Strategy Lucia's dominant response move. Behavioral Observatory currently supports: ```text Acknowledge Apology Offer Escalation ``` Choose the main strategy, not every strategy present. --- ## Humanness A 1-7 Behavioral Observatory label for how human Lucia's response felt. Current anchors: ```text 1 = Template 4 = Functional 7 = Warm + Specific ``` Humanness is not a substitute for truth, usefulness, or safety. --- ## Gold Standard A high-confidence human-reviewed example that can be used for calibration, training, or future benchmark design. Gold Standard examples require deliberate human judgment. A derived suggestion is not automatically Gold Standard. --- ## Custom Prompt Suite A saved set of 1–10 manually chosen prompts. Use custom suites when testing a specific behavior family repeatedly. Examples: - Overwhelm phrasing - Lost / out-of-loop prompts - Payment-risk triage - Concierge confirmation gaps - Guest trust repair - Spanish language handling --- ## Auto-generated 50-Prompt Test A broader 50-prompt test run generated by Eval Labs for full-spectrum review. Use it for regression coverage after broader changes. Current canonical route: ```text /lucia/auto-generated ``` Legacy inbound alias: ```text /lucia/automated ``` --- ## Controlled Batch Runner The controlled platform-readiness surface used for controlled 1-run smoke, 3-run checkpoint, and 10-run checkpoint batches. It was used to complete the 60-run AI-reviewed platform readiness gate. Owner/admin and evaluator roles can use it in the current access model. Tester cannot. Canonical route: ```text /lucia/batch-runner ``` --- ## AI-reviewed platform readiness gate A controlled batch validation protocol that proves Eval Labs platform behavior can complete end to end. It can prove run creation, Lucia response capture, review generation, review persistence, finalization, Run History truth, Global Analysis truth, Supabase count alignment, localStorage compactness, scoped visibility in the tested owner context, and controlled batch lifecycle. It does not prove Lucia is human-approved. --- ## Human Lucia-quality approval The judgment layer where human evaluators decide whether Lucia's behavior is ready, useful, trustworthy, and operationally appropriate. This remains separate from AI-reviewed platform readiness. --- ## Run Source The source type of the run. Current values: ```text custom automated manual ``` `custom` means the run came from a user-created prompt suite. `automated` means it came from the 50-prompt generated battery. --- ## Tester identity The logged-in Clerk user who saves or exports review data. Eval Labs records limited identity metadata: ```text Clerk user id email display name when available ``` This helps us know who evaluated a response. --- ## Role metadata The current Clerk public metadata key used by Eval Labs is: ```json { "eval_labs_role": "owner" } ``` Supported values are: ```text owner admin evaluator tester ``` Missing or unknown role metadata should not grant privileged access. --- ## Owner role The privileged Eval Labs role with full access to Home, Launcher, Custom Prompt Test, Auto-generated Prompt Test, Guest Facing Agent Verification Check, Verification Results, Controlled Batch Runner, Run History, Team Review, Global Analysis, Single Run Analysis, review routes, and future admin/tooling surfaces. --- ## Admin role The privileged operational role. Admin has similar access to owner for current testing, evidence inspection, Team Review, Global Analysis, batch runner usage, and evaluator oversight. --- ## Evaluator role The full evaluator workbench role. Evaluators can use evaluator-safe test surfaces and their own run/review/history routes. Evaluators cannot see Team Review, Global Analysis, owner/admin tools, or shared platform-wide evidence unless explicitly widened later. --- ## Tester role The entry-level prompt-testing role. Testers can use Custom Prompt Test and Auto-generated Prompt Test. Testers cannot use Verification Check, Verification Results, Controlled Batch Runner, Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools. --- ## Run History The scoped run ledger at: ```text /lucia/automated/runs ``` It records completed/finalized run truth and may include scoped operational run state. --- ## Team Review The owner/admin oversight surface at: ```text /team-review ``` Team Review groups evaluator activity, review gaps, flags, recent work, and evidence that needs privileged attention. --- ## Global Analysis The read-only behavioral and analytics surface at: ```text /analysis ``` Global Analysis is owner/admin-only in the current model. It shows AI-analyzed platform evidence, not human Lucia-quality approval. The legacy alias is: ```text /experiments ``` --- ## Single Run Analysis The read-only analysis surface for one completed run/session: ```text /analysis/runs/:sessionId ``` It can include run metadata, behavioral summaries, item rows, copy controls, and deep links. --- ## localStorage compactness The client persistence doctrine that completed cloud-backed runs should not persist full item-level payloads in localStorage. The readiness diagnostic target is: ```text persistedLocalFullPayloadSessionCount = 0 persistedLocalHasItemLevelData = false persistedLocalItemLevelDataSessionCount = 0 ownedSessionCount = expected run count otherOwnerSessionCount = 0 ownerlessSessionCount = 0 ``` --- ## RLS / backend permission enforcement Supabase row-level security and backend/API permission checks. Frontend role behavior comes from Clerk public metadata. Persisted evidence protection depends on the Clerk session token carrying `eval_labs_role` so Supabase RLS can recognize privileged owner/admin access. Verify the Clerk-to-Supabase role claim path when role metadata, JWT templates, RLS policies, or privileged evidence hydration changes. --- ## `exportedBy` The user who exported a session file. Important: this may differ from the person who originally reviewed the prompts. --- ## `savedBy` The user who saved a specific prompt review. This is more important than `exportedBy` when auditing human review work. --- ## `savedAt` The time a review was saved. --- ## Intent layer The part of Lucia responsible for interpreting what kind of user message was sent and routing it into the correct behavior mode. If a distress prompt routes to a generic capability redirect, that is usually an intent-layer failure. --- ## Emotional containment Lucia's ability to reduce felt chaos without becoming therapy-bot language. Containment means Lucia narrows the field and gives one clear next move. --- ## Trust-state discipline Lucia's habit of distinguishing what is known, inferred, suggested, requested, confirmed, or not yet done. A trust-state failure is serious. --- ## Truth-state The specific truth status of a claim or action. Examples: ```text known inferred suggested requested confirmed not yet done ``` Truth-state is the thing being preserved. Trust-state discipline is the habit of preserving it. --- ## Regression A behavior that used to work but breaks after a code, prompt, model, or configuration change. Eval Labs exists largely to detect regressions before they become product damage. --- ## Employee Review The fast, guided review layer used by non-expert reviewers. Employee Review captures observable human judgment without asking employees to invent labels or taxonomies. --- ## Quick Review The guided question flow inside the Review Queue. It asks simple questions such as whether Lucia understood the need, gave the right next move, calmed the situation, or created risk. --- ## Adjudication The senior-review process that assigns final meaning to ambiguous, high-risk, or reusable cases. Adjudication converts review signal into canonical training signal. --- ## Senior reviewer A reviewer trusted to inspect escalated cases, resolve ambiguity, assign final labels, and decide whether a case should become a canon candidate. --- ## `reviewState` The workflow state for a prompt review. Current values include: ```text clean_pass needs_review needs_adjudication canon_candidate ``` --- ## Needs final call Employee-facing language for a case that needs senior adjudication. --- ## Canon candidate A response or failure pattern that may teach Lucia something durable enough to enter Canon, eval suites, or future training guidance. --- ## Ontology drift The quality failure that happens when reviewers invent inconsistent categories, labels, or meanings over time. Eval Labs prevents ontology drift by separating Employee Review from Adjudication. --- ## Semantic confidence bar The stepped 1–10 slider used for scoring dimensions. It uses restrained color and fill behavior to help reviewers feel score quality without extra interpretation.