Eval Labs Canon

# How Eval Labs Works > [!summary] > Eval Labs creates a role-based loop between prompts, Lucia responses, human review, persisted evidence, owner/admin oversight, exported evidence, and product improvement. --- ## Basic flow ```text Prompt → Lucia Engine response → Shared Review Queue → Suggested selections plus human review → Ratings, Quick Review, and Human Guidance Evaluation → Save, finalize, and export → Run History and Global Analysis truth → Team Review for owner/admin oversight → Registry Diagnostics for derived dataset and lane inspection → Behavioral Observatory for saved behavioral labels when assigned → Pattern discovery → Lucia refinement → Re-run same suite ``` --- ## Current test surfaces Eval Labs currently has four distinct test surfaces. ### 1. Custom Prompt Test The reviewer enters 1–10 prompts manually or loads a saved custom suite. Use this for targeted behavioral refinement. Owner/admin, evaluator, and tester roles can use this surface. ### 2. Auto-generated Prompt Test Eval Labs generates a full 50-prompt run. Use this for broad coverage and regression checks. This normal auto-generated tester is separate from the controlled batch runner. Owner/admin, evaluator, and tester roles can use this surface. ### 3. Guest Facing Agent Verification Check Eval Labs can run the Guest Facing Agent verification scenario pack through the app. Use this to inspect booked-guest verification behavior and results without treating terminal logs as the primary human workflow. Owner/admin and evaluator roles can use this surface. Tester cannot. ### 4. Controlled Batch Runner Controlled platform-readiness tooling. It runs controlled 1-run smoke, 3-run checkpoint, and 10-run checkpoint batches. It is used to prove the Eval Labs platform lifecycle, not to collect normal evaluator work. Owner/admin and evaluator roles can use this surface. Tester cannot. Read the surface matrix: [[08 - Eval Labs Roles and Access Matrix|Eval Labs Roles and Access Matrix]]. --- ## Role-based human evaluation Eval Labs uses Clerk public metadata to determine the user's role: ```json { "eval_labs_role": "..." } ``` Current roles are `owner`, `admin`, `evaluator`, `tester`, and missing/unassigned. Owner/admin have full platform access, shared persisted evidence, Team Review, Global Analysis, and all test surfaces. Evaluator has the full evaluator workbench, evaluator-safe test surfaces, and own run/review/history routes, but no Team Review or Global Analysis. Tester is the narrow prompt-testing onboarding lane. Tester can use Custom Prompt Test and Auto-generated Prompt Test, but cannot use verification, controlled batch, Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools. --- ## Current inspection and labeling surfaces Eval Labs now has two additional high-power surfaces that must stay distinct. ### Registry Diagnostics Registry Diagnostics is diagnostic and derived. It reads existing run/review evidence and shows suggested Dataset Registry membership plus suggested Human Review Queue 2.0 lanes. It does not save human behavioral labels. ### Behavioral Observatory Behavioral Observatory is a first-class behavioral labeling surface. It lets a reviewer inspect a conversation and save structured labels for Intent, Guest Affect, Response Strategy, Humanness, and Notes. When Supabase confirms the save, the label becomes persisted Behavioral Observatory evidence. --- ## Shared Review Queue Custom, auto-generated, and controlled batch runs all use the run/review/finalization infrastructure, but they do not mean the same thing operationally. This matters. We do not have separate human quality standards for Custom and Auto-generated tests. The generation path differs. The Lucia quality bar does not. Controlled batch runs are platform evidence. They can prove the pipeline works, but they do not replace human quality judgment. The Review Queue now includes suggested selections, semantic scoring sliders, Quick Review, Human Guidance Evaluation, save / next flows, search and workflow filters, export controls, and run finalization. --- ## Core objects ### Suite A named group of prompts. ### Run One execution of a suite or generated prompt batch. Runs may be Custom, Auto-generated, or controlled batch readiness runs. ### Run item One prompt/response pair inside a run. ### Review The human evaluation of a run item. ### Derived signal A suggestion inferred from existing run/review data. ### Behavioral label A saved Behavioral Observatory label for one run item and reviewer. ### Export The portable evidence file used for analysis, comparison, and product work. ### Global Analysis The read-only owner/admin evidence layer used to inspect AI-analyzed run behavior across shared persisted data. ### Team Review The owner/admin oversight layer used to inspect evaluator activity, review gaps, flags, and recent human-evaluation signal. --- ## What is captured Eval Labs captures: - prompt text - prompt order - Lucia response - run source - custom suite identity - run status - review ratings - suggested review values - employee Quick Review - Human Guidance Evaluation - pass/fail decision - review state and adjudication metadata - reviewer notes - review lifecycle - derived Dataset Registry membership suggestions - derived Human Review Queue 2.0 lane suggestions - Behavioral Observatory labels when saved - prompt dirty / completion state - saved time - saved-by tester identity - exported-by tester identity - Supabase run identity when cloud-backed - local payload state when compacted - role and scope context for access-controlled evidence --- ## Why exact prompt order matters When refining Lucia, exact prompts matter. Small language differences can reveal routing gaps. For example: ```text I'm overwhelmed. ``` _may route correctly, while:_ ```text I'm generally frazzled. ``` _may route incorrectly._ Eval Labs must preserve prompt text and order so developers can reproduce the issue. --- ## Evaluation test options > [!summary] > Custom Prompt Test for exact 1-10 prompts, Auto-generated Prompt Test for normal generated coverage, Guest Facing Agent Verification Check for booked-guest verification behavior, and Controlled Batch Runner for controlled platform-readiness checks. --- >![[custom-launcher-empty.png]] _Custom launcher example._ --- >![[automated-launched-ready.png]] _Automated launcher example._ --- ## Expanded review flow The review portion of the loop now has more structure: ```text Prompt → Lucia response → Suggested selections → Rating sliders → Employee Quick Review → Human Guidance Evaluation → Escalation routing → Senior adjudication when needed → Lifecycle finalization → Exported evidence → Lucia refinement ``` This prevents simple employee review from becoming accidental model-training truth. See: [[03 - Review Architecture|Review Architecture]]. --- ## Derived vs persisted Use this distinction everywhere: ```text derived = suggested from existing run/review evidence persisted = saved and reloadable after Supabase confirms it ``` Registry Diagnostics is derived classification inspection. Behavioral Observatory can start from derived context, but only a saved label is persisted Behavioral Observatory label data. --- ## Staged hydration Dashboards should render from real persisted data without fake metrics. The current platform loads run summaries first, then recent/deep evidence. This lets owner/admin and scoped user dashboards become useful faster while deeper item payloads continue hydrating. If a metric requires deep evidence that has not loaded yet, label the limitation. Do not invent a number. --- ## AI-reviewed platform readiness loop The readiness loop is different from human Lucia-quality review: ```text Controlled batch run → run creation → Lucia response capture → AI review generation → review persistence → run finalization → Run History truth → Global Analysis truth → Supabase verification → localStorage compactness check ``` The 60-run gate passed this loop. That proves Eval Labs platform readiness. It does not prove Lucia is human-approved.