Eval Labs Canon

# Eval Labs Review Layer Release Notes > [!summary] > This page records the May 2026 review-layer evolution: shared run launchers, employee review, suggested selections, Human Guidance Evaluation, adjudication-ready schema, exports, queue filters, lifecycle finalization, and the later platform-readiness split between normal testing and controlled batch gates. --- ## May 2026 review-layer milestone Eval Labs evolved from a prompt runner into a layered review product. The key change: ```text custom or automated run → shared Review Queue → suggested selections plus human review → lifecycle finalization and export ``` --- ## Major shipped changes ### Adjudication-ready schema Added review model support for: ```text reviewState luciaPredictedLabels humanLabels adjudication employeeReview suggestedEmployeeReview humanGuidanceEval suggestedHumanGuidanceEval canonCandidate reviewLifecycle ``` These fields are preserved through local storage, Supabase payload persistence, dirty-state detection, and exports. --- ### Employee Review layer Added guided employee-review fields: ```text understoodNeed rightNextMove calmingEffect riskOrConfusion seniorReview reusableLearning ``` These replace freeform taxonomy collection for non-expert reviewers. The app can also suggest Employee Review answers from prompt/response heuristics. The suggestion is visible as suggested signal; the reviewer still saves the human review. --- ### Suggested review layer Added app-suggested review values for: ```text 1-10 ratings keepTalking pass / refine / fail priority Employee Review answers 1-5 Human Guidance Evaluation scores ``` These suggestions come from prompt text, Lucia response text, run status, run errors, and simple response-quality heuristics such as clear next move, calming language, list-heavy output, robotic language, fake empathy, and overclaiming. They are not canonical truth. --- ### Review Queue UX The Review Queue now favors guided employee judgment: - single-column Quick Review flow - numbered question cards - separate selection boxes - suggested selections - reduced freeform text burden - senior-review routing - canon-candidate routing - Human Guidance Evaluation - Save / Save & Next / Save & next flagged flows - search and workflow filters - JSON, CSV, and Markdown export controls - finalization after all prompts are reviewed --- ### Semantic confidence sliders The “How did Lucia do?” scoring section moved from 1–10 button rows to stepped semantic confidence sliders. The final design direction: ```text low score → muted concern mid score → soft uncertainty high score → restrained confidence ``` The sliders should feel like native OS controls: calm, premium, tactile, and low-friction. --- ### Adjudication queue filters Added workflow queue filters for: ```text Needs final call Canon candidates ``` This lets senior review focus on the cases that matter most. This release supports adjudication routing, metadata, and exports. It does not depend on a separate senior-adjudication editing screen. --- ### Exports JSON, CSV, and Markdown exports now preserve structured review, suggested review, Employee Review, Human Guidance Evaluation, adjudication metadata, lifecycle state, tester identity, and prompt dirty/completion state. --- ### Supabase persistence Supabase persistence now stores run lifecycle metadata on `eval_runs`, embeds the full case and prompt review record in `eval_run_items.payload`, and writes `eval_item_reviews` rows for review persistence. Hydration prefers the embedded `eval_run_items.payload.promptRecord` over fragile review-table reads. --- ## Current doctrine impact This release established a new Eval Labs principle: ```text The app may suggest. The reviewer must decide. Senior meaning stays separate from employee signal. ``` This should be protected in future product work. --- ## Product surface refinement After the review-layer release, Eval Labs was refined into a clearer product surface: - top app shell owns page identity - in-page blog-style mastheads were removed from the app - Custom Prompt Test, Auto-generated Prompt Test, and Controlled Batch Runner are separate surfaces - Controlled Batch Runner is controlled readiness tooling; current access is owner/admin/evaluator, not tester - Auto-generated Prompt Test remains the normal 50-prompt generated tester - Run History rows use a standardized two-zone layout - copy controls use Copy Session ID / Copy Deep Link patterns across key surfaces - Single Run Analysis gives read-only run-level evidence outside the Review Queue --- ## Readiness doctrine added The AI-reviewed platform readiness gate passed after 60 completed runs and 3,000 reviewed prompts. This extends the review-layer doctrine: ```text The app may suggest. The reviewer must decide. AI-reviewed platform readiness is not human Lucia-quality approval. ``` Protect this distinction in future release notes and onboarding language.