Eval Labs Canon

# Data Model and Export Contract > [!summary] > Eval Labs data is designed to preserve prompt context, run source, Lucia response, layered review judgment, lifecycle state, and tester identity. --- ## Session metadata A session export includes metadata such as: ```text id title mode runSource category subcategory templateKey promptCount status createdAt updatedAt adminBranch engineBranch runFailureType runFailureReason runFailureAt reviewLifecycle remoteRunId ownerUserId ownerScopeVersion localPayloadState ``` The important field for the custom launcher is: ```text runSource: custom ``` Custom and Auto-generated launchers both create run sessions that flow into the same Review Queue. The session `mode` remains the run mechanics; `runSource` distinguishes whether the run came from the Custom launcher or the Auto-generated launcher. Controlled batch readiness runs also rely on the same run/session lifecycle, but their product meaning is different: they are platform readiness evidence, not normal evaluator-facing tests. Registry Diagnostics reads existing session/run/review evidence and derives dataset membership and review-lane suggestions from it. Those suggestions are diagnostic output, not saved labels. Behavioral Observatory labels are separate persisted records when saved through the Behavioral Observatory flow. --- ## Cases Each prompt becomes a case. A case contains: ```text id sessionId orderIndex sourceType title promptText promptLocked luciaResponse runStatus runError category subcategory templateKey createdAt updatedAt ``` Order matters. The exported `orderIndex` must remain stable. --- ## Prompt results Prompt results include: - draft review state, including reviewer input and suggestions - saved review state - saved timestamp - saved-by tester identity - completion/dirty state derived from saved vs draft review A generated-but-unreviewed item may have null ratings and `savedBy: null`. That is expected. A saved review should include `savedBy`. --- ## `exportedBy` vs `savedBy` `exportedBy` identifies who exported the file. `savedBy` identifies who reviewed/saved the individual prompt. This distinction matters because one person may export a run that another person reviewed. --- ## Tester identity fields Eval Labs stores only limited identity fields: ```text clerkUserId email name ``` No unnecessary Clerk metadata should be stored. Role gating reads Clerk public metadata through `eval_labs_role`, but role metadata is product access state, not review authorship. --- ## Nulls in exports Some nulls are normal. Expected nulls include: ```text runFailureType: null runFailureReason: null runFailureAt: null savedBy: null when not reviewed yet ratings: null when not scored yet ``` Do not treat every null as a bug. Treat nulls as suspicious only when the workflow step should have populated them. --- ## Export Options and Example >![[export-controls.png]] _Export controls for easy usability with multiple data formats._ ```json { "format": "lucia-eval-lab-session/v0.3", "exportedAt": "2026-04-29T19:33:10.000Z", "exportedBy": { "clerkUserId": "user_3D2BItLYUO1uqJOqzlZTvHZNgsF", "email": "[email protected]", "name": "Aviv Hadar" }, "session": { "metadata": { "id": "session-example", "runSource": "custom", "status": "ready", "reviewLifecycle": { "status": "in_review", "finalizedAt": null, "finalizedBy": null } }, "caseOrder": ["case-001"], "cases": { "case-001": { "orderIndex": 0, "promptText": "I'm spinning a little. Tell me what to do first so I can breathe again.", "luciaResponse": "Take a breath. This feels heavier than it is. Nothing critical is slipping beyond the first move.", "runStatus": "success" } }, "promptResults": { "case-001": { "draft": {}, "saved": null, "savedAt": null, "savedBy": null } } } } ``` --- ## Review-layer fields Prompt reviews now support these additional fields: ```text ratings suggestedRatings keepTalking suggestedKeepTalking status suggestedStatus priority suggestedPriority reviewState luciaPredictedLabels humanLabels adjudication employeeReview suggestedEmployeeReview humanGuidanceEval suggestedHumanGuidanceEval canonCandidate ``` The suggested fields are product suggestions, not final reviewer judgment. They are generated from prompt/response/run-status heuristics and remain separate from the reviewer-saved values. Suggested fields may feed derived context in Registry Diagnostics or Behavioral Observatory, but they do not become persisted Behavioral Observatory labels unless a reviewer saves a label in the Behavioral Observatory surface. --- ## Employee Review object Employee Review captures guided non-expert signal: ```text understoodNeed rightNextMove calmingEffect riskOrConfusion seniorReview reusableLearning ``` These fields are intentionally simple and should remain employee-friendly. --- ## Human Guidance Evaluation object Human Guidance Evaluation captures a 1-5 review layer: ```text emotionalValidation cognitiveUnderstanding actionability toneAppropriateness authenticity notes ``` Warmth and intelligence are not separate export fields. They are expressed through the current scoring dimensions and guidance fields: `tone`, `calming`, `naturalness`, `trust`, `usefulness`, `cognitiveUnderstanding`, `actionability`, and `authenticity`. --- ## Adjudication object Adjudication captures final senior-review meaning when it exists in the review record: ```text finalLabels reason adjudicator adjudicatedAt ``` Final labels may include: ```text guestIntent followThroughRequired actionType emotionalRead ownerStressLevel ``` --- ## Review lifecycle object Run lifecycle finalization is stored at the session level: ```text status: in_review | ready_to_finalize | finalized finalizedAt finalizedBy ``` Finalization does not replace per-prompt review data. It marks the run lifecycle after all prompts are reviewed. --- ## Supabase persistence contract The Supabase persistence layer stores the run and item contract in three places: ```text eval_runs.metadata.reviewLifecycle eval_runs.metadata.metadata eval_run_items.payload.case eval_run_items.payload.promptRecord eval_item_reviews ``` `eval_run_items.payload.promptRecord` embeds the full prompt review record, including saved/draft state, suggested review, employee review, Human Guidance Evaluation, adjudication metadata, canon candidate signal, tester identity, and dirty/completion state. `eval_item_reviews` is still written for review rows, but hydration prefers the embedded `eval_run_items.payload.promptRecord` instead of relying on fragile review-table reads. Behavioral Observatory labels are stored separately: ```text public.eval_behavioral_labels ``` This table stores first-class Behavioral Observatory labels: ```text run_id run_item_id owner_user_id reviewer_user_id intent guest_affect response_strategy humanness notes status payload created_at updated_at ``` The key distinction: ```text eval_item_reviews = Review Queue review evidence eval_behavioral_labels = Behavioral Observatory label evidence ``` One saved Behavioral Observatory label exists per reviewer per run item. Current persisted run evidence is scoped by the signed-in Clerk user and the role claim available to Supabase RLS. Owner/admin can inspect shared persisted evidence where privileged RLS allows it. Evaluator and tester data remains scoped to their own work except where owner/admin oversight applies. Current readiness verification checks counts across: ```text public.eval_runs public.eval_run_items public.eval_item_reviews ``` For the 60-run readiness gate, the final verified result was: ```text ready | 60 | 3000 | 3000 | 3000 | 3000 ``` Meaning: - 60 ready runs - 3,000 expected prompts - 3,000 run items - 3,000 non-empty Lucia responses - 3,000 reviews for the tested reviewer id --- ## localStorage compaction contract Completed cloud-backed runs should not leave full item-level payloads persisted in localStorage. The platform-readiness diagnostic target is: ```text persistedLocalFullPayloadSessionCount = 0 persistedLocalHasItemLevelData = false persistedLocalItemLevelDataSessionCount = 0 ownedSessionCount = expected run count otherOwnerSessionCount = 0 ownerlessSessionCount = 0 ``` The final verified 60-run diagnostic was: ```text sessionCount = 60 persistedLocalFullPayloadSessionCount = 0 persistedLocalHasItemLevelData = false persistedLocalItemLevelDataSessionCount = 0 ownedSessionCount = 60 otherOwnerSessionCount = 0 ownerlessSessionCount = 0 rawByteSize ≈ 68,815 ``` This supports platform readiness and client compactness. It does not prove backend authorization is complete. --- ## Export rule Exports should preserve the full review contract: ```text employeeReview = what the reviewer experienced suggestedEmployeeReview = what the app suggested humanGuidanceEval = structured 1-5 human guidance suggestedHumanGuidanceEval = app-suggested human guidance adjudication = final senior-review metadata when present reviewLifecycle = whether the run is still in review or finalized ``` Do not collapse these into one field.