Eval Labs Canon

# Product Architecture > [!summary] > Eval Labs is a separate role-based human evaluation platform that tests Lucia through the deployed Engine, stores review evidence in Supabase, and exposes scoped testing, run-history, review, Team Review, and Global Analysis surfaces. --- ## High-level architecture ```text Employee / Reviewer → Eval Labs web app → Clerk role and Supabase RLS scope → Lucia Engine /admin/operator-focus → Lucia response → Eval Labs Review Queue → Suggested selections plus human review → Quick Review / Human Guidance Evaluation → Lifecycle finalization → Supabase persistence → Run History / Team Review / Global Analysis / Exports ``` --- ## Runtime responsibility split Eval Labs owns: - top app shell and route identity - test launcher UX - custom prompt suite UX - auto-generated prompt tester UX - guest-facing verification check and results UX - controlled batch runner UX - run orchestration - Run History - Team Review - Global Analysis - Single Run Analysis - copy Session ID / copy Deep Link controls - role-gated product access - Clerk public metadata role behavior - Supabase RLS role-claim requirements - Review Queue - suggested review generation - human ratings - semantic scoring sliders - Quick Review - Human Guidance Evaluation - review lifecycle and finalization - dirty / completion state - tester identity capture - exports - Supabase persistence for eval data - staged hydration from run summaries to recent/deep evidence - localStorage compaction for completed cloud-backed runs Lucia Engine owns: - actual Lucia behavior - intent/routing - response generation - emotional containment - operational prioritization - model gateway behavior Eval Labs does not decide Lucia's response quality. It records and evaluates the response Lucia produced. AI-reviewed platform evidence proves that the Eval Labs lifecycle works. Human reviewers still decide Lucia behavioral quality. --- ## Current Engine target Eval Labs endpoint selection is environment-configured through `VITE_LUCIA_EVAL_ENDPOINT`. The current Lucia v0.1.3.6 validation target for active dev refinement is: ```text https://api-dev.hellolucia.ai/admin/operator-focus ``` Development is where active iteration happens. Staging is for promoted validation only when intentionally configured. --- ## Source of truth hierarchy When debugging Eval Labs platform behavior: 1. Browser Network request URL 2. Current route and role state 3. Supabase rows and counts 4. Run History / Analysis UI truth 5. localStorage diagnostics 6. Render service environment 7. Netlify environment variables 8. Lucia Engine deployed commit 9. Eval Labs deployed commit 10. Exported run metadata 11. Human memory Human memory is useful. It is not the source of truth. --- ## Current route architecture The current route map is documented in [[04 - Product Surfaces and Route Map|Product Surfaces and Route Map]]. Core canonical paths: ```text / Owner/Admin Home dashboard /lucia/launcher workspace chooser /lucia/custom Custom prompt tester /lucia/auto-generated Auto-generated 50-prompt tester /guest-facing/verification Guest Facing Agent Verification Check /lucia/batch-runner Controlled Batch Runner /lucia/automated/runs Run History /team-review Team Review /analysis Global Analysis /analysis/runs/:sessionId Single Run Analysis /runs/:sessionId/review Review Queue ``` Legacy aliases: ```text /lucia/automated alias to /lucia/auto-generated /experiments alias to /analysis ``` --- ## Current role architecture Role gating is documented in [[05 - Role and Access Model|Role and Access Model]]. Current supported Clerk public metadata values: ```text owner admin evaluator tester ``` Owner/admin are privileged roles with full platform access, Team Review, Global Analysis, shared persisted evidence, and all test surfaces. Evaluator is the full evaluator workbench role. Evaluator can use evaluator-safe test surfaces and own run/review/history routes, but cannot use Team Review or Global Analysis. Tester is the entry-level prompt-testing lane. Tester can use Custom Prompt Test and Auto-generated Prompt Test, but not verification, controlled batch, Team Review, Global Analysis, Registry Diagnostics, Behavioral Observatory, or owner/admin tools. Missing or unknown role metadata should fail closed. Frontend role behavior is driven by Clerk public metadata. Persisted evidence access depends on the Clerk session token carrying `eval_labs_role` so Supabase RLS can recognize privileged owner/admin access. --- ## Important design decision The custom prompt feature did not require a separate database model because Eval Labs already had a general structure: ```text Session → Run items → Lucia responses → Human reviews ``` Custom prompts are a new run source, not a new evaluation universe. That is good architecture. --- ## Current run source strategy Custom runs use: ```text mode: automated runSource: custom category: custom/prompts templateKey: custom-prompt ``` This preserves compatibility with the existing run engine while clearly distinguishing custom runs from generated automated runs. Controlled batch runs use the same platform lifecycle to create, execute, review, finalize, persist, and verify runs. They are operational readiness evidence, not a separate human-review standard. Guest Facing Agent Verification Check is a separate app surface for booked-guest verification behavior and results. It is evaluator-safe but not tester-facing. --- ## Review-layer architecture Eval Labs now separates review responsibility into layers: ```text Review Queue UI → Suggested review values → Employee Review fields → Human Guidance Evaluation → Review State / Escalation flags → Lifecycle / dirty / completion state → Adjudication metadata → Exports / Analysis ``` The schema supports high-resolution analysis while the employee UI remains simple. This is intentional. The user-facing review experience should remain calm and guided even when the exported data is detailed.