Eval Labs Canon

# Eval Labs Step-by-Step Operator Guide > [!summary] > This is the simple operator guide for the major Eval Labs surfaces. Use it to avoid confusing diagnostic pages, review pages, analysis pages, and persistence truth. --- ## Status labels Use these words exactly: ```text implemented = present product/code path active hardening = implemented path still being validated, polished, or tightened diagnostic = read-only inspection derived = suggested from existing data persisted = saved and reloadable from Supabase future = planned or possible later deferred = intentionally outside current role/surface scope ``` --- ## A. Home Use Home when you need the owner/admin overview. Home shows: - platform status - recent activity - quick access to major surfaces - product state - readiness or evidence summaries when available Do not treat Home as a human approval page. Home can show platform progress. Human Lucia-quality approval still comes from human review. --- ## B. Registry Diagnostics Use Registry Diagnostics when you need to inspect derived classification behavior. Route: ```text /registry-diagnostics ``` Why it exists: ```text To show why existing Eval Labs data appears to match datasets and review queue lanes. ``` How to read the top summary: 1. Check how many datasets exist. 2. Check how many runs are included. 3. Check how many suggested memberships exist. 4. Check confidence breakdowns. 5. Treat the whole page as diagnostic. How to inspect dataset cards: 1. Read the dataset name. 2. Read the suggested membership count. 3. Check confidence. 4. Check source fields. 5. Ask whether the evidence is real or thin. How to inspect queue lanes: 1. Read the suggested lane. 2. Check which items triggered it. 3. Check confidence. 4. Look for broad or weak matches. How to use Noise / Watch: 1. Look for overmatching. 2. Look for weak low-confidence matches. 3. Look for items suggested for too many datasets. 4. Write down issues for product or engineering. What not to assume: - Do not assume dataset membership is final. - Do not assume queue routing is final. - Do not assume a human saved a label. - Do not use this page as Behavioral Observatory. --- ## C. Behavioral Observatory Use Behavioral Observatory when you need to review conversations and save behavioral labels. Route: ```text /behavioral-observatory ``` Before using it: 1. Confirm your role and assignment allow access. 2. Confirm the data shown is from the scoped run context you intend to review. 3. Notice whether the selected conversation is showing a derived suggestion or a saved label. Select a conversation: 1. Open the labeling queue. 2. Pick one conversation. 3. Read the Human message first. 4. Read Lucia's response second. Set Intent: 1. Choose what the human was trying to do. 2. Use `Other` only when the listed categories do not fit. Set Guest Affect: 1. Choose the smallest truthful emotional read. 2. Do not dramatize the guest's state. Set Response Strategy: 1. Choose Lucia's main response move. 2. Pick the dominant strategy, not every strategy present. Set Humanness: ```text 1 = Template 4 = Functional 7 = Warm + Specific ``` Do not use humanness as a substitute for truth, usefulness, or safety. Add notes: 1. Add a note only when it preserves useful behavioral evidence. 2. Keep it short. 3. Name the behavior and why it matters. Save label: 1. Click Save label or Save updates. 2. Wait for the saved state. 3. If the state is error, do not count the label as persisted. Refresh/check saved status if needed: 1. Refresh the page. 2. Confirm the label reloads. 3. If it does not reload, treat the label as not verified. --- ## D. Guest Facing Agent Verification Use Guest Facing Agent Verification when you need to run or inspect booked-guest verification behavior. Routes: ```text /guest-facing/verification /guest-facing/verification/results ``` Verification is for: - running the scenario pack from the app surface - inspecting pass/fail results - reviewing failure details - exporting or copying verification summaries Verification is not: - a tester lane - Team Review - Global Analysis - proof that Lucia is human-approved --- ## E. Team Review Use Team Review when owner/admin needs oversight of evaluator activity and review quality. Route: ```text /team-review ``` Team Review is for: - inspecting evaluator activity - finding missing checks - spotting flags and failures - reviewing recent human-evaluation signal - deciding what needs owner/admin attention Team Review is not: - evaluator productivity tracking for its own sake - tester onboarding - a human approval page --- ## F. Global Analysis Use Global Analysis when you need read-only behavioral and analytics evidence. Route: ```text /analysis ``` Global Analysis is for: - inspecting completed run evidence - reading behavioral summaries - comparing patterns - opening Single Run Analysis when available Global Analysis is not: - a human approval page - a Behavioral Observatory label-save workflow - Registry Diagnostics --- ## G. Run History Use Run History when you need the run ledger. Route: ```text /lucia/automated/runs ``` Run History is for: - finding completed/finalized runs - checking run lifecycle truth - copying run/session identifiers - opening review or analysis routes when allowed Run History is not: - proof that a response was good - proof that Behavioral Observatory labels exist - proof that Lucia is human-approved --- ## Operator rule Use the right surface for the right truth: ```text Run History = run ledger truth Team Review = owner/admin oversight truth Global Analysis = read-only evidence truth Registry Diagnostics = derived classification truth Behavioral Observatory = saved behavioral label truth Review Queue = prompt/item review workflow Human reviewers = Lucia quality judgment ```