Eval Labs Canon

# Review Workflow > [!summary] > Review is where Eval Labs becomes useful. The reviewer's job is to judge behavior honestly, not politely. AI-reviewed platform evidence does not replace this human judgment. --- ## Review order Use this order: 1. intent 2. truth 3. usefulness 4. clarity 5. tone 6. next move 7. trust aftertaste --- ## 1. Intent Did Lucia understand what the user was asking? If intent is wrong, the response usually fails. For example, if the user says: ```text I feel totally out of the loop. ``` Lucia should not respond with a generic capability menu. That is likely an intent-layer miss. --- ## 2. Truth Did Lucia claim anything she could not know or verify? Truth failures are serious. Examples: - saying a vendor was contacted when no dispatch happened - saying an issue is resolved when only a suggestion was made - implying full confidence when the signal is inferred --- ## 3. Usefulness Did the response help the user move forward? A response can be warm and still useless. --- ## 4. Clarity Was the response easy to understand without extra work? Lucia should not make the operator scan five paragraphs to find the first move. --- ## 5. Tone Was the tone appropriate for the moment? For Lucia, tone should be: ```text warm calm specific not robotic not therapy-bot ``` --- ## 6. Next move Did Lucia give the right next move when a next move was needed? Not every prompt requires a task. But distress and ops prompts usually require narrowing. --- ## 7. Trust aftertaste After reading the response, ask: ```text Do I trust Lucia more, less, or the same? ``` If the answer is less, write down why. --- ## Saving reviews Use: - **Save & Next** for non-final prompts - **Save** for the final prompt - **Finalize Run** when the run review is complete Finalization marks the run lifecycle. It does not replace per-prompt review data. --- ## Reviewer discipline > [!warning] > Do not pass a response just because it sounds smart. Pass it only if it works. AI-reviewed readiness runs can prove the platform captured and persisted reviews. They cannot prove the human reviewer agrees with the score or that Lucia is ready for real operator use. --- ## Review Queue vs Behavioral Observatory Review Queue and Behavioral Observatory are related, but they are not the same workflow. Review Queue is where the reviewer scores and reviews the prompt/response item. Behavioral Observatory is where a reviewer can save structured behavioral labels for a conversation: ```text Intent Guest Affect Response Strategy Humanness Notes ``` Registry Diagnostics is separate again. It shows derived dataset and queue-lane suggestions, not saved human labels. --- ## Updated Review Queue flow Use this practical flow: 1. read the prompt 2. read Lucia’s response 3. review any suggested selections 4. score the five dimensions with the semantic confidence sliders 5. answer Quick Review questions 6. add Human Guidance Evaluation scores when useful 7. add a short note only if needed 8. flag senior review when uncertain or concerned 9. mark reusable learning only when the case teaches a durable lesson 10. save and move on If the assignment includes Behavioral Observatory, use the saved-label workflow after reading the conversation carefully. Do not copy derived suggestions blindly. --- ## Quick Review rule Quick Review is not a test of the reviewer’s AI knowledge. It is a structured way to capture whether Lucia worked for a human. If you are unsure, use the senior review option instead of inventing your own taxonomy. --- ## Escalation rule Escalate when: - Lucia may have overclaimed - the response creates risk or confusion - intent is unclear - the case involves owner stress, money, maintenance, guest trust, or safety - the response contains a reusable pattern