第七类：质量与反馈

确保输出质量的方法 — 评分标准、自我审查、反馈循环和版本管理。

相关基础技术： Schema Priming, Negative Space, Cognitive Offloading（见 prompt-engineering-for-skills.md）

Pattern 27: Scoring Rubrics / Quantitative Assessment

出现频率： 约 4% 的 skills（80-100 个文件） 相关模式： Structured Output Templates, Few-Shot Examples, Self-Critique

定义： 提供带有明确标准、分数范围、每档描述词和阈值映射的数字化评分框架，将总分转化为分类结果。

适用场景：

主观评估需要跨运行可比性时
输出需要量化评分（演示评审、代码质量、就绪审计）
多个维度需要独立评分时
不同阈值触发不同后续操作时

正面示例

## Scoring Criteria

| # | Criterion | Description | Score Range |
|---|-----------|-------------|-------------|
| 1 | **Hook** | Does the opening grab attention in the first 15 seconds? | 1–5 |
| 2 | **Problem / Solution** | Is the problem clear and the solution compelling? | 1–5 |
| 3 | **Demo Flow** | Is the demo logical, smooth, and shows the product working? | 1–5 |
| 4 | **Technical Clarity** | Are technical choices explained clearly for the audience? | 1–5 |
| 5 | **Call to Action** | Does the pitch end with a clear ask or next step? | 1–5 |

**Total:** 25 points. Scores map to: 20–25 Strong, 13–19 Adequate, <=12 Needs Work.

**Scoring rubric for Instruction Clarity:**
- **5/5**: All checks pass. Clear phases, strong directives, output format specified.
- **4/5**: Frontmatter complete, workflow exists, minor language weakness.
- **3/5**: Basic structure exists but missing output format or has weak language.
- **2/5**: Missing frontmatter fields or no clear workflow.
- **1/5**: No frontmatter or unstructured prose.

为什么有效： 每个标准都有名称、描述和分数范围 — 模型明确知道评什么、怎么评。总分到类别的映射（20-25 Strong、13-19 Adequate、<=12 Needs Work）使最终判定确定化。skillqa 标准提供了每档描述，模型可以用具体标准区分 3 分和 4 分。五个维度覆盖不同方面 — “Hook”和”Call to Action”之间没有重叠。

反面示例

Rate the pitch on a scale of 1-10. Consider things like how engaging it is,
whether the demo works, and if the technical approach makes sense.
Give an overall assessment.

为什么失败： 单一 1-10 分制无标准意味着每次运行的校准不同。”Consider things like”是建议而非必需维度。没有阈值映射意味着”7/10”在不同运行中可能代表不同含义。没有每档描述意味着模型无法区分相邻分数。对同一输入多次运行会产生不同分数。

Pattern 28: Self-Critique / Quality Self-Check

出现频率： 约 2% 的 skills（30-50 个文件） 相关模式： Evidence Chain, Negative Constraints, Scoring Rubrics

定义： 要求 agent 在交付前审查自己的输出 — 识别弱点、标记低置信度区域、验证是否符合规则。

适用场景：

规格/文档生成中隐含假设危险时
根因分析中过早下结论浪费排查时间时
模型置信度变化且用户需要了解时
产出可执行建议时（错误建议代价高昂）

正面示例

### Adversarial Self-Critique

The spec author's honest assessment of where this spec is weakest. Not generic failure
modes — specific weaknesses in THIS spec.

**Rules:**
- Minimum 3 weaknesses per spec.
- Each weakness must be specific to THIS spec — "specs can be misinterpreted" is not valid.
- Watch indicators must be observable during execution, not after.

### Weakness 1: [Title]
- **Assumption being made:** [the specific assumption]
- **What happens if wrong:** [what the executor would build incorrectly]
- **Watch indicator:** [observable signal during execution]

### Weakness 2: [Title]
...

为什么有效： “至少 3 个弱点”防止敷衍了事。明确拒绝通用弱点（”specs can be misinterpreted”）迫使模型找到当前输出的实际问题。三字段结构（假设、后果、观察指标）使每个弱点可操作 — 用户知道执行期间该监控什么。”Adversarial”框架鼓励模型寻找问题而非维护自己的工作。

反面示例

Review your output and make sure it's good. Fix any issues you find.

为什么失败： “Make sure it’s good”与模型生成输出时遵循的指令相同 — 用同样标准自审产出同样结果。没有最少弱点数意味着模型发现零弱点（作者看自己的作品一切良好）。没有弱点结构意味着只有笼统空话。”Fix any issues”意味着用户永远看不到弱点 — 模型默默”修复”它们，实际可能是掩盖问题。

Pattern 29: Feedback Solicitation

出现频率： <1% 的 skills（10-20 个文件） 相关模式： Progress Feedback, Configuration Persistence

定义： 指示 agent 在自然停顿点呈现反馈调查或请求，带有优先级层级和会话级去重。

适用场景：

正在迭代中、需要用户反馈的 skills
有多种潜在失败模式、需要 bug 报告的 skills
希望在用户遇到差距时捕获功能请求
需要满意度指标的生产 skills

正面示例

## Feedback

Surface the feedback survey **at most once per session** at a natural stopping point.

**Link:** [Excel AI Tools Pulse](https://aka.ms/ExcelAIToolsPulse) (anonymous, 2 min)

**When to surface** (pick the first that matches, then stop for the session):

1. **Bug** — something went wrong → offer to draft a brief bug report
2. **Feature gap** — user wants something this skill can't do → offer to draft feature request
3. **Satisfaction** — task completed smoothly → one-line mention
4. **First completion** — skill finished successfully, no other trigger → link in closing output

Never interrupt the active task. Never mention the survey again if declined or ignored.

为什么有效： “每个会话最多一次”防止反馈疲劳。优先级层确保 bug 在通用满意度问题之前呈现。”Pick the first that matches, then stop”是确定性规则。”Never interrupt the active task”确保反馈不干扰工作。不同反馈类型有不同响应方式（bug → 提议起草报告，satisfaction → 一句话提及）。

反面示例

Ask the user for feedback when you're done. Include a link to our survey.

为什么失败： 没有会话级去重意味着模型在每次交互后都请求反馈。没有优先级层意味着 bug 和通用满意度得到同等处理。”When you’re done”在多步工作流中模糊 — 每步之后还是仅在结束时？没有约束禁止中断活跃工作意味着模型可能在分析过程中请求反馈。

Pattern 30: Version Check / Update Notification

出现频率： <1% 的 skills（10-20 个文件） 相关模式： Configuration Persistence, Error Handling

定义： 检查已安装插件版本是否与最新可用版本一致并通知用户更新，检查失败时优雅降级。

适用场景：

正在积极开发和频繁更新的 skills
版本不匹配导致微妙行为差异的 skills
通过插件市场分发的 skills
用户因安全或兼容性原因需要保持最新版时

正面示例

### Check for Updates

**Run this section on every invocation**, before any other workflow section. It is designed
to be non-blocking — if any step fails (network error, file not found, parse error),
log a brief warning and continue silently.

**Read installed version**

    $installedPluginJson = "$env:USERPROFILE\.copilot\installed-plugins\
      marketplace\my-plugin\.claude-plugin\plugin.json"
    $installedVersion = (Get-Content $installedPluginJson -Raw | ConvertFrom-Json).version

**Fetch latest version from GitHub**

Uses `gh api` (GitHub CLI) for authenticated access, with `Invoke-RestMethod` as fallback:

    # Primary: GitHub CLI
    $base64 = gh api repos/org/plugins/contents/plugins/
      my-plugin/.claude-plugin/plugin.json --jq '.content'
    $latestVersion = ([System.Text.Encoding]::UTF8.GetString(
      [Convert]::FromBase64String($base64.Trim())) | ConvertFrom-Json).version

    # Fallback: direct HTTP (works for public repos)
    $latestUrl = "https://raw.githubusercontent.com/org/plugins/
      main/plugins/my-plugin/.claude-plugin/plugin.json"
    $latestVersion = (Invoke-RestMethod -Uri $latestUrl -TimeoutSec 5).version

**Compare versions** using `[version]` type for numeric comparison.

**Report result:**
- If latest > installed: "Update available: v{installed} → v{latest}" + offer update command
- If versions match: "v{installed} (latest)" + continue
- If check fails: "Could not check for updates. Continuing with installed version."

为什么有效： 检查被设计为非阻塞 — 网络故障不会阻止 skill 运行。两种获取方式（gh api + 直接 HTTP）提供冗余。版本比较使用正确的数字解析（[version] 类型），而非字符串比较。三种结果（有更新、已最新、检查失败）各有明确的用户提示。更新路径会暂停并询问后再更新，而非自动更新。

反面示例

Check if there's a newer version available. If so, tell the user to update.

为什么失败： 没有指定已安装或远程版本的路径。没有网络故障降级意味着 skill 可能在做任何实际工作前就崩溃。没有版本比较方法 — “1.9.0”和”1.10.0”的字符串比较会得出错误结果。没有处理”检查失败”的情况。没有更新前的用户确认。”Tell the user to update”没有给出实际命令。