方法论 / 技术路线
LLM Evaluation Framework
Closed-loop evaluation harness running an LLM agent against benchmark problems with deployment, scoring and reporting.
完整 Prompt
Create an LLM evaluation framework diagram for a closed-loop benchmark harness. Layout: - Input: benchmark tasks and evaluation rubric. - Pipeline: prompt builder -> LLM agent -> tool environment -> response collector -> scorer -> report generator. - Add a deployment gate after scoring with pass / fail decision. - Show feedback loop from error analysis back to prompt builder and agent configuration. - Include metric boxes for accuracy, cost, latency, safety, and format compliance. Style: - Clean MLOps / evaluation architecture diagram on white background. - Navy pipeline blocks, teal forward flow, coral failure feedback, amber metric badges. - Use readable labels and consistent spacing. - Suitable for AI evaluation papers, internal platform docs, and benchmark reports.立即试用此 Prompt
适用场景
For LLM benchmarking, agent evaluation and AIOps-style automated test harnesses.
变体
With safety guardrails
Add a "Safety Filter" node between Run Episode and Score & Log that screens agent actions for unsafe operations (e.g., destructive shell commands) before they execute. Unsafe actions are logged and the episode is terminated.
使用建议
- Number the phases — readers expect phase ordering in evaluation harnesses.
- Place the benchmark database outside the loop. Mixing it into the loop muddles the figure.
- Show a metrics store explicitly — without it the "evaluation" word feels incomplete.
常见问题
Can I depict multi-turn agent runs?
Add a small inner loop on the agent: each turn the agent observes the environment and takes an action; the inner loop closes when the episode terminates.
