Skip to content

Commit 0de7dca

Browse files
shyamnamboodiripadpeterwald
authored andcommitted
Introduce Reason property on EvaluationMetric (#6087)
Also includes changes to the report to display this information. This addresses #6032. Additionally, this PR also includes numerous general improvements to the TypeScript report rendering - * Make the metric cards clickable and display metric details (such as diagnostics, reasons etc.) on click in a new collapsible section inline in the report (as opposed to in a hover tooltip). This addresses #6037. * Show conversations in a friendlier chat bubble form and make the sections that display conversation history collapsible so that long conversations can be collapsed. This addresses #6036 partially. * Introduce a global settings pane and move the per-textbox toggles for rendering markdown to the global location. This also addresses #6036 partially. * In the scenario tree, collapse single children into their respective parent level to reduce the amount of clicking required to expand deep trees. * Remove the scenario-level section for failure reasons since it is now possible to view failure reasons (and diagnostics) on a per-metric basis by clicking on metric cards. * A bunch of other minor layout, sizing and UX improvements and fixes for the report.
1 parent 077d4f3 commit 0de7dca

File tree

17 files changed

+575
-512
lines changed

17 files changed

+575
-512
lines changed

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/RelevanceTruthAndCompletenessEvaluator.Prompts.cs

Lines changed: 1 addition & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -7,93 +7,7 @@ public partial class RelevanceTruthAndCompletenessEvaluator
77
{
88
private static class Prompts
99
{
10-
internal static string BuildEvaluationPrompt(string userQuery, string modelResponse, string history)
11-
{
12-
#pragma warning disable S103 // Lines should not be too long
13-
return
14-
$$"""
15-
Read the History, User Query, and Model Response below and produce your response as a single JSON object.
16-
Do not include any other text in your response besides the JSON object.
17-
18-
The JSON object should have the following format. However, do not include any markdown tags in your
19-
response. Your response should start with an open curly brace and end with a closing curly brace for the
20-
JSON.
21-
```
22-
{
23-
"relevance": 1,
24-
"truth": 1,
25-
"completeness": 1
26-
}
27-
```
28-
29-
-----
30-
31-
History: {{history}}
32-
33-
-----
34-
35-
User Query: {{userQuery}}
36-
37-
-----
38-
39-
Model Response: {{modelResponse}}
40-
41-
-----
42-
43-
That's the History, User Query, and Model Response you will rate. Now, in 3 Steps, you will evaluate the Model Response on 3 criteria.
44-
45-
-----
46-
47-
Step 1: Rate the relevance of the response.
48-
49-
Regardless of truth of statements, how much of the response is directly related to the request?
50-
51-
1 = None of the response is at all related
52-
2 = A little of the response is directly related, or the response is a little bit related
53-
3 = Some of the response is related, or the response is somewhat related
54-
4 = Most of the response is related, or the response is mostly related
55-
5 = 100% of the response is 100% related to the request
56-
57-
Record your response as the value of the "relevance" property in the JSON output.
58-
59-
-----
60-
61-
Step 2: Rate the truth of the response.
62-
63-
Read the History, Query, and Model Response again.
64-
65-
Regardless of relevance, how true are the verifiable statements in the response?
66-
67-
1 = The entire response is totally false
68-
2 = A little of the response is true, or the response is a little bit true
69-
3 = Some of the response is true, or the response is somewhat true
70-
4 = Most of the response is true, or the response is mostly true
71-
5 = 100% of the response is 100% true
72-
73-
Record your response as the value of the "truth" property in the JSON output.
74-
75-
-----
76-
77-
Step 3: Rate the completeness of the response.
78-
79-
Read the History, Query, and Model Response again.
80-
81-
Regardless of whether the statements made in the response are true, how many of the points necessary to address the request, does the response contain?
82-
83-
1 = The response omits all points that are necessary to address the request.
84-
2 = The response includes a little of the points that are necessary to address the request.
85-
3 = The response includes some of the points that are necessary to address the request.
86-
4 = The response includes most of the points that are necessary to address the request.
87-
5 = The response includes all points that are necessary to address the request. For explain tasks, nothing is left unexplained. For improve tasks, I looked for all potential improvements, and none were left out. For fix tasks, the response purports to get the user all the way to a fixed state (regardless of whether it actually works). For "do task" responses, it does everything requested.
88-
89-
Record your response as the value of the "completeness" property in the JSON output.
90-
91-
-----
92-
""";
93-
#pragma warning restore S103
94-
}
95-
96-
internal static string BuildEvaluationPromptWithReasoning(
10+
internal static string BuildEvaluationPrompt(
9711
string userQuery,
9812
string modelResponse,
9913
string history)

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/RelevanceTruthAndCompletenessEvaluator.cs

Lines changed: 7 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,10 @@ namespace Microsoft.Extensions.AI.Evaluation.Quality;
2323
/// <remarks>
2424
/// <see cref="RelevanceTruthAndCompletenessEvaluator"/> returns three <see cref="NumericMetric"/>s that contain scores
2525
/// for 'Relevance', 'Truth' and 'Completeness' respectively. Each score is a number between 1 and 5, with 1 indicating
26-
/// a poor score, and 5 indicating an excellent score.
26+
/// a poor score, and 5 indicating an excellent score. Each returned score is also accompanied by a
27+
/// <see cref="EvaluationMetric.Reason"/> that provides an explanation for the score.
2728
/// </remarks>
28-
/// <param name="options">Options for <see cref="RelevanceTruthAndCompletenessEvaluator"/>.</param>
29-
public sealed partial class RelevanceTruthAndCompletenessEvaluator(
30-
RelevanceTruthAndCompletenessEvaluatorOptions? options = null) : ChatConversationEvaluator
29+
public sealed partial class RelevanceTruthAndCompletenessEvaluator : ChatConversationEvaluator
3130
{
3231
/// <summary>
3332
/// Gets the <see cref="EvaluationMetric.Name"/> of the <see cref="NumericMetric"/> returned by
@@ -61,9 +60,6 @@ public sealed partial class RelevanceTruthAndCompletenessEvaluator(
6160
ResponseFormat = ChatResponseFormat.Json
6261
};
6362

64-
private readonly RelevanceTruthAndCompletenessEvaluatorOptions _options =
65-
options ?? RelevanceTruthAndCompletenessEvaluatorOptions.Default;
66-
6763
/// <inheritdoc/>
6864
protected override EvaluationResult InitializeResult()
6965
{
@@ -101,17 +97,7 @@ userRequest is not null
10197

10298
string renderedHistory = builder.ToString();
10399

104-
string prompt =
105-
_options.IncludeReasoning
106-
? Prompts.BuildEvaluationPromptWithReasoning(
107-
renderedUserRequest,
108-
renderedModelResponse,
109-
renderedHistory)
110-
: Prompts.BuildEvaluationPrompt(
111-
renderedUserRequest,
112-
renderedModelResponse,
113-
renderedHistory);
114-
100+
string prompt = Prompts.BuildEvaluationPrompt(renderedUserRequest, renderedModelResponse, renderedHistory);
115101
return prompt;
116102
}
117103

@@ -192,23 +178,23 @@ void UpdateResult(Rating rating)
192178
relevance.Interpretation = relevance.InterpretScore();
193179
if (!string.IsNullOrWhiteSpace(rating.RelevanceReasoning))
194180
{
195-
relevance.AddDiagnostic(EvaluationDiagnostic.Informational(rating.RelevanceReasoning!));
181+
relevance.Reason = rating.RelevanceReasoning!;
196182
}
197183

198184
NumericMetric truth = result.Get<NumericMetric>(TruthMetricName);
199185
truth.Value = rating.Truth;
200186
truth.Interpretation = truth.InterpretScore();
201187
if (!string.IsNullOrWhiteSpace(rating.TruthReasoning))
202188
{
203-
truth.AddDiagnostic(EvaluationDiagnostic.Informational(rating.TruthReasoning!));
189+
truth.Reason = rating.TruthReasoning!;
204190
}
205191

206192
NumericMetric completeness = result.Get<NumericMetric>(CompletenessMetricName);
207193
completeness.Value = rating.Completeness;
208194
completeness.Interpretation = completeness.InterpretScore();
209195
if (!string.IsNullOrWhiteSpace(rating.CompletenessReasoning))
210196
{
211-
completeness.AddDiagnostic(EvaluationDiagnostic.Informational(rating.CompletenessReasoning!));
197+
completeness.Reason = rating.CompletenessReasoning!;
212198
}
213199

214200
if (!string.IsNullOrWhiteSpace(rating.Error))

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/RelevanceTruthAndCompletenessEvaluatorOptions.cs

Lines changed: 0 additions & 41 deletions
This file was deleted.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Reporting/TypeScript/components/App.css

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ The .NET Foundation licenses this file to you under the MIT license.
55

66
#root {
77
margin: 0 auto;
8-
padding: 2rem;
8+
padding: 0rem 2rem 2rem 2rem;
99
background-color: white;
1010
}
11-
Lines changed: 34 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
// Licensed to the .NET Foundation under one or more agreements.
22
// The .NET Foundation licenses this file to you under the MIT license.
33

4+
import { useState } from 'react';
5+
import { Settings28Regular } from '@fluentui/react-icons';
6+
import { Drawer, DrawerBody, DrawerHeader, DrawerHeaderTitle, Switch } from '@fluentui/react-components';
47
import { makeStyles } from '@fluentui/react-components';
58
import './App.css';
69
import { ScoreNode } from './Summary';
@@ -12,20 +15,44 @@ type AppProperties = {
1215
};
1316

1417
const useStyles = makeStyles({
15-
footerText: { fontSize: '0.8rem', marginTop: '2rem' }
16-
})
18+
header: { display: 'flex', justifyContent: 'space-between', alignItems: 'center', position: 'sticky', top: 0, backgroundColor: 'white', zIndex: 1 },
19+
footerText: { fontSize: '0.8rem', marginTop: '2rem' },
20+
closeButton: { position: 'absolute', top: '1.5rem', right: '1rem', cursor: 'pointer', fontSize: '2rem' },
21+
switchLabel: { fontSize: '1rem', paddingTop: '1rem' },
22+
drawerBody: { paddingTop: '1rem' },
23+
});
1724

18-
function App({dataset, tree}:AppProperties) {
25+
function App({ dataset, tree }: AppProperties) {
1926
const classes = useStyles();
27+
const [isSettingsOpen, setIsSettingsOpen] = useState(false);
28+
const [renderMarkdown, setRenderMarkdown] = useState(true);
29+
30+
const toggleSettings = () => setIsSettingsOpen(!isSettingsOpen);
31+
const toggleRenderMarkdown = () => setRenderMarkdown(!renderMarkdown);
32+
const closeSettings = () => setIsSettingsOpen(false);
33+
2034
return (
2135
<>
22-
<h1>AI Evaluation Report</h1>
36+
<div className={classes.header}>
37+
<h1>AI Evaluation Report</h1>
38+
<Settings28Regular onClick={toggleSettings} style={{ cursor: 'pointer' }} />
39+
</div>
2340

24-
<ScenarioGroup node={tree} />
41+
<ScenarioGroup node={tree} renderMarkdown={renderMarkdown} />
2542

2643
<p className={classes.footerText}>Generated at {dataset.createdAt} by Microsoft.Extensions.AI.Evaluation.Reporting version {dataset.generatorVersion}</p>
44+
45+
<Drawer open={isSettingsOpen} onOpenChange={toggleSettings} position='end'>
46+
<DrawerHeader>
47+
<DrawerHeaderTitle>Settings</DrawerHeaderTitle>
48+
<span className={classes.closeButton} onClick={closeSettings}>&times;</span>
49+
</DrawerHeader>
50+
<DrawerBody className={classes.drawerBody}>
51+
<Switch checked={renderMarkdown} onChange={toggleRenderMarkdown} label={<span className={classes.switchLabel}>Render markdown for conversations</span>} />
52+
</DrawerBody>
53+
</Drawer>
2754
</>
28-
)
55+
);
2956
}
3057

31-
export default App
58+
export default App;

src/Libraries/Microsoft.Extensions.AI.Evaluation.Reporting/TypeScript/components/EvalTypes.d.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,20 +65,24 @@ type BaseEvaluationMetric = {
6565

6666
type MetricWithNoValue = BaseEvaluationMetric & {
6767
$type: "none";
68+
reason?: string;
6869
value: undefined;
6970
};
7071

7172
type NumericMetric = BaseEvaluationMetric & {
7273
$type: "numeric";
74+
reason?: string;
7375
value?: number;
7476
};
7577

7678
type BooleanMetric = BaseEvaluationMetric & {
7779
$type: "boolean";
80+
reason?: string;
7881
value?: boolean;
7982
};
8083

8184
type StringMetric = BaseEvaluationMetric & {
8285
$type: "string";
86+
reason?: string;
8387
value?: string;
8488
};

0 commit comments

Comments
 (0)