Skip to content

Commit 68d1661

Browse files
Introduce Content Safety evaluators (#6223)
The new evaluators ship in a new Microsoft.Extensions.AI.Evaluation.Safety package. Also includes the following public API changes: - Add a Metadata dictionary on EvaluationMetric. - Make EvaluationMetric.Diagnostics nullable. - Convert instance functions on some (fully mutable) result types to extension methods in the same namespace. And some reporting improvements including: - Change boolean metric UI representation in metric card from Pass / Fail to Yes / No - Display the above Metadata contents in a table in the metric details view when a metric card is clicked - Improve display for diagnostics in metric details - diagnostics are now also displayed in a table with with proper formatting and an option copy diagnostics to the clipboard Fixes #5937
1 parent 5298642 commit 68d1661

File tree

64 files changed

+3590
-470
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+3590
-470
lines changed

eng/packages/General.props

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
<?xml version="1.0" encoding="utf-8"?>
22
<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
33
<ItemGroup>
4+
<PackageVersion Include="Azure.Core" Version="1.45.0" />
45
<PackageVersion Include="Azure.Identity" Version="1.13.2" />
56
<PackageVersion Include="Azure.Storage.Files.DataLake" Version="12.21.0" />
67
<PackageVersion Include="Azure.AI.Inference" Version="1.0.0-beta.4" />

src/Libraries/Microsoft.Extensions.AI.Evaluation.Console/Program.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ private static async Task<int> Main(string[] args)
139139
// TASK: Support some mechanism to fail a build (i.e. return a failure exit code) based on one or more user
140140
// specified criteria (e.g., if x% of metrics were deemed 'poor'). Ideally this mechanism would be flexible /
141141
// extensible enough to allow users to configure multiple different kinds of failure criteria.
142-
142+
// See https://github.com/dotnet/extensions/issues/6038.
143143
#if DEBUG
144144
ParseResult parseResult = rootCmd.Parse(args);
145145
if (parseResult.HasOption(debugOpt))

src/Libraries/Microsoft.Extensions.AI.Evaluation.Console/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
* [`Microsoft.Extensions.AI.Evaluation`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) - Defines core abstractions and types for supporting evaluation.
66
* [`Microsoft.Extensions.AI.Evaluation.Quality`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) - Contains evaluators that can be used to evaluate the quality of AI responses in your projects including Relevance, Truth, Completeness, Fluency, Coherence, Equivalence and Groundedness.
7+
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains a set of evaluators that are built atop the Azure AI Content Safety service that can be used to evaluate the content safety of AI responses in your projects including Protected Material, Groundedness Pro, Ungrounded Attributes, Hate and Unfairness, Self Harm, Violence, Sexual, Code Vulnerability and Indirect Attack.
78
* [`Microsoft.Extensions.AI.Evaluation.Reporting`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) - Contains support for caching LLM responses, storing the results of evaluations and generating reports from that data.
89
* [`Microsoft.Extensions.AI.Evaluation.Reporting.Azure`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the `Microsoft.Extensions.AI.Evaluation.Reporting` library with an implementation for caching LLM responses and storing the evaluation results in an Azure Storage container.
910
* [`Microsoft.Extensions.AI.Evaluation.Console`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) - A command line dotnet tool for generating reports and managing evaluation data.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/EquivalenceEvaluatorContext.cs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@
99
namespace Microsoft.Extensions.AI.Evaluation.Quality;
1010

1111
/// <summary>
12-
/// Contextual information required to evaluate the 'Equivalence' of a response.
12+
/// Contextual information that the <see cref="EquivalenceEvaluator"/> uses to evaluate the 'Equivalence' of a
13+
/// response.
1314
/// </summary>
1415
/// <param name="groundTruth">
1516
/// The ground truth response against which the response that is being evaluated is compared.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/GroundednessEvaluatorContext.cs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@
99
namespace Microsoft.Extensions.AI.Evaluation.Quality;
1010

1111
/// <summary>
12-
/// Contextual information required to evaluate the 'Groundedness' of a response.
12+
/// Contextual information that the <see cref="GroundednessEvaluator"/> uses to evaluate the 'Groundedness' of a
13+
/// response.
1314
/// </summary>
1415
/// <param name="groundingContext">
1516
/// Contextual information against which the 'Groundedness' of a response is evaluated.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
* [`Microsoft.Extensions.AI.Evaluation`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) - Defines core abstractions and types for supporting evaluation.
66
* [`Microsoft.Extensions.AI.Evaluation.Quality`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) - Contains evaluators that can be used to evaluate the quality of AI responses in your projects including Relevance, Truth, Completeness, Fluency, Coherence, Equivalence and Groundedness.
7+
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains a set of evaluators that are built atop the Azure AI Content Safety service that can be used to evaluate the content safety of AI responses in your projects including Protected Material, Groundedness Pro, Ungrounded Attributes, Hate and Unfairness, Self Harm, Violence, Sexual, Code Vulnerability and Indirect Attack.
78
* [`Microsoft.Extensions.AI.Evaluation.Reporting`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) - Contains support for caching LLM responses, storing the results of evaluations and generating reports from that data.
89
* [`Microsoft.Extensions.AI.Evaluation.Reporting.Azure`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the `Microsoft.Extensions.AI.Evaluation.Reporting` library with an implementation for caching LLM responses and storing the evaluation results in an Azure Storage container.
910
* [`Microsoft.Extensions.AI.Evaluation.Console`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) - A command line dotnet tool for generating reports and managing evaluation data.

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/RelevanceTruthAndCompletenessEvaluator.cs

Lines changed: 106 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@
77
// constructor syntax.
88

99
using System.Collections.Generic;
10+
using System.Diagnostics;
11+
using System.Globalization;
12+
using System.Linq;
1013
using System.Text;
1114
using System.Text.Json;
1215
using System.Threading;
@@ -125,71 +128,112 @@ protected override async ValueTask PerformEvaluationAsync(
125128
EvaluationResult result,
126129
CancellationToken cancellationToken)
127130
{
128-
ChatResponse evaluationResponse =
129-
await chatConfiguration.ChatClient.GetResponseAsync(
130-
evaluationMessages,
131-
_chatOptions,
132-
cancellationToken: cancellationToken).ConfigureAwait(false);
133-
134-
string evaluationResponseText = evaluationResponse.Text.Trim();
131+
ChatResponse evaluationResponse;
135132
Rating rating;
133+
string duration;
134+
Stopwatch stopwatch = Stopwatch.StartNew();
136135

137-
if (string.IsNullOrEmpty(evaluationResponseText))
138-
{
139-
rating = Rating.Inconclusive;
140-
result.AddDiagnosticToAllMetrics(
141-
EvaluationDiagnostic.Error(
142-
"Evaluation failed because the model failed to produce a valid evaluation response."));
143-
}
144-
else
136+
try
145137
{
146-
try
138+
evaluationResponse =
139+
await chatConfiguration.ChatClient.GetResponseAsync(
140+
evaluationMessages,
141+
_chatOptions,
142+
cancellationToken: cancellationToken).ConfigureAwait(false);
143+
144+
string evaluationResponseText = evaluationResponse.Text.Trim();
145+
if (string.IsNullOrEmpty(evaluationResponseText))
147146
{
148-
rating = Rating.FromJson(evaluationResponseText!);
147+
rating = Rating.Inconclusive;
148+
result.AddDiagnosticToAllMetrics(
149+
EvaluationDiagnostic.Error(
150+
"Evaluation failed because the model failed to produce a valid evaluation response."));
149151
}
150-
catch (JsonException)
152+
else
151153
{
152154
try
153155
{
154-
string repairedJson =
155-
await JsonOutputFixer.RepairJsonAsync(
156-
chatConfiguration,
157-
evaluationResponseText!,
158-
cancellationToken).ConfigureAwait(false);
159-
160-
if (string.IsNullOrEmpty(repairedJson))
156+
rating = Rating.FromJson(evaluationResponseText!);
157+
}
158+
catch (JsonException)
159+
{
160+
try
161161
{
162-
rating = Rating.Inconclusive;
163-
result.AddDiagnosticToAllMetrics(
164-
EvaluationDiagnostic.Error(
165-
$"""
162+
string repairedJson =
163+
await JsonOutputFixer.RepairJsonAsync(
164+
chatConfiguration,
165+
evaluationResponseText!,
166+
cancellationToken).ConfigureAwait(false);
167+
168+
if (string.IsNullOrEmpty(repairedJson))
169+
{
170+
rating = Rating.Inconclusive;
171+
result.AddDiagnosticToAllMetrics(
172+
EvaluationDiagnostic.Error(
173+
$"""
166174
Failed to repair the following response from the model and parse scores for '{RelevanceMetricName}', '{TruthMetricName}' and '{CompletenessMetricName}'.:
167175
{evaluationResponseText}
168176
"""));
177+
}
178+
else
179+
{
180+
rating = Rating.FromJson(repairedJson!);
181+
}
169182
}
170-
else
183+
catch (JsonException ex)
171184
{
172-
rating = Rating.FromJson(repairedJson!);
173-
}
174-
}
175-
catch (JsonException ex)
176-
{
177-
rating = Rating.Inconclusive;
178-
result.AddDiagnosticToAllMetrics(
179-
EvaluationDiagnostic.Error(
180-
$"""
185+
rating = Rating.Inconclusive;
186+
result.AddDiagnosticToAllMetrics(
187+
EvaluationDiagnostic.Error(
188+
$"""
181189
Failed to repair the following response from the model and parse scores for '{RelevanceMetricName}', '{TruthMetricName}' and '{CompletenessMetricName}'.:
182190
{evaluationResponseText}
183191
{ex}
184192
"""));
193+
}
185194
}
186195
}
187196
}
197+
finally
198+
{
199+
stopwatch.Stop();
200+
duration = $"{stopwatch.Elapsed.TotalSeconds.ToString("F2", CultureInfo.InvariantCulture)} s";
201+
}
188202

189-
UpdateResult(rating);
203+
UpdateResult();
190204

191-
void UpdateResult(Rating rating)
205+
void UpdateResult()
192206
{
207+
const string Rationales = "Rationales";
208+
const string Separator = "; ";
209+
210+
var commonMetadata = new Dictionary<string, string>();
211+
212+
if (!string.IsNullOrWhiteSpace(evaluationResponse.ModelId))
213+
{
214+
commonMetadata["rtc-evaluation-model-used"] = evaluationResponse.ModelId!;
215+
}
216+
217+
if (evaluationResponse.Usage is UsageDetails usage)
218+
{
219+
if (usage.InputTokenCount is not null)
220+
{
221+
commonMetadata["rtc-evaluation-input-tokens-used"] = $"{usage.InputTokenCount}";
222+
}
223+
224+
if (usage.OutputTokenCount is not null)
225+
{
226+
commonMetadata["rtc-evaluation-output-tokens-used"] = $"{usage.OutputTokenCount}";
227+
}
228+
229+
if (usage.TotalTokenCount is not null)
230+
{
231+
commonMetadata["rtc-evaluation-total-tokens-used"] = $"{usage.TotalTokenCount}";
232+
}
233+
}
234+
235+
commonMetadata["rtc-evaluation-duration"] = duration;
236+
193237
NumericMetric relevance = result.Get<NumericMetric>(RelevanceMetricName);
194238
relevance.Value = rating.Relevance;
195239
relevance.Interpretation = relevance.InterpretScore();
@@ -198,6 +242,13 @@ void UpdateResult(Rating rating)
198242
relevance.Reason = rating.RelevanceReasoning!;
199243
}
200244

245+
relevance.AddOrUpdateMetadata(commonMetadata);
246+
if (rating.RelevanceReasons.Any())
247+
{
248+
string value = string.Join(Separator, rating.RelevanceReasons);
249+
relevance.AddOrUpdateMetadata(name: Rationales, value);
250+
}
251+
201252
NumericMetric truth = result.Get<NumericMetric>(TruthMetricName);
202253
truth.Value = rating.Truth;
203254
truth.Interpretation = truth.InterpretScore();
@@ -206,6 +257,13 @@ void UpdateResult(Rating rating)
206257
truth.Reason = rating.TruthReasoning!;
207258
}
208259

260+
truth.AddOrUpdateMetadata(commonMetadata);
261+
if (rating.TruthReasons.Any())
262+
{
263+
string value = string.Join(Separator, rating.TruthReasons);
264+
truth.AddOrUpdateMetadata(name: Rationales, value);
265+
}
266+
209267
NumericMetric completeness = result.Get<NumericMetric>(CompletenessMetricName);
210268
completeness.Value = rating.Completeness;
211269
completeness.Interpretation = completeness.InterpretScore();
@@ -214,6 +272,13 @@ void UpdateResult(Rating rating)
214272
completeness.Reason = rating.CompletenessReasoning!;
215273
}
216274

275+
completeness.AddOrUpdateMetadata(commonMetadata);
276+
if (rating.CompletenessReasons.Any())
277+
{
278+
string value = string.Join(Separator, rating.CompletenessReasons);
279+
completeness.AddOrUpdateMetadata(name: Rationales, value);
280+
}
281+
217282
if (!string.IsNullOrWhiteSpace(rating.Error))
218283
{
219284
result.AddDiagnosticToAllMetrics(EvaluationDiagnostic.Error(rating.Error!));

src/Libraries/Microsoft.Extensions.AI.Evaluation.Quality/SingleNumericMetricEvaluator.cs

Lines changed: 57 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
// The .NET Foundation licenses this file to you under the MIT license.
33

44
using System.Collections.Generic;
5+
using System.Diagnostics;
6+
using System.Globalization;
57
using System.Threading;
68
using System.Threading.Tasks;
79
using Microsoft.Shared.Diagnostics;
@@ -65,33 +67,66 @@ protected sealed override async ValueTask PerformEvaluationAsync(
6567
_ = Throw.IfNull(chatConfiguration);
6668
_ = Throw.IfNull(result);
6769

68-
ChatResponse evaluationResponse =
69-
await chatConfiguration.ChatClient.GetResponseAsync(
70-
evaluationMessages,
71-
_chatOptions,
72-
cancellationToken: cancellationToken).ConfigureAwait(false);
73-
74-
string evaluationResponseText = evaluationResponse.Text.Trim();
75-
70+
Stopwatch stopwatch = Stopwatch.StartNew();
7671
NumericMetric metric = result.Get<NumericMetric>(MetricName);
7772

78-
if (string.IsNullOrEmpty(evaluationResponseText))
79-
{
80-
metric.AddDiagnostic(
81-
EvaluationDiagnostic.Error(
82-
"Evaluation failed because the model failed to produce a valid evaluation response."));
83-
}
84-
else if (int.TryParse(evaluationResponseText, out int score))
73+
try
8574
{
86-
metric.Value = score;
75+
ChatResponse evaluationResponse =
76+
await chatConfiguration.ChatClient.GetResponseAsync(
77+
evaluationMessages,
78+
_chatOptions,
79+
cancellationToken: cancellationToken).ConfigureAwait(false);
80+
81+
if (!string.IsNullOrWhiteSpace(evaluationResponse.ModelId))
82+
{
83+
metric.AddOrUpdateMetadata(name: "evaluation-model-used", value: evaluationResponse.ModelId!);
84+
}
85+
86+
if (evaluationResponse.Usage is UsageDetails usage)
87+
{
88+
if (usage.InputTokenCount is not null)
89+
{
90+
metric.AddOrUpdateMetadata(name: "evaluation-input-tokens-used", value: $"{usage.InputTokenCount}");
91+
}
92+
93+
if (usage.OutputTokenCount is not null)
94+
{
95+
metric.AddOrUpdateMetadata(name: "evaluation-output-tokens-used", value: $"{usage.OutputTokenCount}");
96+
}
97+
98+
if (usage.TotalTokenCount is not null)
99+
{
100+
metric.AddOrUpdateMetadata(name: "evaluation-total-tokens-used", value: $"{usage.TotalTokenCount}");
101+
}
102+
}
103+
104+
string evaluationResponseText = evaluationResponse.Text.Trim();
105+
106+
if (string.IsNullOrEmpty(evaluationResponseText))
107+
{
108+
metric.AddDiagnostic(
109+
EvaluationDiagnostic.Error(
110+
"Evaluation failed because the model failed to produce a valid evaluation response."));
111+
}
112+
else if (int.TryParse(evaluationResponseText, out int score))
113+
{
114+
metric.Value = score;
115+
}
116+
else
117+
{
118+
metric.AddDiagnostic(
119+
EvaluationDiagnostic.Error(
120+
$"Failed to parse '{evaluationResponseText!}' as an integer score for '{MetricName}'."));
121+
}
122+
123+
metric.Interpretation = metric.InterpretScore();
87124
}
88-
else
125+
finally
89126
{
90-
metric.AddDiagnostic(
91-
EvaluationDiagnostic.Error(
92-
$"Failed to parse '{evaluationResponseText!}' as an integer score for '{MetricName}'."));
127+
stopwatch.Stop();
128+
string duration = $"{stopwatch.Elapsed.TotalSeconds.ToString("F2", CultureInfo.InvariantCulture)} s";
129+
metric.AddOrUpdateMetadata(name: "evaluation-duration", value: duration);
93130
}
94-
95-
metric.Interpretation = metric.InterpretScore();
96131
}
97132
}

src/Libraries/Microsoft.Extensions.AI.Evaluation.Reporting.Azure/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
* [`Microsoft.Extensions.AI.Evaluation`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) - Defines core abstractions and types for supporting evaluation.
66
* [`Microsoft.Extensions.AI.Evaluation.Quality`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) - Contains evaluators that can be used to evaluate the quality of AI responses in your projects including Relevance, Truth, Completeness, Fluency, Coherence, Equivalence and Groundedness.
7+
* [`Microsoft.Extensions.AI.Evaluation.Safety`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) - Contains a set of evaluators that are built atop the Azure AI Content Safety service that can be used to evaluate the content safety of AI responses in your projects including Protected Material, Groundedness Pro, Ungrounded Attributes, Hate and Unfairness, Self Harm, Violence, Sexual, Code Vulnerability and Indirect Attack.
78
* [`Microsoft.Extensions.AI.Evaluation.Reporting`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) - Contains support for caching LLM responses, storing the results of evaluations and generating reports from that data.
89
* [`Microsoft.Extensions.AI.Evaluation.Reporting.Azure`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the `Microsoft.Extensions.AI.Evaluation.Reporting` library with an implementation for caching LLM responses and storing the evaluation results in an Azure Storage container.
910
* [`Microsoft.Extensions.AI.Evaluation.Console`](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) - A command line dotnet tool for generating reports and managing evaluation data.

0 commit comments

Comments
 (0)