-
Notifications
You must be signed in to change notification settings - Fork 968
Labels
Description
Proposal summary
We like to extend the existing LLM-as-a-judge evaluation metrics to include a new judge metric called "Sycophancy". Full paper with the methodology and prompts can be found here: https://arxiv.org/pdf/2502.08177
Example of an existing judge metric (Hallucination) is defined here:
- Docs: https://www.comet.com/docs/opik/evaluation/metrics/hallucination
- Docs Code: https://github.com/comet-ml/opik/blob/main/apps/opik-documentation/documentation/fern/docs/evaluation/metrics/hallucination.mdx
- Python SDK: https://github.com/comet-ml/opik/tree/main/sdks/python/src/opik/evaluation/metrics/llm_judges/hallucination
- Python Examples: https://github.com/comet-ml/opik/blob/main/sdks/python/examples/metrics.py
- Frontend: https://github.com/comet-ml/opik/blob/main/apps/opik-frontend/src/constants/llm.ts
Expectation is the new judge is added to the frontend for using LLM-as-a-judge from the UI (Online Evaluation tab) as well as in the Python SDK. The appropriate docs needs to be updated and a video attached of the metric working.
Motivation
I would like to see more robust set of metrics and evaluations based on recent research