Using Human vs AI in Dataset Experiments for RAG evaluation #4789
-
Hi all, Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hey @omrihar, this is a great question.
As you noted, the
The interfaces are not compatible with each other. This is an area in which we'd like to improve.
In your case, it sounds like you want to use the experiments API since you want to evaluate performance on a golden dataset as you iterate.
Since we don't currently support the Human vs AI evaluator in the experiments API at the moment, I recommend creating an evaluator function using the prompt template found here. Hope this helps! |
Beta Was this translation helpful? Give feedback.
Hey @omrihar, this is a great question.
As you noted, the
llm_classify
API is designed for use with dataframes in a notebook environment, and results in annotations on spans. In contrast, the evaluators in the experiment section are intended for running across a dataset, and result in annotations on dataset examples.The interfaces are not compatible with each other. This is an area in which we'd like to improve.
In your case, it sounds like you want to use the experiments AP…