Using Human vs AI in Dataset Experiments for RAG evaluation #4789

omrihar · 2024-09-27T13:23:44Z

omrihar
Sep 27, 2024

Hi all,
I'm currently evaluating Phoenix for usage both during development and production. One thing I'd like to do is use the Human vs AI evaluation that exist in Evals in experiments, and I must say I don't really understand the differences between the Evals in the Evals section and the Evals in the experiment section, and also that not all Evals are covered in the experiment section.
My idea is to have a file in my repo that runs the current version of the rag pipeline I have against a "golden" dataset I curate, and that every time I make a significant change I can easily run an experiment and save it together with the commit hash of the current status of the repo or with the name of the branch.
I see that the evals are run with llm_classify which runs in parallel on the entirety of my samples, whereas I have run_experiment that seems to run for on esample at a time. Also the interface is different, since one accepts a dataframe whereas the other accepts a task and a set of evaluators. Are these two interfaces compatible with each other? In which context should I use which? And how can I efficiently run the Human vs AI evaluator on a curated dataset I have stored in Phoenix?

Thanks!
Omri

Answered by axiomofjoy

Sep 28, 2024

Hey @omrihar, this is a great question.

I don't really understand the differences between the Evals in the Evals section and the Evals in the experiment section

As you noted, the llm_classify API is designed for use with dataframes in a notebook environment, and results in annotations on spans. In contrast, the evaluators in the experiment section are intended for running across a dataset, and result in annotations on dataset examples.

Are these two interfaces compatible with each other?

The interfaces are not compatible with each other. This is an area in which we'd like to improve.

In which context should I use which?

In your case, it sounds like you want to use the experiments AP…

View full answer

axiomofjoy · 2024-09-28T01:44:54Z

axiomofjoy
Sep 28, 2024
Maintainer

Hey @omrihar, this is a great question.

I don't really understand the differences between the Evals in the Evals section and the Evals in the experiment section

As you noted, the llm_classify API is designed for use with dataframes in a notebook environment, and results in annotations on spans. In contrast, the evaluators in the experiment section are intended for running across a dataset, and result in annotations on dataset examples.

Are these two interfaces compatible with each other?

The interfaces are not compatible with each other. This is an area in which we'd like to improve.

In which context should I use which?

In your case, it sounds like you want to use the experiments API since you want to evaluate performance on a golden dataset as you iterate.

And how can I efficiently run the Human vs AI evaluator on a curated dataset I have stored in Phoenix?

Since we don't currently support the Human vs AI evaluator in the experiments API at the moment, I recommend creating an evaluator function using the prompt template found here.

Hope this helps!

2 replies

omrihar Sep 28, 2024
Author

Thanks @axiomofjoy, this does clarify indeed. I guess what confused me a bit is that it seems that the Human vs AI eval specifically makes more sense in the "golden dataset" setup, since it's a bit clumsy to run on spans, because you don't necessarily have a correct answer in that context, and when you have a golden dataset you do. I think it would make sense to perhaps create a helper function that takes the prompt & rails for an Eval and create an Evaluator out of it. Do you think it's something complex to generalize properly?

Thanks, I really like what you did with Phoneix.
Cheers,
Omri

axiomofjoy Oct 2, 2024
Maintainer

Hey @omrihar, I've file an enhancement request here. Feel free to add color to the ticket!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Human vs AI in Dataset Experiments for RAG evaluation #4789

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Using Human vs AI in Dataset Experiments for RAG evaluation #4789

omrihar Sep 27, 2024

Replies: 1 comment · 2 replies

axiomofjoy Sep 28, 2024 Maintainer

omrihar Sep 28, 2024 Author

axiomofjoy Oct 2, 2024 Maintainer

omrihar
Sep 27, 2024

Replies: 1 comment 2 replies

axiomofjoy
Sep 28, 2024
Maintainer

omrihar Sep 28, 2024
Author

axiomofjoy Oct 2, 2024
Maintainer