Skip to content

Commit

Permalink
Update code, data, and documentation for launch
Browse files Browse the repository at this point in the history
  • Loading branch information
minsukkahng committed May 14, 2024
1 parent 817f8a0 commit e80be93
Show file tree
Hide file tree
Showing 25 changed files with 314 additions and 229 deletions.
46 changes: 23 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,16 @@
LLM Comparator is an interactive visualization tool for analyzing side-by-side
LLM evaluation results. It is designed to help people qualitatively analyze how
responses from two models differ at example- and slice-levels. Users can
interactively discover insights like "Model A's responses are better than B's on
email rewriting tasks because Model A tends to generate bulleted lists more
often."
interactively discover insights like *"Model A's responses are better than B's
on email rewriting tasks because Model A tends to generate bulleted lists more
often."*

![Screenshot of LLM Comparator interface](documentation/images/llm_comparator_screenshot.png)


## Using LLM Comparator

You can open LLM Comparator at https://pair-code.github.io/llm-comparator/.
You can play with LLM Comparator at https://pair-code.github.io/llm-comparator/.

You can either select one of the example files we provide, or you can upload
your own JSON file (e.g.,
Expand All @@ -25,19 +25,19 @@ that follows our format which we describe below.
We provide an example file for comparing
the model responses between [Gemma](https://ai.google.dev/gemma) 1.1 and 1.0
for prompts obtained from the
[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations). You can click the link below to play with it:
[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations).
You can click the link below to play with it:
https://pair-code.github.io/llm-comparator/?results_path=https://pair-code.github.io/llm-comparator/data/example_arena.json

The tool helps you analyze *when* and *why* Gemma 1.1 is better or worse than
1.0 and *how* responses from two models qualitatively differ.
1.0 and *how* responses from two models differ.

- ***When***: The **Score Distribution** panel shows that the quality of
responses from Model A (Gemma 1.1) is considered better than that from Model B
(Gemma 1.0) (larger blue area than orange),
according to the LLM-based evaluation method
- ***When***: The **Score Distribution** and **Metrics by Prompt Category**
panels show that the quality of responses from Model A (Gemma 1.1) is considered
better than that from Model B (Gemma 1.0) (larger blue area than orange;
>50% win rate), according to the LLM-based evaluation method
([LLM-as-a-judge](https://arxiv.org/abs/2306.05685)).
This holds true for most prompt categories
(as in **Metrics by Prompt Category** panel).
This holds true for most prompt categories (e.g., Humanities, Math).
- ***Why***: The **Rationale Summary** panel dives into the reasons behind these
score differences.
In this case, the LLM judge focused mostly on the amount of details. It also
Expand All @@ -60,8 +60,8 @@ must follow the schema described below.

We assume that a user has a set of input prompts to test. For each prompt, they
need to prepare the responses to the prompt from two LLMs (i.e., Model A, Model
B), and a numerical score obtained from automatic side-by-side evaluation (also
known as [LLM-as-a-judge](https://arxiv.org/abs/2306.05685) or
B), and a numerical score obtained from side-by-side evaluation (e.g.,
[LLM-as-a-judge](https://arxiv.org/abs/2306.05685),
[AutoSxS](https://cloud.google.com/vertex-ai/generative-ai/docs/models/side-by-side-eval)).
A positive score represents that A's response is better than B's; a negative
score indicates B is better; and zero meaning a tie.
Expand All @@ -83,7 +83,7 @@ All the fields presented below are required.
"examples": [
{
"input_text": "This is a prompt.",
"tags": ["Coding"], # A list of keywords for categorizing prompts
"tags": ["Math"], # A list of keywords for categorizing prompts
"output_text_a": "Response to the prompt from the first model (A)",
"output_text_b": "Response to the prompt from the other model (B)",
"score": -1.25, # Score from the judge LLM
Expand All @@ -100,13 +100,13 @@ All the fields presented below are required.

### Additional Data

Users can optionally provide additional information to be analyzed in LLM
You can optionally provide additional information to be analyzed in LLM
Comparator.

#### Custom Fields

If you have additional information about each prompt, it can be displayed as
a column in the table and aggregated information is visualized as a chart
columns in the table and aggregated information is visualized as charts
on the right side of the interface. It supports various data types, such as:

- `number`: Numeric data, visualized as histograms (e.g., word count for prompt,
Expand Down Expand Up @@ -231,18 +231,18 @@ npm run serve

## Citing LLM Comparator

If you use LLM Comparator as part of your work, please cite our paper at
https://arxiv.org/abs/2402.10524.
If you use LLM Comparator as part of your work, please cite our research paper
at https://arxiv.org/abs/2402.10524.

```
@inproceedings{kahng2024comparator,
title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of
Large Language Models},
title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of Large Language Models},
author={Kahng, Minsuk and Tenney, Ian and Pushkarna, Mahima and Liu, Michael Xieyang and Wexler, James and Reif, Emily and Kallarackal, Krystal and Chang, Minsuk and Terry, Michael and Dixon, Lucas},
booktitle={Extended Abstracts of the CHI Conference on Human Factors in
Computing Systems},
booktitle={Extended Abstracts of the CHI Conference on Human Factors in Computing Systems},
year={2024},
publisher={ACM},
doi={10.1145/3613905.3650755},
url={https://arxiv.org/abs/2402.10524}
}
```

Expand Down
5 changes: 2 additions & 3 deletions client/app.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
* limitations under the License.
*/

// tslint:disable:g3-no-void-expression
// tslint:disable:no-new-decorators
import './components/charts';
import './components/custom_functions';
Expand Down Expand Up @@ -89,14 +88,14 @@ export class LlmComparatorAppElement extends MobxLitElement {
</div>
<div class="link-icon">
<a href=${feedbackLink} target="_blank">
<mwc-icon class="icon" title="Open Form">
<mwc-icon class="icon" title="Send Feedback">
feedback
</mwc-icon>
</a>
</div>
<div class="link-icon">
<a href=${documentationLink} target="_blank">
<mwc-icon class="icon" title="Open project page">
<mwc-icon class="icon" title="Open Documentation Page">
help_outline
</mwc-icon>
</a>
Expand Down
2 changes: 1 addition & 1 deletion client/components/bar_chart.ts
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ export interface AggregatedEntry {

/**
* Component for bar charts. Currently for rating scores by individual raters.
* TODO(b/311744307): Extract common parts in the histogram.
* TODO: Extract common parts in the histogram.
*/
@customElement('comparator-bar-chart')
export class BarChartElement extends MobxLitElement {
Expand Down
2 changes: 1 addition & 1 deletion client/components/charts.ts
Original file line number Diff line number Diff line change
Expand Up @@ -373,7 +373,7 @@ export class ChartsElement extends MobxLitElement {
const renderChartsForCustomFields: Array<[string, any]> =
this.appState
.columns
// TODO(b/315388387): Will not need when custom functions are
// TODO: Will not need when custom functions are
// merged.
.filter((field: Field) => field.id.startsWith('custom_field:'))
.filter(
Expand Down
4 changes: 2 additions & 2 deletions client/components/custom_functions.ts
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ export class CustomFunctionsElement extends MobxLitElement {
</comparator-binary-stacked-bar-chart>`;
}

// TODO(b/326139568): Merge into the side-by-side histogram code in charts.ts.
// TODO: Merge into the side-by-side histogram code in charts.ts.
private renderChartForNumberType(customFunc: CustomFunction) {
const getHistogramSpec = () =>
this.appState.histogramSpecForCustomFuncs[customFunc.id];
Expand Down Expand Up @@ -423,7 +423,7 @@ export class CustomFunctionsElement extends MobxLitElement {
'disabled': customFunc.precomputed === true,
});

// TODO(b/323336525): Improve the design for displaying custom func rows.
// TODO: Improve the design for displaying custom func rows.
// prettier-ignore
return html`
<tr class=${customFuncRowStyle(customFunc.id)}>
Expand Down
4 changes: 2 additions & 2 deletions client/components/dataset_selection.css
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@
}

.panel-instruction {
color: #555;
line-height: 16px;
color: var(--comparator-grey-800);
line-height: 18px;
margin: 5px 0;
padding: 2px 0;
}
Expand Down
18 changes: 12 additions & 6 deletions client/components/dataset_selection.ts
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ import {AppState} from '../services/state_service';
import {styles} from './dataset_selection.css';

/**
* Dataset Selection component.
* Component for selecting data files.
*/
@customElement('comparator-dataset-selection')
export class DatasetSelectionElement extends MobxLitElement {
Expand All @@ -53,11 +53,17 @@ export class DatasetSelectionElement extends MobxLitElement {

return html`
<div>
The json file must contain these three properties: "metadata", "models",
and "examples".
The json file must contain these three properties:
<span class="filepath">metadata</span>,
<span class="filepath">models</span>,
and <span class="filepath">examples</span>.
<br />
Each example must have "input_text", "tags", "output_text_a",
"output_text_b", and "score".
Each example in <span class="filepath">examples</span> must have
<span class="filepath">input_text</span>,
<span class="filepath">tags</span>,
<span class="filepath">output_text_a</span>,
<span class="filepath">output_text_b</span>,
and <span class="filepath">score</span>.
<br />
Please refer to our document for details:
<a href="${documentationLink}" target="_blank">${documentationLink}</a>
Expand Down Expand Up @@ -94,7 +100,7 @@ export class DatasetSelectionElement extends MobxLitElement {

const textareaPlaceholder = 'Enter a URL to load the json file from.';
const urlLoadPath =
this.appState.appLink + '?results_path=https://.../results.json';
this.appState.appLink + '?results_path=https://.../...json';
const panelIntro = html`
Enter the URL path of a json file prepared for LLM Comparator.`;
const panelOutro = html`
Expand Down
21 changes: 10 additions & 11 deletions client/components/example_details.ts
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ export class ExampleDetailsElement extends MobxLitElement {
</comparator-histogram>`;
}

// TODO(b/311725252): Create a separate data-table component.
// TODO: Create a separate data-table component.
private renderRaterTable() {
const selectedExample = this.selectedExample;
if (selectedExample == null) {
Expand Down Expand Up @@ -237,18 +237,17 @@ export class ExampleDetailsElement extends MobxLitElement {
<th class="score" rowspan="2">Score ${renderSortIcons()}</th>
<th class="label" rowspan="2">Rating</th>
<th class="flipped" rowspan="2">Flipped?</th>
<th class="rationale" rowspan="2">
Rationale
<small>(Careful for flipped cases!)</small>
</th>
${this.appState.customFieldsOfPerRatingType.map((field: Field) =>
renderCustomFieldHeaderCell(field),
)}
<th class="rationale" rowspan="2">Rationale</th>
${
this.appState.customFieldsOfPerRatingType.map(
(field: Field) => renderCustomFieldHeaderCell(field),
)}
</tr>
<tr class="second-row">
${this.appState.customFieldsOfPerRatingType.map((field: Field) =>
renderCustomFieldHeaderCellSecondRow(field),
)}
${
this.appState.customFieldsOfPerRatingType.map(
(field: Field) => renderCustomFieldHeaderCellSecondRow(field),
)}
</tr>`;

// Table body.
Expand Down
13 changes: 12 additions & 1 deletion client/components/example_table.css
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,11 @@ td.score.b-win {
text-decoration: underline;
}

.selected .rater-info-link {
color: var(--comparator-grey-800);
font-weight: 600;
}

td.score:hover .rater-info-link {
color: var(--comparator-grey-800);
}
Expand Down Expand Up @@ -257,7 +262,8 @@ ul.rationale-list li.cluster-selected::before {

.text-holder,
.list-holder,
.sequence-chunks-holder {
.sequence-chunks-holder,
.score-holder {
height: 119px; /* Set default as 17px x 7 rows */
overflow-x: hidden;
overflow-y: scroll;
Expand All @@ -273,6 +279,11 @@ ul.rationale-list li.cluster-selected::before {
overflow-wrap: anywhere;
}

.score-holder {
overflow-y: hidden;
padding-top: 0;
}

tr.monospace .text-holder {
font-family: monospace;
}
Expand Down
31 changes: 19 additions & 12 deletions client/components/example_table.ts
Original file line number Diff line number Diff line change
Expand Up @@ -91,12 +91,16 @@ export class ExampleTableElement extends MobxLitElement {

private styleHolder(example: Example) {
return styleMap({
'height':
this.appState.selectedExample !== example
? `${
this.appState.numberOfLinesPerOutputCell * LINE_HEIGHT_IN_CELL
}px`
: 'auto',
'height': this.appState.getIsExampleExpanded(example.index) !== true ?
`${
this.appState.numberOfLinesPerOutputCell *
LINE_HEIGHT_IN_CELL}px` :
'auto',
'min-height': this.appState.getIsExampleExpanded(example.index) === true ?
`${
this.appState.numberOfLinesPerOutputCell *
LINE_HEIGHT_IN_CELL}px` :
null,
});
}

Expand Down Expand Up @@ -233,14 +237,17 @@ export class ExampleTableElement extends MobxLitElement {

private renderRow(example: Example, rowIndex: number) {
const handleDoubleClickRow = () => {
this.appState.selectedExample =
this.appState.selectedExample === example ? null : example;
this.appState.isExampleExpanded[example.index] =
this.appState.getIsExampleExpanded(example.index) === true ? false :
true;
};
const styleRow = classMap({
'selected': this.appState.selectedExample === example,
'monospace': this.appState.useMonospace === true,
});

const styleHolder = this.styleHolder(example);

// Use text diff only when both are texts.
const textDiff =
typeof example.output_text_a === 'string' &&
Expand Down Expand Up @@ -376,10 +383,12 @@ export class ExampleTableElement extends MobxLitElement {
</div>
${renderHistogram}` :
'';
const renderScore = example.score == null ? 'null' : html`
const renderScore = example.score == null ? 'Null' : html`
<div class="score-holder" style=${styleHolder}>
<div class="score-number">${example.score.toFixed(2)}</div>
${scoreDescription}
${raterInfoLink}`;
${raterInfoLink}
</div>`;

const styleScore = classMap({
'score': true,
Expand Down Expand Up @@ -467,8 +476,6 @@ export class ExampleTableElement extends MobxLitElement {
) :
html``;

const styleHolder = this.styleHolder(example);

// Custom fields.
const renderCustomField = (field: Field, columnIndex: number) => {
if (field.type === FieldType.PER_RATING_PER_MODEL_CATEGORY) {
Expand Down
2 changes: 1 addition & 1 deletion client/components/histogram.ts
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ import {styles} from './histogram.css';

/**
* Component for histograms for the distribution of scores or custom funcs.
* TODO(b/311744307): Extract common parts in the bar chart.
* TODO: Extract common parts in the bar chart.
*/
@customElement('comparator-histogram')
export class HistogramElement extends MobxLitElement {
Expand Down
9 changes: 7 additions & 2 deletions client/components/metrics_by_slice.css
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
thead {
position: sticky;
top: 0;
}

th.score-avg {
width: 98px; /* width sum for score-avg-number and score-avg-chart */
}
Expand Down Expand Up @@ -111,9 +116,9 @@ rect.bar.win-rate-result-tie {
fill: var(--comparator-grey-400);
}

.collapsed {
.collapsed .table-container {
max-height: 220px;
overflow-y: hidden;
overflow-y: scroll;
}

line.middle-point-vertical {
Expand Down
Loading

0 comments on commit e80be93

Please sign in to comment.