Update code, data, and documentation for launch

PAIR-code · May 14, 2024 · e80be93 · e80be93
1 parent 817f8a0
commit e80be93
Show file tree

Hide file tree

Showing 25 changed files with 314 additions and 229 deletions.
diff --git a/README.md b/README.md
@@ -3,16 +3,16 @@
 LLM Comparator is an interactive visualization tool for analyzing side-by-side
 LLM evaluation results. It is designed to help people qualitatively analyze how
 responses from two models differ at example- and slice-levels. Users can
-interactively discover insights like "Model A's responses are better than B's on
-email rewriting tasks because Model A tends to generate bulleted lists more
-often."
+interactively discover insights like *"Model A's responses are better than B's
+on email rewriting tasks because Model A tends to generate bulleted lists more
+often."*
 
 ![Screenshot of LLM Comparator interface](documentation/images/llm_comparator_screenshot.png)
 
 
 ## Using LLM Comparator
 
-You can open LLM Comparator at https://pair-code.github.io/llm-comparator/.
+You can play with LLM Comparator at https://pair-code.github.io/llm-comparator/.
 
 You can either select one of the example files we provide, or you can upload
 your own JSON file (e.g.,
@@ -25,19 +25,19 @@ that follows our format which we describe below.
 We provide an example file for comparing
 the model responses between [Gemma](https://ai.google.dev/gemma) 1.1 and 1.0
 for prompts obtained from the
-[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations). You can click the link below to play with it:
+[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations).
+You can click the link below to play with it:
 https://pair-code.github.io/llm-comparator/?results_path=https://pair-code.github.io/llm-comparator/data/example_arena.json
 
 The tool helps you analyze *when* and *why* Gemma 1.1 is better or worse than
-1.0 and *how* responses from two models qualitatively differ.
+1.0 and *how* responses from two models differ.
 
-- ***When***: The **Score Distribution** panel shows that the quality of
-responses from Model A (Gemma 1.1) is considered better than that from Model B
-(Gemma 1.0) (larger blue area than orange),
-according to the LLM-based evaluation method
+- ***When***: The **Score Distribution** and **Metrics by Prompt Category**
+panels show that the quality of responses from Model A (Gemma 1.1) is considered
+better than that from Model B (Gemma 1.0) (larger blue area than orange;
+>50% win rate), according to the LLM-based evaluation method
 ([LLM-as-a-judge](https://arxiv.org/abs/2306.05685)).
-This holds true for most prompt categories
-(as in **Metrics by Prompt Category** panel).
+This holds true for most prompt categories (e.g., Humanities, Math).
 - ***Why***: The **Rationale Summary** panel dives into the reasons behind these
 score differences.
 In this case, the LLM judge focused mostly on the amount of details. It also
@@ -60,8 +60,8 @@ must follow the schema described below.
 
 We assume that a user has a set of input prompts to test. For each prompt, they
 need to prepare the responses to the prompt from two LLMs (i.e., Model A, Model
-B), and a numerical score obtained from automatic side-by-side evaluation (also
-known as [LLM-as-a-judge](https://arxiv.org/abs/2306.05685) or
+B), and a numerical score obtained from side-by-side evaluation (e.g.,
+[LLM-as-a-judge](https://arxiv.org/abs/2306.05685),
 [AutoSxS](https://cloud.google.com/vertex-ai/generative-ai/docs/models/side-by-side-eval)).
 A positive score represents that A's response is better than B's; a negative
 score indicates B is better; and zero meaning a tie.
@@ -83,7 +83,7 @@ All the fields presented below are required.
     "examples": [
         {
             "input_text": "This is a prompt.",
-            "tags": ["Coding"],  # A list of keywords for categorizing prompts
+            "tags": ["Math"],  # A list of keywords for categorizing prompts
             "output_text_a": "Response to the prompt from the first model (A)",
             "output_text_b": "Response to the prompt from the other model (B)",
             "score": -1.25,  # Score from the judge LLM
@@ -100,13 +100,13 @@ All the fields presented below are required.
 
 ### Additional Data
 
-Users can optionally provide additional information to be analyzed in LLM
+You can optionally provide additional information to be analyzed in LLM
 Comparator.
 
 #### Custom Fields
 
 If you have additional information about each prompt, it can be displayed as
-a column in the table and aggregated information is visualized as a chart
+columns in the table and aggregated information is visualized as charts
 on the right side of the interface. It supports various data types, such as:
 
 - `number`: Numeric data, visualized as histograms (e.g., word count for prompt,
@@ -231,18 +231,18 @@ npm run serve
 
 ## Citing LLM Comparator
 
-If you use LLM Comparator as part of your work, please cite our paper at
-https://arxiv.org/abs/2402.10524.
+If you use LLM Comparator as part of your work, please cite our research paper
+at https://arxiv.org/abs/2402.10524.
 
 ```
 @inproceedings{kahng2024comparator,
-    title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of
-    Large Language Models},
+    title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of Large Language Models},
     author={Kahng, Minsuk and Tenney, Ian and Pushkarna, Mahima and Liu, Michael Xieyang and Wexler, James and Reif, Emily and Kallarackal, Krystal and Chang, Minsuk and Terry, Michael and Dixon, Lucas},
-    booktitle={Extended Abstracts of the CHI Conference on Human Factors in
-    Computing Systems},
+    booktitle={Extended Abstracts of the CHI Conference on Human Factors in Computing Systems},
     year={2024},
     publisher={ACM},
+    doi={10.1145/3613905.3650755},
+    url={https://arxiv.org/abs/2402.10524}
 }
 ```
 

diff --git a/client/app.ts b/client/app.ts
@@ -15,7 +15,6 @@
  * limitations under the License.
  */
 
-// tslint:disable:g3-no-void-expression
 // tslint:disable:no-new-decorators
 import './components/charts';
 import './components/custom_functions';
@@ -89,14 +88,14 @@ export class LlmComparatorAppElement extends MobxLitElement {
           </div>
           <div class="link-icon">
             <a href=${feedbackLink} target="_blank">
-              <mwc-icon class="icon" title="Open Form">
+              <mwc-icon class="icon" title="Send Feedback">
                 feedback
               </mwc-icon>
             </a>
           </div>
           <div class="link-icon">
             <a href=${documentationLink} target="_blank">
-              <mwc-icon class="icon" title="Open project page">
+              <mwc-icon class="icon" title="Open Documentation Page">
                 help_outline
               </mwc-icon>
             </a>

diff --git a/client/components/bar_chart.ts b/client/components/bar_chart.ts
@@ -37,7 +37,7 @@ export interface AggregatedEntry {
 
 /**
  * Component for bar charts. Currently for rating scores by individual raters.
- * TODO(b/311744307): Extract common parts in the histogram.
+ * TODO: Extract common parts in the histogram.
  */
 @customElement('comparator-bar-chart')
 export class BarChartElement extends MobxLitElement {

diff --git a/client/components/charts.ts b/client/components/charts.ts
@@ -373,7 +373,7 @@ export class ChartsElement extends MobxLitElement {
     const renderChartsForCustomFields: Array<[string, any]> =
         this.appState
             .columns
-            // TODO(b/315388387): Will not need when custom functions are
+            // TODO: Will not need when custom functions are
             // merged.
             .filter((field: Field) => field.id.startsWith('custom_field:'))
             .filter(

diff --git a/client/components/custom_functions.ts b/client/components/custom_functions.ts
@@ -245,7 +245,7 @@ export class CustomFunctionsElement extends MobxLitElement {
     </comparator-binary-stacked-bar-chart>`;
   }
 
-  // TODO(b/326139568): Merge into the side-by-side histogram code in charts.ts.
+  // TODO: Merge into the side-by-side histogram code in charts.ts.
   private renderChartForNumberType(customFunc: CustomFunction) {
     const getHistogramSpec = () =>
         this.appState.histogramSpecForCustomFuncs[customFunc.id];
@@ -423,7 +423,7 @@ export class CustomFunctionsElement extends MobxLitElement {
       'disabled': customFunc.precomputed === true,
     });
 
-    // TODO(b/323336525): Improve the design for displaying custom func rows.
+    // TODO: Improve the design for displaying custom func rows.
     // prettier-ignore
     return html`
       <tr class=${customFuncRowStyle(customFunc.id)}>

diff --git a/client/components/dataset_selection.css b/client/components/dataset_selection.css
@@ -38,8 +38,8 @@
 }
 
 .panel-instruction {
-  color: #555;
-  line-height: 16px;
+  color: var(--comparator-grey-800);
+  line-height: 18px;
   margin: 5px 0;
   padding: 2px 0;
 }

diff --git a/client/components/dataset_selection.ts b/client/components/dataset_selection.ts
@@ -29,7 +29,7 @@ import {AppState} from '../services/state_service';
 import {styles} from './dataset_selection.css';
 
 /**
- * Dataset Selection component.
+ * Component for selecting data files.
  */
 @customElement('comparator-dataset-selection')
 export class DatasetSelectionElement extends MobxLitElement {
@@ -53,11 +53,17 @@ export class DatasetSelectionElement extends MobxLitElement {
 
     return html`
       <div>
-        The json file must contain these three properties: "metadata", "models",
-        and "examples".
+        The json file must contain these three properties:
+        <span class="filepath">metadata</span>,
+        <span class="filepath">models</span>,
+        and <span class="filepath">examples</span>.
         <br />
-        Each example must have "input_text", "tags", "output_text_a",
-        "output_text_b", and "score".
+        Each example in <span class="filepath">examples</span> must have
+        <span class="filepath">input_text</span>,
+        <span class="filepath">tags</span>,
+        <span class="filepath">output_text_a</span>,
+        <span class="filepath">output_text_b</span>,
+        and <span class="filepath">score</span>.
         <br />
         Please refer to our document for details:
         <a href="${documentationLink}" target="_blank">${documentationLink}</a>
@@ -94,7 +100,7 @@ export class DatasetSelectionElement extends MobxLitElement {
 
     const textareaPlaceholder = 'Enter a URL to load the json file from.';
     const urlLoadPath =
-        this.appState.appLink + '?results_path=https://.../results.json';
+        this.appState.appLink + '?results_path=https://.../...json';
     const panelIntro = html`
       Enter the URL path of a json file prepared for LLM Comparator.`;
     const panelOutro = html`

diff --git a/client/components/example_details.ts b/client/components/example_details.ts
@@ -153,7 +153,7 @@ export class ExampleDetailsElement extends MobxLitElement {
     </comparator-histogram>`;
   }
 
-  // TODO(b/311725252): Create a separate data-table component.
+  // TODO: Create a separate data-table component.
   private renderRaterTable() {
     const selectedExample = this.selectedExample;
     if (selectedExample == null) {
@@ -237,18 +237,17 @@ export class ExampleDetailsElement extends MobxLitElement {
         <th class="score" rowspan="2">Score ${renderSortIcons()}</th>
         <th class="label" rowspan="2">Rating</th>
         <th class="flipped" rowspan="2">Flipped?</th>
-        <th class="rationale" rowspan="2">
-          Rationale
-          <small>(Careful for flipped cases!)</small>
-        </th>
-        ${this.appState.customFieldsOfPerRatingType.map((field: Field) =>
-          renderCustomFieldHeaderCell(field),
-        )}
+        <th class="rationale" rowspan="2">Rationale</th>
+        ${
+        this.appState.customFieldsOfPerRatingType.map(
+            (field: Field) => renderCustomFieldHeaderCell(field),
+            )}
       </tr>
       <tr class="second-row">
-        ${this.appState.customFieldsOfPerRatingType.map((field: Field) =>
-          renderCustomFieldHeaderCellSecondRow(field),
-        )}
+        ${
+        this.appState.customFieldsOfPerRatingType.map(
+            (field: Field) => renderCustomFieldHeaderCellSecondRow(field),
+            )}
       </tr>`;
 
     // Table body.

diff --git a/client/components/example_table.css b/client/components/example_table.css
@@ -217,6 +217,11 @@ td.score.b-win {
   text-decoration: underline;
 }
 
+.selected .rater-info-link {
+  color: var(--comparator-grey-800);
+  font-weight: 600;
+}
+
 td.score:hover .rater-info-link {
   color: var(--comparator-grey-800);
 }
@@ -257,7 +262,8 @@ ul.rationale-list li.cluster-selected::before {
 
 .text-holder,
 .list-holder,
-.sequence-chunks-holder {
+.sequence-chunks-holder,
+.score-holder {
   height: 119px;  /* Set default as 17px x 7 rows */
   overflow-x: hidden;
   overflow-y: scroll;
@@ -273,6 +279,11 @@ ul.rationale-list li.cluster-selected::before {
   overflow-wrap: anywhere;
 }
 
+.score-holder {
+  overflow-y: hidden;
+  padding-top: 0;
+}
+
 tr.monospace .text-holder {
   font-family: monospace;
 }

diff --git a/client/components/example_table.ts b/client/components/example_table.ts
@@ -91,12 +91,16 @@ export class ExampleTableElement extends MobxLitElement {
 
   private styleHolder(example: Example) {
     return styleMap({
-      'height':
-        this.appState.selectedExample !== example
-          ? `${
-              this.appState.numberOfLinesPerOutputCell * LINE_HEIGHT_IN_CELL
-            }px`
-          : 'auto',
+      'height': this.appState.getIsExampleExpanded(example.index) !== true ?
+          `${
+              this.appState.numberOfLinesPerOutputCell *
+              LINE_HEIGHT_IN_CELL}px` :
+          'auto',
+      'min-height': this.appState.getIsExampleExpanded(example.index) === true ?
+          `${
+              this.appState.numberOfLinesPerOutputCell *
+              LINE_HEIGHT_IN_CELL}px` :
+          null,
     });
   }
 
@@ -233,14 +237,17 @@ export class ExampleTableElement extends MobxLitElement {
 
   private renderRow(example: Example, rowIndex: number) {
     const handleDoubleClickRow = () => {
-      this.appState.selectedExample =
-        this.appState.selectedExample === example ? null : example;
+      this.appState.isExampleExpanded[example.index] =
+          this.appState.getIsExampleExpanded(example.index) === true ? false :
+                                                                       true;
     };
     const styleRow = classMap({
       'selected': this.appState.selectedExample === example,
       'monospace': this.appState.useMonospace === true,
     });
 
+    const styleHolder = this.styleHolder(example);
+
     // Use text diff only when both are texts.
     const textDiff =
       typeof example.output_text_a === 'string' &&
@@ -376,10 +383,12 @@ export class ExampleTableElement extends MobxLitElement {
           </div>
           ${renderHistogram}` :
         '';
-    const renderScore = example.score == null ? 'null' : html`
+    const renderScore = example.score == null ? 'Null' : html`
+        <div class="score-holder" style=${styleHolder}>
           <div class="score-number">${example.score.toFixed(2)}</div>
           ${scoreDescription}
-          ${raterInfoLink}`;
+          ${raterInfoLink}
+        </div>`;
 
     const styleScore = classMap({
       'score': true,
@@ -467,8 +476,6 @@ export class ExampleTableElement extends MobxLitElement {
             ) :
         html``;
 
-    const styleHolder = this.styleHolder(example);
-
     // Custom fields.
     const renderCustomField = (field: Field, columnIndex: number) => {
       if (field.type === FieldType.PER_RATING_PER_MODEL_CATEGORY) {

diff --git a/client/components/histogram.ts b/client/components/histogram.ts
@@ -38,7 +38,7 @@ import {styles} from './histogram.css';
 
 /**
  * Component for histograms for the distribution of scores or custom funcs.
- * TODO(b/311744307): Extract common parts in the bar chart.
+ * TODO: Extract common parts in the bar chart.
  */
 @customElement('comparator-histogram')
 export class HistogramElement extends MobxLitElement {

diff --git a/client/components/metrics_by_slice.css b/client/components/metrics_by_slice.css
@@ -1,3 +1,8 @@
+thead {
+  position: sticky;
+  top: 0;
+}
+
 th.score-avg {
   width: 98px;  /* width sum for score-avg-number and score-avg-chart */
 }
@@ -111,9 +116,9 @@ rect.bar.win-rate-result-tie {
   fill: var(--comparator-grey-400);
 }
 
-.collapsed {
+.collapsed .table-container {
   max-height: 220px;
-  overflow-y: hidden;
+  overflow-y: scroll;
 }
 
 line.middle-point-vertical {