diff --git a/README.md b/README.md index 717a898..4e17a13 100644 --- a/README.md +++ b/README.md @@ -3,16 +3,16 @@ LLM Comparator is an interactive visualization tool for analyzing side-by-side LLM evaluation results. It is designed to help people qualitatively analyze how responses from two models differ at example- and slice-levels. Users can -interactively discover insights like "Model A's responses are better than B's on -email rewriting tasks because Model A tends to generate bulleted lists more -often." +interactively discover insights like *"Model A's responses are better than B's +on email rewriting tasks because Model A tends to generate bulleted lists more +often."* ![Screenshot of LLM Comparator interface](documentation/images/llm_comparator_screenshot.png) ## Using LLM Comparator -You can open LLM Comparator at https://pair-code.github.io/llm-comparator/. +You can play with LLM Comparator at https://pair-code.github.io/llm-comparator/. You can either select one of the example files we provide, or you can upload your own JSON file (e.g., @@ -25,19 +25,19 @@ that follows our format which we describe below. We provide an example file for comparing the model responses between [Gemma](https://ai.google.dev/gemma) 1.1 and 1.0 for prompts obtained from the -[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations). You can click the link below to play with it: +[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations). +You can click the link below to play with it: https://pair-code.github.io/llm-comparator/?results_path=https://pair-code.github.io/llm-comparator/data/example_arena.json The tool helps you analyze *when* and *why* Gemma 1.1 is better or worse than -1.0 and *how* responses from two models qualitatively differ. +1.0 and *how* responses from two models differ. -- ***When***: The **Score Distribution** panel shows that the quality of -responses from Model A (Gemma 1.1) is considered better than that from Model B -(Gemma 1.0) (larger blue area than orange), -according to the LLM-based evaluation method +- ***When***: The **Score Distribution** and **Metrics by Prompt Category** +panels show that the quality of responses from Model A (Gemma 1.1) is considered +better than that from Model B (Gemma 1.0) (larger blue area than orange; +>50% win rate), according to the LLM-based evaluation method ([LLM-as-a-judge](https://arxiv.org/abs/2306.05685)). -This holds true for most prompt categories -(as in **Metrics by Prompt Category** panel). +This holds true for most prompt categories (e.g., Humanities, Math). - ***Why***: The **Rationale Summary** panel dives into the reasons behind these score differences. In this case, the LLM judge focused mostly on the amount of details. It also @@ -60,8 +60,8 @@ must follow the schema described below. We assume that a user has a set of input prompts to test. For each prompt, they need to prepare the responses to the prompt from two LLMs (i.e., Model A, Model -B), and a numerical score obtained from automatic side-by-side evaluation (also -known as [LLM-as-a-judge](https://arxiv.org/abs/2306.05685) or +B), and a numerical score obtained from side-by-side evaluation (e.g., +[LLM-as-a-judge](https://arxiv.org/abs/2306.05685), [AutoSxS](https://cloud.google.com/vertex-ai/generative-ai/docs/models/side-by-side-eval)). A positive score represents that A's response is better than B's; a negative score indicates B is better; and zero meaning a tie. @@ -83,7 +83,7 @@ All the fields presented below are required. "examples": [ { "input_text": "This is a prompt.", - "tags": ["Coding"], # A list of keywords for categorizing prompts + "tags": ["Math"], # A list of keywords for categorizing prompts "output_text_a": "Response to the prompt from the first model (A)", "output_text_b": "Response to the prompt from the other model (B)", "score": -1.25, # Score from the judge LLM @@ -100,13 +100,13 @@ All the fields presented below are required. ### Additional Data -Users can optionally provide additional information to be analyzed in LLM +You can optionally provide additional information to be analyzed in LLM Comparator. #### Custom Fields If you have additional information about each prompt, it can be displayed as -a column in the table and aggregated information is visualized as a chart +columns in the table and aggregated information is visualized as charts on the right side of the interface. It supports various data types, such as: - `number`: Numeric data, visualized as histograms (e.g., word count for prompt, @@ -231,18 +231,18 @@ npm run serve ## Citing LLM Comparator -If you use LLM Comparator as part of your work, please cite our paper at -https://arxiv.org/abs/2402.10524. +If you use LLM Comparator as part of your work, please cite our research paper +at https://arxiv.org/abs/2402.10524. ``` @inproceedings{kahng2024comparator, - title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of - Large Language Models}, + title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of Large Language Models}, author={Kahng, Minsuk and Tenney, Ian and Pushkarna, Mahima and Liu, Michael Xieyang and Wexler, James and Reif, Emily and Kallarackal, Krystal and Chang, Minsuk and Terry, Michael and Dixon, Lucas}, - booktitle={Extended Abstracts of the CHI Conference on Human Factors in - Computing Systems}, + booktitle={Extended Abstracts of the CHI Conference on Human Factors in Computing Systems}, year={2024}, publisher={ACM}, + doi={10.1145/3613905.3650755}, + url={https://arxiv.org/abs/2402.10524} } ``` diff --git a/client/app.ts b/client/app.ts index 46879e0..df665bc 100644 --- a/client/app.ts +++ b/client/app.ts @@ -15,7 +15,6 @@ * limitations under the License. */ -// tslint:disable:g3-no-void-expression // tslint:disable:no-new-decorators import './components/charts'; import './components/custom_functions'; @@ -89,14 +88,14 @@ export class LlmComparatorAppElement extends MobxLitElement { `; } + // Render a win rate chart using a stacked percentage bar chart. private renderWinRateChart( winRate: number, entry: SliceWinRate, @@ -419,7 +417,7 @@ export class MetricsBySliceElement extends MobxLitElement { y2=${this.barHeight * 0.5} />` : svg``; - // TODO(b/325506046): Use tooltip for confidence interval details. + // TODO: Use tooltip for confidence interval details. const tooltipText = intervalLeft != null && intervalRight != null ? `${`95% CI: [${intervalLeft.toFixed(3)}, ${intervalRight.toFixed( @@ -618,7 +616,6 @@ export class MetricsBySliceElement extends MobxLitElement { } override render() { - // prettier-ignore return html`${this.renderWinRateBySliceChart()}`; } } diff --git a/client/components/rationale_summary.css b/client/components/rationale_summary.css index b0579cc..21ce424 100644 --- a/client/components/rationale_summary.css +++ b/client/components/rationale_summary.css @@ -4,7 +4,7 @@ th.example-count { } th.remove { - width: 32px; + width: 26px; } text.bar-count-text { diff --git a/client/components/rationale_summary.ts b/client/components/rationale_summary.ts index d7a0b2f..9b2e829 100644 --- a/client/components/rationale_summary.ts +++ b/client/components/rationale_summary.ts @@ -51,7 +51,7 @@ export class RationaleSummaryElement extends MobxLitElement { private readonly widthOfNumberLabel = 10; // Whether to show the "others" category (id=0). - // TODO(kahng): Implemented, but decided not to display it for now. + // TODO: Implemented, but decided not to display it for now. @observable showOthers = false; @observable sortColumn = 'A'; // label, A, or B @@ -297,10 +297,7 @@ export class RationaleSummaryElement extends MobxLitElement {
What are some clusters of the rationales used by the rater - when it thinks - ${this.sortColumn === 'A' || this.sortColumn === 'B' - ? `${this.sortColumn}` - : 'either A or B'} is better? + when it thinks A or B is better?
diff --git a/client/components/score_histogram.ts b/client/components/score_histogram.ts index f7d194a..8e0b1d9 100644 --- a/client/components/score_histogram.ts +++ b/client/components/score_histogram.ts @@ -31,7 +31,7 @@ import {AppState} from '../services/state_service'; import {styles} from './score_histogram.css'; /** - * Component for visualizing Autorater scores. + * Component for visualizing the score distribution as a histogram. */ @customElement('comparator-score-histogram') export class ScoreHistogramElement extends MobxLitElement { diff --git a/client/components/settings.ts b/client/components/settings.ts index 43a60e5..81e6801 100644 --- a/client/components/settings.ts +++ b/client/components/settings.ts @@ -77,7 +77,7 @@ declare global { } /** - * Renders the data table settings pop-up. + * Renders the data table settings pop-up on the left side. */ @customElement('comparator-settings') export class ComparatorSettingsElement extends MobxLitElement { diff --git a/client/components/toolbar.ts b/client/components/toolbar.ts index 006daab..0a01f6c 100644 --- a/client/components/toolbar.ts +++ b/client/components/toolbar.ts @@ -28,7 +28,7 @@ import {AppState} from '../services/state_service'; import {styles} from './toolbar.css'; /** - * Toolbar component. + * Toolbar component at the top of the main table. */ @customElement('comparator-toolbar') export class ToolbarElement extends MobxLitElement { @@ -260,39 +260,43 @@ export class ToolbarElement extends MobxLitElement {
${shownNum} displayed - ${filteredNum !== totalNum ? - html` + ${ + filteredNum !== totalNum ? html` of ${this.appState.filteredExamples.length} filtered` : - ''} + ''} (${totalNum} total)
- ${isAnyFilter === true ? + ${ + isAnyFilter === true ? html` ${renderFilterChips}` : ''}
- ${currentSorting.column !== SortColumn.NONE ? - html` -
- - - - ${currentSorting.column === SortColumn.CUSTOM_ATTRIBUTE ? + ${ + currentSorting.column !== SortColumn.NONE ? + html` +
+ + + + ${ + currentSorting.column === SortColumn.CUSTOM_ATTRIBUTE ? currentSorting.customField!.name : currentSorting.column} - ${currentSorting.modelIndex != null ? - ` for Output ${ - Object.values(AOrB)[currentSorting.modelIndex]}` : + ${ + currentSorting.modelIndex != null ? + ` for Response ${ + Object.values(AOrB)[currentSorting.modelIndex]}` : ''} - - ${currentSorting.order} - -
` : - ''} +
+ ${currentSorting.order} +
+
` : + ''}
`; } } diff --git a/client/lib/types.ts b/client/lib/types.ts index 545dfe1..f06c48b 100644 --- a/client/lib/types.ts +++ b/client/lib/types.ts @@ -46,7 +46,7 @@ export interface IndividualRating { rating_label: string | null; is_flipped: boolean | null; rationale: string | null; - // TODO(b/324469307): Support more types. + // TODO: Support more types. custom_fields: { [key: string]: string | Array; }; @@ -100,10 +100,10 @@ export interface SequenceChunk { // tslint:disable:enforce-name-casing export interface Example { index: number; - input_text: string | SequenceChunk[]; - output_text_a: string | SequenceChunk[]; - output_text_b: string | SequenceChunk[]; + input_text: string|SequenceChunk[]; tags: string[]; + output_text_a: string|SequenceChunk[]; + output_text_b: string|SequenceChunk[]; score: number | null; individual_rater_scores: IndividualRating[]; rationale_list: RationaleListItem[]; @@ -171,7 +171,6 @@ export interface CustomFieldSchema { export interface Metadata { source_path: string; custom_fields_schema: CustomFieldSchema[]; - sampling_step_size: number; } /** @@ -343,7 +342,7 @@ export interface HistogramSpec { /** * Interface for a custom field for ratings selection. * (only supporting per_rating_per_model_category for now) - * TODO(b/324469307): Support more per-rating types. + * TODO: Support more per-rating types. */ export interface RatingChartSelection { fieldId: string; diff --git a/client/lib/utils.ts b/client/lib/utils.ts index 6f216f0..77aa272 100644 --- a/client/lib/utils.ts +++ b/client/lib/utils.ts @@ -677,9 +677,8 @@ export function getBarFilterLabel( * Helper for cleaning LLM-generated values. */ export function cleanValue(val: string | null) { - // There exist many variants of "issues" (e.g., "No Issues", "No issues(s), - // "Major issue(s)", etc. We detect them and replace with "issues". - return val == null ? val : val.replace(/\bissues?\(?(s)?\)?$/i, 'issues'); + // We can include some manual data cleaning pipelines. + return val; } /** diff --git a/client/services/state_service.ts b/client/services/state_service.ts index fbe62a9..76b56c2 100644 --- a/client/services/state_service.ts +++ b/client/services/state_service.ts @@ -21,7 +21,7 @@ import {computed, makeObservable, observable} from 'mobx'; import {BUILT_IN_DEMO_FILES, DEFAULT_COLUMN_LIST, DEFAULT_HISTOGRAM_SPEC, DEFAULT_NUM_EXAMPLES_TO_DISPLAY, DEFAULT_RATIONALE_CLUSTER_SIMILARITY_THRESHOLD, DEFAULT_SORTING_CRITERIA, DEFAULT_WIN_RATE_THRESHOLD, FIELD_ID_FOR_INPUT, FIELD_ID_FOR_OUTPUT_A, FIELD_ID_FOR_OUTPUT_B, FIELD_ID_FOR_RATIONALE_LIST, FIELD_ID_FOR_RATIONALES, FIELD_ID_FOR_SCORE, FIVE_POINT_LIKERT_HISTOGRAM_SPEC, INITIAL_CUSTOM_FUNCTIONS,} from '../lib/constants'; import type {ChartSelectionKey, CustomFieldSchema, CustomFunction, Example, Field, HistogramSpec, IndividualRating, Metadata, Model, RatingChartSelection, RationaleCluster, RationaleListItem, SortCriteria,} from '../lib/types'; import {AOrB, ChartType, CustomFuncReturnType, DataResponse, ErrorResponse, FieldType, SortColumn, SortOrder,} from '../lib/types'; -import {compareNumbersWithNulls, compareStringsWithNulls, computeSimilaritiesBetweenVectorAndNormalizedMatrix, convertToNumber, extractTextFromTextOrSequenceChunks, getFieldIdForCustomFunc, getHistogramBinIndexFromValue, getMinAndMax, groupByAndSortKeys, groupByValues, initializeCustomFuncSelections, isPerRatingFieldType, mergeTwoArrays, normalizeVector, searchText,} from '../lib/utils'; +import {compareNumbersWithNulls, compareStringsWithNulls, convertToNumber, extractTextFromTextOrSequenceChunks, getFieldIdForCustomFunc, getHistogramBinIndexFromValue, getMinAndMax, groupByAndSortKeys, groupByValues, initializeCustomFuncSelections, isPerRatingFieldType, mergeTwoArrays, searchText,} from '../lib/utils'; import {CustomFunctionService} from './custom_function_service'; import {Service} from './service'; @@ -35,13 +35,7 @@ export class AppState extends Service { makeObservable(this); } - @observable datasetPath: string | null = null; - @observable isDatasetPathUploadedFile = false; - @observable isOpenDatasetSelectionPanel = true; - - @observable targetTeam = 'app'; // app, gemini, bard, etc. - @observable exampleDatasetPaths: string[] = BUILT_IN_DEMO_FILES; - + // Fields from data files. @observable metadata: Metadata = { source_path: '', @@ -52,11 +46,27 @@ export class AppState extends Service { @observable examples: Example[] = []; @observable rationaleClusters: RationaleCluster[] = []; + // Dataset path. + @observable datasetPath: string|null = null; + @observable isDatasetPathUploadedFile = false; + @observable isOpenDatasetSelectionPanel = true; + + @observable exampleDatasetPaths: string[] = BUILT_IN_DEMO_FILES; + + // Tags. + @observable selectedTag: string|null = null; + + // Table sorting. @observable currentSorting: SortCriteria = DEFAULT_SORTING_CRITERIA; - @observable selectedExample: Example | null = null; - @observable selectedTag: string | null = null; + // Example expansion (key: index). If not exists, assume false. + @observable isExampleExpanded: {[key: number]: boolean} = {}; + getIsExampleExpanded(index: number): boolean { + return this.isExampleExpanded[index] ?? false; + } + // Example details. + @observable selectedExample: Example|null = null; @observable showSelectedExampleDetails = false; @observable exampleDetailsPanelExpanded = false; @@ -75,7 +85,7 @@ export class AppState extends Service { DEFAULT_RATIONALE_CLUSTER_SIMILARITY_THRESHOLD; // Columns. - // TODO(b/315147299): Use a url service to sync the visibility state. + // TODO: Use a url service to sync the visibility state. @observable columns: Field[] = DEFAULT_COLUMN_LIST; // Charts. @@ -88,7 +98,7 @@ export class AppState extends Service { // For simple bar charts, a single-item array, e.g., [null] (non-selected); // for grouped bar charts, a two-item array, e.g., ['sports', null] // (if the bar for 'sports' is selected for A; no bars are selected for B). - // TODO(b/315722619): Merge selection variables into one. + // TODO: Merge selection variables into one. @observable selectedBarChartValues: {[key: string]: Array} = {}; @@ -209,8 +219,6 @@ export class AppState extends Service { ); } - @observable sampleCountForCheckingRatingLevelDataAvailability = 10; - // Custom functions. @observable customFunctions: {[key: number]: CustomFunction} = {}; @@ -281,7 +289,7 @@ export class AppState extends Service { return ( this.columns .filter((field: Field) => field.type === FieldType.PER_MODEL_NUMBER) - // TODO(b/315388387): Will not need when custom functions are + // TODO: Will not need when custom functions are // merged. .filter((field: Field) => field.id.startsWith('custom_field:'))); } @@ -297,7 +305,7 @@ export class AppState extends Service { field.type === FieldType.PER_RATING_PER_MODEL_CATEGORY, ) // Exclude custom functions - // TODO(b/315388387): Will not need when custom functions are + // TODO: Will not need when custom functions are // merged. .filter((field: Field) => field.id.startsWith('custom_field:'))); } @@ -489,7 +497,7 @@ export class AppState extends Service { return examples; } - // TODO(b/326139568): Merge with the side-by-side histograms. + // TODO: Merge with the side-by-side histograms. private applyHistogramFilterForCustomFuncs( examplesBeforeThisFilter: Example[], excludeId: number|null = null, @@ -991,7 +999,9 @@ export class AppState extends Service { this.selectedTag = null; this.selectedCustomFuncId = null; - // TODO(b/315722619) Merge selection variables. + this.isExampleExpanded = {}; + + // TODO Merge selection variables. this.selectedHistogramBinForScores = null; this.selectedHistogramBinForCustomFields = {}; this.selectedBarChartValues = {}; @@ -1035,8 +1045,7 @@ export class AppState extends Service { params[key] = decodeURIComponent(value); } - // Get results_path (and cns_path) parameter from url. - // The cns_path is for those who have used the older versions. + // Get path parameters from url. if (params.hasOwnProperty('results_path')) { const datasetPath = params['results_path']; // Get max examples parameter from url. @@ -1058,9 +1067,6 @@ export class AppState extends Service { samplingStepSize, columnsToHide, ); - } else if (params.hasOwnProperty('cns_path')) { - const datasetPath = params['cns_path']; - this.loadData(datasetPath, null); } } @@ -1129,11 +1135,11 @@ export class AppState extends Service { } else { this.histogramSpecForScores = DEFAULT_HISTOGRAM_SPEC; } - // TODO(b/338112225): Support custom higher ranges (e.g., 5.0 to -5.0). + // TODO: Support custom higher ranges (e.g., 5.0 to -5.0). } // Add histogram spec for custom functions with return type number. - // TODO(b/326139568): Merge with the side-by-side histograms. + // TODO: Merge with the side-by-side histograms. private addHistogramSpecForCustomFunc(customFunc: CustomFunction) { if (customFunc.returnType === CustomFuncReturnType.NUMBER) { const fieldId = getFieldIdForCustomFunc(customFunc.id); @@ -1258,7 +1264,7 @@ export class AppState extends Service { }); } - // Load data either from the server or uploaded file. + // Load data either from a specified path or uploaded file. async loadData( datasetPath: string, fileObject: File | null = null, @@ -1274,7 +1280,7 @@ export class AppState extends Service { try { // Load data from the uploaded file. const fileContent = await this.readFileContent(fileObject); - // TODO(b/333119821): Validate the format of the uploaded file. + // TODO: Validate the format of the uploaded file. const jsonResponse = JSON.parse(fileContent); dataResponse = jsonResponse as DataResponse; } catch (error) { @@ -1299,7 +1305,7 @@ export class AppState extends Service { throw new Error(errorMessage); } if (response.status === 502) { - // TODO(b/316021912): Use a corp domain url. + // TODO: Use a corp domain url. const errorMessage = 'Failed to load the dataset. The server may not exist anymore, ' + 'possibly with updated URLs. Try opening this URL ' + @@ -1332,7 +1338,7 @@ export class AppState extends Service { // Assign indices to examples. example.index = index; - // TODO(b/338112784): Check if all the required fields exist. + // TODO: Check if all the required fields exist. // Assign indices to individual ratings. example.individual_rater_scores.forEach( @@ -1443,7 +1449,7 @@ export class AppState extends Service { this.customFieldsOfPerRatingType.forEach((ratingField: Field) => { // Change the key from field name to field id. this.examples.forEach((ex: Example) => { - // TODO(b/324469307): Support more per-rating types. + // TODO: Support more per-rating types. if (ratingField.type === FieldType.PER_RATING_STRING) { ex.individual_rater_scores.forEach((rating: IndividualRating) => { // Change the key from field name to field id. @@ -1468,7 +1474,7 @@ export class AppState extends Service { // Perform group-by aggregations over ratings. this.examples.forEach((ex: Example) => { - // TODO(b/324469307): Support more per-rating types. + // TODO: Support more per-rating types. if (ratingField.type === FieldType.PER_RATING_STRING) { // Simply concatenate strings. ex.custom_fields[ratingField.id] = ex.individual_rater_scores @@ -1537,17 +1543,12 @@ export class AppState extends Service { this.runCustomFunction(this.examples, customFunc); }); - const statusMessage = `Loaded the dataset of ${ - this.examples.length - } examples.${ - this.metadata.sampling_step_size > 1 - ? ` Because of the large size, we sampled data from every ${this.metadata.sampling_step_size} examples.` - : '' - }`; + const statusMessage = + `Loaded the dataset of ${this.examples.length} examples.`; this.updateStatusMessage(statusMessage, true); // Update URL. - // TODO(b/315147299): Create a URL service to keep URL and app in sync. + // TODO: Create a URL service to keep URL and app in sync. const url = new URL(window.location.href); if (this.isDatasetPathUploadedFile === false) { url.searchParams.set('results_path', this.datasetPath); @@ -1656,6 +1657,7 @@ export class AppState extends Service { } } + // Remove a rationale cluster row. removeCluster(clusterId: number) { if (clusterId === this.selectedRationaleClusterId) { this.selectedRationaleClusterId = null; diff --git a/docs/data/example_tiny.json b/docs/data/example_tiny.json index 79c9204..dd13b2a 100644 --- a/docs/data/example_tiny.json +++ b/docs/data/example_tiny.json @@ -1,30 +1,52 @@ { "metadata": { "source_path": "n/a: synthetic data for LLM Comparator demo", - "custom_fields_schema": [] + "custom_fields_schema": [ + {"name": "language", "type": "per_model_category"} + ] }, "models": [ - {"name": "ABC 1.1"}, - {"name": "ABC 1.0"} + {"name": "ABC v0.6"}, + {"name": "ABC v0.5"} ], "examples": [ { - "input_text": "Which city should I visit in South Korea?", - "tags": ["Travel"], - "output_text_a": "You can visit Seoul, the capital of South Korea.", - "output_text_b": "You can visit Seoul, Busan, and Jeju.", + "input_text": "What is LLM Comparator?", + "tags": ["Technology"], + "output_text_a": "LLM Comparator is an interactive tool for analyzing results from side-by-side LLM evaluation. It visualizes model performance and helps users explore individual responses.\n\nIt has been developed by the People + AI Research Team at Google. The code is available at https://github.com/PAIR-code/llm-comparator.", + "output_text_b": "LLM Comparator is a tool for comparing LLM responses from two different models.", "score": 0.5, - "individual_rater_scores": [], - "custom_fields": {} - }, + "individual_rater_scores": [ + {"score": 1.0, "rating_label": "A is better", "is_flipped": false, "rationale": "Response A is more detailed."}, + {"score": 1.5, "rating_label": "A is much better", "is_flipped": false, "rationale": "Response A provides more information."}, + {"score": -0.5, "rating_label": "B is slightly better", "is_flipped": true, "rationale": "Response B more succinctly answers the question."}, + {"score": 0.0, "rating_label": "same", "is_flipped": true, "rationale": "Both provide correct information."} + ], + "custom_fields": { + "language": ["English", "English"] + } + }, { "input_text": "How to draw bar charts using Python?", - "tags": ["Coding"], - "output_text_a": "I don't know it.", - "output_text_b": "You can use some data visualization libraries.", + "tags": ["Technology"], + "output_text_a": "Bar charts can be created by using data visualization libraries.", + "output_text_b": "You can draw bar charts using data visualization libraries.\n\n- Matplotlib is a very popular, established library primarily for creating static plots.\n- Plotly is a web-based visualization library for creating a variety of interactive charts.\n- Altair is a declarative visualization library based on a simple, expressive grammar.", "score": -1.0, "individual_rater_scores": [], - "custom_fields": {} + "custom_fields": { + "language": ["English", "English"] + } + }, + { + "input_text": "Which city should I visit in South Korea?", + "tags": ["Travel"], + "output_text_a": "You can visit Seoul.\n\nSeoul is the capital of South Korea. It is the country's largest city with a population of nearly ten million people.", + "output_text_b": "Sure, I can tell you. You can visit Seoul, Busan, and Jeju.\n\n- Seoul is the capital city with historic palaces and skyscrapers\n- Busan is the second-largest city with beautiful beaches\n- Jeju is a volcanic island and famous for its natural scenery.", + "score": 0.5, + "individual_rater_scores": [], + "custom_fields": { + "language": ["English", "English"] + } }, { "input_text": "Hi, how are you?", @@ -32,8 +54,23 @@ "output_text_a": "Good, how are you?", "output_text_b": "Hi, how are you?", "score": 0.0, + "individual_rater_scores": [ + {"score": 0.0, "rationale": "There is no meaningful difference."} + ], + "custom_fields": { + "language": ["English", "English"] + } + }, + { + "input_text": "How to say hello in Korean?", + "tags": [], + "output_text_a": "안녕하세요?", + "output_text_b": "I don't speak Korean.", + "score": 1.5, "individual_rater_scores": [], - "custom_fields": {} + "custom_fields": { + "language": ["Korean", "English"] + } } ] } \ No newline at end of file diff --git a/docs/dev_sources.concat.js b/docs/dev_sources.concat.js index 7d87c8a..cef5ac3 100644 --- a/docs/dev_sources.concat.js +++ b/docs/dev_sources.concat.js @@ -11382,11 +11382,6 @@ rect.clickable-transparent-area.selected:hover { constructor(customFunctionService) { super(); this.customFunctionService = customFunctionService; - this.datasetPath = null; - this.isDatasetPathUploadedFile = false; - this.isOpenDatasetSelectionPanel = true; - this.targetTeam = "app"; - this.exampleDatasetPaths = BUILT_IN_DEMO_FILES; this.metadata = { source_path: "", custom_fields_schema: [], @@ -11395,9 +11390,14 @@ rect.clickable-transparent-area.selected:hover { this.models = [{ name: "" }, { name: "" }]; this.examples = []; this.rationaleClusters = []; + this.datasetPath = null; + this.isDatasetPathUploadedFile = false; + this.isOpenDatasetSelectionPanel = true; + this.exampleDatasetPaths = BUILT_IN_DEMO_FILES; + this.selectedTag = null; this.currentSorting = DEFAULT_SORTING_CRITERIA; + this.isExampleExpanded = {}; this.selectedExample = null; - this.selectedTag = null; this.showSelectedExampleDetails = false; this.exampleDetailsPanelExpanded = false; this.hasRationaleClusters = false; @@ -11428,7 +11428,6 @@ rect.clickable-transparent-area.selected:hover { this.isShowTagChips = true; this.isShowSidebar = true; this.numberOfLinesPerOutputCell = 7; - this.sampleCountForCheckingRatingLevelDataAvailability = 10; this.customFunctions = {}; this.histogramSpecForCustomFuncs = {}; this.histogramSpecForCustomFuncsOfDiff = {}; @@ -11438,6 +11437,9 @@ rect.clickable-transparent-area.selected:hover { this.valueDomainsForCustomFields = {}; makeObservable(this); } + getIsExampleExpanded(index) { + return this.isExampleExpanded[index] ?? false; + } resetSearchFilter(fieldId) { this.searchFilters[fieldId] = ""; this.searchFilterInputs[fieldId] = ""; @@ -11636,7 +11638,7 @@ rect.clickable-transparent-area.selected:hover { } return examples; } - // TODO(b/326139568): Merge with the side-by-side histograms. + // TODO: Merge with the side-by-side histograms. applyHistogramFilterForCustomFuncs(examplesBeforeThisFilter, excludeId = null, excludeModel = null) { let examples = examplesBeforeThisFilter; Object.values(this.customFunctions).filter( @@ -11930,6 +11932,7 @@ rect.clickable-transparent-area.selected:hover { this.selectedExample = null; this.selectedTag = null; this.selectedCustomFuncId = null; + this.isExampleExpanded = {}; this.selectedHistogramBinForScores = null; this.selectedHistogramBinForCustomFields = {}; this.selectedBarChartValues = {}; @@ -11972,9 +11975,6 @@ rect.clickable-transparent-area.selected:hover { samplingStepSize, columnsToHide ); - } else if (params.hasOwnProperty("cns_path")) { - const datasetPath = params["cns_path"]; - this.loadData(datasetPath, null); } } // Update the sorting option. @@ -12032,7 +12032,7 @@ rect.clickable-transparent-area.selected:hover { } } // Add histogram spec for custom functions with return type number. - // TODO(b/326139568): Merge with the side-by-side histograms. + // TODO: Merge with the side-by-side histograms. addHistogramSpecForCustomFunc(customFunc) { if (customFunc.returnType === "Number" /* NUMBER */) { const fieldId = getFieldIdForCustomFunc(customFunc.id); @@ -12130,7 +12130,7 @@ rect.clickable-transparent-area.selected:hover { reader.readAsText(file); }); } - // Load data either from the server or uploaded file. + // Load data either from a specified path or uploaded file. async loadData(datasetPath, fileObject = null, maxNumExamplesToDisplay = null, samplingStepSize = null, columnsToHide = []) { this.isOpenDatasetSelectionPanel = false; this.updateStatusMessage("Loading the dataset... Please wait..."); @@ -12317,7 +12317,7 @@ rect.clickable-transparent-area.selected:hover { this.selectionsFromCustomFuncResults[newId] = initializeCustomFuncSelections(); this.runCustomFunction(this.examples, customFunc); }); - const statusMessage = `Loaded the dataset of ${this.examples.length} examples.${this.metadata.sampling_step_size > 1 ? ` Because of the large size, we sampled data from every ${this.metadata.sampling_step_size} examples.` : ""}`; + const statusMessage = `Loaded the dataset of ${this.examples.length} examples.`; this.updateStatusMessage(statusMessage, true); const url = new URL(window.location.href); if (this.isDatasetPathUploadedFile === false) { @@ -12402,6 +12402,7 @@ rect.clickable-transparent-area.selected:hover { this.updateStatusMessage(error, false); } } + // Remove a rationale cluster row. removeCluster(clusterId) { if (clusterId === this.selectedRationaleClusterId) { this.selectedRationaleClusterId = null; @@ -12428,40 +12429,40 @@ rect.clickable-transparent-area.selected:hover { }; __decorateClass([ observable - ], AppState.prototype, "datasetPath", 2); + ], AppState.prototype, "metadata", 2); __decorateClass([ observable - ], AppState.prototype, "isDatasetPathUploadedFile", 2); + ], AppState.prototype, "models", 2); __decorateClass([ observable - ], AppState.prototype, "isOpenDatasetSelectionPanel", 2); + ], AppState.prototype, "examples", 2); __decorateClass([ observable - ], AppState.prototype, "targetTeam", 2); + ], AppState.prototype, "rationaleClusters", 2); __decorateClass([ observable - ], AppState.prototype, "exampleDatasetPaths", 2); + ], AppState.prototype, "datasetPath", 2); __decorateClass([ observable - ], AppState.prototype, "metadata", 2); + ], AppState.prototype, "isDatasetPathUploadedFile", 2); __decorateClass([ observable - ], AppState.prototype, "models", 2); + ], AppState.prototype, "isOpenDatasetSelectionPanel", 2); __decorateClass([ observable - ], AppState.prototype, "examples", 2); + ], AppState.prototype, "exampleDatasetPaths", 2); __decorateClass([ observable - ], AppState.prototype, "rationaleClusters", 2); + ], AppState.prototype, "selectedTag", 2); __decorateClass([ observable ], AppState.prototype, "currentSorting", 2); __decorateClass([ observable - ], AppState.prototype, "selectedExample", 2); + ], AppState.prototype, "isExampleExpanded", 2); __decorateClass([ observable - ], AppState.prototype, "selectedTag", 2); + ], AppState.prototype, "selectedExample", 2); __decorateClass([ observable ], AppState.prototype, "showSelectedExampleDetails", 2); @@ -12558,9 +12559,6 @@ rect.clickable-transparent-area.selected:hover { __decorateClass([ computed ], AppState.prototype, "isScoreDivergingScheme", 1); - __decorateClass([ - observable - ], AppState.prototype, "sampleCountForCheckingRatingLevelDataAvailability", 2); __decorateClass([ observable ], AppState.prototype, "customFunctions", 2); @@ -13263,7 +13261,7 @@ line.axis { .numExamples=${filteredExamples.length}> `; } - // TODO(b/326139568): Merge into the side-by-side histogram code in charts.ts. + // TODO: Merge into the side-by-side histogram code in charts.ts. renderChartForNumberType(customFunc) { const getHistogramSpec = () => this.appState.histogramSpecForCustomFuncs[customFunc.id]; const getHistogramSpecForDiff = () => this.appState.histogramSpecForCustomFuncsOfDiff[customFunc.id]; @@ -13578,8 +13576,8 @@ line.axis { } .panel-instruction { - color: #555; - line-height: 16px; + color: var(--comparator-grey-800); + line-height: 18px; margin: 5px 0; padding: 2px 0; } @@ -13654,11 +13652,17 @@ input, button { const documentationLink = "https://github.com/PAIR-code/llm-comparator"; return x`
- The json file must contain these three properties: "metadata", "models", - and "examples". + The json file must contain these three properties: + metadata, + models, + and examples.
- Each example must have "input_text", "tags", "output_text_a", - "output_text_b", and "score". + Each example in examples must have + input_text, + tags, + output_text_a, + output_text_b, + and score.
Please refer to our document for details: ${documentationLink} @@ -13681,7 +13685,7 @@ input, button { "selected": this.appState.datasetPath === datasetPath }); const textareaPlaceholder = "Enter a URL to load the json file from."; - const urlLoadPath = this.appState.appLink + "?results_path=https://.../results.json"; + const urlLoadPath = this.appState.appLink + "?results_path=https://.../...json"; const panelIntro = x` Enter the URL path of a json file prepared for LLM Comparator.`; const panelOutro = x` @@ -13982,7 +13986,7 @@ td.rationale { .isFlipXAxis=${() => this.appState.isFlipScoreHistogramAxis}> `; } - // TODO(b/311725252): Create a separate data-table component. + // TODO: Create a separate data-table component. renderRaterTable() { const selectedExample = this.selectedExample; if (selectedExample == null) { @@ -14047,10 +14051,7 @@ td.rationale { Score ${renderSortIcons()} Rating Flipped? - - Rationale - (Careful for flipped cases!) - + Rationale ${this.appState.customFieldsOfPerRatingType.map( (field) => renderCustomFieldHeaderCell(field) )} @@ -15145,6 +15146,11 @@ td.score.b-win { text-decoration: underline; } +.selected .rater-info-link { + color: var(--comparator-grey-800); + font-weight: 600; +} + td.score:hover .rater-info-link { color: var(--comparator-grey-800); } @@ -15185,7 +15191,8 @@ ul.rationale-list li.cluster-selected::before { .text-holder, .list-holder, -.sequence-chunks-holder { +.sequence-chunks-holder, +.score-holder { height: 119px; /* Set default as 17px x 7 rows */ overflow-x: hidden; overflow-y: scroll; @@ -15201,6 +15208,11 @@ ul.rationale-list li.cluster-selected::before { overflow-wrap: anywhere; } +.score-holder { + overflow-y: hidden; + padding-top: 0; +} + tr.monospace .text-holder { font-family: monospace; } @@ -15433,7 +15445,8 @@ th .search-field button { } styleHolder(example) { return o10({ - "height": this.appState.selectedExample !== example ? `${this.appState.numberOfLinesPerOutputCell * LINE_HEIGHT_IN_CELL}px` : "auto" + "height": this.appState.getIsExampleExpanded(example.index) !== true ? `${this.appState.numberOfLinesPerOutputCell * LINE_HEIGHT_IN_CELL}px` : "auto", + "min-height": this.appState.getIsExampleExpanded(example.index) === true ? `${this.appState.numberOfLinesPerOutputCell * LINE_HEIGHT_IN_CELL}px` : null }); } renderPerModelField(values, field) { @@ -15523,12 +15536,13 @@ th .search-field button { } renderRow(example, rowIndex) { const handleDoubleClickRow = () => { - this.appState.selectedExample = this.appState.selectedExample === example ? null : example; + this.appState.isExampleExpanded[example.index] = this.appState.getIsExampleExpanded(example.index) === true ? false : true; }; const styleRow = e6({ "selected": this.appState.selectedExample === example, "monospace": this.appState.useMonospace === true }); + const styleHolder = this.styleHolder(example); const textDiff = typeof example.output_text_a === "string" && typeof example.output_text_b === "string" ? getTextDiff(example.output_text_a, example.output_text_b) : getTextDiff("", ""); const renderTextString = (rawText, parsedText, searchQuery, selectedCustomFunc2) => { if (searchQuery !== "") { @@ -15607,10 +15621,12 @@ th .search-field button { rater${example.individual_rater_scores.length > 1 ? "s" : ""}
${renderHistogram}` : ""; - const renderScore = example.score == null ? "null" : x` + const renderScore = example.score == null ? "Null" : x` +
${example.score.toFixed(2)}
${scoreDescription} - ${raterInfoLink}`; + ${raterInfoLink} +
`; const styleScore = e6({ "score": true, "clickable": true, @@ -15661,7 +15677,6 @@ th .search-field button { )] || Array)[modelIndex], selectedCustomFunc ) : x``; - const styleHolder = this.styleHolder(example); const renderCustomField = (field, columnIndex) => { if (field.type === "per_rating_per_model_category" /* PER_RATING_PER_MODEL_CATEGORY */) { return this.renderPerRatingPerModelCategoryField(rowIndex, columnIndex); @@ -15968,7 +15983,12 @@ th .search-field button { ], ExampleTableElement); // client/components/metrics_by_slice.css - var styles9 = i`th.score-avg { + var styles9 = i`thead { + position: sticky; + top: 0; +} + +th.score-avg { width: 98px; /* width sum for score-avg-number and score-avg-chart */ } @@ -16081,9 +16101,9 @@ rect.bar.win-rate-result-tie { fill: var(--comparator-grey-400); } -.collapsed { +.collapsed .table-container { max-height: 220px; - overflow-y: hidden; + overflow-y: scroll; } line.middle-point-vertical { @@ -16240,6 +16260,7 @@ circle { return value - baseValue < 0 && intervalLeft - baseValue < 0 && intervalRight - baseValue < 0; } } + // Render a confidence interval chart for average scores. renderScoreConfIntervalChart(avgScore, intervalLeft, intervalRight) { if (avgScore == null) { return x`${renderScoreConfIntervalChart} `; } + // Render a win rate chart using a stacked percentage bar chart. renderWinRateChart(winRate, entry, intervalLeft, intervalRight) { const styleElement = (className) => e6({ "win-rate-point": className === "win-rate-point", @@ -16553,7 +16575,7 @@ circle { } th.remove { - width: 32px; + width: 26px; } text.bar-count-text { @@ -16780,8 +16802,7 @@ td.remove {
What are some clusters of the rationales used by the rater - when it thinks - ${this.sortColumn === "A" || this.sortColumn === "B" ? `${this.sortColumn}` : "either A or B"} is better? + when it thinks A or B is better?
@@ -20893,16 +20914,16 @@ mwc-switch {
${currentSorting.column !== "None" /* NONE */ ? x` -
- - - - ${currentSorting.column === "custom attribute" /* CUSTOM_ATTRIBUTE */ ? currentSorting.customField.name : currentSorting.column} - ${currentSorting.modelIndex != null ? ` for Output ${Object.values(AOrB)[currentSorting.modelIndex]}` : ""} - - ${currentSorting.order} - -
` : ""} +
+ + + + ${currentSorting.column === "custom attribute" /* CUSTOM_ATTRIBUTE */ ? currentSorting.customField.name : currentSorting.column} + ${currentSorting.modelIndex != null ? ` for Response ${Object.values(AOrB)[currentSorting.modelIndex]}` : ""} + + ${currentSorting.order} + +
` : ""} `; } }; @@ -20948,14 +20969,14 @@ mwc-switch {