New text2sql metrics #1584

oktie · 2025-02-07T22:14:22Z

The goal is to add additional metrics and results to the output of the text2sql execution accuracy metric implementation. We used to produce just one number: 1 if the dataframe produced by the SQLs in pred and gold are the same, 0 otherwise. I've added scores to report 12 scores/outputs:

execution_result: if df responses match (same as before)
non_empty_execution_result: if dfs are non-empty and match
subset_non_empty_execution_result: if non-empty dfs and gt df subset of predicted df
non_empty_gold_df: if gt df is non-empty
gold_sql_runtime: ground truth query runtime
predicted_sql_runtime: predicted query runtime
pred_to_gold_runtime_ratio: ratio of predicted query runtime to gt query runtime
gold_error: if gt has an error
predicted_error: if predicted query has an error
ground truth dataframe
predicted query's dataframe
error message (if any)

What we used to get (output of examples/evaluate_text2sql.py):

num_of_instances (int):
    10
anls (float):
    0.12179476763080924
score (float):
    0.2
score_name (str):
    execution_accuracy
execution_accuracy (float):
    0.2
execution_accuracy_ci_low (float64):
    0.0
execution_accuracy_ci_high (float64):
    0.6
score_ci_low (float64):
    0.0
score_ci_high (float64):
    0.6

What we get with the new additions:

num_of_instances (int):
    10
anls (float):
    0.12179476763080924
score (float):
    0.0
score_name (str):
    non_empty_execution_accuracy
non_empty_execution_accuracy (float):
    0.0
subset_non_empty_execution_result (float):
    0.0
pred_to_gold_runtime_ratio (float):
    0.9950211077562516
predicted_error (float):
    0.1
predicted_sql_runtime (float):
    0.8206439448520542
gold_error (float):
    0.0
non_empty_gold_df (float):
    0.0
gold_sql_runtime (float):
    0.8285779342986643
execution_accuracy (float):
    0.2
predicted_sql_runtime_ci_low (float64):
    0.7456668988301784
predicted_sql_runtime_ci_high (float64):
    1.0325724025184853
gold_sql_runtime_ci_low (float64):
    0.7711773584951769
gold_sql_runtime_ci_high (float64):
    0.9317167796579513
execution_accuracy_ci_low (float64):
    0.0
execution_accuracy_ci_high (float64):
    0.6

fix: wml inference with space id only Signed-off-by: Cheng Qian <cheng.qian@ibm.com> Co-authored-by: Cheng Qian <cheng.qian@ibm.com>

Signed-off-by: elronbandel <elronbandel@gmail.com>

…guration Signed-off-by: elronbandel <elronbandel@gmail.com>

…tries Signed-off-by: elronbandel <elronbandel@gmail.com>

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Fix failing tests Signed-off-by: elronbandel <elronbandel@gmail.com> * Fix failing tests Signed-off-by: elronbandel <elronbandel@gmail.com> * Fix failing tests Signed-off-by: elronbandel <elronbandel@gmail.com> --------- Signed-off-by: elronbandel <elronbandel@gmail.com>

Signed-off-by: elronbandel <elronbandel@gmail.com>

add streaming, add tests

* try lazy loadHF first Signed-off-by: dafnapension <dafnashein@yahoo.com> * reduce benchmark profiling to generating the dataset only. Not inferring (that is dome mocking anyhow) and not evaluating (of the mocked results). add trust_remote also to load_dataset_builder Signed-off-by: dafnapension <dafnashein@yahoo.com> * try procrastination for load csv too Signed-off-by: dafnapension <dafnashein@yahoo.com> * added split cache for the generators, and log limit once per data and increase loader cache Signed-off-by: dafnapension <dafnashein@yahoo.com> * make sklearn loader too - a lazy loader Signed-off-by: dafnapension <dafnashein@yahoo.com> * adjust to new readers for csv Signed-off-by: dafnapension <dafnashein@yahoo.com> * Enhance LoadHF class to support optional splits and improve dataset loading logic Signed-off-by: elronbandel <elronbandel@gmail.com> * Refactor LoadHF class to improve dataset loading and implement limit on yielded instances Signed-off-by: elronbandel <elronbandel@gmail.com> * Refactor LoadHF class to streamline dataset loading and enhance split handling Signed-off-by: elronbandel <elronbandel@gmail.com> * Remove unused import and update line number in secrets baseline Signed-off-by: elronbandel <elronbandel@gmail.com> * Refactor load_data method to simplify error handling and remove unnecessary cache checks Signed-off-by: elronbandel <elronbandel@gmail.com> * Merge origin/main Signed-off-by: elronbandel <elronbandel@gmail.com> * Refactor loaders to implement LazyLoader class and update load_iterables method for improved streaming support Signed-off-by: elronbandel <elronbandel@gmail.com> * Update exception handling in test_failed_load_csv to catch general exceptions Signed-off-by: elronbandel <elronbandel@gmail.com> * Refactor LoadHF class to streamline data loading and enhance error handling Signed-off-by: elronbandel <elronbandel@gmail.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: elronbandel <elronbandel@gmail.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com>

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Add support for all Granite Guardian risks Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Remove Granite Guardian from LLM as Judge evaluators Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Rename wrong metric name (3.0 version -> 3) Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Add support for custom risks Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Adapt catalog Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Add more examples Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Apply linter Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Add generation params Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Use inference engine instead of internal model Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Add _mock_infer_log_probs to infer_log_prob Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Apply linter Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Bring back breaking catalog names changes Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Add wrongly deleted artifacts Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Only create watsonx inference engine if it is None Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Modularize getting the prompt Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Change default names to what Granite Guardian expects by default Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Adapt rag granite guardian prepare file and catalog Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Adapt metric so it works for all inference engines Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Bring back certainty and improve score naming Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * fixes and format Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Adapt rag catalog Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Adapt WMLInferenceEngineBase credential check: apikey -> api_key Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Use credentials object and pass project and space Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Adapt WML log prob default params Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Adapt granite guardian catalog and fix example Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Apply linter Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Implement inheritance for each risk type Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Apply linter Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> * Uncomment log prob params check Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com> --------- Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

oktie · 2025-02-12T13:51:17Z

To resolve some merge conflicts, I'm closing this and will do a new PR.

oktie added 2 commits February 7, 2025 09:03

adding select prefix to get_sql processor

6d53232

new text2sql execution scores and results in metrics

4390579

oktie requested a review from perlitz February 7, 2025 22:14

tsinggggg and others added 19 commits February 11, 2025 15:10

fix: minor bug when only space id is provided for WML inference (#1583)

0d92053

fix: wml inference with space id only Signed-off-by: Cheng Qian <cheng.qian@ibm.com> Co-authored-by: Cheng Qian <cheng.qian@ibm.com>

Try fixing csv loader

5315b32

Signed-off-by: elronbandel <elronbandel@gmail.com>

fix: improve CSV loading error handling in LoadCSV class

27dc9d4

Signed-off-by: elronbandel <elronbandel@gmail.com>

Improve the load/prepare time for rag tasks

542f5ba

Signed-off-by: elronbandel <elronbandel@gmail.com>

refactor: streamline metric testing and enhance dataset loading confi…

5eb25ea

…guration Signed-off-by: elronbandel <elronbandel@gmail.com>

fix: enhance CSV loading with retry mechanism and configurable max re…

c5a57a3

…tries Signed-off-by: elronbandel <elronbandel@gmail.com>

test: add unit test for LoadCSV error handling on file not found

fd7c962

Signed-off-by: elronbandel <elronbandel@gmail.com>

fix: update metrics parameter to accept a list in test_card function

887b730

Signed-off-by: elronbandel <elronbandel@gmail.com>

Fix failing tests (#1589)

6a93031

Signed-off-by: elronbandel <elronbandel@gmail.com>

Fix metrics formatting and style (#1591)

8be440f

Signed-off-by: elronbandel <elronbandel@gmail.com>

Fix bird dataset (#1593)

8741819

add streaming, add tests

Fix loading without limit (#1594)

1b1ea85

Signed-off-by: elronbandel <elronbandel@gmail.com>

Text2SQL metric bug fix

3fd30ce

text2sql metric bug fix

c62da22

text2sql metric test update

4be8f29

text2sql metrics bug fix

09e7ebb

oktie closed this Feb 12, 2025

oktie deleted the new-text2sql-metrics branch February 12, 2025 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New text2sql metrics #1584

New text2sql metrics #1584

oktie commented Feb 7, 2025

oktie commented Feb 12, 2025

New text2sql metrics #1584

New text2sql metrics #1584

Conversation

oktie commented Feb 7, 2025

oktie commented Feb 12, 2025