Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New text2sql metrics #1584

Closed
wants to merge 21 commits into from
Closed

New text2sql metrics #1584

wants to merge 21 commits into from

Conversation

oktie
Copy link
Member

@oktie oktie commented Feb 7, 2025

The goal is to add additional metrics and results to the output of the text2sql execution accuracy metric implementation. We used to produce just one number: 1 if the dataframe produced by the SQLs in pred and gold are the same, 0 otherwise. I've added scores to report 12 scores/outputs:

  1. execution_result: if df responses match (same as before)
  2. non_empty_execution_result: if dfs are non-empty and match
  3. subset_non_empty_execution_result: if non-empty dfs and gt df subset of predicted df
  4. non_empty_gold_df: if gt df is non-empty
  5. gold_sql_runtime: ground truth query runtime
  6. predicted_sql_runtime: predicted query runtime
  7. pred_to_gold_runtime_ratio: ratio of predicted query runtime to gt query runtime
  8. gold_error: if gt has an error
  9. predicted_error: if predicted query has an error
  10. ground truth dataframe
  11. predicted query's dataframe
  12. error message (if any)

What we used to get (output of examples/evaluate_text2sql.py):

num_of_instances (int):
    10
anls (float):
    0.12179476763080924
score (float):
    0.2
score_name (str):
    execution_accuracy
execution_accuracy (float):
    0.2
execution_accuracy_ci_low (float64):
    0.0
execution_accuracy_ci_high (float64):
    0.6
score_ci_low (float64):
    0.0
score_ci_high (float64):
    0.6

What we get with the new additions:

num_of_instances (int):
    10
anls (float):
    0.12179476763080924
score (float):
    0.0
score_name (str):
    non_empty_execution_accuracy
non_empty_execution_accuracy (float):
    0.0
subset_non_empty_execution_result (float):
    0.0
pred_to_gold_runtime_ratio (float):
    0.9950211077562516
predicted_error (float):
    0.1
predicted_sql_runtime (float):
    0.8206439448520542
gold_error (float):
    0.0
non_empty_gold_df (float):
    0.0
gold_sql_runtime (float):
    0.8285779342986643
execution_accuracy (float):
    0.2
predicted_sql_runtime_ci_low (float64):
    0.7456668988301784
predicted_sql_runtime_ci_high (float64):
    1.0325724025184853
gold_sql_runtime_ci_low (float64):
    0.7711773584951769
gold_sql_runtime_ci_high (float64):
    0.9317167796579513
execution_accuracy_ci_low (float64):
    0.0
execution_accuracy_ci_high (float64):
    0.6

@oktie oktie requested a review from perlitz February 7, 2025 22:14
tsinggggg and others added 19 commits February 11, 2025 15:10
fix: wml inference with space id only

Signed-off-by: Cheng Qian <cheng.qian@ibm.com>
Co-authored-by: Cheng Qian <cheng.qian@ibm.com>
Signed-off-by: elronbandel <elronbandel@gmail.com>
Signed-off-by: elronbandel <elronbandel@gmail.com>
Signed-off-by: elronbandel <elronbandel@gmail.com>
…guration

Signed-off-by: elronbandel <elronbandel@gmail.com>
…tries

Signed-off-by: elronbandel <elronbandel@gmail.com>
Signed-off-by: elronbandel <elronbandel@gmail.com>
Signed-off-by: elronbandel <elronbandel@gmail.com>
Signed-off-by: elronbandel <elronbandel@gmail.com>
* Fix failing tests

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Fix failing tests

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Fix failing tests

Signed-off-by: elronbandel <elronbandel@gmail.com>

---------

Signed-off-by: elronbandel <elronbandel@gmail.com>
Signed-off-by: elronbandel <elronbandel@gmail.com>
add streaming, add tests
* try lazy loadHF first

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* reduce benchmark profiling to generating the dataset only. Not inferring (that is dome mocking anyhow) and not evaluating (of the mocked results). add trust_remote also to load_dataset_builder

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* try procrastination for load csv too

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added split cache for the generators, and log limit once per data and increase loader cache

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* make sklearn loader too - a lazy loader

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* adjust to new readers for csv

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* Enhance LoadHF class to support optional splits and improve dataset loading logic

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Refactor LoadHF class to improve dataset loading and implement limit on yielded instances

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Refactor LoadHF class to streamline dataset loading and enhance split handling

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Remove unused import and update line number in secrets baseline

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Refactor load_data method to simplify error handling and remove unnecessary cache checks

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Merge origin/main

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Refactor loaders to implement LazyLoader class and update load_iterables method for improved streaming support

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Update exception handling in test_failed_load_csv to catch general exceptions

Signed-off-by: elronbandel <elronbandel@gmail.com>

* Refactor LoadHF class to streamline data loading and enhance error handling

Signed-off-by: elronbandel <elronbandel@gmail.com>

---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: elronbandel <elronbandel@gmail.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: elronbandel <elronbandel@gmail.com>
* Add support for all Granite Guardian risks

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Remove Granite Guardian from LLM as Judge evaluators

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Rename wrong metric name (3.0 version -> 3)

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Add support for custom risks

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Adapt catalog

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Add more examples

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Apply linter

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Add generation params

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Use inference engine instead of internal model

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Add _mock_infer_log_probs to infer_log_prob

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Apply linter

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Bring back breaking catalog names changes

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Add wrongly deleted artifacts

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Only create watsonx inference engine if it is None

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Modularize getting the prompt

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Change default names to what Granite Guardian expects by default

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Adapt rag granite guardian prepare file and catalog

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Adapt metric so it works for all inference engines

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Bring back certainty and improve score naming

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* fixes and format

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Adapt rag catalog

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Adapt WMLInferenceEngineBase credential check: apikey -> api_key

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Use credentials object and pass project and space

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Adapt WML log prob default params

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Adapt granite guardian catalog and fix example

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Apply linter

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Implement inheritance for each risk type

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Apply linter

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

* Uncomment log prob params check

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>

---------

Signed-off-by: Martín Santillán Cooper <msantillancooper@ibm.com>
@oktie
Copy link
Member Author

oktie commented Feb 12, 2025

To resolve some merge conflicts, I'm closing this and will do a new PR.

@oktie oktie closed this Feb 12, 2025
@oktie oktie deleted the new-text2sql-metrics branch February 12, 2025 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants