The provided code snippet illustrates how to evaluate a quantized model with different benchmark datasets using Python. Below is a detailed explanation of each function interface and the parameters they accept:
- Parameters:
model
(object): The quantized model loaded usingAutoModelForCausalLM
.tokenizer
(object): The tokenizer loaded usingAutoTokenizer
.test_dataset
(list of str): List of dataset identifiers to evaluate the model's perplexity. Options include['wikitext2', 'c4', 'ptb']
.
- Parameters:
model
(object): The quantized model.tokenizer
(object): The tokenizer.model_type
(str): Type of the model to evaluate. Options are['chatglm', 'baichuan', 'llama']
.subject
(str): Subject area for evaluation. Options are['all', 'STEM', 'Social Science, 'Humanities', 'Other']
.num_shot
(int): Number of examples used for few-shot learning. Range is 0 to 5.
- Parameters:
- Similar to
eval_ceval
, but thesubject
options differ slightly, including['all', 'STEM', 'Social Sciences', 'Humanities', 'China Specific', 'Other']
.
- Similar to
- Parameters:
model
(object): The quantized model.tokenizer
(object): The tokenizer.test_dataset
(str): Dataset to evaluate. Options areQuestionAnswering_squad
QuestionAnswering_advqa
QuestionAnswering_newsqa
QuestionAnswering_searchqa
SentimentAnalysis_amazon
SentimentAnalysis_dynasent
SentimentAnalysis_semeval
SentimentAnalysis_sst5
NaturalLanguageInference_mnli
NaturalLanguageInference_anli
NaturalLanguageInference_wanli
NaturalLanguageInference_contractnli
ToxicDetection_civilcomments
ToxicDetection_advcivil
ToxicDetection_implicithate
ToxicDetection_toxigen
split
(str): Dataset split to evaluate, generally set to 'test'.ICL_split
(str): In-context learning split, usually set to 'test'.num_shot
(int): Number of examples for few-shot learning, typically 0.
- Parameters:
model
(object): The quantized model.tokenizer
(object): The tokenizer.eval_tasks
(list of str): List of tasks for evaluation. Includes["winogrande", "piqa", "hellaswag"]
.num_shot
(int): Number of examples for few-shot learning, options are 0 or 5.
Each function is designed to test different aspects of the model's capabilities using a variety of datasets and tasks, enabling comprehensive performance assessments.
The Benchmark.eval_lmeval
function can evaluate the model across a broad range of tasks. Here is a comprehensive list of tasks that can be specified:
- 'winogrande'
- 'wsc273'
- 'race'
- 'anli_r1'
- 'anli_r2'
- 'anli_r3'
- 'pubmedqa'
- 'blimp_adjunct_island'
- 'blimp_anaphor_gender_agreement'
- 'blimp_anaphor_number_agreement'
- 'blimp_animate_subject_passive'
- 'blimp_animate_subject_trans'
- 'blimp_causative'
- 'blimp_complex_NP_island'
- 'blimp_coordinate_structure_constraint_complex_left_branch'
- 'blimp_coordinate_structure_constraint_object_extraction'
- 'blimp_determiner_noun_agreement_1'
- 'blimp_determiner_noun_agreement_2'
- 'blimp_determiner_noun_agreement_irregular_1'
- 'blimp_determiner_noun_agreement_irregular_2'
- 'blimp_determiner_noun_agreement_with_adj_2'
- 'blimp_determiner_noun_agreement_with_adj_irregular_1'
- 'blimp_determiner_noun_agreement_with_adj_irregular_2'
- 'blimp_determiner_noun_agreement_with_adjective_1'
- 'blimp_distractor_agreement_relational_noun'
- 'blimp_distractor_agreement_relative_clause'
- 'blimp_drop_argument'
- 'blimp_ellipsis_n_bar_1'
- 'blimp_ellipsis_n_bar_2'
- 'blimp_existential_there_object_raising'
- 'blimp_existential_there_quantifiers_1'
- 'blimp_existential_there_quantifiers_2'
- 'blimp_existential_there_subject_raising'
- 'blimp_expletive_it_object_raising'
- 'blimp_inchoative'
- 'blimp_intransitive'
- 'blimp_irregular_past_participle_adjectives'
- 'blimp_irregular_past_participle_verbs'
- 'blimp_irregular_plural_subject_verb_agreement_1'
- 'blimp_irregular_plural_subject_verb_agreement_2'
- 'blimp_left_branch_island_echo_question'
- 'blimp_left_branch_island_simple_question'
- 'blimp_matrix_question_npi_licensor_present'
- 'blimp_npi_present_1'
- 'blimp_npi_present_2'
- 'blimp_only_npi_licensor_present'
- 'blimp_only_npi_scope'
- 'blimp_passive_1'
- 'blimp_passive_2'
- 'blimp_principle_A_c_command'
- 'blimp_principle_A_case_1'
- 'blimp_principle_A_case_2'
- 'blimp_principle_A_domain_1'
- 'blimp_principle_A_domain_2'
- 'blimp_principle_A_domain_3'
- 'blimp_principle_A_reconstruction'
- 'blimp_regular_plural_subject_verb_agreement_1'
- 'blimp_regular_plural_subject_verb_agreement_2'
- 'blimp_sentential_negation_npi_licensor_present'
- 'blimp_sentential_negation_npi_scope'
- 'blimp_sentential_subject_island'
- 'blimp_superlative_quantifiers_1'
- 'blimp_superlative_quantifiers_2'
- 'blimp_tough_vs_raising_1'
- 'blimp_tough_vs_raising_2'
- 'blimp_transitive'
- 'blimp_wh_island'
- 'blimp_wh_questions_object_gap'
- 'blimp_wh_questions_subject_gap'
- 'blimp_wh_questions_subject_gap_long_distance'
- 'blimp_wh_vs_that_no_gap'
- 'blimp_wh_vs_that_no_gap_long_distance'
- 'blimp_wh_vs_that_with_gap'
- 'blimp_wh_vs_that_with_gap_long_distance'
- 'openbookqa'
- 'arc_easy'
- 'arc_challenge'
- 'sciq'
- 'swag'
- 'piqa'
- 'hellaswag'
- 'glue_mnli'
- 'glue_mnli_mismatched'
This extensive list includes various language understanding and processing tasks, covering different domains and complexities to assess the model comprehensively.