Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Branch][DeepSparse Evaluation API] Update lm-eval, perplexity, additional datasets #1580

Merged
merged 36 commits into from
Feb 9, 2024

Conversation

dbogunowicz
Copy link
Contributor

@dbogunowicz dbogunowicz commented Feb 5, 2024

This PR updates the version of the lm-eval from 0.3 to 0.4.
Supported and tested datasets to evaluate on gsm8k, hellaswag, arc_challange.

Example usage

Example using CLI (when lm-eval is not installed):

 deepsparse.eval hf:mgoin/TinyStories-1M-ds --dataset hellaswag --dataset arc_challange --limit 2

2024-02-05 13:27:51 deepsparse.evaluation.cli INFO     Creating deepsparse pipeline to evaluate from model path: hf:mgoin/TinyStories-1M-ds
2024-02-05 13:27:51 deepsparse.evaluation.cli INFO     Datasets to evaluate on: ['hellaswag', 'arc_challange']
Batch size: 1
Splits to evaluate on: None
Metrics to evaluate on: None
Additional integration arguments supplied: {'limit': 2}
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 164189.84it/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 54535.87it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240104 COMMUNITY | (86c38139) (release) (optimized) (system=avx2, binary=avx2)
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 92459.61it/s]
2024-02-05 13:27:54 deepsparse.evaluation.registry INFO     No integration specified, inferring the evaluation function from the input arguments...
2024-02-05 13:27:54 deepsparse.evaluation.registry INFO     Inferred the evaluation function: lm-evaluation-harness
Traceback (most recent call last):
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/integrations/__init__.py", line 20, in try_import_lm_evaluation_harness
    import lm_eval
ModuleNotFoundError: No module named 'lm_eval'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/bin/deepsparse.eval", line 8, in <module>
    sys.exit(main())
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/cli.py", line 193, in main
    result: Result = evaluate(
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/evaluator.py", line 63, in evaluate
    eval_integration = EvaluationRegistry.resolve(pipeline, datasets, integration)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/registry.py", line 72, in resolve
    potentially_check_dependency_import(integration)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/utils.py", line 46, in potentially_check_dependency_import
    try_import_lm_evaluation_harness(raise_error=True)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/integrations/__init__.py", line 25, in try_import_lm_evaluation_harness
    raise ImportError(
ImportError: Unable to import lm_eval. To install run 'pip install lm-eval==0.4.0'

Example using CLI

 deepsparse.eval hf:mgoin/TinyStories-1M-ds --dataset hellaswag --dataset arc_challange --limit 2

2024-02-05 13:24:42 deepsparse.evaluation.cli INFO     Creating deepsparse pipeline to evaluate from model path: hf:mgoin/TinyStories-1M-ds
2024-02-05 13:24:42 deepsparse.evaluation.cli INFO     Datasets to evaluate on: ['hellaswag', 'arc_challange']
Batch size: 1
Splits to evaluate on: None
Metrics to evaluate on: None
Additional integration arguments supplied: {'limit': 2}
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 39911.20it/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 20042.29it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240104 COMMUNITY | (86c38139) (release) (optimized) (system=avx2, binary=avx2)
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 31906.88it/s]
2024-02-05 13:24:45 deepsparse.evaluation.registry INFO     No integration specified, inferring the evaluation function from the input arguments...
2024-02-05 13:24:45 deepsparse.evaluation.registry INFO     Inferred the evaluation function: lm-evaluation-harness
2024-02-05:13:24:49,100 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05:13:24:51,939 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05 13:24:51 deepsparse.evaluation.integrations.lm_evaluation_harness INFO     Selected Tasks: ['hellaswag']
2024-02-05:13:24:51,940 INFO     [lm_evaluation_harness.py:67] Selected Tasks: ['hellaswag']
2024-02-05:13:24:55,591 INFO     [task.py:340] Building contexts for task on rank 0...
2024-02-05:13:24:55,592 INFO     [evaluator.py:319] Running loglikelihood requests
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:11<00:00,  8.98s/it]
2024-02-05 13:26:07 deepsparse.evaluation.cli INFO     Evaluation done. Results:
formatted=[Evaluation(task='lm_evaluation_harness', dataset=Dataset(type=None, name='hellaswag', config={'model': 'DeepSparseLM', 'model_args': None, 'batch_size': 1, 'batch_sizes': [], 'device': None, 'use_cache': None, 'limit': 2, 'bootstrap_iters': 100000, 'gen_kwargs': None}, split=None), metrics=[Metric(name='acc,none', value=0.0), Metric(name='acc_stderr,none', value=0.0), Metric(name='acc_norm,none', value=1.0), Metric(name='acc_norm_stderr,none', value=0.0)], samples=None)]
2024-02-05 13:26:07 deepsparse.evaluation.cli INFO     Saving the evaluation results to /nm/drive0/damian/deepsparse/result.json
2024-02-05:13:26:07,507 INFO     [cli.py:212] Saving the evaluation results to /nm/drive0/damian/deepsparse/result.json

Example using evaluate function:

from deepsparse import evaluate

out = evaluate(model="hf:mgoin/TinyStories-1M-ds",
         datasets=["hellaswag", "arc_challenge"], 
          limit = 2)
print(out)
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 131820.98it/s]
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 151767.58it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240104 COMMUNITY | (86c38139) (release) (optimized) (system=avx2, binary=avx2)
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 35654.83it/s]
2024-02-05 13:09:34 deepsparse.evaluation.registry INFO     No integration specified, inferring the evaluation function from the input arguments...
2024-02-05 13:09:34 deepsparse.evaluation.registry INFO     Inferred the evaluation function: lm-evaluation-harness
2024-02-05:13:09:38,769 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05:13:09:41,599 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05 13:09:41 deepsparse.evaluation.integrations.lm_evaluation_harness INFO     Selected Tasks: ['arc_challenge', 'hellaswag']
2024-02-05:13:09:41,601 INFO     [lm_evaluation_harness.py:67] Selected Tasks: ['arc_challenge', 'hellaswag']
2024-02-05:13:09:48,822 INFO     [task.py:340] Building contexts for task on rank 0...
2024-02-05:13:09:48,829 INFO     [task.py:340] Building contexts for task on rank 0...
2024-02-05:13:09:48,832 INFO     [evaluator.py:319] Running loglikelihood requests
100%|██████████| 16/16 [05:34<00:00, 20.92s/it]
formatted=[Evaluation(task='lm_evaluation_harness', dataset=Dataset(type=None, name='arc_challenge', config={'model': 'DeepSparseLM', 'model_args': None, ...

Example running unit tests (requires lm-eval==0.4 to be installed)

damian@gpuserver6:/nm/drive0/damian/deepsparse$ pytest tests/deepsparse/evaluation/integrations/test_lm_evaluation_harness.py 
================================================================================================================================ test session starts ================================================================================================================================
platform linux -- Python 3.10.12, pytest-7.4.3, pluggy-1.3.0
rootdir: /nm/drive0/damian/deepsparse
configfile: pyproject.toml
plugins: flaky-3.7.0, anyio-3.7.1
collected 8 items                                                                                                                                                                                                                                                                   

tests/deepsparse/evaluation/integrations/test_lm_evaluation_harness.py ........                                                                                                                                                                                               [100%]

==================================================================================================================== 8 passed, 19 warnings in 302.35s (0:05:02) =====================================================================================================================

@@ -79,24 +81,66 @@ def if_generative_language_model(pipeline: Pipeline) -> bool:
return False


def args_to_dict(args: Tuple[Any, ...]) -> Dict[str, Any]:
def parse_kwarg_tuples(kwargs: tuple) -> Dict:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 1:1 copy of the same function from SparseML. As proposed by @rahul-tuli, in the future let's have it in the nm-utils so it can be shared between Deepsparse and SparseML

Base automatically changed from feature/damian/ui_improvements to main February 5, 2024 15:56

LM_EVALUATION_HARNESS = "lm-evaluation-harness"
_LOGGER = logging.getLogger(__name__)
LM_EVALUATION_HARNESS = "lm-eval-harness"
Copy link
Member

@mgoin mgoin Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering lm-eval-harness is still the name of the repo, I would propose to keep it as lm-eval-harness. I think if you'd like to alias lm_eval as well since that is the name of their CLI command, that would be fine

TYPOS FIXED: Considering lm-evaluation-harness is still the name of the repo, I would propose to keep it as lm-evaluation-harness. I think if you'd like to alias lm_eval as well since that is the name of their CLI command, that would be fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is something that has been introduced in this PR right? We are indeed committing to the name 'lm-eval-harness'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry there are multiple typos - i meant to keep lm-evaluation-harness and contest the change to lm-eval-harness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgoin note: this comes from the docs: https://neuralmagic.github.io/docs-v2/get-started/deploy (bottom of the page). I am happy to change this to whatever product sees fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this in in the current state, I can always change this detail if needed.

@@ -24,8 +24,7 @@ def try_import_lm_evaluation_harness(raise_error=False):
if raise_error:
raise ImportError(
"Unable to import lm_eval. "
"To install run 'pip install "
"git+https://github.com/EleutherAI/lm-evaluation-harness@b018a7d51'"
"To install run 'pip install lm-eval==0.4.0'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when or how will this error during normal use if raise_error=False by default? once the eval actually begins?

Copy link
Contributor Author

@dbogunowicz dbogunowicz Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Good point. Yes, I will change the default behavior of this function, and set raise_error to True.

This is the intended behavior when the acual eval is being ran. At runtime, when the user intends to use lm-eval, the module will try to do the hot import of the lm-eval. If it fails to find the dependency, installed, it will raise the error.

However, when testing, I do not want to raise errors, but use the output of this function (boolean) to skip the tests that require lm-eval installed.

@dbogunowicz dbogunowicz requested a review from mgoin February 8, 2024 09:33
bfineran
bfineran previously approved these changes Feb 8, 2024
dbogunowicz and others added 2 commits February 9, 2024 12:17
* initial commit

* Update src/deepsparse/evaluation/integrations/__init__.py

* design ready, time to define additional features

* split prep_for_generation operator

* fix logits

* update non-kv cache pipeline and tests

* add tests to address edge cases

* add condition to check of kv_cache full during prompt inference, add test to cover this case, revert debugging changes

* fix typing

* remove commented code

* remove irrelevant condition

* perplexity for non-kv cache pipelines works!

* logic is working

* ready for review

* [DeepSparse Evaluation API] Perplexity eval support for `openai_humaneval`, `c4`, `wikitext2` (#1586)

* fix tests 2

* initial commit

* add return to a function

* make script more robust

---------

Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
@bfineran bfineran changed the title [DeepSparse Evaluation API] Update lm-eval support from 0.3 to 0.4 [Feature Branch][DeepSparse Evaluation API] Update lm-eval, perplexity, additional datasets Feb 9, 2024
bfineran
bfineran previously approved these changes Feb 9, 2024
@bfineran bfineran merged commit 517fd15 into main Feb 9, 2024
13 checks passed
@bfineran bfineran deleted the feature/damian/generate_until branch February 9, 2024 17:11
dbogunowicz added a commit that referenced this pull request Feb 12, 2024
bfineran pushed a commit that referenced this pull request Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants