Merge branch 'main' into feature/ppl/ultrachat200k

neuralmagic · Mar 13, 2024 · 8568a4f · 8568a4f
2 parents d5e11f5 + e09ae26
commit 8568a4f
Show file tree

Hide file tree

Showing 169 changed files with 5,572 additions and 1,383 deletions.
diff --git a/.github/workflows/test-check.yaml b/.github/workflows/test-check.yaml
@@ -45,7 +45,7 @@ jobs:
       - name: "Clean sparsezoo directory"
         run: rm -r sparsezoo/
       - name: ⚙️ Install dependencies
-        run: pip install .[dev,server,image_classification,transformers,clip]
+        run: pip install .[dev,server,image_classification,yolov8,transformers,clip]
       - name: Run base tests
         run: make test
   cli-smoke-tests:

diff --git a/examples/benchmark/resnet50_benchmark.py b/examples/benchmark/resnet50_benchmark.py
@@ -47,7 +47,8 @@
 
 import numpy
 
-from deepsparse import benchmark_model, cpu
+from deepsparse import cpu
+from deepsparse.engine import benchmark_model
 
 
 CORES_PER_SOCKET, AVX_TYPE, VNNI = cpu.cpu_details()

diff --git a/research/mpt/README.md b/research/mpt/README.md
@@ -1,42 +1,42 @@
 *LAST UPDATED: 11/24/2023*
 
-# **Sparse Finetuned LLMs with DeepSparse**
+# **Sparse Fine-Tuned LLMs With DeepSparse**
 
 DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT and Meta's Llama 2.
-Check out our paper [Sparse Finetuning for Inference Acceleration of Large Language Models](https://arxiv.org/abs/2310.06927)
+Check out our paper [Sparse Fine-tuning for Inference Acceleration of Large Language Models](https://arxiv.org/abs/2310.06927)
 
 In this research overview, we will discuss:
-1. [Our Sparse Fineuning Research](#sparse-finetuning-research)
-2. [How to try Text Generation with DeepSparse](#try-it-now)
+1. [Our Sparse Fine-Tuning Research](#sparse-finetuning-research)
+2. [How to Try Text Generation With DeepSparse](#try-it-now)
 
-## **Sparse Finetuning Research**
+## **Sparse Fine-Tuning Research**
 
-We show that MPT-7B and Llama-2-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.
+We show that MPT-7B and Llama-2-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Fine-Tuning**, where we prune the network during the fine-tuning process.
 
 When running the pruned network with DeepSparse, we can accelerate inference by ~7x over the dense-FP32 baseline!
 
-### **Sparse Finetuning on Grade-School Math (GSM)**
+### **Sparse Fine-Tuning on Grade-School Math (GSM)**
 
-Training LLMs consist of two steps. First, the model is pre-trained on a very large corpus of text (typically >1T tokens). Then, the model is adapted for downstream use by continuing training with a much smaller high quality curated dataset. This second step is called finetuning.
+Training LLMs consists of two steps. First, the model is pre-trained on a very large corpus of text (typically >1T tokens). Then, the model is adapted for downstream use by continuing training with a much smaller high-quality curated dataset. This second step is called fine-tuning.
 
 Fine-tuning is useful for two main reasons:
 1. It can teach the model *how to respond* to input (often called **instruction tuning**).
 2. It can teach the model *new information* (often called **domain adaptation**).
 
-An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
+An example of how domain adaptation is helpful in solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
 
-The key insight from [our paper](https://arxiv.org/abs/2310.06927) is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse!
+The key insight from [our paper](https://arxiv.org/abs/2310.06927) is that we can prune the network during the fine-tuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense fine-tuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse!
 
 <div align="center">
-    <img src="https://github.com/neuralmagic/deepsparse/assets/3195154/f9a86726-12f5-4926-8d8c-668c449faa84" width="60%"/>
+    <img src="https://github.com/neuralmagic/deepsparse/assets/3195154/f9a86726-12f5-4926-8d8c-668c449faa84" width="60%" ALT="Sparse Fine-Tuned LLMs on GSM8k"/>
 </div>
 
-- [See the paper on Arxiv](https://arxiv.org/abs/2310.06927)
-- [See our Llama 2 expansion blog on the initial paper](https://neuralmagic.com/blog/fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse/)
+- [See the paper on Arxiv](https://arxiv.org/abs/2310.06927).
+- [See our Llama 2 expansion blog on the initial paper](https://neuralmagic.com/blog/fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse/).
 
-### **How Is This Useful For Real World Use?**
+### **How Is This Useful For Real-World Use?**
 
-While GSM is a "toy" math dataset, it serves as an example of how LLMs can be adapted to solve tasks which the general pretrained model cannot. Given the treasure-troves of domain-specific data held by companies, we expect to see many production models fine-tuned to create more accurate models fit to business tasks. Using Neural Magic, you can deploy these fine-tuned models performantly on CPUs!
+While GSM is a "toy" math dataset, it serves as an example of how LLMs can be adapted to solve tasks that the general pre-trained model cannot. Given the treasure troves of domain-specific data held by companies, we expect to see many production models fine-tuned to create more accurate models fit to business tasks. Using Neural Magic, you can deploy these fine-tuned models performantly on CPUs!
 
 ## Try It Now
 
@@ -82,22 +82,22 @@ print(output.generations[0].text)
 ```
 
 #### Other Resources
-- [Check out all the GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
-- [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d)
-- [Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md)
+- [Check out all the GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true).
+- [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d).
+- [Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md).
 
 ## **Roadmap**
 
 Following these initial results, we are rapidly expanding our support for LLMs across the Neural Magic stack, including:
 
-- **Productizing Sparse Fine Tuning**: Enable external users to apply the sparse fine-tuning to business datasets
-- **Expanding Model Support**: Apply sparse fine-tuning results to Mistral models
-- **Pushing to Higher Sparsity**: Improving our pruning algorithms to reach higher sparsity
-- **Building General Sparse Model**: Create sparse model that can perform well on general tasks like OpenLLM leaderboard
+- **Productizing Sparse Fine-Tuning**: Enable external users to apply the sparse fine-tuning to business datasets.
+- **Expanding Model Support**: Apply sparse fine-tuning results to Mistral models.
+- **Pushing to Higher Sparsity**: Improving our pruning algorithms to reach higher sparsity.
+- **Building General Sparse Model**: Create a sparse model that can perform well on general tasks like OpenLLM leaderboard.
 
 ## **Feedback / Roadmap Requests**
 
-We are excited to add initial support for LLMs in the Neural Magic stack and plan to bring many ongoing improvements over the coming months. For questions or requests regarding LLMs, please reach out in any of the following channels:
+We are excited to add initial support for LLMs in the Neural Magic stack and plan to bring many ongoing improvements over the coming months. For questions or requests regarding LLMs, reach out through any of the following channels:
 - [Neural Magic Community Slack](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)
 - [GitHub Issue Queue](https://github.com/neuralmagic/deepsparse/issues)
 - [Contact Form](http://neuralmagic.com/contact/)
diff --git a/setup.py b/setup.py
@@ -99,14 +99,15 @@ def _parse_requirements_file(file_path):
     "black==22.12.0",
     "flake8>=3.8.3",
     "isort>=5.7.0",
-    "flaky~=3.7.0",
+    "pytest-rerunfailures>=13.0",
     "ndjson>=0.3.1",
     "wheel>=0.36.2",
     "pytest>=6.0.0",
     "onnxruntime>=1.7.0",
     "flask>=1.0.0",
     "flask-cors>=3.0.0",
     "Pillow>=8.3.2",
+    "openai",
 ]
 _docs_deps = [
     "m2r2~=0.2.7",
@@ -147,8 +148,8 @@ def _parse_requirements_file(file_path):
     "transformers<4.37",
     "datasets<2.16",
     "accelerate<0.26",
-    "scikit-learn",
     "seqeval",
+    "evaluate",
 ]
 _sentence_transformers_integration_deps = ["optimum-deepsparse"] + _torch_deps
 
@@ -165,8 +166,7 @@ def _parse_requirements_file(file_path):
 _haystack_integration_deps = _parse_requirements_file(_haystack_requirements_file_path)
 _clip_deps = [
     "open_clip_torch==2.20.0",
-    "scipy<1.10,>=1.8",
-    "transformers<4.35",
+    "transformers<4.37",
 ]
 
 
@@ -309,7 +309,7 @@ def _setup_entry_points() -> Dict:
             f"deepsparse.image_classification.eval={ic_eval}",
             "deepsparse.license=deepsparse.license:main",
             "deepsparse.validate_license=deepsparse.license:validate_license_cli",
-            "deepsparse.eval=deepsparse.evaluation.cli:main",
+            "deepsparse.evaluate=deepsparse.evaluation.cli:main",
         ]
     }
 

diff --git a/src/deepsparse/__init__.py b/src/deepsparse/__init__.py
@@ -34,9 +34,12 @@
 from .pipeline_config import *
 from .tasks import *
 from .pipeline import *
-from .loggers import *
 from .version import __version__, is_release
 from .analytics import deepsparse_analytics as _analytics
 from .subgraph_execute import *
+from .analyze import analyze
+from .evaluation.evaluator import evaluate
+from .benchmark.benchmark_model import benchmark_model
+from .benchmark.benchmark_pipeline import benchmark_pipeline
 
 _analytics.send_event("python__init")
diff --git a/src/deepsparse/analyze.py b/src/deepsparse/analyze.py
@@ -24,14 +24,18 @@
 from deepsparse.benchmark.benchmark_model import benchmark_model
 from deepsparse.utils import generate_random_inputs, model_to_path
 from sparsezoo import convert_to_bool
-from sparsezoo.analyze import (
+from sparsezoo.analyze_v1 import (
     BenchmarkResult,
     BenchmarkScenario,
     ImposedSparsificationInfo,
     ModelAnalysis,
     NodeInferenceResult,
 )
-from sparsezoo.analyze.cli import analyze_options, analyze_performance_options
+from sparsezoo.analyze_v1.cli import (
+    DEEPSPARSE_ENGINE,
+    analyze_options,
+    analyze_performance_options,
+)
 
 
 _LOGGER = logging.getLogger(__name__)
@@ -74,21 +78,11 @@ def main(
             )
 
     _LOGGER.info("Starting Analysis ...")
-    analysis = ModelAnalysis.create(model_path)
-    _LOGGER.info("Analysis complete, collating results...")
-    scenario = BenchmarkScenario(
-        batch_size=batch_size_throughput,
-        num_cores=None,
-        engine=benchmark_engine,
-    )
-    performance_summary = run_benchmark_and_analysis(
-        onnx_model=model_to_path(model_path),
-        scenario=scenario,
-    )
+    analysis = analyze(model_path, batch_size_throughput, benchmark_engine)
+
     by_types: bool = convert_to_bool(by_types)
     by_layers: bool = convert_to_bool(by_layers)
 
-    analysis.benchmark_results = [performance_summary]
     summary = analysis.summary(
         by_types=by_types,
         by_layers=by_layers,
@@ -103,13 +97,9 @@ def main(
 
         print("Comparison Analysis:")
         for model_to_compare in compare:
-            compare_model_analysis = ModelAnalysis.create(model_to_compare)
-            _LOGGER.info(f"Running Performance Analysis on {model_to_compare}")
-            performance_summary = run_benchmark_and_analysis(
-                onnx_model=model_to_path(model_to_compare),
-                scenario=scenario,
+            compare_model_analysis = analyze(
+                model_to_compare, batch_size_throughput, benchmark_engine
             )
-            compare_model_analysis.benchmark_results = [performance_summary]
             summary_comparison_model = compare_model_analysis.summary(
                 by_types=by_types,
                 by_layers=by_layers,
@@ -124,6 +114,34 @@ def main(
         analysis.yaml(file_path=save)
 
 
+def analyze(
+    model_path,
+    batch_size_throughput: int = 1,
+    benchmark_engine: str = DEEPSPARSE_ENGINE,
+) -> ModelAnalysis:
+    """
+    :param model_path: Local filepath to an ONNX model, or a SparseZoo stub
+    :param batch_size_throughput: Batch size for throughput benchmark
+    :param benchmark_engine: Benchmark engine to use, can be 'deepsparse' or
+        'onnxruntime', defaults to 'deepsparse'
+    :return: A `ModelAnalysis` object encapsulating the results of the analysis
+    """
+    analysis = ModelAnalysis.create(model_path)
+    _LOGGER.info("Analysis complete, collating results...")
+    scenario = BenchmarkScenario(
+        batch_size=batch_size_throughput,
+        num_cores=None,
+        engine=benchmark_engine,
+    )
+    performance_summary = run_benchmark_and_analysis(
+        onnx_model=model_to_path(model_path),
+        scenario=scenario,
+    )
+
+    analysis.benchmark_results = [performance_summary]
+    return analysis
+
+
 def run_benchmark_and_analysis(
     onnx_model: str,
     scenario: BenchmarkScenario,

diff --git a/src/deepsparse/benchmark/benchmark_model.py b/src/deepsparse/benchmark/benchmark_model.py
@@ -411,6 +411,11 @@ def benchmark_model(
         if not disable_kv_cache_overrides:
             if not sequence_length:
                 sequence_length = infer_sequence_length(model_path)
+                if not sequence_length:
+                    raise ValueError(
+                        "Unable to infer sequence length from model. "
+                        "Specify it manually through `sequence_length` argument."
+                    )
             if input_ids_length > sequence_length:
                 raise ValueError(
                     f"input_ids_length: {input_ids_length} "