diff --git a/DEVELOPING.md b/DEVELOPING.md
index 81b6cce603..2e12ccd8b4 100644
--- a/DEVELOPING.md
+++ b/DEVELOPING.md
@@ -16,7 +16,7 @@ limitations under the License.
# Developing DeepSparse
-The DeepSparse Python API is developed and tested using Python 3.8-3.10.
+The DeepSparse Python API is developed and tested using Python 3.8-3.11.
To develop the Python API, you will also need the development dependencies and to follow the styling guidelines.
Here's some details to get started.
diff --git a/README.md b/README.md
index 411c45348a..067b2bd3cf 100644
--- a/README.md
+++ b/README.md
@@ -20,120 +20,134 @@ limitations under the License.
DeepSparse
-
An inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application
+ Sparsity-aware deep learning inference runtime for CPUs
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-[DeepSparse](https://github.com/neuralmagic/deepsparse) is a CPU inference runtime that takes advantage of sparsity within neural networks to execute inference quickly. Coupled with [SparseML](https://github.com/neuralmagic/sparseml), an open-source optimization library, DeepSparse enables you to achieve GPU-class performance on commodity hardware.
+[DeepSparse](https://github.com/neuralmagic/deepsparse) is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. Coupled with [SparseML](https://github.com/neuralmagic/sparseml), our optimization library for pruning and quantizing your models, DeepSparse delivers exceptional inference performance on CPU hardware.
-For details of training sparse models for deployment with DeepSparse, [check out SparseML](https://github.com/neuralmagic/sparseml).
-
-### ✨NEW✨ DeepSparse ARM Alpha 💪
+## ✨NEW✨ DeepSparse LLMs
-Neural Magic is bringing performant deep learning inference to ARM CPUs! In our recent product release, we launched alpha support for DeepSparse on AWS Graviton and Ampere. We are working towards a general release across ARM server, embedded, and mobile platforms in 2023.
+We are excited to announce initial support for performant LLM inference in DeepSparse with:
+- Sparse kernels for speedups and memory savings from unstructured sparse weights
+- 8-bit weight and activation quantization support
+- Efficient usage of cached attention keys and values for minimal memory movement
-**If you would like to trial the alpha or want early access to the general release, [sign up for the waitlist](https://neuralmagic.com/deepsparse-arm-waitlist/).**
+![mpt-chat-comparison](https://github.com/neuralmagic/deepsparse/assets/3195154/ccf39323-4603-4489-8462-7b103872aeb3)
-## Installation
+### Try It Now
-DeepSparse is available in two editions:
-1. DeepSparse Community is free for evaluation, research, and non-production use with our [DeepSparse Community License](https://neuralmagic.com/legal/engine-license-agreement/).
-2. DeepSparse Enterprise requires a [trial license](https://neuralmagic.com/deepsparse-free-trial/) or [can be fully licensed](https://neuralmagic.com/legal/master-software-license-and-service-agreement/) for production, commercial applications.
+Install (requires Linux):
+```bash
+pip install -U deepsparse-nightly[llm]
+```
-#### Install via Docker (Recommended)
+Run inference:
+```python
+from deepsparse import TextGeneration
+pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")
-DeepSparse Community is available as a container image hosted on [GitHub container registry](https://github.com/neuralmagic/deepsparse/pkgs/container/deepsparse).
+prompt="""
+Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
+"""
+print(pipeline(prompt, max_new_tokens=75).generations[0].text)
-```bash
-docker pull ghcr.io/neuralmagic/deepsparse:1.4.2
-docker tag ghcr.io/neuralmagic/deepsparse:1.4.2 deepsparse-docker
-docker run -it deepsparse-docker
+# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero, and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.
```
-- [Check out the Docker page](https://github.com/neuralmagic/deepsparse/tree/main/docker/) for more details.
+> [Check out the `TextGeneration` documentation for usage details.](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md)
+
+### Sparsity :handshake: Performance
+
+Developed in collaboration with IST Austria, [our recent paper](https://arxiv.org/abs/2310.06927) details a new technique called **Sparse Finetuning**, which allows us to prune MPT-7B to 60% sparsity during finetuning without drop in accuracy. With our new support for LLMs, DeepSparse accelerates the sparse-quantized model 7x over the dense baseline:
+
+
+
+
+
+> [Learn more about our Sparse Finetuning research.](https://github.com/neuralmagic/deepsparse/blob/main/research/mpt#sparse-finetuned-llms-with-deepsparse)
+
+> [Check out the model running live on Hugging Face.](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k)
+
+### LLM Roadmap
+
+Following this initial launch, we are rapidly expanding our support for LLMs, including:
-#### Install via PyPI
-DeepSparse Community is also available via PyPI. We recommend using a virtual enviornment.
+1. Productizing Sparse Finetuning: Enable external users to apply the sparse fine-tuning to their datasets via SparseML
+2. Expanding Model Support: Apply our sparse finetuning results to Llama2 and Mistral models
+3. Pushing to Higher Sparsity: Improving our pruning algorithms to reach even higher sparsity
+
+## Computer Vision and NLP Models
+
+In addition to LLMs, DeepSparse supports many variants of CNNs and Transformer models such as BERT, ViT, ResNet, EfficientNet, YOLOv5/8, and many more! Take a look at the [Computer Vision](https://sparsezoo.neuralmagic.com/?modelSet=computer_vision) and [Natural Language Processing](https://sparsezoo.neuralmagic.com/?modelSet=natural_language_processing) domains of [SparseZoo](https://sparsezoo.neuralmagic.com/), our home for optimized models.
+
+### Installation
+
+Install via [PyPI](https://pypi.org/project/deepsparse/) ([optional dependencies detailed here](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/installation.md)):
```bash
-pip install deepsparse
+pip install deepsparse
```
-- [Check out the Installation page](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/installation.md) for optional dependencies.
+To experiment with the latest features, there is a nightly build available using `pip install deepsparse-nightly` or you can clone + install from source using `pip install -e path/to/deepsparse`.
-## Hardware Support and System Requirements
+#### System Requirements
+- Hardware: [x86 AVX2, AVX512, AVX512-VNNI and ARM v8.2+](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/hardware-support.md)
+- Operating System: Linux
+- Python: 3.8-3.11
+- ONNX versions 1.5.0-1.15.0, ONNX opset version 11 or higher
-[Supported Hardware for DeepSparse](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/hardware-support.md)
-
-DeepSparse is tested on Python versions 3.8-3.10, ONNX versions 1.5.0-1.12.0, ONNX opset version 11 or higher, and manylinux compliant systems. Please note that DeepSparse is only supported natively on Linux. For those using Mac or Windows, running Linux in a Docker or virtual machine is necessary to use DeepSparse.
+For those using Mac or Windows, we recommend using Linux Containers with Docker.
## Deployment APIs
DeepSparse includes three deployment APIs:
-- **Engine** is the lowest-level API. With Engine, you pass tensors and receive the raw logits.
+- **Engine** is the lowest-level API. With Engine, you compile an ONNX model, pass tensors as input, and receive the raw outputs.
- **Pipeline** wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
- **Server** wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.
### Engine
-The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input.
+The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. Users can provide their own ONNX models, whether dense or sparse.
```python
from deepsparse import Engine
-from deepsparse.utils import generate_random_inputs, model_to_path
# download onnx, compile
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
-batch_size = 1
-compiled_model = Engine(model=zoo_stub, batch_size=batch_size)
+compiled_model = Engine(model=zoo_stub, batch_size=1)
# run inference (input is raw numpy tensors, output is raw scores)
-inputs = generate_random_inputs(model_to_path(zoo_stub), batch_size)
+inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
print(output)
# > [array([[-0.3380675 , 0.09602544]], dtype=float32)] << raw scores
```
-### DeepSparse Pipelines
-
-Pipeline is the default API for interacting with DeepSparse. Similar to Hugging Face Pipelines, DeepSparse Pipelines wrap Engine with pre- and post-processing (as well as other utilities), enabling you to send raw data to DeepSparse and receive the post-processed prediction.
+### Pipeline
-The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.
+Pipelines wrap Engine with pre- and post-processing, enabling you to pass raw data and receive the post-processed prediction. The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.
```python
from deepsparse import Pipeline
@@ -151,15 +165,9 @@ print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]
```
-#### Additional Resources
-- Check out the [Use Cases Page](https://github.com/neuralmagic/deepsparse/tree/main/docs/use-cases) for more details on supported tasks.
-- Check out the [Pipelines User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/deepsparse-pipelines.md) for more usage details.
-
-### DeepSparse Server
+### Server
-Server wraps Pipelines with REST APIs, enabling you to stand up model serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions.
-
-DeepSparse Server is launched from the command line, configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:
+Server wraps Pipelines with REST APIs, enabling you to stand up model serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. DeepSparse Server is launched from the command line, configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:
```bash
deepsparse.server \
@@ -172,7 +180,7 @@ Sending a request:
```python
import requests
-url = "http://localhost:5543/predict" # Server's port default to 5543
+url = "http://localhost:5543/v2/models/sentiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}
response = requests.post(url, json=obj)
@@ -180,103 +188,70 @@ print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}
```
-#### Additional Resources
-- Check out the [Use Cases Page](https://github.com/neuralmagic/deepsparse/tree/main/docs/use-cases) for more details on supported tasks.
-- Check out the [Server User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/deepsparse-server.md) for more usage details.
-
-## ONNX
-
-DeepSparse accepts models in the ONNX format. ONNX models can be passed in one of two ways:
-
-- **SparseZoo Stub**: [SparseZoo](https://sparsezoo.neuralmagic.com/) is an open-source repository of sparse models. The examples on this page use SparseZoo stubs to identify models and download them for deployment in DeepSparse.
-
-- **Local ONNX File**: Users can provide their own ONNX models, whether dense or sparse. For example:
-
-```bash
-wget https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-7.onnx
-```
-
-```python
-from deepsparse import Engine
-from deepsparse.utils import generate_random_inputs
-onnx_filepath = "mobilenetv2-7.onnx"
-batch_size = 16
-
-# Generate random sample input
-inputs = generate_random_inputs(onnx_filepath, batch_size)
-
-# Compile and run
-compiled_model = Engine(model=onnx_filepath, batch_size=batch_size)
-outputs = compiled_model(inputs)
-print(outputs[0].shape)
-# (16, 1000) << batch, num_classes
-```
-
-## Inference Modes
-
-DeepSparse offers different inference scenarios based on your use case.
-
-**Single-stream** scheduling: the latency/synchronous scenario, requests execute serially. [`default`]
-
-
-
-It's highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets.
-
-**Multi-stream** scheduling: the throughput/asynchronous scenario, requests execute in parallel.
-
-
-
-The most common use cases for the multi-stream scheduler are where parallelism is low with respect to core count, and where requests need to be made asynchronously without time to batch them.
+### Additional Resources
+- [Use Cases Page](https://github.com/neuralmagic/deepsparse/tree/main/docs/use-cases) for more details on supported tasks
+- [Pipelines User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/deepsparse-pipelines.md) for Pipeline documentation
+- [Server User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/deepsparse-server.md) for Server documentation
+- [Benchmarking User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/deepsparse-benchmarking.md) for benchmarking documentation
+- [Cloud Deployments and Demos](https://github.com/neuralmagic/deepsparse/tree/main/examples/)
+- [User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide) for more detailed documentation
-- [Check out the Scheduler User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/scheduler.md) for more details.
## Product Usage Analytics
-DeepSparse Community Edition gathers basic usage telemetry including, but not limited to, Invocations, Package, Version, and IP Address for Product Usage Analytics purposes. Review Neural Magic's [Products Privacy Policy](https://neuralmagic.com/legal/) for further details on how we process this data.
-To disable Product Usage Analytics, run the command:
+DeepSparse gathers basic usage telemetry including, but not limited to, Invocations, Package, Version, and IP Address for Product Usage Analytics purposes. Review Neural Magic's [Products Privacy Policy](https://neuralmagic.com/legal/) for further details on how we process this data.
+
+To disable Product Usage Analytics, run:
```bash
export NM_DISABLE_ANALYTICS=True
```
-Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check." For additional assistance, reach out through the [DeepSparse GitHub Issue queue](https://github.com/neuralmagic/deepsparse/issues).
-
-## Additional Resources
-- [Benchmarking Performance](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/deepsparse-benchmarking.md)
-- [User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide)
-- [Use Cases](https://github.com/neuralmagic/deepsparse/tree/main/docs/use-cases)
-- [Cloud Deployments and Demos](https://github.com/neuralmagic/deepsparse/tree/main/examples/)
-
-#### Versions
-- [DeepSparse](https://pypi.org/project/deepsparse) | stable
-- [DeepSparse-Nightly](https://pypi.org/project/deepsparse-nightly/) | nightly (dev)
-- [GitHub](https://github.com/neuralmagic/deepsparse/releases) | releases
-
-#### Info
-- [Blog](https://www.neuralmagic.com/blog/)
-- [Resources](https://www.neuralmagic.com/resources/)
+Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check."
## Community
-### Be Part of the Future... And the Future is Sparse!
+### Get In Touch
-Contribute with code, examples, integrations, and documentation as well as bug reports and feature requests! [Learn how here.](https://github.com/neuralmagic/deepsparse/blob/main/CONTRIBUTING.md)
-
-For user help or questions about DeepSparse, sign up or log in to our **[Neural Magic Community Slack](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)**. We are growing the community member by member and happy to see you there. Bugs, feature requests, or additional questions can also be posted to our [GitHub Issue Queue.](https://github.com/neuralmagic/deepsparse/issues) You can get the latest news, webinar and event invites, research papers, and other ML Performance tidbits by [subscribing](https://neuralmagic.com/subscribe/) to the Neural Magic community.
+- [Contribution Guide](https://github.com/neuralmagic/deepsparse/blob/main/CONTRIBUTING.md)
+- [Community Slack](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)
+- [GitHub Issue Queue](https://github.com/neuralmagic/deepsparse/issues)
+- [Subscribe To Our Newsletter](https://neuralmagic.com/subscribe/)
+- [Blog](https://www.neuralmagic.com/blog/)
-For more general questions about Neural Magic, complete this [form.](http://neuralmagic.com/contact/)
+For more general questions about Neural Magic, [complete this form.](http://neuralmagic.com/contact/)
### License
-[DeepSparse Community](https://docs.neuralmagic.com/products/deepsparse) is licensed under the [Neural Magic DeepSparse Community License.](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE-NEURALMAGIC)
-Some source code, example files, and scripts included in the deepsparse GitHub repository or directory are licensed under the [Apache License Version 2.0](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE) as noted.
+- **DeepSparse Community** is licensed under the [Neural Magic DeepSparse Community License.](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE-NEURALMAGIC)
+Some source code, example files, and scripts included in the DeepSparse GitHub repository or directory are licensed under the [Apache License Version 2.0](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE) as noted.
-[DeepSparse Enterprise](https://docs.neuralmagic.com/products/deepsparse-ent) requires a Trial License or [can be fully licensed](https://neuralmagic.com/legal/master-software-license-and-service-agreement/) for production, commercial applications.
+- **DeepSparse Enterprise** requires a Trial License or [can be fully licensed](https://neuralmagic.com/legal/master-software-license-and-service-agreement/) for production, commercial applications.
### Cite
Find this project useful in your research or other communications? Please consider citing:
```bibtex
+@misc{kurtic2023sparse,
+ title={Sparse Finetuning for Inference Acceleration of Large Language Models},
+ author={Eldar Kurtic and Denis Kuznedelev and Elias Frantar and Michael Goin and Dan Alistarh},
+ year={2023},
+ url={https://arxiv.org/abs/2310.06927},
+ eprint={2310.06927},
+ archivePrefix={arXiv},
+ primaryClass={cs.CL}
+}
+
+@misc{kurtic2022optimal,
+ title={The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models},
+ author={Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh},
+ year={2022},
+ url={https://arxiv.org/abs/2203.07259},
+ eprint={2203.07259},
+ archivePrefix={arXiv},
+ primaryClass={cs.CL}
+}
+
@InProceedings{
pmlr-v119-kurtz20a,
title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks},
@@ -295,10 +270,7 @@ Find this project useful in your research or other communications? Please consid
}
@article{DBLP:journals/corr/abs-2111-13445,
- author = {Eugenia Iofinova and
- Alexandra Peste and
- Mark Kurtz and
- Dan Alistarh},
+ author = {Eugenia Iofinova and Alexandra Peste and Mark Kurtz and Dan Alistarh},
title = {How Well Do Sparse Imagenet Models Transfer?},
journal = {CoRR},
volume = {abs/2111.13445},
diff --git a/docs/llms/integration-langchain.md b/docs/llms/integration-langchain.md
new file mode 100644
index 0000000000..b1e0589a70
--- /dev/null
+++ b/docs/llms/integration-langchain.md
@@ -0,0 +1,76 @@
+
+
+# **DeepSparse LangChain Integration**
+
+[DeepSparse](https://github.com/neuralmagic/deepsparse) has an official integration within [LangChain](https://python.langchain.com/docs/integrations/llms/deepsparse).
+It is broken into two parts: installation and then examples of DeepSparse usage.
+
+## Installation and Setup
+
+- Install the Python packages with `pip install deepsparse-nightly langchain`
+- Choose a [SparseZoo model](https://sparsezoo.neuralmagic.com/?useCase=text_generation) or export a support model to ONNX [using Optimum](https://github.com/neuralmagic/notebooks/blob/main/notebooks/opt-text-generation-deepsparse-quickstart/OPT_Text_Generation_DeepSparse_Quickstart.ipynb)
+- Models hosted on HuggingFace are also supported by prepending `"hf:"` to the model id, such as [`"hf:mgoin/TinyStories-33M-quant-deepsparse"`](https://huggingface.co/mgoin/TinyStories-33M-quant-deepsparse)
+
+## Wrappers
+
+There exists a DeepSparse LLM wrapper, which you can access with:
+
+```python
+from langchain.llms import DeepSparse
+```
+
+It provides a simple, unified interface for all models:
+
+```python
+from langchain.llms import DeepSparse
+llm = DeepSparse(model='zoo:nlg/text_generation/codegen_mono-350m/pytorch/huggingface/bigpython_bigquery_thepile/base-none')
+print(llm('def fib():'))
+```
+
+And provides support for per token output streaming:
+
+```python
+from langchain.llms import DeepSparse
+llm = DeepSparse(
+ model="zoo:nlg/text_generation/codegen_mono-350m/pytorch/huggingface/bigpython_bigquery_thepile/base_quant-none",
+ streaming=True
+)
+for chunk in llm.stream("Tell me a joke", stop=["'","\n"]):
+ print(chunk, end='', flush=True)
+```
+
+## Configuration
+
+It has arguments to control the model loaded, any configs for how the model should be loaded, configs to control how tokens are generated, and then whether to return all tokens at once or to stream them one-by-one.
+
+```python
+model: str
+"""The path to a model file or directory or the name of a SparseZoo model stub."""
+
+model_config: Optional[Dict[str, Any]] = None
+"""Keyword arguments passed to the pipeline construction.
+Common parameters are sequence_length, prompt_sequence_length"""
+
+generation_config: Union[None, str, Dict] = None
+"""GenerationConfig dictionary consisting of parameters used to control
+sequences generated for each prompt. Common parameters are:
+max_length, max_new_tokens, num_return_sequences, output_scores,
+top_p, top_k, repetition_penalty."""
+
+streaming: bool = False
+"""Whether to stream the results, token by token."""
+```
diff --git a/docs/llms/text-generation-pipeline.md b/docs/llms/text-generation-pipeline.md
index 976393cf46..1cb478a182 100644
--- a/docs/llms/text-generation-pipeline.md
+++ b/docs/llms/text-generation-pipeline.md
@@ -14,23 +14,23 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
-# **Text Generation Pipelines**
+# **Text Generation Pipeline**
-This user guide describes how to run inference of text generation models with DeepSparse.
+This user guide explains how to run inference of text generation models with DeepSparse.
## **Installation**
-DeepSparse support for LLMs is currently available on DeepSparse's nightly build on PyPi:
+DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPi:
```bash
-pip install -U deepsparse-nightly==1.6.0.20231007[transformers]
+pip install -U deepsparse-nightly[llm]
```
#### **System Requirements**
- Hardware: x86 AVX2, AVX512, AVX512-VNNI and ARM v8.2+.
- Operating System: Linux (MacOS will be supported soon)
-- Python: v3.8-3.10
+- Python: v3.8-3.11
For those using MacOS or Windows, we suggest using Linux containers with Docker to run DeepSparse.
@@ -41,8 +41,8 @@ DeepSparse exposes a Pipeline interface called `TextGeneration`, which is used t
from deepsparse import TextGeneration
# construct a pipeline
-MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
-pipeline = TextGeneration(model_path=MODEL_PATH)
+model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
+pipeline = TextGeneration(model=model_path)
# generate text
prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What is Kubernetes? ### Response:"
@@ -52,27 +52,29 @@ print(output.generations[0].text)
# >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
```
-> **Note:** The 7B model takes about 2 minutes to compile. Set `MODEL_PATH` to `hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
+> **Note:** The 7B model takes about 2 minutes to compile. Set `model_path = hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
+
## **Model Format**
DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories.
-> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs from SparseZoo.***
+> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic.***
+>
### **SparseZoo Stubs**
-SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none` identifes a 50% pruned-quantized MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
+SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
```python
-model_path = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
-pipeline = TextGeneration(model_path=model_path)
+model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
+pipeline = TextGeneration(model=model_path)
```
### **Local Deployment Directory**
Additionally, we can pass a local path to a deployment directory. Use the SparseZoo API to download an example deployment directory:
```python
-import sparsezoo
-sz_model = sparsezoo.Model("zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none", "./local-model")
+from sparsezoo import Model
+sz_model = Model("zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized", "./local-model")
sz_model.deployment.download()
```
@@ -84,8 +86,16 @@ ls ./local-model/deployment
We can pass the local directory path to `TextGeneration`:
```python
-model_path = "./local-model/deployment"
-pipeline = TextGeneration(model_path=model_path)
+from deepsparse import TextGeneration
+pipeline = TextGeneration(model="./local-model/deployment")
+```
+
+### **Hugging Face Models**
+Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).
+
+```python
+from deepsparse import TextGeneration
+pipeline = TextGeneration(model="hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant")
```
## **Input and Output Formats**
@@ -96,8 +106,7 @@ The following examples use a quantized 33M parameter TinyStories model for quick
```python
from deepsparse import TextGeneration
-MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse"
-pipeline = TextGeneration(model_path=MODEL_PATH)
+pipeline = TextGeneration(model="hf:mgoin/TinyStories-33M-quant-deepsparse")
```
### Input Format
@@ -112,13 +121,14 @@ for prompt_i, generation_i in zip(output.prompts, output.generations):
print(f"{prompt_i}{generation_i.text}")
# >> Princess Peach jumped from the balcony and landed on the ground. She was so happy that she had found her treasure. She thanked the old
+
# >> Mario ran into the castle and started to explore. He ran around the castle and climbed on the throne. He even tried to climb
```
- `streaming`: Boolean determining whether to stream response. If True, then the results are returned as a generator object which yields the results as they are generated.
```python
-prompt = "Princess peach jumped from the balcony"
+prompt = "Princess Peach jumped from the balcony"
output_iterator = pipeline(prompt=prompt, streaming=True, max_new_tokens=20)
print(prompt, end="")
@@ -172,8 +182,8 @@ The following examples use a quantized 33M parameter TinyStories model for quick
```python
from deepsparse import TextGeneration
-MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse"
-pipeline = TextGeneration(model_path=MODEL_PATH)
+model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse"
+pipeline = TextGeneration(model=model_id)
```
### **Creating A `GenerationConfig`**
@@ -213,7 +223,7 @@ We can pass a `GenerationConfig` to `TextGeneration.__init__` or `TextGeneration
```python
# set generation_config during __init__
-pipeline_w_gen_config = TextGeneration(model_path=MODEL_PATH, generation_config={"max_new_tokens": 10})
+pipeline_w_gen_config = TextGeneration(model=model_id, generation_config={"max_new_tokens": 10})
# generation_config is the default during __call__
output = pipeline_w_gen_config(prompt=prompt)
@@ -225,7 +235,7 @@ print(f"{prompt}{output.generations[0].text}")
```python
# no generation_config set during __init__
-pipeline_w_no_gen_config = TextGeneration(model_path=MODEL_PATH)
+pipeline_w_no_gen_config = TextGeneration(model=model_id)
# generation_config is the passed during __call__
output = pipeline_w_no_gen_config(prompt=prompt, generation_config= {"max_new_tokens": 10})
@@ -295,7 +305,7 @@ import numpy
# only 20 logits are not set to -inf == only 20 logits used to sample token
output = pipeline(prompt=prompt, do_sample=True, top_k=20, max_new_tokens=15, output_scores=True)
print(numpy.isfinite(output.generations[0].score).sum(axis=1))
-# >> array([20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20])
+# >> [20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
```
- `top_p`: Float to define the tokens that are considered with nucleus sampling. If `0.0`, `top_p` is turned off. Default is `0.0`
@@ -306,7 +316,7 @@ import numpy
output = pipeline(prompt=prompt, do_sample=True, top_p=0.9, max_new_tokens=15, output_scores=True)
print(numpy.isfinite(output.generations[0].score).sum(axis=1))
-# >> array([20, 15, 10, 5, 25, 3, 10, 7, 6, 6, 15, 12, 11, 3, 4, 4])
+# >> [ 5 119 18 14 204 6 7 367 191 20 12 7 46 6 2 35]
```
- `repetition_penalty`: The more a token is used within generation the more it is penalized to not be picked in successive generation passes. If `0.0`, `repetation_penalty` is turned off. Default is `0.0`
diff --git a/docs/user-guide/installation.md b/docs/user-guide/installation.md
index 767ff8a5be..5b8ede3706 100644
--- a/docs/user-guide/installation.md
+++ b/docs/user-guide/installation.md
@@ -16,7 +16,7 @@ limitations under the License.
# DeepSparse Installation
-DeepSparse is tested on Python 3.8-3.10, ONNX 1.5.0-1.10.1, ONNX opset version 11+ and is [manylinux compliant](https://peps.python.org/pep-0513/).
+DeepSparse is tested on Python 3.8-3.11, ONNX 1.5.0-1.15.0, ONNX opset version 11+ and is [manylinux compliant](https://peps.python.org/pep-0513/).
It currently supports Intel and AMD AVX2, AVX-512, and VNNI x86 instruction sets.
diff --git a/research/mpt/README.md b/research/mpt/README.md
index 15edc6f126..e0cbca6feb 100644
--- a/research/mpt/README.md
+++ b/research/mpt/README.md
@@ -1,32 +1,38 @@
-# **Sparse Finetuned LLMs with DeepSparse**
-
-DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.
+*LAST UPDATED: 10/11/2023*
-In this overview, we will discuss:
-1. [Current status of our sparse fine-tuning research](#sparse-fine-tuning-research)
-2. [How to try text generation with DeepSparse](#try-it-now)
+# **Sparse Finetuned LLMs with DeepSparse**
-For detailed usage instructions, [see the text generation user guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md).
+DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.
+Check out our paper [Sparse Finetuning for Inference Acceleration of Large Language Models](https://arxiv.org/abs/2310.06927)
-![deepsparse_mpt_gsm_speedup](https://github.com/neuralmagic/deepsparse/assets/3195154/8687401c-f479-4999-ba6b-e01c747dace9)
+In this research overview, we will discuss:
+1. [Our Sparse Fineuning Research](#sparse-finetuning-research)
+2. [How to try Text Generation with DeepSparse](#try-it-now)
## **Sparse Finetuning Research**
-Sparsity is a powerful model compression technique, where weights are removed from the network with limited accuracy drop.
+We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.
-We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization, without loss, using a technique called **Sparse Finetuning**, where we prune the network during the fine-tuning process.
+When running the pruned network with DeepSparse, we can accelerate inference by ~7x over the dense-FP32 baseline!
### **Sparse Finetuning on Grade-School Math (GSM)**
-Open-source LLMs are typically fine-tuned onto downstream datasets for two reasons:
-* **Instruction Tuning**: show the LLM examples of how to respond to human input or prompts properly
-* **Domain Adaptation**: show the LLM examples with information it does not currently understand
+Training LLMs consist of two steps. First, the model is pre-trained on a very large corpus of text (typically >1T tokens). Then, the model is adapted for downstream use by continuing training with a much smaller high quality curated dataset. This second step is called finetuning.
+
+Fine-tuning is useful for two main reasons:
+1. It can teach the model *how to respond* to input (often called **instruction tuning**).
+2. It can teach the model *new information* (often called **domain adaptation**).
+
-An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B-base. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
+An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
-The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with limited accuracy drop on GSM8k runs 6.7x faster than the dense baseline with DeepSparse!
+The key insight from [our paper](https://arxiv.org/abs/2310.06927) is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse!
-Paper: (link to paper)
+
+
+
+
+- [See the paper on Arxiv](https://arxiv.org/abs/2310.06927)
### **How Is This Useful For Real World Use?**
@@ -37,18 +43,20 @@ While GSM is a "toy" math dataset, it serves as an example of how LLMs can be ad
Install the DeepSparse Nightly build (requires Linux):
```bash
-pip install deepsparse-nightly[transformers]
+pip install -U deepsparse-nightly[llm]
```
+The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d).
+
### MPT-7B on GSM
-We can run inference on the 60% sparse-quantized MPT-7B GSM model using DeepSparse's `TextGeneration` Pipeline:
+We can run inference on the models using DeepSparse's `TextGeneration` Pipeline:
```python
from deepsparse import TextGeneration
-MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/gsm8k/pruned60_quant-none"
-pipeline = TextGeneration(model_path=MODEL_PATH)
+model = "zoo:mpt-7b-gsm8k_mpt_pretrain-pruned60_quantized"
+pipeline = TextGeneration(model_path=model)
prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May"
output = pipeline(prompt=prompt)
@@ -59,13 +67,13 @@ print(output.generations[0].text)
### >> #### 72
```
-It is also possible to run models directly from Hugging Face by prepending `"hf:"` to a model id, such as:
+It is also possible to run the models directly from Hugging Face by prepending `"hf:"` to a model id, such as:
```python
from deepsparse import TextGeneration
-MODEL_PATH = "hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant"
-pipeline = TextGeneration(model_path=MODEL_PATH)
+hf_model_id = "hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant"
+pipeline = TextGeneration(model=hf_model_id)
prompt = "Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut parts must be divided into 5 equal parts. How long will each final cut be?"
output = pipeline(prompt=prompt)
@@ -76,26 +84,22 @@ print(output.generations[0].text)
### >> #### 5
```
+> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic's team***
+
+
#### Other Resources
- [Check out all the MPT GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
- [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d)
+- [Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md)
-### **MPT-7B on Dolly-HHRLHF**
+## **Roadmap**
-We have also made a 50% sparse-quantized MPT-7B fine-tuned on [Dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) available on SparseZoo. We can run inference with the following:
+Following these initial results, we are rapidly expanding our support for LLMs across the Neural Magic stack, including:
-```python
-from deepsparse import TextGeneration
-
-MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
-pipeline = TextGeneration(model_path=MODEL_PATH)
-
-prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is Kubernetes? ### Response:"
-output = pipeline(prompt=prompt)
-print(output.generations[0].text)
-
-### >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
-```
+- **Productizing Sparse Fine Tuning**: Enable external users to apply the sparse fine-tuning to business datasets
+- **Expanding Model Support**: Apply sparse fine-tuning results to Llama2 and Mistral models
+- **Pushing to Higher Sparsity**: Improving our pruning algorithms to reach higher sparsity
+- **Building General Sparse Model**: Create sparse model that can perform well on general tasks like OpenLLM leaderboard
## **Feedback / Roadmap Requests**
diff --git a/setup.py b/setup.py
index ad9cbf5c33..64402852a4 100644
--- a/setup.py
+++ b/setup.py
@@ -346,7 +346,7 @@ def _setup_long_description() -> Tuple[str, str]:
install_requires=_setup_install_requires(),
extras_require=_setup_extras(),
entry_points=_setup_entry_points(),
- python_requires=">=3.8, <3.11",
+ python_requires=">=3.8, <3.12",
classifiers=[
"Development Status :: 5 - Production/Stable",
"Programming Language :: Python :: 3",
@@ -354,6 +354,7 @@ def _setup_long_description() -> Tuple[str, str]:
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
+ "Programming Language :: Python :: 3.11",
"Intended Audience :: Developers",
"Intended Audience :: Education",
"Intended Audience :: Information Technology",
diff --git a/src/deepsparse/image_classification/README.md b/src/deepsparse/image_classification/README.md
index 5a5571a197..1229ff9db5 100644
--- a/src/deepsparse/image_classification/README.md
+++ b/src/deepsparse/image_classification/README.md
@@ -151,7 +151,7 @@ Making a request:
```python
import requests
-url = 'http://0.0.0.0:5543/predict/from_files'
+url = 'http://0.0.0.0:5543/v2/models/image_classification/infer/from_files'
path = ['goldfish.jpeg'] # just put the name of images in here
files = [('request', open(img, 'rb')) for img in path]
resp = requests.post(url=url, files=files)
diff --git a/src/deepsparse/server/cli.py b/src/deepsparse/server/cli.py
index 3dba3659b2..c51ba2f972 100644
--- a/src/deepsparse/server/cli.py
+++ b/src/deepsparse/server/cli.py
@@ -224,7 +224,6 @@ def main(
EndpointConfig(
task=task,
name=f"{task}",
- route="/predict",
model=model_path,
batch_size=batch_size,
)
diff --git a/src/deepsparse/server/openai.md b/src/deepsparse/server/openai.md
new file mode 100644
index 0000000000..e68c9f9fa7
--- /dev/null
+++ b/src/deepsparse/server/openai.md
@@ -0,0 +1,102 @@
+## 🔌 OpenAI Integration
+
+The [OpenAI API](https://platform.openai.com/docs/api-reference/introduction) can be used to interact with `text_generation` models.
+Similar to the Deepsparse server (see `README.md` for details), a config file can be
+created for the text generation models. A sample config is provided below
+
+```yaml
+num_cores: 2
+num_workers: 2
+endpoints:
+ - task: text_generation
+ model: zoo:opt-1.3b-opt_pretrain-pruned50_quantW8A8
+```
+
+To start the server with the OpenAI integration, the following command can be used:
+
+`deepsparse.openai sample_config.yaml`
+
+The standard deepsparse server command is also available:
+
+`deepsparse.server --config_file sample_config.yaml --integration openai`
+
+Once launched, the OpenAI endpoints will be available. The payload expected by the endpoints
+can be found under the OpenAI documentation for each endpoint. Currently, the supported endpoints
+are:
+
+```
+/v1/models
+/v1/chat/completions
+```
+
+Inference requests can be sent through standard curls commands, the requests library,
+or through the OpenAI API.
+
+---
+
+### OpenAI API Requests
+
+- Starting the server with the config above, we have accesso to one model
+`zoo:opt-1.3b-opt_pretrain-pruned50_quantW8A8`. We can send an inference request using
+the OpenAI API, as shown in the example code below:
+
+```python
+import openai
+
+openai.api_key = "EMPTY"
+openai.api_base = "http://localhost:5543/v1"
+
+stream = False
+completion = openai.ChatCompletion.create(
+ messages="how are you?",
+ stream=stream,
+ max_tokens=30,
+ model="zoo:opt-1.3b-opt_pretrain-pruned50_quantW8A8",
+)
+
+print("Chat results:")
+if stream:
+ text = ""
+ for c in completion:
+ print(c)
+else:
+ print(completion)
+```
+
+- We can toggle the `stream` flag to enable streaming outputs as well
+
+---
+
+## Using `curl` or `requests`
+
+- We can also run inference through `curl` commands or by using the `requests` library
+
+`curl`:
+```bash
+curl http://localhost:5543/v1/chat/completions \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "zoo:opt-1.3b-opt_pretrain-pruned50_quantW8A8",
+ "messages": "your favourite book?",
+ "max_tokens": 30,
+ "n": 2,
+ "stream": true
+ }'
+```
+
+`reqeusts`:
+
+```python
+import requests
+
+url = "http://localhost:5543/v1/chat/completions"
+
+obj = {
+ "model": "zoo:opt-1.3b-opt_pretrain-pruned50_quantW8A8",
+ "messages": "how are you?",
+ "max_tokens": 10
+}
+
+response = requests.post(url, json=obj)
+print(response.text)
+```
\ No newline at end of file
diff --git a/src/deepsparse/server/sagemaker.py b/src/deepsparse/server/sagemaker.py
index 865d1f1fb5..59b19a2d92 100644
--- a/src/deepsparse/server/sagemaker.py
+++ b/src/deepsparse/server/sagemaker.py
@@ -57,11 +57,25 @@ def _add_inference_endpoints(
if hasattr(pipeline.input_schema, "from_files"):
routes_and_fns.append(
- (route, partial(Server.predict_from_files, ProxyPipeline(pipeline)))
+ (
+ route + "/from_files",
+ partial(
+ Server.predict_from_files,
+ ProxyPipeline(pipeline),
+ self.server_config.system_logging,
+ ),
+ )
)
else:
routes_and_fns.append(
- (route, partial(Server.predict, ProxyPipeline(pipeline)))
+ (
+ route,
+ partial(
+ Server.predict,
+ ProxyPipeline(pipeline),
+ self.server_config.system_logging,
+ ),
+ )
)
self._update_routes(
diff --git a/src/deepsparse/server/server.py b/src/deepsparse/server/server.py
index c4f5ed5bf1..a983b39921 100644
--- a/src/deepsparse/server/server.py
+++ b/src/deepsparse/server/server.py
@@ -246,19 +246,27 @@ async def predict(
system_logging_config: SystemLoggingConfig,
raw_request: Request,
):
- request = proxy_pipeline.pipeline.input_schema(**await raw_request.json())
- pipeline_outputs = proxy_pipeline.pipeline(request)
+ pipeline_outputs = proxy_pipeline.pipeline(**await raw_request.json())
server_logger = proxy_pipeline.pipeline.logger
if server_logger:
log_system_information(
server_logger=server_logger, system_logging_config=system_logging_config
)
- pipeline_outputs = prep_outputs_for_serialization(pipeline_outputs)
- return pipeline_outputs
+ return prep_outputs_for_serialization(pipeline_outputs)
@staticmethod
- def predict_from_files(proxy_pipeline: ProxyPipeline, request: List[UploadFile]):
+ def predict_from_files(
+ proxy_pipeline: ProxyPipeline,
+ system_logging_config: SystemLoggingConfig,
+ request: List[UploadFile],
+ ):
request = proxy_pipeline.pipeline.input_schema.from_files(
(file.file for file in request), from_server=True
)
- return Server.predict(request)
+ pipeline_outputs = proxy_pipeline.pipeline(request)
+ server_logger = proxy_pipeline.pipeline.logger
+ if server_logger:
+ log_system_information(
+ server_logger=server_logger, system_logging_config=system_logging_config
+ )
+ return prep_outputs_for_serialization(pipeline_outputs)
diff --git a/src/deepsparse/transformers/README.md b/src/deepsparse/transformers/README.md
index b61ee7fc6c..c17fd235e9 100644
--- a/src/deepsparse/transformers/README.md
+++ b/src/deepsparse/transformers/README.md
@@ -126,7 +126,7 @@ Making a request:
```python
import requests
-url = "http://localhost:5543/predict" # Server's port default to 5543
+url = "http://localhost:5543/v2/models/question_answering/infer" # Server's port default to 5543
obj = {
"question": "Who is Mark?",
@@ -170,7 +170,7 @@ Making a request:
```python
import requests
-url = "http://localhost:5543/predict" # Server's port default to 5543
+url = "http://localhost:5543/v2/models/text_generation/infer" # Server's port default to 5543
obj = {"sequence": "Who is the president of the United States?"}
@@ -215,7 +215,7 @@ Making a request:
```python
import requests
-url = "http://localhost:5543/predict" # Server's port default to 5543
+url = "http://localhost:5543/v2/models/setiment_analysis/infer" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}
@@ -268,7 +268,7 @@ Making a request:
```python
import requests
-url = "http://localhost:5543/predict" # Server's port default to 5543
+url = "http://localhost:5543/v2/models/text_classification/infer" # Server's port default to 5543
obj = {
"sequences": [
@@ -321,7 +321,7 @@ Making a request:
```python
import requests
-url = "http://localhost:5543/predict" # Server's port default to 5543
+url = "http://localhost:5543/v2/models/token_classification/infer" # Server's port default to 5543
obj = {"inputs": "Drive from California to Texas!"}
diff --git a/src/deepsparse/yolact/README.md b/src/deepsparse/yolact/README.md
index 46f0acd4de..5d202334b0 100644
--- a/src/deepsparse/yolact/README.md
+++ b/src/deepsparse/yolact/README.md
@@ -130,7 +130,7 @@ Making a request:
import requests
import json
-url = 'http://0.0.0.0:5543/predict/from_files'
+url = 'http://0.0.0.0:5543/v2/models/yolact/infer/from_files'
path = ['thailand.jpg'] # list of images for inference
files = [('request', open(img, 'rb')) for img in path]
resp = requests.post(url=url, files=files)
diff --git a/src/deepsparse/yolo/README.md b/src/deepsparse/yolo/README.md
index b14982f19b..0802c2589a 100644
--- a/src/deepsparse/yolo/README.md
+++ b/src/deepsparse/yolo/README.md
@@ -129,7 +129,7 @@ Making a request:
import requests
import json
-url = 'http://0.0.0.0:5543/predict/from_files'
+url = 'http://0.0.0.0:5543/v2/models/yolo/infer/from_files'
path = ['basilica.jpg'] # list of images for inference
files = [('request', open(img, 'rb')) for img in path]
resp = requests.post(url=url, files=files)
diff --git a/tests/server/test_endpoints.py b/tests/server/test_endpoints.py
index a20a11c7fc..86e7269e10 100644
--- a/tests/server/test_endpoints.py
+++ b/tests/server/test_endpoints.py
@@ -38,8 +38,8 @@ class StrSchema(BaseModel):
value: str
-def parse(v: StrSchema) -> int:
- return int(v.value)
+def parse(value) -> int:
+ return int(value)
class TestStatusEndpoints:
@@ -106,7 +106,7 @@ def test_add_model_endpoint(
):
mock_pipeline = Mock(
side_effect=parse,
- input_schema=StrSchema,
+ input_schema=str,
output_schema=int,
logger=MultiLogger([]),
)
@@ -146,6 +146,7 @@ def test_add_model_endpoint_with_from_files(self, server, app):
assert app.routes[-1].path == "/v2/models/predict/parse_int/infer/from_files"
assert app.routes[-1].endpoint.func.__annotations__ == {
"proxy_pipeline": ProxyPipeline,
+ "system_logging_config": SystemLoggingConfig,
"request": List[UploadFile],
}
assert app.routes[-1].response_model is int
@@ -159,9 +160,12 @@ def test_sagemaker_only_adds_one_endpoint(self, sagemaker_server, app):
pipeline=Mock(input_schema=FromFilesSchema, output_schema=int),
)
assert len(app.routes) == num_routes + 1
- assert app.routes[-1].path == "/invocations/predict/parse_int/infer"
+ num_routes = len(app.routes)
+
+ assert app.routes[-1].path == "/invocations/predict/parse_int/infer/from_files"
assert app.routes[-1].endpoint.func.__annotations__ == {
"proxy_pipeline": ProxyPipeline,
+ "system_logging_config": SystemLoggingConfig,
"request": List[UploadFile],
}
@@ -174,7 +178,8 @@ def test_sagemaker_only_adds_one_endpoint(self, sagemaker_server, app):
assert app.routes[-1].path == "/invocations/predict/parse_int/infer"
assert app.routes[-1].endpoint.func.__annotations__ == {
"proxy_pipeline": ProxyPipeline,
- "request": List[UploadFile],
+ "system_logging_config": SystemLoggingConfig,
+ "raw_request": Request,
}
def test_add_endpoint_with_no_route_specified(self, server, app):