Skip to content

Commit

Permalink
Merge branch 'main' into sentence-transformer-support
Browse files Browse the repository at this point in the history
  • Loading branch information
mgoin authored Oct 13, 2023
2 parents 7a4cc1d + b476ac8 commit ab2f0b8
Show file tree
Hide file tree
Showing 16 changed files with 430 additions and 239 deletions.
2 changes: 1 addition & 1 deletion DEVELOPING.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ limitations under the License.

# Developing DeepSparse

The DeepSparse Python API is developed and tested using Python 3.8-3.10.
The DeepSparse Python API is developed and tested using Python 3.8-3.11.
To develop the Python API, you will also need the development dependencies and to follow the styling guidelines.

Here's some details to get started.
Expand Down
276 changes: 124 additions & 152 deletions README.md

Large diffs are not rendered by default.

76 changes: 76 additions & 0 deletions docs/llms/integration-langchain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
<!--
Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# **DeepSparse LangChain Integration**

[DeepSparse](https://github.com/neuralmagic/deepsparse) has an official integration within [LangChain](https://python.langchain.com/docs/integrations/llms/deepsparse).
It is broken into two parts: installation and then examples of DeepSparse usage.

## Installation and Setup

- Install the Python packages with `pip install deepsparse-nightly langchain`
- Choose a [SparseZoo model](https://sparsezoo.neuralmagic.com/?useCase=text_generation) or export a support model to ONNX [using Optimum](https://github.com/neuralmagic/notebooks/blob/main/notebooks/opt-text-generation-deepsparse-quickstart/OPT_Text_Generation_DeepSparse_Quickstart.ipynb)
- Models hosted on HuggingFace are also supported by prepending `"hf:"` to the model id, such as [`"hf:mgoin/TinyStories-33M-quant-deepsparse"`](https://huggingface.co/mgoin/TinyStories-33M-quant-deepsparse)

## Wrappers

There exists a DeepSparse LLM wrapper, which you can access with:

```python
from langchain.llms import DeepSparse
```

It provides a simple, unified interface for all models:

```python
from langchain.llms import DeepSparse
llm = DeepSparse(model='zoo:nlg/text_generation/codegen_mono-350m/pytorch/huggingface/bigpython_bigquery_thepile/base-none')
print(llm('def fib():'))
```

And provides support for per token output streaming:

```python
from langchain.llms import DeepSparse
llm = DeepSparse(
model="zoo:nlg/text_generation/codegen_mono-350m/pytorch/huggingface/bigpython_bigquery_thepile/base_quant-none",
streaming=True
)
for chunk in llm.stream("Tell me a joke", stop=["'","\n"]):
print(chunk, end='', flush=True)
```

## Configuration

It has arguments to control the model loaded, any configs for how the model should be loaded, configs to control how tokens are generated, and then whether to return all tokens at once or to stream them one-by-one.

```python
model: str
"""The path to a model file or directory or the name of a SparseZoo model stub."""

model_config: Optional[Dict[str, Any]] = None
"""Keyword arguments passed to the pipeline construction.
Common parameters are sequence_length, prompt_sequence_length"""

generation_config: Union[None, str, Dict] = None
"""GenerationConfig dictionary consisting of parameters used to control
sequences generated for each prompt. Common parameters are:
max_length, max_new_tokens, num_return_sequences, output_scores,
top_p, top_k, repetition_penalty."""

streaming: bool = False
"""Whether to stream the results, token by token."""
```
60 changes: 35 additions & 25 deletions docs/llms/text-generation-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,23 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# **Text Generation Pipelines**
# **Text Generation Pipeline**

This user guide describes how to run inference of text generation models with DeepSparse.
This user guide explains how to run inference of text generation models with DeepSparse.

## **Installation**

DeepSparse support for LLMs is currently available on DeepSparse's nightly build on PyPi:
DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPi:

```bash
pip install -U deepsparse-nightly==1.6.0.20231007[transformers]
pip install -U deepsparse-nightly[llm]
```

#### **System Requirements**

- Hardware: x86 AVX2, AVX512, AVX512-VNNI and ARM v8.2+.
- Operating System: Linux (MacOS will be supported soon)
- Python: v3.8-3.10
- Python: v3.8-3.11

For those using MacOS or Windows, we suggest using Linux containers with Docker to run DeepSparse.

Expand All @@ -41,8 +41,8 @@ DeepSparse exposes a Pipeline interface called `TextGeneration`, which is used t
from deepsparse import TextGeneration

# construct a pipeline
MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
pipeline = TextGeneration(model_path=MODEL_PATH)
model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
pipeline = TextGeneration(model=model_path)

# generate text
prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What is Kubernetes? ### Response:"
Expand All @@ -52,27 +52,29 @@ print(output.generations[0].text)
# >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
```

> **Note:** The 7B model takes about 2 minutes to compile. Set `MODEL_PATH` to `hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
> **Note:** The 7B model takes about 2 minutes to compile. Set `model_path = hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
## **Model Format**

DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories.

> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs from SparseZoo.***
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic.***
>
### **SparseZoo Stubs**

SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none` identifes a 50% pruned-quantized MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.

```python
model_path = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
pipeline = TextGeneration(model_path=model_path)
model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
pipeline = TextGeneration(model=model_path)
```

### **Local Deployment Directory**

Additionally, we can pass a local path to a deployment directory. Use the SparseZoo API to download an example deployment directory:
```python
import sparsezoo
sz_model = sparsezoo.Model("zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none", "./local-model")
from sparsezoo import Model
sz_model = Model("zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized", "./local-model")
sz_model.deployment.download()
```

Expand All @@ -84,8 +86,16 @@ ls ./local-model/deployment

We can pass the local directory path to `TextGeneration`:
```python
model_path = "./local-model/deployment"
pipeline = TextGeneration(model_path=model_path)
from deepsparse import TextGeneration
pipeline = TextGeneration(model="./local-model/deployment")
```

### **Hugging Face Models**
Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).

```python
from deepsparse import TextGeneration
pipeline = TextGeneration(model="hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant")
```

## **Input and Output Formats**
Expand All @@ -96,8 +106,7 @@ The following examples use a quantized 33M parameter TinyStories model for quick
```python
from deepsparse import TextGeneration

MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse"
pipeline = TextGeneration(model_path=MODEL_PATH)
pipeline = TextGeneration(model="hf:mgoin/TinyStories-33M-quant-deepsparse")
```

### Input Format
Expand All @@ -112,13 +121,14 @@ for prompt_i, generation_i in zip(output.prompts, output.generations):
print(f"{prompt_i}{generation_i.text}")

# >> Princess Peach jumped from the balcony and landed on the ground. She was so happy that she had found her treasure. She thanked the old

# >> Mario ran into the castle and started to explore. He ran around the castle and climbed on the throne. He even tried to climb
```

- `streaming`: Boolean determining whether to stream response. If True, then the results are returned as a generator object which yields the results as they are generated.

```python
prompt = "Princess peach jumped from the balcony"
prompt = "Princess Peach jumped from the balcony"
output_iterator = pipeline(prompt=prompt, streaming=True, max_new_tokens=20)

print(prompt, end="")
Expand Down Expand Up @@ -172,8 +182,8 @@ The following examples use a quantized 33M parameter TinyStories model for quick
```python
from deepsparse import TextGeneration

MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse"
pipeline = TextGeneration(model_path=MODEL_PATH)
model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse"
pipeline = TextGeneration(model=model_id)
```

### **Creating A `GenerationConfig`**
Expand Down Expand Up @@ -213,7 +223,7 @@ We can pass a `GenerationConfig` to `TextGeneration.__init__` or `TextGeneration

```python
# set generation_config during __init__
pipeline_w_gen_config = TextGeneration(model_path=MODEL_PATH, generation_config={"max_new_tokens": 10})
pipeline_w_gen_config = TextGeneration(model=model_id, generation_config={"max_new_tokens": 10})

# generation_config is the default during __call__
output = pipeline_w_gen_config(prompt=prompt)
Expand All @@ -225,7 +235,7 @@ print(f"{prompt}{output.generations[0].text}")

```python
# no generation_config set during __init__
pipeline_w_no_gen_config = TextGeneration(model_path=MODEL_PATH)
pipeline_w_no_gen_config = TextGeneration(model=model_id)

# generation_config is the passed during __call__
output = pipeline_w_no_gen_config(prompt=prompt, generation_config= {"max_new_tokens": 10})
Expand Down Expand Up @@ -295,7 +305,7 @@ import numpy
# only 20 logits are not set to -inf == only 20 logits used to sample token
output = pipeline(prompt=prompt, do_sample=True, top_k=20, max_new_tokens=15, output_scores=True)
print(numpy.isfinite(output.generations[0].score).sum(axis=1))
# >> array([20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20])
# >> [20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
```

- `top_p`: Float to define the tokens that are considered with nucleus sampling. If `0.0`, `top_p` is turned off. Default is `0.0`
Expand All @@ -306,7 +316,7 @@ import numpy
output = pipeline(prompt=prompt, do_sample=True, top_p=0.9, max_new_tokens=15, output_scores=True)
print(numpy.isfinite(output.generations[0].score).sum(axis=1))

# >> array([20, 15, 10, 5, 25, 3, 10, 7, 6, 6, 15, 12, 11, 3, 4, 4])
# >> [ 5 119 18 14 204 6 7 367 191 20 12 7 46 6 2 35]
```
- `repetition_penalty`: The more a token is used within generation the more it is penalized to not be picked in successive generation passes. If `0.0`, `repetation_penalty` is turned off. Default is `0.0`

Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ limitations under the License.

# DeepSparse Installation

DeepSparse is tested on Python 3.8-3.10, ONNX 1.5.0-1.10.1, ONNX opset version 11+ and is [manylinux compliant](https://peps.python.org/pep-0513/).
DeepSparse is tested on Python 3.8-3.11, ONNX 1.5.0-1.15.0, ONNX opset version 11+ and is [manylinux compliant](https://peps.python.org/pep-0513/).

It currently supports Intel and AMD AVX2, AVX-512, and VNNI x86 instruction sets.

Expand Down
Loading

0 comments on commit ab2f0b8

Please sign in to comment.