-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
218f083
commit b5be53b
Showing
30 changed files
with
1,676 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# GPU Compute Providers | ||
|
||
## vast.ai | ||
|
||
For short experiments, vast.ai is a cost-effective solution as it is billed per minute and not per hour. | ||
Please note you also pay for setup time. | ||
|
||
Vast.ai comes with limitation of not being able to run docker containers inside rented machines. | ||
Vast.ai allows you to run your own docker image if it is uploaded to public docker registry (or credentials docker registry password are provided), but this avenue was not explored. | ||
|
||
|
||
## paperspace | ||
|
||
Paperspace is billed per hour, and tends to have a higher cost than vast.ai. | ||
The performance seemed to vary a lot, but the root cause was never identified. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Improving Consistency of Inference | ||
|
||
This document describes the steps to improve the consistency of inference results, from the MUST-HAVE requirements to potential improvements. | ||
|
||
If you a case, where recommendation, especially a "thought to be safe" configuration, does not provide consistent results, please file an Issue report alongside the steps to reproduce the issue. | ||
|
||
|
||
## Recommendations | ||
|
||
* [ ] Have test cases that verify the consistency of inference results. | ||
* Recommendation: | ||
* Before every release, check if the known input-output pairs are still the same as in previous release. It is very important for known output to be as long as possible, operation errors are cumulative the long output is much more likely showcase inconsistencies. | ||
* [ ] All software dependencies need to be locked to a specific version. | ||
* Recommendation: | ||
* Python dependencies: These are the ones that will be most updated. Use [Rye](https://rye.astral.sh/) or [pip-tools](https://github.com/jazzband/pip-tools). | ||
* Binary dependencies: Use doker images and tags to lock the version of the final image. Treat each tag as a reseed of the inference process. | ||
* Use same driver version (Please note, that so far we have yet to document a case where driver version affected inference results) | ||
* [ ] Set seed for all random number generators used in the inference process. | ||
* Recommendation: | ||
* For single-threaded apps, use `deterministic_ml.v1.set_seed()` to set the seed for all known random number generator process wide. | ||
* Whenever initializing a new random generator, explicitly set the seed in deterministic manner. | ||
* For multi-threaded or async applications ensure that random generators are isolated per thread or task. | ||
* [ ] Disable auto-optimization or JIT compilation in the inference process. | ||
* Recommendation: | ||
* Use `deterministic_ml.v1.disable_auto_optimization()` to disable auto-optimization or JIT compilation process wide. | ||
* [ ] Use the same kind of hardware for all inference runs. | ||
* Recommendation: | ||
* Use the same GPU chip model and vRAM size for are inference runs. Hardware interface (PCIe, SXM, etc.) does not seem to affect the results, but 2xA100 40G do not return the same results as 1xA100 80G. | ||
* When testing new, but similar hardware to check if the results are consistent with previously known platform, maximize pseudo-randomization of the inference process (e.g. by setting high temperature and low top-p values). | ||
|
||
## Framework Specific Recommendations | ||
|
||
### PyTorch | ||
|
||
See https://pytorch.org/docs/stable/notes/randomness.html . | ||
The https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility also apply. | ||
|
||
### vLLM | ||
|
||
vLLM is PyTorch based, so the same constraints apply. | ||
However, vLLM has much narrower scope, hence other than general recommendations from [MUST-HAVE](#must-have) section, only following is required: | ||
* make sure to use exactly the same parameters for the model initialization | ||
* `enforce_eager=True` | ||
* to get the same output for the same input, use the exactly same `SamplingParams` with explicitly set `seed` parameter | ||
|
||
|
||
```python | ||
model = vllm.LLM( | ||
model=model_name, | ||
enforce_eager=True, # Ensure eager mode is enabled | ||
) | ||
|
||
|
||
sampling_params = vllm.SamplingParams( | ||
max_tokens=4096, | ||
# temperature=1000, # High value encourages pseudo-randomization | ||
# top_p=0.1, # Low value encourages pseudo-randomization | ||
seed=42, | ||
) | ||
|
||
response = model.generate(requests, sampling_params) | ||
|
||
``` | ||
|
||
|
||
## Unconfirmed advice | ||
|
||
### Consistent results across different CUDA hardware | ||
|
||
It should be theoretically possible to get consistent results across different hardware, but even if limited to CUDA-capatible GPUs, it will be at the cost of performance. | ||
|
||
* [ ] Use `torch.backends.cudnn.deterministic = True` and `torch.backends.cudnn.benchmark = False` to ensure that the results are consistent across different CUDA hardware. | ||
* Recommendation: | ||
* Use `deterministic_ml.v1.set_seed()` to set the seed for all known random number generator process wide. | ||
* Use `deterministic_ml.v1.disable_auto_optimization()` to disable auto-optimization or JIT compilation process wide. | ||
* Use `torch.backends.cudnn.deterministic = True` and `torch.backends.cudnn.benchmark = False` to ensure that the results are consistent across different CUDA hardware. | ||
* Use the same GPU chip model and vRAM size for are inference runs. Hardware interface (PCIe, SXM, etc.) does not seem to affect the results, but 2xA100 40G do not return the same results as 1xA100 80G. | ||
|
||
|
||
See [TO_BE_INVESTIGATED.md](TO_BE_INVESTIGATED.md) for more potential improvements. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# Manual integration testing | ||
|
||
This document describes how we test whenever we are able to achieve deterministic results for particular experiment scenarios against tested hardware and software configurations. | ||
|
||
1) For each hardware configuration: | ||
1) Create target machine \[manual\] | ||
2) Run the target experiment scenario on the target machine. | ||
`./run.py vllm_llama_3_70b_instruct_awq -n machine_name username@host -p ssh_port` | ||
3) Destroy the target machine \[manual\] | ||
2) Analyze the results \[manual\] | ||
|
||
## Initial local setup | ||
|
||
```bash | ||
pdm install -G test | ||
``` | ||
|
||
## Choosing experiment scenario | ||
|
||
Either use already defined scenario in [tests/integration/experiments](../tests/integration/experiments), e.g. | ||
`vllm_llama_3_70b_instruct_awq` or create a new one. | ||
|
||
Each scenario may define target machine environment initial setup steps, including: | ||
* `setup.sh` - Shell script executed on the target machine, for example, installing binary dependencies. | ||
* `requirements.txt` - Python packages installed on the target machine in dedicated python virtual environment. | ||
|
||
And must include: | ||
* `__main__.py` - Main experiment script, which is executed on the target machine taking as first argument the output directory to which `output.yaml` file should be saved. | ||
|
||
## Running the experiment scenario | ||
|
||
|
||
### Run experiment against N target machines | ||
|
||
Repeat following for a number of target machines. | ||
You may want to even mix some of configurations, e.g. different GPU models, different CUDA versions, etc. to get a better understanding what influences the determinism of output. | ||
|
||
### Create target machine | ||
|
||
You can use service cheap GPU machines like ones provided by [vast.ai](https://vast.ai/), [paperspace](https://www.paperspace.com/) etc. | ||
Please note that for one-off experiment, services like vast.ai are more cost-effective since they are billed per minute and not per hour. | ||
See [Compute Providers document](COMPUTE_PROVIDERS.md) for more information. | ||
|
||
Example machine configuration: vast.io, on-demand, 1x NVIDIA A100 80GB, 100GB disk with Ubuntu-based template +CUDA drivers installed. | ||
|
||
### Run the experiment scenario | ||
|
||
In [tests/integration/experiments](../tests/integration/experiments) directory, run | ||
|
||
```bash | ||
./run.py vllm_llama_3_70b_instruct_awq -c target_comment username@host -p ssh_port | ||
``` | ||
|
||
### Destroy the target machine | ||
|
||
Destroy the target machine to avoid unnecessary costs. | ||
|
||
|
||
## Analyzing the results | ||
|
||
Results are stored in [`results` directory](../tests/integration/results). | ||
They are grouped by experiment scenario, then target machine name and timestamp. | ||
|
||
Each result contains: | ||
* `experiment.log` - Experiment log output. | ||
* `output.yaml` - Experiment output in YAML format. This is the most important file to analyze. | ||
* `sysinfo.yaml` - System information of the target machine, used to cluster results by hardware configuration. | ||
|
||
You can use `./analyze.py` script to analyze the results. | ||
|
||
```bash | ||
./analyze.py vllm_llama_3_70b_instruct_awq | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# To be investigated | ||
|
||
## Disabling "fast math" optimizations | ||
|
||
To check: | ||
* as this is a compiler flag - what Python-related software is affected by it? |
Oops, something went wrong.