Skip to content

Commit

Permalink
first working version
Browse files Browse the repository at this point in the history
  • Loading branch information
mjurbanski-reef committed Aug 22, 2024
1 parent 218f083 commit 69e9ae9
Show file tree
Hide file tree
Showing 30 changed files with 1,673 additions and 28 deletions.
35 changes: 32 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,41 @@
# deterministic-ml
# Deterministic ML Models execution using Python frameworks
 [![Continuous Integration](https://github.com/backend-developers-ltd/deterministic-ml/workflows/Continuous%20Integration/badge.svg)](https://github.com/backend-developers-ltd/deterministic-ml/actions?query=workflow%3A%22Continuous+Integration%22) [![License](https://img.shields.io/pypi/l/deterministic_ml.svg?label=License)](https://pypi.python.org/pypi/deterministic_ml) [![python versions](https://img.shields.io/pypi/pyversions/deterministic_ml.svg?label=python%20versions)](https://pypi.python.org/pypi/deterministic_ml) [![PyPI version](https://img.shields.io/pypi/v/deterministic_ml.svg?label=PyPI%20version)](https://pypi.python.org/pypi/deterministic_ml)

This project is two-part:
* documentation that describes how to ensure deterministic execution of ML models across different frameworks
* a Python package that provides utilities that help to ensure deterministic execution of ML models across different frameworks and versions

Currently supported frameworks and inference engines: CUDA-based, PyTorch, vLLM.

The goal is to be able to reproduce exactly the same results on another machine using the same software.
This means, finding a balance between performance and hardware restrictions without compromising reproduciblity.
I.e. if limiting to a single GPU model and vRAM size is required to achieve reproducibility, then it is also acceptable solution, especially if otherwise it would require "dumbing down" other cards just to achieve the same results.

## Experiment results so far

Through [Integration testing](docs/MANUAL_INTEGRATION_TESTING.md) we can see that the output of the model can be achieved in a deterministic way.

Here is the summary of the results for vLLM running llama3 model:
* each card GPU model (combined with its vRAM configuration) has a different output, but is consistent across runs
* the output is consistent across different CUDA versions (more testing is needed here, only small range was tested)
* GPU interface (SXM4, PCIe) does not affect the output
* A100 80GB and A100X 80GB produce the same output
* 2x A100 40GB do not produce the same output as 1x A100 80GB
* driver versions 535.129.03 and 555.58.02 produce the same output

To learn more about this particular example, please refer to the [Integration testing](docs/MANUAL_INTEGRATION_TESTING.md) documentation and the [tests/integration/experiments/vllm_llama_3_70b_instruct_awq](tests/integration/experiments/vllm_llama_3_70b_instruct_awq) experiment code.

## Usage

> [!IMPORTANT]
> This package uses [ApiVer](#versioning), make sure to import `deterministic_ml.v1`.

```
pip install deterministic_ml[vllm] # pick the right extra for your use case, e.g. [vllm] or [torch]
```


## Versioning

This package uses [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
Expand Down Expand Up @@ -38,6 +67,6 @@ If you wish to install dependencies into `.venv` so your IDE can pick them up, y
pdm install --dev
```

### Release process
## Contributing

Run `nox -s make_release -- X.Y.Z` where `X.Y.Z` is the version you're releasing and follow the printed instructions.
Contributions are welcome, especially ones that add to [docs/IMPROVING_CONSITENCY.md](docs/IMPROVING_CONSITENCY.md) docs expanding the list of recommendations for improving the consistency of inference results when using various python frameworks.
15 changes: 15 additions & 0 deletions docs/COMPUTE_PROVIDERS.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# GPU Compute Providers

## vast.ai

For short experiments, vast.ai is a cost-effective solution as it is billed per minute and not per hour.
Please note you also pay for setup time.

Vast.ai comes with limitation of not being able to run docker containers inside rented machines.
Vast.ai allows you to run your own docker image if it is uploaded to public docker registry (or credentials docker registry password are provided), but this avenue was not explored.


## paperspace

Paperspace is billed per hour, and tends to have a higher cost than vast.ai.
The performance seemed to vary a lot, but the root cause was never identified.
80 changes: 80 additions & 0 deletions docs/IMPROVING_CONSITENCY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Improving Consistency of Inference

This document describes the steps to improve the consistency of inference results, from the MUST-HAVE requirements to potential improvements.

If you a case, where recommendation, especially a "thought to be safe" configuration, does not provide consistent results, please file an Issue report alongside the steps to reproduce the issue.


## Recommendations

* [ ] Have test cases that verify the consistency of inference results.
* Recommendation:
* Before every release, check if the known input-output pairs are still the same as in previous release. It is very important for known output to be as long as possible, operation errors are cumulative the long output is much more likely showcase inconsistencies.
* [ ] All software dependencies need to be locked to a specific version.
* Recommendation:
* Python dependencies: These are the ones that will be most updated. Use [Rye](https://rye.astral.sh/) or [pip-tools](https://github.com/jazzband/pip-tools).
* Binary dependencies: Use doker images and tags to lock the version of the final image. Treat each tag as a reseed of the inference process.
* Use same driver version (Please note, that so far we have yet to document a case where driver version affected inference results)
* [ ] Set seed for all random number generators used in the inference process.
* Recommendation:
* For single-threaded apps, use `deterministic_ml.v1.set_seed()` to set the seed for all known random number generator process wide.
* Whenever initializing a new random generator, explicitly set the seed in deterministic manner.
* For multi-threaded or async applications ensure that random generators are isolated per thread or task.
* [ ] Disable auto-optimization or JIT compilation in the inference process.
* Recommendation:
* Use `deterministic_ml.v1.disable_auto_optimization()` to disable auto-optimization or JIT compilation process wide.
* [ ] Use the same kind of hardware for all inference runs.
* Recommendation:
* Use the same GPU chip model and vRAM size for are inference runs. Hardware interface (PCIe, SXM, etc.) does not seem to affect the results, but 2xA100 40G do not return the same results as 1xA100 80G.
* When testing new, but similar hardware to check if the results are consistent with previously known platform, maximize pseudo-randomization of the inference process (e.g. by setting high temperature and low top-p values).

## Framework Specific Recommendations

### PyTorch

See https://pytorch.org/docs/stable/notes/randomness.html .
The https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility also apply.

### vLLM

vLLM is PyTorch based, so the same constraints apply.
However, vLLM has much narrower scope, hence other than general recommendations from [MUST-HAVE](#must-have) section, only following is required:
* make sure to use exactly the same parameters for the model initialization
* `enforce_eager=True`
* to get the same output for the same input, use the exactly same `SamplingParams` with explicitly set `seed` parameter


```python
model = vllm.LLM(
model=model_name,
enforce_eager=True, # Ensure eager mode is enabled
)


sampling_params = vllm.SamplingParams(
max_tokens=4096,
# temperature=1000, # High value encourages pseudo-randomization
# top_p=0.1, # Low value encourages pseudo-randomization
seed=42,
)

response = model.generate(requests, sampling_params)

```


## Unconfirmed advice

### Consistent results across different CUDA hardware

It should be theoretically possible to get consistent results across different hardware, but even if limited to CUDA-capatible GPUs, it will be at the cost of performance.

* [ ] Use `torch.backends.cudnn.deterministic = True` and `torch.backends.cudnn.benchmark = False` to ensure that the results are consistent across different CUDA hardware.
* Recommendation:
* Use `deterministic_ml.v1.set_seed()` to set the seed for all known random number generator process wide.
* Use `deterministic_ml.v1.disable_auto_optimization()` to disable auto-optimization or JIT compilation process wide.
* Use `torch.backends.cudnn.deterministic = True` and `torch.backends.cudnn.benchmark = False` to ensure that the results are consistent across different CUDA hardware.
* Use the same GPU chip model and vRAM size for are inference runs. Hardware interface (PCIe, SXM, etc.) does not seem to affect the results, but 2xA100 40G do not return the same results as 1xA100 80G.


See [TO_BE_INVESTIGATED.md](TO_BE_INVESTIGATED.md) for more potential improvements.
73 changes: 73 additions & 0 deletions docs/MANUAL_INTEGRATION_TESTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Manual integration testing

This document describes how we test whenever we are able to achieve deterministic results for particular experiment scenarios against tested hardware and software configurations.

1) For each hardware configuration:
1) Create target machine \[manual\]
2) Run the target experiment scenario on the target machine.
`./run.py vllm_llama_3_70b_instruct_awq -n machine_name username@host -p ssh_port`
3) Destroy the target machine \[manual\]
2) Analyze the results \[manual\]

## Initial local setup

```bash
pdm install -G test
```

## Choosing experiment scenario

Either use already defined scenario in [tests/integration/experiments](../tests/integration/experiments), e.g.
`vllm_llama_3_70b_instruct_awq` or create a new one.

Each scenario may define target machine environment initial setup steps, including:
* `setup.sh` - Shell script executed on the target machine, for example, installing binary dependencies.
* `requirements.txt` - Python packages installed on the target machine in dedicated python virtual environment.

And must include:
* `__main__.py` - Main experiment script, which is executed on the target machine taking as first argument the output directory to which `output.yaml` file should be saved.

## Running the experiment scenario


### Run experiment against N target machines

Repeat following for a number of target machines.
You may want to even mix some of configurations, e.g. different GPU models, different CUDA versions, etc. to get a better understanding what influences the determinism of output.

### Create target machine

You can use service cheap GPU machines like ones provided by [vast.ai](https://vast.ai/), [paperspace](https://www.paperspace.com/) etc.
Please note that for one-off experiment, services like vast.ai are more cost-effective since they are billed per minute and not per hour.
See [Compute Providers document](COMPUTE_PROVIDERS.md) for more information.

Example machine configuration: vast.io, on-demand, 1x NVIDIA A100 80GB, 100GB disk with Ubuntu-based template +CUDA drivers installed.

### Run the experiment scenario

In [tests/integration/experiments](../tests/integration/experiments) directory, run

```bash
./run.py vllm_llama_3_70b_instruct_awq -c target_comment username@host -p ssh_port
```

### Destroy the target machine

Destroy the target machine to avoid unnecessary costs.


## Analyzing the results

Results are stored in [`results` directory](../tests/integration/results).
They are grouped by experiment scenario, then target machine name and timestamp.

Each result contains:
* `experiment.log` - Experiment log output.
* `output.yaml` - Experiment output in YAML format. This is the most important file to analyze.
* `sysinfo.yaml` - System information of the target machine, used to cluster results by hardware configuration.

You can use `./analyze.py` script to analyze the results.

```bash
./analyze.py vllm_llama_3_70b_instruct_awq
```
6 changes: 6 additions & 0 deletions docs/TO_BE_INVESTIGATED.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# To be investigated

## Disabling "fast math" optimizations

To check:
* as this is a compiler flag - what Python-related software is affected by it?
Loading

0 comments on commit 69e9ae9

Please sign in to comment.