-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathllama
16 lines (15 loc) · 1.3 KB
/
llama
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
FATAL EXCEPTION: Thread-5
Process: ai.mlc.mlcchat, PID: 25382
org.apache.tvm.Base$TVMError: [00:09:15] C:/Users/submission/Desktop/mlc-llm/cpp/serve/threaded_engine.cc:283: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 6452.269 MB, which is less than the sum of model weight size (9076.273 MB) and temporary buffer size (2836.074 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.
Stack trace not available when DMLC_LOG_STACK_TRACE is disabled at compile time.
at org.apache.tvm.Base.checkCall(Base.java:173)
at org.apache.tvm.Function.invoke(Function.java:130)
at ai.mlc.mlcllm.JSONFFIEngine.runBackgroundLoop(JSONFFIEngine.java:64)
at ai.mlc.mlcllm.MLCEngine$backgroundWorker$1.invoke(MLCEngine.kt:42)
at ai.mlc.mlcllm.MLCEngine$backgroundWorker$1.invoke(MLCEngine.kt:40)
at ai.mlc.mlcllm.BackgroundWorker$start$1.invoke(MLCEngine.kt:19)
at ai.mlc.mlcllm.BackgroundWorker$start$1.invoke(MLCEngine.kt:18)
at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)