Fail to build the docker for mlc command on Ubuntu 22.04 #2132

Bob123Yang · 2025-02-25T01:54:31Z

I failed to run the mlc command on Ubuntu 22.04:

mlcr run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=5000
--all_models=yes

Failed to resolve 'developer.download.nvidia.com' as below, but in fact I can access the link of developer.download.nvidia.com via firefox manually.

212.3 The following NEW packages will be installed:
212.3   libcublas-12-3 libcublas-dev-12-3
242.8 0 upgraded, 2 newly installed, 0 to remove and 87 not upgraded.
242.8 Need to get 514 MB of archives.
242.8 After this operation, 1577 MB of additional disk space will be used.
242.8 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
262.8 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
282.9 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
282.9 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
284.9 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
284.9 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
308.9 Err:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
308.9   Could not connect to developer.download.nvidia.com:443 (23.223.211.90), connection timed out Could not connect to developer.download.nvidia.com:443 (23.223.211.42), connection timed out
328.9 Err:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
328.9   Temporary failure resolving 'developer.download.nvidia.com'
328.9 E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/./libcublas-12-3_12.3.4.1-1_amd64.deb  Could not connect to developer.download.nvidia.com:443 (23.223.211.90), connection timed out Could not connect to developer.download.nvidia.com:443 (23.223.211.42), connection timed out
328.9 E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/./libcublas-dev-12-3_12.3.4.1-1_amd64.deb  Temporary failure resolving 'developer.download.nvidia.com'
328.9 E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
------

 4 warnings found (use docker --debug to expand):
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 6)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 14)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 53)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 66)
Dockerfile.multi:32
--------------------
  31 |     COPY docker/common/install_tensorrt.sh install_tensorrt.sh
  32 | >>> RUN bash ./install_tensorrt.sh \
  33 | >>>     --TRT_VER=${TRT_VER} \
  34 | >>>     --CUDA_VER=${CUDA_VER} \
  35 | >>>     --CUDNN_VER=${CUDNN_VER} \
  36 | >>>     --NCCL_VER=${NCCL_VER} \
  37 | >>>     --CUBLAS_VER=${CUBLAS_VER} && \
  38 | >>>     rm install_tensorrt.sh
  39 |     
--------------------
ERROR: failed to solve: process "/bin/bash -c bash ./install_tensorrt.sh     --TRT_VER=${TRT_VER}     --CUDA_VER=${CUDA_VER}     --CUDNN_VER=${CUDNN_VER}     --NCCL_VER=${NCCL_VER}     --CUBLAS_VER=${CUBLAS_VER} &&     rm install_tensorrt.sh" did not complete successfully: exit code: 100
exit status 1
make: *** [Makefile:55: devel_build] Error 1
make: Leaving directory '/home/bob2/MLC/repos/local/cache/get-git-repo_d790359e/repo/docker'
Traceback (most recent call last):
  File "/home/bob2/mlc/bin/mlcr", line 8, in <module>
    sys.exit(mlcr())
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1715, in mlcr
    main()
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1797, in main
    res = method(run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1529, in run
    return self.call_script_module_function("run", run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1509, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 225, in run
    r = self._run(i)
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1768, in _run
    r = customize_code.preprocess(ii)
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/script/run-mlperf-inference-app/customize.py", line 284, in preprocess
    r = mlc.access(ii)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 92, in access
    result = method(options)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1526, in docker
    return self.call_script_module_function("docker", run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1511, in call_script_module_function
    result = automation_instance.docker(run_args)  # Pass args to the run method
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 4684, in docker
    return docker_run(self, i)
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/docker.py", line 308, in docker_run
    r = self_module._run_deps(
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3695, in _run_deps
    r = self.action_object.access(ii)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 92, in access
    result = method(options)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1529, in run
    return self.call_script_module_function("run", run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1519, in call_script_module_function
    raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.main.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = get-ml-model-gptj, return code = 256)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please file an issue at https://github.com/mlcommons/mlperf-automations/issues along with the full MLC command being run and the relevant
or full console log.

arjunsuresh · 2025-02-25T02:14:46Z

Can you please try the same command with --docker_cache=no?

Bob123Yang · 2025-02-25T02:31:01Z

OK, I will try it and share the result later. Thanks.

Bob123Yang · 2025-02-25T02:50:45Z

some error promt as below but the the progress is still going on:

......
......
......

Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script ammo-wf-exec is installed in '/home/bob2/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script evaluate-cli is installed in '/home/bob2/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dask-cuda 23.10.0 requires pynvml<11.5,>=11.0.0, but you have pynvml 12.0.0 which is incompatible.
Successfully installed accelerate-0.25.0 bandit-1.7.7 build-1.2.2.post1 cfgv-3.4.0 colored-2.3.0 coloredlogs-15.0.1 coverage-7.6.12 datasets-3.3.2 diffusers-0.15.0 dill-0.3.8 distlib-0.3.9 evaluate-0.4.3 flatbuffers-25.2.10 graphviz-0.20.3 huggingface-hub-0.29.1 humanfriendly-10.0 identify-2.6.8 janus-2.0.0 lark-1.2.2 multiprocess-0.70.16 mypy-1.15.0 mypy_extensions-1.0.0 nltk-3.9.1 nodeenv-1.9.1 nvidia-ammo-0.7.4 nvidia-ml-py-12.570.86 onnx-graphsurgeon-0.5.5 onnxruntime-1.16.3 optimum-1.24.0 parameterized-0.9.0 pbr-6.1.1 pre-commit-4.1.0 py-1.11.0 pyarrow-19.0.1 pybind11-stubgen-2.5.3 pynvml-12.0.0 pyproject_hooks-1.2.0 pytest-cov-6.0.0 pytest-forked-1.6.0 requests-2.32.3 rouge_score-0.1.2 safetensors-0.5.2 sentencepiece-0.2.0 stevedore-5.4.1 tokenizers-0.15.2 tqdm-4.67.1 transformers-4.36.1 virtualenv-20.29.2 xxhash-3.5.0

[notice] A new release of pip is available: 23.3.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
-- The CXX compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info

......
......
......

[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int4b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs2Int4b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int8b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int8b.cu.o
[ 98%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int4b.cu.o
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_fp16.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12932: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_fp16.cu.o] Error 2
gmake[3]: *** Waiting for unfinished jobs....
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_bf16.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12917: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_bf16.cu.o] Error 2
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_int32.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12962: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_int32.cu.o] Error 2
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_fp32.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12947: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_fp32.cu.o] Error 2
[ 98%] Built target layers_src
[ 98%] Built target common_src
[ 98%] Built target runtime_src

Bob123Yang · 2025-02-25T02:54:21Z

Quit after the above errors occurred.

mlcr run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=5000
--all_models=yes
--docker_cache=no

arjunsuresh · 2025-02-25T10:47:08Z

oh. Which GPU are you running on?

Bob123Yang · 2025-02-26T01:09:21Z

oh, I made the same mistake again - mix the GPU with different model.

Thank you @arjunsuresh , I will remove one and try it later.

Bob123Yang · 2025-02-26T01:25:59Z

@arjunsuresh Unfortunately the complete same error as above happen and failed to build the docker.

please see the log here:

mlc-log.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to build the docker for mlc command on Ubuntu 22.04 #2132

Fail to build the docker for mlc command on Ubuntu 22.04 #2132

Bob123Yang commented Feb 25, 2025

arjunsuresh commented Feb 25, 2025

Bob123Yang commented Feb 25, 2025

Bob123Yang commented Feb 25, 2025

Bob123Yang commented Feb 25, 2025

arjunsuresh commented Feb 25, 2025

Bob123Yang commented Feb 26, 2025

Bob123Yang commented Feb 26, 2025

Fail to build the docker for mlc command on Ubuntu 22.04 #2132

Fail to build the docker for mlc command on Ubuntu 22.04 #2132

Comments

Bob123Yang commented Feb 25, 2025

arjunsuresh commented Feb 25, 2025

Bob123Yang commented Feb 25, 2025

Bob123Yang commented Feb 25, 2025

Bob123Yang commented Feb 25, 2025

arjunsuresh commented Feb 25, 2025

Bob123Yang commented Feb 26, 2025

Bob123Yang commented Feb 26, 2025