Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to build the docker for mlc command on Ubuntu 22.04 #2132

Open
Bob123Yang opened this issue Feb 25, 2025 · 7 comments
Open

Fail to build the docker for mlc command on Ubuntu 22.04 #2132

Bob123Yang opened this issue Feb 25, 2025 · 7 comments

Comments

@Bob123Yang
Copy link

I failed to run the mlc command on Ubuntu 22.04:

mlcr run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=5000
--all_models=yes

Failed to resolve 'developer.download.nvidia.com' as below, but in fact I can access the link of developer.download.nvidia.com via firefox manually.

212.3 The following NEW packages will be installed:
212.3   libcublas-12-3 libcublas-dev-12-3
242.8 0 upgraded, 2 newly installed, 0 to remove and 87 not upgraded.
242.8 Need to get 514 MB of archives.
242.8 After this operation, 1577 MB of additional disk space will be used.
242.8 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
262.8 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
282.9 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
282.9 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
284.9 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
284.9 Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
308.9 Err:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-12-3 12.3.4.1-1
308.9   Could not connect to developer.download.nvidia.com:443 (23.223.211.90), connection timed out Could not connect to developer.download.nvidia.com:443 (23.223.211.42), connection timed out
328.9 Err:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  libcublas-dev-12-3 12.3.4.1-1
328.9   Temporary failure resolving 'developer.download.nvidia.com'
328.9 E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/./libcublas-12-3_12.3.4.1-1_amd64.deb  Could not connect to developer.download.nvidia.com:443 (23.223.211.90), connection timed out Could not connect to developer.download.nvidia.com:443 (23.223.211.42), connection timed out
328.9 E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/./libcublas-dev-12-3_12.3.4.1-1_amd64.deb  Temporary failure resolving 'developer.download.nvidia.com'
328.9 E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
------

 4 warnings found (use docker --debug to expand):
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 6)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 14)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 53)
 - FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 66)
Dockerfile.multi:32
--------------------
  31 |     COPY docker/common/install_tensorrt.sh install_tensorrt.sh
  32 | >>> RUN bash ./install_tensorrt.sh \
  33 | >>>     --TRT_VER=${TRT_VER} \
  34 | >>>     --CUDA_VER=${CUDA_VER} \
  35 | >>>     --CUDNN_VER=${CUDNN_VER} \
  36 | >>>     --NCCL_VER=${NCCL_VER} \
  37 | >>>     --CUBLAS_VER=${CUBLAS_VER} && \
  38 | >>>     rm install_tensorrt.sh
  39 |     
--------------------
ERROR: failed to solve: process "/bin/bash -c bash ./install_tensorrt.sh     --TRT_VER=${TRT_VER}     --CUDA_VER=${CUDA_VER}     --CUDNN_VER=${CUDNN_VER}     --NCCL_VER=${NCCL_VER}     --CUBLAS_VER=${CUBLAS_VER} &&     rm install_tensorrt.sh" did not complete successfully: exit code: 100
exit status 1
make: *** [Makefile:55: devel_build] Error 1
make: Leaving directory '/home/bob2/MLC/repos/local/cache/get-git-repo_d790359e/repo/docker'
Traceback (most recent call last):
  File "/home/bob2/mlc/bin/mlcr", line 8, in <module>
    sys.exit(mlcr())
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1715, in mlcr
    main()
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1797, in main
    res = method(run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1529, in run
    return self.call_script_module_function("run", run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1509, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 225, in run
    r = self._run(i)
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1768, in _run
    r = customize_code.preprocess(ii)
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/script/run-mlperf-inference-app/customize.py", line 284, in preprocess
    r = mlc.access(ii)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 92, in access
    result = method(options)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1526, in docker
    return self.call_script_module_function("docker", run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1511, in call_script_module_function
    result = automation_instance.docker(run_args)  # Pass args to the run method
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 4684, in docker
    return docker_run(self, i)
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/docker.py", line 308, in docker_run
    r = self_module._run_deps(
  File "/home/bob2/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3695, in _run_deps
    r = self.action_object.access(ii)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 92, in access
    result = method(options)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1529, in run
    return self.call_script_module_function("run", run_args)
  File "/home/bob2/mlc/lib/python3.10/site-packages/mlc/main.py", line 1519, in call_script_module_function
    raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.main.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = get-ml-model-gptj, return code = 256)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please file an issue at https://github.com/mlcommons/mlperf-automations/issues along with the full MLC command being run and the relevant
or full console log.
@arjunsuresh
Copy link
Contributor

Can you please try the same command with --docker_cache=no?

@Bob123Yang
Copy link
Author

OK, I will try it and share the result later. Thanks.

@Bob123Yang
Copy link
Author

some error promt as below but the the progress is still going on:

......
......
......

Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script ammo-wf-exec is installed in '/home/bob2/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script evaluate-cli is installed in '/home/bob2/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dask-cuda 23.10.0 requires pynvml<11.5,>=11.0.0, but you have pynvml 12.0.0 which is incompatible.
Successfully installed accelerate-0.25.0 bandit-1.7.7 build-1.2.2.post1 cfgv-3.4.0 colored-2.3.0 coloredlogs-15.0.1 coverage-7.6.12 datasets-3.3.2 diffusers-0.15.0 dill-0.3.8 distlib-0.3.9 evaluate-0.4.3 flatbuffers-25.2.10 graphviz-0.20.3 huggingface-hub-0.29.1 humanfriendly-10.0 identify-2.6.8 janus-2.0.0 lark-1.2.2 multiprocess-0.70.16 mypy-1.15.0 mypy_extensions-1.0.0 nltk-3.9.1 nodeenv-1.9.1 nvidia-ammo-0.7.4 nvidia-ml-py-12.570.86 onnx-graphsurgeon-0.5.5 onnxruntime-1.16.3 optimum-1.24.0 parameterized-0.9.0 pbr-6.1.1 pre-commit-4.1.0 py-1.11.0 pyarrow-19.0.1 pybind11-stubgen-2.5.3 pynvml-12.0.0 pyproject_hooks-1.2.0 pytest-cov-6.0.0 pytest-forked-1.6.0 requests-2.32.3 rouge_score-0.1.2 safetensors-0.5.2 sentencepiece-0.2.0 stevedore-5.4.1 tokenizers-0.15.2 tqdm-4.67.1 transformers-4.36.1 virtualenv-20.29.2 xxhash-3.5.0

[notice] A new release of pip is available: 23.3.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
-- The CXX compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info

......
......
......

[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int4b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs2Int4b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int8b.cu.o
[ 97%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int8b.cu.o
[ 98%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int4b.cu.o
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_fp16.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12932: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_fp16.cu.o] Error 2
gmake[3]: *** Waiting for unfinished jobs....
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_bf16.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12917: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_bf16.cu.o] Error 2
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_int32.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12962: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_int32.cu.o] Error 2
/code/tensorrt_llm/cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/threadblock/epilogue_tensor_op_int32.h(97): error: class template "cutlass::epilogue::threadblock::detail::DefaultIteratorsTensorOp" has already been defined
  struct DefaultIteratorsTensorOp<cutlass::bfloat16_t, int32_t, 8, ThreadblockShape, WarpShape, InstructionShape,
         ^

1 error detected in the compilation of "/code/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_fp32.cu".
gmake[3]: *** [tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/build.make:12947: tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/cutlass_kernels/int8_gemm/int8_gemm_fp32.cu.o] Error 2
[ 98%] Built target layers_src
[ 98%] Built target common_src
[ 98%] Built target runtime_src


@Bob123Yang
Copy link
Author

Quit after the above errors occurred.

mlcr run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=5000
--all_models=yes
--docker_cache=no

@arjunsuresh
Copy link
Contributor

oh. Which GPU are you running on?

@Bob123Yang
Copy link
Author

oh, I made the same mistake again - mix the GPU with different model.

Thank you @arjunsuresh , I will remove one and try it later.

@Bob123Yang
Copy link
Author

@arjunsuresh Unfortunately the complete same error as above happen and failed to build the docker.

please see the log here:

mlc-log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants