Skip to content

AMD RX 9060XT ROCm error: invalid device function #6044

@mudler

Description

@mudler

Discussed in #6008

Originally posted by cybershaman August 10, 2025
Hello all!

Been hitting my head against this issue for some time now so though I might reach out to the community here for advice :-)

Host - Proxmox 8.4 Bare Metal Install

  • AMD RX 9060XT (gfx1200)
  • Kernel 6.11.11 to leverage proper GPU detection
  • amdgpu Kernel module compiled and inserted as dkms (probably not quite needed though?)
  • "rocm-smi" functional
  • "rocminfo" shows only 1 Agent (CPU)

Container (LXC) - Ubuntu 22.04.5 LTS

  • ROCm 6.4.2 installed via amdgpu-install (of course no DKMS)
  • following devices passed through from Host with proper cgroup perms for container:
    • /dev/kfd
    • /dev/dri/*
  • "rocm-smi" functional
  • "rocminfo" functional (2 Agents, CPU & GPU)

LocalAI compiled from git source:

  • REBUILD=true BUILD_TYPE=hipblas GPU_TARGETS=gfx1200 GO_TAGS=stablediffusion,tts BUILD_GRPC_FOR_BACKEND_LLAMA=true BUILD_GRPC=true make build
  • using environment variable HSA_OVERRIDE_GFX_VERSION=12.0.0 just in case
  • testing with a LLAMA model, relevant config parts:
    • backend: rocm-llama-cpp
    • f16: true
    • threads: 0
    • gpu_layers: 200

LocalAI appears to be recognising and utilizing the GPU as there is VRAM movement and a tiny bit of GPU usage visible while querying the API.
However, throws an error eventually:

11:04AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr ggml_cuda_init: found 1 ROCm devices:
11:04AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr   Device 0: AMD Radeon RX 9060 XT, gfx1200 (0x1200), VMM: no, Wave Size: 32
[...]
11:04AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr load_tensors: offloading 32 repeating layers to GPU
11:04AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr load_tensors: offloading output layer to GPU
11:04AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr load_tensors: offloaded 33/33 layers to GPU
11:04AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr load_tensors:   CPU_Mapped model buffer size =    70.32 MiB
11:04AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr load_tensors:        ROCm0 model buffer size =  3877.56 MiB
[...]
11:05AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr ggml_cuda_compute_forward: MUL_MAT failed
11:05AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr ROCm error: invalid device function
11:05AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr   current device: 0, in function ggml_cuda_compute_forward at /LocalAI/backend/cpp/llama-cpp-fallback-build/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2513
11:05AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr   err
11:05AM DBG GRPC(discolm_german-127.0.0.1:37515): stderr /LocalAI/backend/cpp/llama-cpp-fallback-build/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:84: ROCm error

Anyone have any ideas and/or pointers?
Thank you very much in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions