- Set HIP_VISIBLE_DEVICES=0 to use only the discrete GPU (gfx1201).
llama.cpp was trying to split layers across the iGPU (gfx1036) which
caused segfaults when loading the multimodal projector.
- Restore --mmproj for both HF models (multimodal works correctly with
single GPU).
- Keep qwen3.5:9b disabled (Ollama-extracted GGUF uses old mrope_sections
key format incompatible with this llama.cpp build).
Replace the Ollama service with a custom ROCm image combining
ghcr.io/ggml-org/llama.cpp:server-rocm and llama-swap v199.
Main motivations:
- Unblock qwen35 HF GGUFs (qwen35 architecture not supported in
Ollama 0.20.4 for HF-imported models)
- Stay current with llama.cpp upstream without waiting for Ollama releases
Changes:
- ollama/Dockerfile: build llama-swap on top of llama.cpp:server-rocm
- ollama/llama-swap.yaml: define 4 models with full sampler config,
GPU offload, and mmproj for the two multimodal HF fine-tunes
- ollama/docker-compose.yml: replace Ollama image with local build;
fix broken volume mount (was /ubuntu/.ollama, now explicit /models)
- ollama/Caddyfile: update upstream port 11434→8080 (llama-swap default)
- ai/docker-compose.yml: switch Open WebUI from OLLAMA_BASE_URL to
OPENAI_API_BASE_URL pointing at llama-swap /v1 endpoint