feat(llama-swap): add 128k max-ctx profiles for qwen3.5 variants

fix(llama-swap): use --flash-attn on (flag requires explicit value)
style(llama-swap): consistent flag ordering across all profiles
2026-04-10 16:10:04 +02:00 · 2026-04-10 15:50:25 +02:00 · 2026-04-10 15:44:24 +02:00 · 2026-04-10 15:42:19 +02:00 · 2026-04-10 15:10:38 +02:00 · 2026-04-10 11:42:39 +02:00
5 changed files with 155 additions and 5 deletions
--- a/ai/docker-compose.yml
+++ b/ai/docker-compose.yml
@ -8,7 +8,9 @@ services:
      - "/srv/docker/ai/data/data:/app/backend/data" # Double data is intentional
      - "/srv/docker/ai/data/.webui_secret_key:/app/backend/.webui_secret_key"
    environment:
-      - OLLAMA_BASE_URL=https://ollama.lan.poldebra.me
+      - OPENAI_API_BASE_URL=https://ollama.lan.poldebra.me/v1
      - OPENAI_API_KEY=sk-no-key-required
      - ENABLE_OLLAMA_API=false
    networks:
      internal:
        ipv4_address: 172.24.0.5
--- a/ollama/Caddyfile
+++ b/ollama/Caddyfile
@ -21,7 +21,7 @@
          X-Forwarded-Host {host}
          X-Forwarded-Port {server_port}
        }
-        reverse_proxy 172.23.0.5:11434 {
+        reverse_proxy 172.23.0.5:8080 {
            header_up X-Forwarded-Proto {scheme}
        }
    }
--- a/ollama/Dockerfile
+++ b/ollama/Dockerfile
@ -0,0 +1,12 @@
 # syntax=docker/dockerfile:1
 FROM ghcr.io/ggml-org/llama.cpp:server-rocm
 ARG LLAMA_SWAP_VERSION=v199
 ADD https://github.com/mostlygeek/llama-swap/releases/download/${LLAMA_SWAP_VERSION}/llama-swap_199_linux_amd64.tar.gz /tmp/llama-swap.tar.gz
 RUN tar -xzf /tmp/llama-swap.tar.gz -C /usr/local/bin llama-swap \
 && chmod +x /usr/local/bin/llama-swap \
 && rm /tmp/llama-swap.tar.gz
 EXPOSE 8080
 ENTRYPOINT ["/usr/local/bin/llama-swap"]
 CMD ["-config", "/etc/llama-swap/config.yaml", "-listen", ":8080"]
--- a/ollama/docker-compose.yml
+++ b/ollama/docker-compose.yml
@ -1,15 +1,21 @@
 services:
  app:
-    image: ollama/ollama:rocm
+    build: .
    image: local/llama-swap-rocm:latest
    restart: unless-stopped
    hostname: ollama
    container_name: ollama
    user: 1000:1000
    volumes:
-      - "/srv/docker/ollama/data:/ubuntu/.ollama"
+      - "/srv/docker/ollama/data/models:/models:ro"
      - "./llama-swap.yaml:/etc/llama-swap/config.yaml:ro"
    environment:
      - HIP_VISIBLE_DEVICES=0
    devices:
      - "/dev/kfd:/dev/kfd"
      - "/dev/dri:/dev/dri"
    group_add:
      - video
      - render
    networks:
      internal:
        ipv4_address: 172.23.0.5
--- a/ollama/llama-swap.yaml
+++ b/ollama/llama-swap.yaml
@ -0,0 +1,130 @@
 healthCheckTimeout: 180
 logLevel: info
 models:
  "qwen3.5:9b":
    cmd: |
      /app/llama-server
      --host 0.0.0.0 --port ${PORT}
      --model /models/Qwen3.5-9B-Q4_K_M.gguf
      --mmproj /models/Qwen3.5-9B-mmproj-F16.gguf
      --alias qwen3.5:9b
      --n-gpu-layers 999
      --ctx-size 8192
      --flash-attn on
      --jinja
      --temp 0.7 --top-p 0.9
  "qwen3.5:9b-32k":
    cmd: |
      /app/llama-server
      --host 0.0.0.0 --port ${PORT}
      --model /models/Qwen3.5-9B-Q4_K_M.gguf
      --mmproj /models/Qwen3.5-9B-mmproj-F16.gguf
      --alias qwen3.5:9b-32k
      --n-gpu-layers 999
      --ctx-size 32768
      --flash-attn on
      --jinja
      --cache-type-k q8_0
      --cache-type-v q8_0
      --temp 0.7 --top-p 0.9
  "qwen3.5:9b-uncensored":
    cmd: |
      /app/llama-server
      --host 0.0.0.0 --port ${PORT}
      --model /models/HauhauCS-Qwen3.5-9B-Uncensored-Aggressive.q4_k_m.gguf
      --mmproj /models/HauhauCS-Qwen3.5-9B-Uncensored-Aggressive.mmproj.gguf
      --alias "hf.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive:q4_k_m"
      --n-gpu-layers 999
      --ctx-size 32768
      --flash-attn on
      --jinja
      --cache-type-k q8_0
      --cache-type-v q8_0
      --temp 0.7 --top-p 0.9
  "qwen3.5:9b-claude-4.6-opus-reasoning":
    cmd: |
      /app/llama-server
      --host 0.0.0.0 --port ${PORT}
      --model /models/Jackrong-Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2.q4_k_m.gguf
      --mmproj /models/Jackrong-Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2.mmproj.gguf
      --alias "hf.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF:q4_k_m"
      --n-gpu-layers 999
      --ctx-size 32768
      --flash-attn on
      --jinja
      --cache-type-k q8_0
      --cache-type-v q8_0
      --temp 0.6 --top-p 0.95
  "qwen3.5:9b-128k":
    cmd: |
      /app/llama-server
      --host 0.0.0.0 --port ${PORT}
      --model /models/Qwen3.5-9B-Q4_K_M.gguf
      --mmproj /models/Qwen3.5-9B-mmproj-F16.gguf
      --alias qwen3.5:9b-128k
      --n-gpu-layers 999
      --ctx-size 131072
      --flash-attn on
      --jinja
      --cache-type-k q8_0
      --cache-type-v q8_0
      --temp 0.7 --top-p 0.9
  "qwen3.5:9b-uncensored-128k":
    cmd: |
      /app/llama-server
      --host 0.0.0.0 --port ${PORT}
      --model /models/HauhauCS-Qwen3.5-9B-Uncensored-Aggressive.q4_k_m.gguf
      --mmproj /models/HauhauCS-Qwen3.5-9B-Uncensored-Aggressive.mmproj.gguf
      --alias "hf.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive:q4_k_m"
      --n-gpu-layers 999
      --ctx-size 131072
      --flash-attn on
      --jinja
      --cache-type-k q8_0
      --cache-type-v q8_0
      --temp 0.7 --top-p 0.9
  "qwen3.5:9b-claude-4.6-opus-reasoning-128k":
    cmd: |
      /app/llama-server
      --host 0.0.0.0 --port ${PORT}
      --model /models/Jackrong-Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2.q4_k_m.gguf
      --mmproj /models/Jackrong-Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2.mmproj.gguf
      --alias "hf.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF:q4_k_m"
      --n-gpu-layers 999
      --ctx-size 131072
      --flash-attn on
      --jinja
      --cache-type-k q8_0
      --cache-type-v q8_0
      --temp 0.6 --top-p 0.95
  "gemma4:e4b-uncensored":
    cmd: |
      /app/llama-server
      --host 0.0.0.0 --port ${PORT}
      --model /models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf
      --mmproj /models/mmproj-Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-f16.gguf
      --alias "hf.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive:q8_k_p"
      --n-gpu-layers 999
      --ctx-size 8192
      --flash-attn on
      --jinja
  "gemma4:26b-a4b":
    cmd: |
      /app/llama-server
      --host 0.0.0.0 --port ${PORT}
      --model /models/gemma-4-26B-A4B-it-UD-IQ4_NL.gguf
      --mmproj /models/gemma-4-26B-A4B-it-UD-IQ4_NL-mmproj-BF16.gguf
      --alias "hf.co/unsloth/gemma-4-26B-A4B-it:ud-iq4_nl"
      --n-gpu-layers 999
      --ctx-size 8192
      --flash-attn on
      --jinja
Author	SHA1	Message	Date
Davide Polonio	c233d06dcb	feat(llama-swap): add 128k max-ctx profiles for qwen3.5 variants	2026-04-10 16:10:04 +02:00
Davide Polonio	df3b927985	fix(llama-swap): use --flash-attn on (flag requires explicit value)	2026-04-10 15:50:25 +02:00
Davide Polonio	6ea3e870bd	style(llama-swap): consistent flag ordering across all profiles	2026-04-10 15:44:24 +02:00
Davide Polonio	55ac2e5568	feat(llama-swap): fix contexts, add flash-attn/jinja, tune sampling params	2026-04-10 15:42:19 +02:00
Davide Polonio	2d852879b6	feat: add new models	2026-04-10 15:10:38 +02:00
Davide Polonio	c6e4901dee	refactor(ollama): simplify model names in llama-swap configuration Replace verbose Hugging Face model references with concise local aliases for Qwen3.5 models to improve readability and maintainability.	2026-04-10 11:42:39 +02:00
Davide Polonio	8ab4213b62	feat(ollama): add persistence in Ollama container Re-enable qwen3.5:9b and qwen3.5:9bctxSmall using fresh unsloth/Qwen3.5-9B-GGUF quantization, which uses the correct rope.dimension_sections format (4 elements) compatible with this llama.cpp build. Both models include the mmproj for multimodal support. The old Ollama-extracted GGUF (mrope_sections, 3 elements) has been removed.	2026-04-10 10:57:34 +02:00
Davide Polonio	ebc71492c3	fix(ollama): restrict to RX 9070 XT, restore mmproj - Set HIP_VISIBLE_DEVICES=0 to use only the discrete GPU (gfx1201). llama.cpp was trying to split layers across the iGPU (gfx1036) which caused segfaults when loading the multimodal projector. - Restore --mmproj for both HF models (multimodal works correctly with single GPU). - Keep qwen3.5:9b disabled (Ollama-extracted GGUF uses old mrope_sections key format incompatible with this llama.cpp build).	2026-04-10 00:09:12 +02:00
Davide Polonio	3034f987d7	feat(ollama): migrate from Ollama to llama.cpp + llama-swap Replace the Ollama service with a custom ROCm image combining ghcr.io/ggml-org/llama.cpp:server-rocm and llama-swap v199. Main motivations: - Unblock qwen35 HF GGUFs (qwen35 architecture not supported in Ollama 0.20.4 for HF-imported models) - Stay current with llama.cpp upstream without waiting for Ollama releases Changes: - ollama/Dockerfile: build llama-swap on top of llama.cpp:server-rocm - ollama/llama-swap.yaml: define 4 models with full sampler config, GPU offload, and mmproj for the two multimodal HF fine-tunes - ollama/docker-compose.yml: replace Ollama image with local build; fix broken volume mount (was /ubuntu/.ollama, now explicit /models) - ollama/Caddyfile: update upstream port 11434→8080 (llama-swap default) - ai/docker-compose.yml: switch Open WebUI from OLLAMA_BASE_URL to OPENAI_API_BASE_URL pointing at llama-swap /v1 endpoint	2026-04-09 23:14:43 +02:00