2026-04-05 07:17:02 +00:00
2026-04-05 07:17:02 +00:00
2026-04-05 07:17:02 +00:00
2026-04-05 07:17:02 +00:00
2026-04-05 07:17:02 +00:00
2026-04-05 07:17:02 +00:00
2026-04-05 05:20:57 +00:00

LLM Router Pipeline for Open WebUI

An intelligent prompt classification and routing pipeline for Open WebUI. Classifies user prompts using AI (qwen2.5:7b) and routes them to specialized Ollama models, with integrated Brave web search, image generation via Stable Diffusion, and full Finnish/English bilingual support.

Features

  • AI-powered prompt classification with keyword-based fallback
  • Model routing — coding, diagram, reasoning, vision, image generation, and general categories
  • Brave web search with full page content fetching (top 3 results scraped)
  • Heuristic search overrides — safety net that forces search for time-sensitive or factual questions
  • Image generation via AUTOMATIC1111/Forge (Stable Diffusion XL) with LLM-refined prompts
  • Uncensored image generation — prefix any prompt with uncen to bypass all classification/search and generate directly with Juggernaut XL v9
  • VRAM management — automatically juggles GPU memory between Ollama and Stable Diffusion
  • Bilingual — detects Finnish and forces responses in the correct language
  • Thinking/reasoning display — streams model thinking tokens in collapsible blocks
  • Real-time search status — shows which URLs are being fetched as search runs

Model Routing

Category Model (120B) Model (20B) Trigger
coding qwen2.5-coder:14b qwen2.5-coder:14b User asks to write/fix/debug code
diagram qwen2.5-coder:14b qwen2.5-coder:14b Mermaid, flowchart, UML requests
reasoning (FI) gpt-oss:120b gpt-oss:20b Analysis, comparison, strategy (Finnish)
reasoning (EN) gpt-oss:120b gpt-oss:20b Analysis, comparison, strategy (English)
image generation gpt-oss:120b + SDXL gpt-oss:20b + SDXL "generate an image", "luo kuva"
uncensored image Juggernaut XL v9 Juggernaut XL v9 Prompt starts with uncen
vision llama3.2-vision:11b llama3.2-vision:11b User uploads an image
general gpt-oss:120b gpt-oss:20b Everything else

Two pipeline variants are provided:

  • llm_router_v3.py — uses gpt-oss:120b (higher quality, more VRAM/RAM)
  • llm_router-20b.py — uses gpt-oss:20b (lighter, better for constrained hardware)

Prerequisites

  • Ubuntu 22.04 LTS (tested)
  • NVIDIA GPU with 16GB+ VRAM (tested on RTX 2000 Ada)
  • Open WebUI running in Docker with pipelines enabled
  • Ollama installed natively with models pulled:
    ollama pull qwen2.5:7b
    ollama pull qwen2.5-coder:14b
    ollama pull gpt-oss:120b    # or gpt-oss:20b for the lighter variant
    ollama pull llama3.2-vision:11b
    
  • Brave Search API key (free tier: https://brave.com/search/api/)

Setup

1. Deploy the Pipeline

Copy your chosen pipeline file to the Open WebUI pipelines directory:

cp llm_router_v3.py ~/ai-stack/pipelines/
# or for the 20B variant:
cp llm_router-20b.py ~/ai-stack/pipelines/

Restart the pipelines container:

docker restart pipelines

2. Configure Valves in Open WebUI

Go to Admin Panel > Pipelines in Open WebUI and configure:

Setting Description Default
ollama_url Ollama API URL http://ollama:11434
sd_url Stable Diffusion API URL http://172.18.0.1:7860
brave_api_key Brave Search API key (from env BRAVE_API_KEY)
sd_width / sd_height Generated image dimensions 1024 x 1024
sd_steps Sampling steps 25
sd_cfg_scale CFG scale 7.0
brave_max_results Number of search results 6
use_ai_classifier Use AI vs keyword-only classification true
show_routing_info Show routing banner in responses true
search_context_max_chars Max search context size 12000

3. Set Up Stable Diffusion (Image Generation)

Skip this section if you don't need image generation.

Install Forge (AUTOMATIC1111 fork)

# Install system dependencies
sudo apt-get update
sudo apt-get install -y git wget python3-venv python3-pip \
    libgl1 libglib2.0-0 libsm6 libxrender1 libxext6 libffi-dev libssl-dev

# Clone Forge
git clone https://github.com/lllyasviel/stable-diffusion-webui-forge.git ~/stable-diffusion-webui
cd ~/stable-diffusion-webui

# Download SDXL model (~6.9GB)
mkdir -p models/Stable-diffusion
wget -O models/Stable-diffusion/sd_xl_base_1.0.safetensors \
    "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"

# Download Juggernaut XL v9 for uncensored image generation (~6.6GB)
wget -O models/Stable-diffusion/juggernautXL_v9.safetensors \
    "https://huggingface.co/RunDiffusion/Juggernaut-XL-v9/resolve/main/Juggernaut-XL_v9_RunDiffusionPhoto_v2.safetensors"

Fix Python 3.10 build issues (Ubuntu 22.04)

The first launch will create a Python venv and install dependencies. CLIP will fail to build due to a pkg_resources issue on Python 3.10. Fix it:

cd ~/stable-diffusion-webui

# First launch creates the venv — run it once, let it fail, then fix:
./webui.sh --api --listen --xformers --no-half-vae || true

# Fix CLIP build issue
venv/bin/pip install "setuptools<70" wheel
venv/bin/pip install --no-build-isolation \
    https://github.com/openai/CLIP/archive/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1.zip

# Launch again
./webui.sh --api --listen --xformers --no-half-vae

Select the default SDXL model

Once the UI is running, open it in a browser and select sd_xl_base_1.0 from the checkpoint dropdown. Or via API:

curl -X POST http://localhost:7860/sdapi/v1/options \
    -H "Content-Type: application/json" \
    -d '{"sd_model_checkpoint": "sd_xl_base_1.0.safetensors"}'

The pipeline automatically switches between models at runtime — sd_xl_base_1.0 for normal generation, juggernautXL_v9 when the uncen prefix is used.

Create a systemd service

Using the provided script:

chmod +x setup-sd-service.sh
sudo ./setup-sd-service.sh

Or manually (replace $USER and $HOME with actual values):

sudo tee /etc/systemd/system/stable-diffusion.service > /dev/null <<EOF
[Unit]
Description=AUTOMATIC1111 Stable Diffusion WebUI
After=network.target

[Service]
Type=simple
User=$USER
WorkingDirectory=$HOME/stable-diffusion-webui
ExecStart=$HOME/stable-diffusion-webui/webui.sh --api --listen --xformers --no-half-vae --medvram-sdxl
Restart=on-failure
RestartSec=10
Environment=HOME=$HOME

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now stable-diffusion

Verify

# Check the service is running
sudo systemctl status stable-diffusion

# Check available models (should list both sd_xl_base and juggernautXL)
curl -s http://localhost:7860/sdapi/v1/sd-models | python3 -m json.tool

4. Network Configuration

The pipeline runs inside Open WebUI's Docker container and needs to reach services on the host:

Service URL from container Notes
Ollama http://ollama:11434 Docker DNS or host networking
Stable Diffusion http://172.18.0.1:7860 Docker bridge gateway IP

To find your bridge gateway IP:

docker network inspect <your_network> --format '{{range .IPAM.Config}}{{.Gateway}}{{end}}'

Update SD_URL in the pipeline file if your gateway IP differs from 172.18.0.1.

Verify connectivity from inside the container:

docker exec open-webui curl -s http://172.18.0.1:7860/sdapi/v1/sd-models
docker exec open-webui curl -s http://ollama:11434/api/tags | head -c 100

Image Generation

Default mode

Any prompt classified as image_generation (e.g. "generate an image of a cat in space") uses SDXL Base 1.0. The LLM refines the user's request into an optimized Stable Diffusion prompt with quality boosters, then calls the A1111 API.

Uncensored mode

Prefix any prompt with uncen to bypass all classification, web search, and routing — the pipeline goes straight to image generation using Juggernaut XL v9:

uncen a beautiful sunset over the ocean
uncen portrait of a warrior in golden armor

The uncen prefix is stripped and the user's text is sent directly to Stable Diffusion with quality tags appended — no LLM refinement (to avoid model refusal). The pipeline switches the SD checkpoint via the API automatically.

How it works

Default mode:

  1. LLM (gpt-oss) converts the user request into an optimized SD prompt
  2. Ollama models are unloaded from VRAM
  3. SD checkpoint is loaded (SDXL Base)
  4. Image is generated, compressed PNG→JPEG, and streamed in 4KB chunks
  5. SD checkpoint is unloaded from VRAM and page cache is dropped

Uncensored mode:

  1. uncen prefix is stripped, quality tags appended directly (no LLM call)
  2. Ollama models are unloaded from VRAM
  3. SD checkpoint is switched to Juggernaut XL v9
  4. Image is generated, compressed PNG→JPEG, and streamed in 4KB chunks
  5. SD checkpoint is unloaded from VRAM and page cache is dropped

VRAM Management

On a single 16GB GPU, large Ollama models and SDXL cannot be loaded simultaneously. The pipeline handles this automatically:

  1. Before image generation: unloads all Ollama models from VRAM via keep_alive: 0
  2. After image generation: unloads SD checkpoint via /sdapi/v1/unload-checkpoint and drops Linux page cache
  3. Ollama reloads the model on the next chat request (~10-15s warm-up)

If Ollama fails to load after image generation with a memory error, manually clear the page cache:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Architecture

User Message
    │
    ├─ "uncen" prefix? ─────────────── → Juggernaut XL v9 (direct, no search)
    │
    ├─ Image uploaded? ──────────────── → llama3.2-vision:11b
    │
    ├─ AI Classifier (qwen2.5:7b)
    │       │
    │       ├─ coding ──────────────── → qwen2.5-coder:14b
    │       ├─ diagram ─────────────── → qwen2.5-coder:14b (Mermaid)
    │       ├─ reasoning ───────────── → gpt-oss:120b (FI/EN system prompt)
    │       ├─ image_generation ────── → gpt-oss:120b (refine) → SDXL Base
    │       └─ general ─────────────── → gpt-oss:120b
    │
    ├─ Heuristic Search Override
    │       │
    │       └─ Brave Search + page fetch (if needed)
    │
    └─ Stream response (with thinking tokens)

Files

File Description
llm_router_v3.py Main pipeline (gpt-oss:120b)
llm_router-20b.py Lighter pipeline variant (gpt-oss:20b)
setup-sd.sh Stable Diffusion Forge install script (Ubuntu 22.04)
setup-sd-service.sh systemd service creation script

License

MIT

S
Description
Stack and pipeline for Open WebUI.
Readme 146 KiB
Languages
Python 94.5%
Shell 5.5%