LLM Router Pipeline for Open WebUI
An intelligent prompt classification and routing pipeline for Open WebUI. Classifies user prompts using AI (qwen2.5:7b) and routes them to specialized Ollama models, with integrated Brave web search, image generation via Stable Diffusion, and full Finnish/English bilingual support.
Features
- AI-powered prompt classification with keyword-based fallback
- Model routing — coding, diagram, reasoning, vision, image generation, and general categories
- Brave web search with full page content fetching (top 3 results scraped)
- Heuristic search overrides — safety net that forces search for time-sensitive or factual questions
- Image generation via AUTOMATIC1111/Forge (Stable Diffusion XL) with LLM-refined prompts
- Uncensored image generation — prefix any prompt with
uncento bypass all classification/search and generate directly with Juggernaut XL v9 - VRAM management — automatically juggles GPU memory between Ollama and Stable Diffusion
- Bilingual — detects Finnish and forces responses in the correct language
- Thinking/reasoning display — streams model thinking tokens in collapsible blocks
- Real-time search status — shows which URLs are being fetched as search runs
Model Routing
| Category | Model (120B) | Model (20B) | Trigger |
|---|---|---|---|
| coding | qwen2.5-coder:14b | qwen2.5-coder:14b | User asks to write/fix/debug code |
| diagram | qwen2.5-coder:14b | qwen2.5-coder:14b | Mermaid, flowchart, UML requests |
| reasoning (FI) | gpt-oss:120b | gpt-oss:20b | Analysis, comparison, strategy (Finnish) |
| reasoning (EN) | gpt-oss:120b | gpt-oss:20b | Analysis, comparison, strategy (English) |
| image generation | gpt-oss:120b + SDXL | gpt-oss:20b + SDXL | "generate an image", "luo kuva" |
| uncensored image | dolphin-mistral:7b + Juggernaut XL v9 | dolphin-mistral:7b + Juggernaut XL v9 | Prompt starts with uncen |
| vision | llama3.2-vision:11b | llama3.2-vision:11b | User uploads an image |
| general | gpt-oss:120b | gpt-oss:20b | Everything else |
Two pipeline variants are provided:
llm_router_v3.py— uses gpt-oss:120b (higher quality, more VRAM/RAM)llm_router-20b.py— uses gpt-oss:20b (lighter, better for constrained hardware)
Prerequisites
- Ubuntu 22.04 LTS (tested)
- NVIDIA GPU with 16GB+ VRAM (tested on RTX 2000 Ada)
- Open WebUI running in Docker with pipelines enabled
- Ollama installed natively with models pulled:
ollama pull qwen2.5:7b ollama pull qwen2.5-coder:14b ollama pull gpt-oss:120b # or gpt-oss:20b for the lighter variant ollama pull llama3.2-vision:11b ollama pull dolphin-mistral:7b # uncensored model for image prompt refinement - Brave Search API key (free tier: https://brave.com/search/api/)
Setup
1. Deploy the Pipeline
Copy your chosen pipeline file to the Open WebUI pipelines directory:
cp llm_router_v3.py ~/ai-stack/pipelines/
# or for the 20B variant:
cp llm_router-20b.py ~/ai-stack/pipelines/
Restart the pipelines container:
docker restart pipelines
2. Configure Valves in Open WebUI
Go to Admin Panel > Pipelines in Open WebUI and configure:
| Setting | Description | Default |
|---|---|---|
ollama_url |
Ollama API URL | http://ollama:11434 |
sd_url |
Stable Diffusion API URL | http://172.18.0.1:7860 |
brave_api_key |
Brave Search API key | (from env BRAVE_API_KEY) |
sd_width / sd_height |
Generated image dimensions | 1024 x 1024 |
sd_steps |
Sampling steps | 25 |
sd_cfg_scale |
CFG scale | 7.0 |
brave_max_results |
Number of search results | 6 |
use_ai_classifier |
Use AI vs keyword-only classification | true |
show_routing_info |
Show routing banner in responses | true |
search_context_max_chars |
Max search context size | 12000 |
3. Set Up Stable Diffusion (Image Generation)
Skip this section if you don't need image generation.
Install Forge (AUTOMATIC1111 fork)
# Install system dependencies
sudo apt-get update
sudo apt-get install -y git wget python3-venv python3-pip \
libgl1 libglib2.0-0 libsm6 libxrender1 libxext6 libffi-dev libssl-dev
# Clone Forge
git clone https://github.com/lllyasviel/stable-diffusion-webui-forge.git ~/stable-diffusion-webui
cd ~/stable-diffusion-webui
# Download SDXL model (~6.9GB)
mkdir -p models/Stable-diffusion
wget -O models/Stable-diffusion/sd_xl_base_1.0.safetensors \
"https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"
# Download Juggernaut XL v9 for uncensored image generation (~6.6GB)
wget -O models/Stable-diffusion/juggernautXL_v9.safetensors \
"https://huggingface.co/RunDiffusion/Juggernaut-XL-v9/resolve/main/Juggernaut-XL_v9_RunDiffusionPhoto_v2.safetensors"
Fix Python 3.10 build issues (Ubuntu 22.04)
The first launch will create a Python venv and install dependencies. CLIP will fail to build due to a pkg_resources issue on Python 3.10. Fix it:
cd ~/stable-diffusion-webui
# First launch creates the venv — run it once, let it fail, then fix:
./webui.sh --api --listen --xformers --no-half-vae || true
# Fix CLIP build issue
venv/bin/pip install "setuptools<70" wheel
venv/bin/pip install --no-build-isolation \
https://github.com/openai/CLIP/archive/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1.zip
# Launch again
./webui.sh --api --listen --xformers --no-half-vae
Select the default SDXL model
Once the UI is running, open it in a browser and select sd_xl_base_1.0 from the checkpoint dropdown. Or via API:
curl -X POST http://localhost:7860/sdapi/v1/options \
-H "Content-Type: application/json" \
-d '{"sd_model_checkpoint": "sd_xl_base_1.0.safetensors"}'
The pipeline automatically switches between models at runtime — sd_xl_base_1.0 for normal generation, juggernautXL_v9 when the uncen prefix is used.
Create a systemd service
Using the provided script:
chmod +x setup-sd-service.sh
sudo ./setup-sd-service.sh
Or manually (replace $USER and $HOME with actual values):
sudo tee /etc/systemd/system/stable-diffusion.service > /dev/null <<EOF
[Unit]
Description=AUTOMATIC1111 Stable Diffusion WebUI
After=network.target
[Service]
Type=simple
User=$USER
WorkingDirectory=$HOME/stable-diffusion-webui
ExecStart=$HOME/stable-diffusion-webui/webui.sh --api --listen --xformers --no-half-vae --medvram-sdxl
Restart=on-failure
RestartSec=10
Environment=HOME=$HOME
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now stable-diffusion
Verify
# Check the service is running
sudo systemctl status stable-diffusion
# Check available models (should list both sd_xl_base and juggernautXL)
curl -s http://localhost:7860/sdapi/v1/sd-models | python3 -m json.tool
4. Network Configuration
The pipeline runs inside Open WebUI's Docker container and needs to reach services on the host:
| Service | URL from container | Notes |
|---|---|---|
| Ollama | http://ollama:11434 |
Docker DNS or host networking |
| Stable Diffusion | http://172.18.0.1:7860 |
Docker bridge gateway IP |
To find your bridge gateway IP:
docker network inspect <your_network> --format '{{range .IPAM.Config}}{{.Gateway}}{{end}}'
Update SD_URL in the pipeline file if your gateway IP differs from 172.18.0.1.
Verify connectivity from inside the container:
docker exec open-webui curl -s http://172.18.0.1:7860/sdapi/v1/sd-models
docker exec open-webui curl -s http://ollama:11434/api/tags | head -c 100
Image Generation
Default mode
Any prompt classified as image_generation (e.g. "generate an image of a cat in space") uses SDXL Base 1.0. The LLM refines the user's request into an optimized Stable Diffusion prompt with quality boosters, then calls the A1111 API.
Uncensored mode
Prefix any prompt with uncen to bypass all classification, web search, and routing — the pipeline goes straight to image generation using Juggernaut XL v9:
uncen a beautiful sunset over the ocean
uncen portrait of a warrior in golden armor
The uncen prefix is stripped and the prompt is refined by dolphin-mistral:7b (an uncensored LLM that won't refuse any content) instead of gpt-oss. The pipeline switches the SD checkpoint to Juggernaut XL v9 automatically. If dolphin-mistral is unavailable, it falls back to sending the user's text directly with quality tags appended.
How it works
Default mode:
- LLM (gpt-oss) converts the user request into an optimized SD prompt
- Ollama models are unloaded from VRAM
- SD checkpoint is loaded (SDXL Base)
- Image is generated, compressed PNG→JPEG, and streamed in 4KB chunks
- SD checkpoint is unloaded from VRAM and page cache is dropped
Uncensored mode:
uncenprefix is stripped- dolphin-mistral:7b refines the prompt into optimized SD tags (no refusal)
- Ollama models are unloaded from VRAM
- SD checkpoint is switched to Juggernaut XL v9
- Image is generated, compressed PNG→JPEG, and streamed in 4KB chunks
- SD checkpoint is unloaded from VRAM and page cache is dropped
VRAM Management
On a single 16GB GPU, large Ollama models and SDXL cannot be loaded simultaneously. The pipeline handles this automatically:
- Before image generation: unloads all Ollama models from VRAM via
keep_alive: 0 - After image generation: unloads SD checkpoint via
/sdapi/v1/unload-checkpointand drops Linux page cache - Ollama reloads the model on the next chat request (~10-15s warm-up)
If Ollama fails to load after image generation with a memory error, manually clear the page cache:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Architecture
User Message
│
├─ "uncen" prefix? ─────────────── → dolphin-mistral:7b (refine) → Juggernaut XL v9
│
├─ Image uploaded? ──────────────── → llama3.2-vision:11b
│
├─ AI Classifier (qwen2.5:7b)
│ │
│ ├─ coding ──────────────── → qwen2.5-coder:14b
│ ├─ diagram ─────────────── → qwen2.5-coder:14b (Mermaid)
│ ├─ reasoning ───────────── → gpt-oss:120b (FI/EN system prompt)
│ ├─ image_generation ────── → gpt-oss:120b (refine) → SDXL Base
│ └─ general ─────────────── → gpt-oss:120b
│
├─ Heuristic Search Override
│ │
│ └─ Brave Search + page fetch (if needed)
│
└─ Stream response (with thinking tokens)
Files
| File | Description |
|---|---|
llm_router_v3.py |
Main pipeline (gpt-oss:120b) |
llm_router-20b.py |
Lighter pipeline variant (gpt-oss:20b) |
setup-sd.sh |
Stable Diffusion Forge install script (Ubuntu 22.04) |
setup-sd-service.sh |
systemd service creation script |
License
MIT