LLM Router Pipeline for Open WebUI
An intelligent prompt classification and routing pipeline for Open WebUI. Classifies user prompts using AI (qwen2.5:7b) and routes them to specialized Ollama models, with integrated Brave web search, image generation via Stable Diffusion, and full Finnish/English bilingual support.
Features
- AI-powered prompt classification with keyword-based fallback
- Model routing — coding, diagram, reasoning, vision, image generation, and general categories
- Brave web search with full page content fetching (top 3 results scraped)
- Heuristic search overrides — safety net that forces search for time-sensitive or factual questions
- Image generation via AUTOMATIC1111/Forge (Stable Diffusion XL) with LLM-refined prompts
- VRAM management — automatically juggles GPU memory between Ollama and Stable Diffusion
- Bilingual — detects Finnish and forces responses in the correct language
- Thinking/reasoning display — streams model thinking tokens in collapsible blocks
- Real-time search status — shows which URLs are being fetched as search runs
Model Routing
| Category | Model (120B) | Model (20B) | Trigger |
|---|---|---|---|
| coding | qwen2.5-coder:14b | qwen2.5-coder:14b | User asks to write/fix/debug code |
| diagram | qwen2.5-coder:14b | qwen2.5-coder:14b | Mermaid, flowchart, UML requests |
| reasoning (FI) | gpt-oss:120b | gpt-oss:20b | Analysis, comparison, strategy (Finnish) |
| reasoning (EN) | gpt-oss:120b | gpt-oss:20b | Analysis, comparison, strategy (English) |
| image generation | gpt-oss:120b + SDXL | gpt-oss:20b + SDXL | "generate an image", "luo kuva" |
| vision | llama3.2-vision:11b | llama3.2-vision:11b | User uploads an image |
| general | gpt-oss:120b | gpt-oss:20b | Everything else |
Two pipeline variants are provided:
llm_router_v3.py— uses gpt-oss:120b (higher quality, more VRAM/RAM)llm_router-20b.py— uses gpt-oss:20b (lighter, better for constrained hardware)
Prerequisites
- Ubuntu 22.04 LTS (tested)
- NVIDIA GPU with 16GB+ VRAM (tested on RTX 2000 Ada)
- Open WebUI running in Docker with pipelines enabled
- Ollama installed natively with models pulled:
ollama pull qwen2.5:7b ollama pull qwen2.5-coder:14b ollama pull gpt-oss:120b # or gpt-oss:20b for the lighter variant ollama pull llama3.2-vision:11b - Brave Search API key (free tier: https://brave.com/search/api/)
Setup
1. Deploy the Pipeline
Copy your chosen pipeline file to the Open WebUI pipelines directory:
cp llm_router_v3.py ~/ai-stack/pipelines/
# or for the 20B variant:
cp llm_router-20b.py ~/ai-stack/pipelines/
Restart the pipelines container:
docker restart pipelines
2. Configure Valves in Open WebUI
Go to Admin Panel > Pipelines in Open WebUI and configure:
| Setting | Description | Default |
|---|---|---|
ollama_url |
Ollama API URL | http://ollama:11434 |
sd_url |
Stable Diffusion API URL | http://172.18.0.1:7860 |
brave_api_key |
Brave Search API key | (from env BRAVE_API_KEY) |
sd_width / sd_height |
Generated image dimensions | 1024 x 1024 |
sd_steps |
Sampling steps | 25 |
sd_cfg_scale |
CFG scale | 7.0 |
brave_max_results |
Number of search results | 6 |
use_ai_classifier |
Use AI vs keyword-only classification | true |
show_routing_info |
Show routing banner in responses | true |
search_context_max_chars |
Max search context size | 12000 |
3. Set Up Stable Diffusion (Image Generation)
Skip this section if you don't need image generation.
Install Forge (AUTOMATIC1111 fork)
# Install system dependencies
sudo apt-get update
sudo apt-get install -y git wget python3-venv python3-pip \
libgl1 libglib2.0-0 libsm6 libxrender1 libxext6 libffi-dev libssl-dev
# Clone Forge
git clone https://github.com/lllyasviel/stable-diffusion-webui-forge.git ~/stable-diffusion-webui
cd ~/stable-diffusion-webui
# Download SDXL model (~6.9GB)
mkdir -p models/Stable-diffusion
wget -O models/Stable-diffusion/sd_xl_base_1.0.safetensors \
"https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"
Fix Python 3.10 build issues (Ubuntu 22.04)
Before the first launch, pre-install CLIP dependencies to avoid build failures:
cd ~/stable-diffusion-webui
# First launch creates the venv — run it once, let it fail, then fix:
./webui.sh --api --listen --xformers --no-half-vae || true
# Fix CLIP build issue
venv/bin/pip install "setuptools<70" wheel
venv/bin/pip install --no-build-isolation \
https://github.com/openai/CLIP/archive/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1.zip
# Launch again
./webui.sh --api --listen --xformers --no-half-vae
Select SDXL model
Once the UI is running, open it in a browser and select sd_xl_base_1.0 from the checkpoint dropdown. Or via API:
curl -X POST http://localhost:7860/sdapi/v1/options \
-H "Content-Type: application/json" \
-d '{"sd_model_checkpoint": "sd_xl_base_1.0.safetensors"}'
Create a systemd service
chmod +x setup-sd-service.sh
sudo ./setup-sd-service.sh
Or manually:
sudo tee /etc/systemd/system/stable-diffusion.service > /dev/null <<EOF
[Unit]
Description=AUTOMATIC1111 Stable Diffusion WebUI
After=network.target
[Service]
Type=simple
User=$USER
WorkingDirectory=$HOME/stable-diffusion-webui
ExecStart=$HOME/stable-diffusion-webui/webui.sh --api --listen --xformers --no-half-vae --medvram-sdxl
Restart=on-failure
RestartSec=10
Environment=HOME=$HOME
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now stable-diffusion
Verify
curl -s http://localhost:7860/sdapi/v1/sd-models | python3 -m json.tool
4. Network Configuration
The pipeline runs inside Open WebUI's Docker container and needs to reach:
| Service | URL from container | Notes |
|---|---|---|
| Ollama | http://ollama:11434 |
Docker DNS or host networking |
| Stable Diffusion | http://172.18.0.1:7860 |
Docker bridge gateway IP |
To find your bridge gateway IP:
docker network inspect <your_network> --format '{{range .IPAM.Config}}{{.Gateway}}{{end}}'
Verify connectivity from inside the container:
docker exec open-webui curl -s http://172.18.0.1:7860/sdapi/v1/sd-models
VRAM Management
On a single 16GB GPU, gpt-oss:120b and SDXL cannot be loaded simultaneously. The pipeline handles this automatically:
- Before image generation: unloads all Ollama models from VRAM
- After image generation: unloads SD checkpoint from VRAM and drops Linux page cache
- Ollama reloads the model on the next chat request (~10-15s warm-up)
If Ollama fails to load after image generation with a memory error, clear the page cache:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Architecture
User Message
│
├─ Image uploaded? ──────────────── → llama3.2-vision:11b
│
├─ AI Classifier (qwen2.5:7b)
│ │
│ ├─ coding ──────────────── → qwen2.5-coder:14b
│ ├─ diagram ─────────────── → qwen2.5-coder:14b (Mermaid)
│ ├─ reasoning ───────────── → gpt-oss:120b (FI/EN system prompt)
│ ├─ image_generation ────── → gpt-oss:120b (refine) → SDXL (generate)
│ └─ general ─────────────── → gpt-oss:120b
│
├─ Heuristic Search Override
│ │
│ └─ Brave Search + page fetch (if needed)
│
└─ Stream response (with thinking tokens)
Files
| File | Description |
|---|---|
llm_router_v3.py |
Main pipeline (gpt-oss:120b) |
llm_router-20b.py |
Lighter pipeline variant (gpt-oss:20b) |
setup-sd.sh |
Stable Diffusion Forge install script |
setup-sd-service.sh |
systemd service creation script |
License
MIT