294 lines
11 KiB
Markdown
294 lines
11 KiB
Markdown
# LLM Router Pipeline for Open WebUI
|
|
|
|
An intelligent prompt classification and routing pipeline for [Open WebUI](https://github.com/open-webui/open-webui). Classifies user prompts using AI (qwen2.5:7b) and routes them to specialized Ollama models, with integrated Brave web search, image generation via Stable Diffusion, and full Finnish/English bilingual support.
|
|
|
|
## Features
|
|
|
|
- **AI-powered prompt classification** with keyword-based fallback
|
|
- **Model routing** — coding, diagram, reasoning, vision, image generation, and general categories
|
|
- **Brave web search** with full page content fetching (top 3 results scraped)
|
|
- **Heuristic search overrides** — safety net that forces search for time-sensitive or factual questions
|
|
- **Image generation** via AUTOMATIC1111/Forge (Stable Diffusion XL) with LLM-refined prompts
|
|
- **Uncensored image generation** — prefix any prompt with `uncen` to bypass all classification/search and generate directly with Juggernaut XL v9
|
|
- **VRAM management** — automatically juggles GPU memory between Ollama and Stable Diffusion
|
|
- **Bilingual** — detects Finnish and forces responses in the correct language
|
|
- **Thinking/reasoning display** — streams model thinking tokens in collapsible blocks
|
|
- **Real-time search status** — shows which URLs are being fetched as search runs
|
|
|
|
## Model Routing
|
|
|
|
| Category | Model (120B) | Model (20B) | Trigger |
|
|
|---|---|---|---|
|
|
| coding | qwen2.5-coder:14b | qwen2.5-coder:14b | User asks to write/fix/debug code |
|
|
| diagram | qwen2.5-coder:14b | qwen2.5-coder:14b | Mermaid, flowchart, UML requests |
|
|
| reasoning (FI) | gpt-oss:120b | gpt-oss:20b | Analysis, comparison, strategy (Finnish) |
|
|
| reasoning (EN) | gpt-oss:120b | gpt-oss:20b | Analysis, comparison, strategy (English) |
|
|
| image generation | gpt-oss:120b + SDXL | gpt-oss:20b + SDXL | "generate an image", "luo kuva" |
|
|
| uncensored image | dolphin-mistral:7b + Juggernaut XL v9 | dolphin-mistral:7b + Juggernaut XL v9 | Prompt starts with `uncen` |
|
|
| vision | llama3.2-vision:11b | llama3.2-vision:11b | User uploads an image |
|
|
| general | gpt-oss:120b | gpt-oss:20b | Everything else |
|
|
|
|
Two pipeline variants are provided:
|
|
- **`llm_router_v3.py`** — uses gpt-oss:120b (higher quality, more VRAM/RAM)
|
|
- **`llm_router-20b.py`** — uses gpt-oss:20b (lighter, better for constrained hardware)
|
|
|
|
## Prerequisites
|
|
|
|
- **Ubuntu 22.04 LTS** (tested)
|
|
- **NVIDIA GPU** with 16GB+ VRAM (tested on RTX 2000 Ada)
|
|
- **Open WebUI** running in Docker with pipelines enabled
|
|
- **Ollama** installed natively with models pulled:
|
|
```bash
|
|
ollama pull qwen2.5:7b
|
|
ollama pull qwen2.5-coder:14b
|
|
ollama pull gpt-oss:120b # or gpt-oss:20b for the lighter variant
|
|
ollama pull llama3.2-vision:11b
|
|
ollama pull dolphin-mistral:7b # uncensored model for image prompt refinement
|
|
```
|
|
- **Brave Search API key** (free tier: https://brave.com/search/api/)
|
|
|
|
## Setup
|
|
|
|
### 1. Deploy the Pipeline
|
|
|
|
Copy your chosen pipeline file to the Open WebUI pipelines directory:
|
|
|
|
```bash
|
|
cp llm_router_v3.py ~/ai-stack/pipelines/
|
|
# or for the 20B variant:
|
|
cp llm_router-20b.py ~/ai-stack/pipelines/
|
|
```
|
|
|
|
Restart the pipelines container:
|
|
|
|
```bash
|
|
docker restart pipelines
|
|
```
|
|
|
|
### 2. Configure Valves in Open WebUI
|
|
|
|
Go to **Admin Panel > Pipelines** in Open WebUI and configure:
|
|
|
|
| Setting | Description | Default |
|
|
|---|---|---|
|
|
| `ollama_url` | Ollama API URL | `http://ollama:11434` |
|
|
| `sd_url` | Stable Diffusion API URL | `http://172.18.0.1:7860` |
|
|
| `brave_api_key` | Brave Search API key | (from env `BRAVE_API_KEY`) |
|
|
| `sd_width` / `sd_height` | Generated image dimensions | 1024 x 1024 |
|
|
| `sd_steps` | Sampling steps | 25 |
|
|
| `sd_cfg_scale` | CFG scale | 7.0 |
|
|
| `brave_max_results` | Number of search results | 6 |
|
|
| `use_ai_classifier` | Use AI vs keyword-only classification | true |
|
|
| `show_routing_info` | Show routing banner in responses | true |
|
|
| `search_context_max_chars` | Max search context size | 12000 |
|
|
|
|
### 3. Set Up Stable Diffusion (Image Generation)
|
|
|
|
> Skip this section if you don't need image generation.
|
|
|
|
#### Install Forge (AUTOMATIC1111 fork)
|
|
|
|
```bash
|
|
# Install system dependencies
|
|
sudo apt-get update
|
|
sudo apt-get install -y git wget python3-venv python3-pip \
|
|
libgl1 libglib2.0-0 libsm6 libxrender1 libxext6 libffi-dev libssl-dev
|
|
|
|
# Clone Forge
|
|
git clone https://github.com/lllyasviel/stable-diffusion-webui-forge.git ~/stable-diffusion-webui
|
|
cd ~/stable-diffusion-webui
|
|
|
|
# Download SDXL model (~6.9GB)
|
|
mkdir -p models/Stable-diffusion
|
|
wget -O models/Stable-diffusion/sd_xl_base_1.0.safetensors \
|
|
"https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"
|
|
|
|
# Download Juggernaut XL v9 for uncensored image generation (~6.6GB)
|
|
wget -O models/Stable-diffusion/juggernautXL_v9.safetensors \
|
|
"https://huggingface.co/RunDiffusion/Juggernaut-XL-v9/resolve/main/Juggernaut-XL_v9_RunDiffusionPhoto_v2.safetensors"
|
|
```
|
|
|
|
#### Fix Python 3.10 build issues (Ubuntu 22.04)
|
|
|
|
The first launch will create a Python venv and install dependencies. CLIP will fail to build due to a `pkg_resources` issue on Python 3.10. Fix it:
|
|
|
|
```bash
|
|
cd ~/stable-diffusion-webui
|
|
|
|
# First launch creates the venv — run it once, let it fail, then fix:
|
|
./webui.sh --api --listen --xformers --no-half-vae || true
|
|
|
|
# Fix CLIP build issue
|
|
venv/bin/pip install "setuptools<70" wheel
|
|
venv/bin/pip install --no-build-isolation \
|
|
https://github.com/openai/CLIP/archive/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1.zip
|
|
|
|
# Launch again
|
|
./webui.sh --api --listen --xformers --no-half-vae
|
|
```
|
|
|
|
#### Select the default SDXL model
|
|
|
|
Once the UI is running, open it in a browser and select `sd_xl_base_1.0` from the checkpoint dropdown. Or via API:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:7860/sdapi/v1/options \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"sd_model_checkpoint": "sd_xl_base_1.0.safetensors"}'
|
|
```
|
|
|
|
The pipeline automatically switches between models at runtime — `sd_xl_base_1.0` for normal generation, `juggernautXL_v9` when the `uncen` prefix is used.
|
|
|
|
#### Create a systemd service
|
|
|
|
Using the provided script:
|
|
|
|
```bash
|
|
chmod +x setup-sd-service.sh
|
|
sudo ./setup-sd-service.sh
|
|
```
|
|
|
|
Or manually (replace `$USER` and `$HOME` with actual values):
|
|
|
|
```bash
|
|
sudo tee /etc/systemd/system/stable-diffusion.service > /dev/null <<EOF
|
|
[Unit]
|
|
Description=AUTOMATIC1111 Stable Diffusion WebUI
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=$USER
|
|
WorkingDirectory=$HOME/stable-diffusion-webui
|
|
ExecStart=$HOME/stable-diffusion-webui/webui.sh --api --listen --xformers --no-half-vae --medvram-sdxl
|
|
Restart=on-failure
|
|
RestartSec=10
|
|
Environment=HOME=$HOME
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
EOF
|
|
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable --now stable-diffusion
|
|
```
|
|
|
|
#### Verify
|
|
|
|
```bash
|
|
# Check the service is running
|
|
sudo systemctl status stable-diffusion
|
|
|
|
# Check available models (should list both sd_xl_base and juggernautXL)
|
|
curl -s http://localhost:7860/sdapi/v1/sd-models | python3 -m json.tool
|
|
```
|
|
|
|
### 4. Network Configuration
|
|
|
|
The pipeline runs inside Open WebUI's Docker container and needs to reach services on the host:
|
|
|
|
| Service | URL from container | Notes |
|
|
|---|---|---|
|
|
| Ollama | `http://ollama:11434` | Docker DNS or host networking |
|
|
| Stable Diffusion | `http://172.18.0.1:7860` | Docker bridge gateway IP |
|
|
|
|
To find your bridge gateway IP:
|
|
|
|
```bash
|
|
docker network inspect <your_network> --format '{{range .IPAM.Config}}{{.Gateway}}{{end}}'
|
|
```
|
|
|
|
Update `SD_URL` in the pipeline file if your gateway IP differs from `172.18.0.1`.
|
|
|
|
Verify connectivity from inside the container:
|
|
|
|
```bash
|
|
docker exec open-webui curl -s http://172.18.0.1:7860/sdapi/v1/sd-models
|
|
docker exec open-webui curl -s http://ollama:11434/api/tags | head -c 100
|
|
```
|
|
|
|
## Image Generation
|
|
|
|
### Default mode
|
|
|
|
Any prompt classified as `image_generation` (e.g. "generate an image of a cat in space") uses **SDXL Base 1.0**. The LLM refines the user's request into an optimized Stable Diffusion prompt with quality boosters, then calls the A1111 API.
|
|
|
|
### Uncensored mode
|
|
|
|
Prefix any prompt with `uncen` to bypass all classification, web search, and routing — the pipeline goes straight to image generation using **Juggernaut XL v9**:
|
|
|
|
```
|
|
uncen a beautiful sunset over the ocean
|
|
uncen portrait of a warrior in golden armor
|
|
```
|
|
|
|
The `uncen` prefix is stripped and the prompt is refined by **dolphin-mistral:7b** (an uncensored LLM that won't refuse any content) instead of gpt-oss. The pipeline switches the SD checkpoint to Juggernaut XL v9 automatically. If dolphin-mistral is unavailable, it falls back to sending the user's text directly with quality tags appended.
|
|
|
|
### How it works
|
|
|
|
**Default mode:**
|
|
1. LLM (gpt-oss) converts the user request into an optimized SD prompt
|
|
2. Ollama models are unloaded from VRAM
|
|
3. SD checkpoint is loaded (SDXL Base)
|
|
4. Image is generated, compressed PNG→JPEG, and streamed in 4KB chunks
|
|
5. SD checkpoint is unloaded from VRAM and page cache is dropped
|
|
|
|
**Uncensored mode:**
|
|
1. `uncen` prefix is stripped
|
|
2. dolphin-mistral:7b refines the prompt into optimized SD tags (no refusal)
|
|
3. Ollama models are unloaded from VRAM
|
|
4. SD checkpoint is switched to Juggernaut XL v9
|
|
5. Image is generated, compressed PNG→JPEG, and streamed in 4KB chunks
|
|
6. SD checkpoint is unloaded from VRAM and page cache is dropped
|
|
|
|
## VRAM Management
|
|
|
|
On a single 16GB GPU, large Ollama models and SDXL cannot be loaded simultaneously. The pipeline handles this automatically:
|
|
|
|
1. **Before image generation**: unloads all Ollama models from VRAM via `keep_alive: 0`
|
|
2. **After image generation**: unloads SD checkpoint via `/sdapi/v1/unload-checkpoint` and drops Linux page cache
|
|
3. Ollama reloads the model on the next chat request (~10-15s warm-up)
|
|
|
|
If Ollama fails to load after image generation with a memory error, manually clear the page cache:
|
|
|
|
```bash
|
|
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
User Message
|
|
│
|
|
├─ "uncen" prefix? ─────────────── → dolphin-mistral:7b (refine) → Juggernaut XL v9
|
|
│
|
|
├─ Image uploaded? ──────────────── → llama3.2-vision:11b
|
|
│
|
|
├─ AI Classifier (qwen2.5:7b)
|
|
│ │
|
|
│ ├─ coding ──────────────── → qwen2.5-coder:14b
|
|
│ ├─ diagram ─────────────── → qwen2.5-coder:14b (Mermaid)
|
|
│ ├─ reasoning ───────────── → gpt-oss:120b (FI/EN system prompt)
|
|
│ ├─ image_generation ────── → gpt-oss:120b (refine) → SDXL Base
|
|
│ └─ general ─────────────── → gpt-oss:120b
|
|
│
|
|
├─ Heuristic Search Override
|
|
│ │
|
|
│ └─ Brave Search + page fetch (if needed)
|
|
│
|
|
└─ Stream response (with thinking tokens)
|
|
```
|
|
|
|
## Files
|
|
|
|
| File | Description |
|
|
|---|---|
|
|
| `llm_router_v3.py` | Main pipeline (gpt-oss:120b) |
|
|
| `llm_router-20b.py` | Lighter pipeline variant (gpt-oss:20b) |
|
|
| `setup-sd.sh` | Stable Diffusion Forge install script (Ubuntu 22.04) |
|
|
| `setup-sd-service.sh` | systemd service creation script |
|
|
|
|
## License
|
|
|
|
MIT
|