LLM-Router-Stack/README.md

# LLM Router Pipeline for Open WebUI

An intelligent prompt classification and routing pipeline for [Open WebUI](https://github.com/open-webui/open-webui). Classifies user prompts using AI (qwen2.5:7b) and routes them to specialized Ollama models, with integrated Brave web search, image generation via Stable Diffusion, and full Finnish/English bilingual support.

## Features

- **AI-powered prompt classification** with keyword-based fallback
- **Model routing** — coding, diagram, reasoning, vision, image generation, and general categories
- **Brave web search** with full page content fetching (top 3 results scraped)
- **Heuristic search overrides** — safety net that forces search for time-sensitive or factual questions
- **Image generation** via AUTOMATIC1111/Forge (Stable Diffusion XL) with LLM-refined prompts
- **Uncensored image generation** — prefix any prompt with `uncen` to bypass all classification/search and generate directly with Juggernaut XL v9
- **VRAM management** — automatically juggles GPU memory between Ollama and Stable Diffusion
- **Bilingual** — detects Finnish and forces responses in the correct language
- **Thinking/reasoning display** — streams model thinking tokens in collapsible blocks
- **Real-time search status** — shows which URLs are being fetched as search runs

## Model Routing

| Category | Model (120B) | Model (20B) | Trigger |
|---|---|---|---|
| coding | qwen2.5-coder:14b | qwen2.5-coder:14b | User asks to write/fix/debug code |
| diagram | qwen2.5-coder:14b | qwen2.5-coder:14b | Mermaid, flowchart, UML requests |
| reasoning (FI) | gpt-oss:120b | gpt-oss:20b | Analysis, comparison, strategy (Finnish) |
| reasoning (EN) | gpt-oss:120b | gpt-oss:20b | Analysis, comparison, strategy (English) |
| image generation | gpt-oss:120b + SDXL | gpt-oss:20b + SDXL | "generate an image", "luo kuva" |
| uncensored image | dolphin-mistral:7b + Juggernaut XL v9 | dolphin-mistral:7b + Juggernaut XL v9 | Prompt starts with `uncen` |
| vision | llama3.2-vision:11b | llama3.2-vision:11b | User uploads an image |
| general | gpt-oss:120b | gpt-oss:20b | Everything else |

Two pipeline variants are provided:
- **`llm_router_v3.py`** — uses gpt-oss:120b (higher quality, more VRAM/RAM)
- **`llm_router-20b.py`** — uses gpt-oss:20b (lighter, better for constrained hardware)

## Prerequisites

- **Ubuntu 22.04 LTS** (tested)
- **NVIDIA GPU** with 16GB+ VRAM (tested on RTX 2000 Ada)
- **Open WebUI** running in Docker with pipelines enabled
- **Ollama** installed natively with models pulled:
  ```bash
  ollama pull qwen2.5:7b
  ollama pull qwen2.5-coder:14b
  ollama pull gpt-oss:120b    # or gpt-oss:20b for the lighter variant
  ollama pull llama3.2-vision:11b
  ollama pull dolphin-mistral:7b   # uncensored model for image prompt refinement
  ```
- **Brave Search API key** (free tier: https://brave.com/search/api/)

## Setup

### 1. Deploy the Pipeline

Copy your chosen pipeline file to the Open WebUI pipelines directory:

```bash
cp llm_router_v3.py ~/ai-stack/pipelines/
# or for the 20B variant:
cp llm_router-20b.py ~/ai-stack/pipelines/
```

Restart the pipelines container:

```bash
docker restart pipelines
```

### 2. Configure Valves in Open WebUI

Go to **Admin Panel > Pipelines** in Open WebUI and configure:

| Setting | Description | Default |
|---|---|---|
| `ollama_url` | Ollama API URL | `http://ollama:11434` |
| `sd_url` | Stable Diffusion API URL | `http://172.18.0.1:7860` |
| `brave_api_key` | Brave Search API key | (from env `BRAVE_API_KEY`) |
| `sd_width` / `sd_height` | Generated image dimensions | 1024 x 1024 |
| `sd_steps` | Sampling steps | 25 |
| `sd_cfg_scale` | CFG scale | 7.0 |
| `brave_max_results` | Number of search results | 6 |
| `use_ai_classifier` | Use AI vs keyword-only classification | true |
| `show_routing_info` | Show routing banner in responses | true |
| `search_context_max_chars` | Max search context size | 12000 |

### 3. Set Up Stable Diffusion (Image Generation)

> Skip this section if you don't need image generation.

#### Install Forge (AUTOMATIC1111 fork)

```bash
# Install system dependencies
sudo apt-get update
sudo apt-get install -y git wget python3-venv python3-pip \
    libgl1 libglib2.0-0 libsm6 libxrender1 libxext6 libffi-dev libssl-dev

# Clone Forge
git clone https://github.com/lllyasviel/stable-diffusion-webui-forge.git ~/stable-diffusion-webui
cd ~/stable-diffusion-webui

# Download SDXL model (~6.9GB)
mkdir -p models/Stable-diffusion
wget -O models/Stable-diffusion/sd_xl_base_1.0.safetensors \
    "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"

# Download Juggernaut XL v9 for uncensored image generation (~6.6GB)
wget -O models/Stable-diffusion/juggernautXL_v9.safetensors \
    "https://huggingface.co/RunDiffusion/Juggernaut-XL-v9/resolve/main/Juggernaut-XL_v9_RunDiffusionPhoto_v2.safetensors"
```

#### Fix Python 3.10 build issues (Ubuntu 22.04)

The first launch will create a Python venv and install dependencies. CLIP will fail to build due to a `pkg_resources` issue on Python 3.10. Fix it:

```bash
cd ~/stable-diffusion-webui

# First launch creates the venv — run it once, let it fail, then fix:
./webui.sh --api --listen --xformers --no-half-vae || true

# Fix CLIP build issue
venv/bin/pip install "setuptools<70" wheel
venv/bin/pip install --no-build-isolation \
    https://github.com/openai/CLIP/archive/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1.zip

# Launch again
./webui.sh --api --listen --xformers --no-half-vae
```

#### Select the default SDXL model

Once the UI is running, open it in a browser and select `sd_xl_base_1.0` from the checkpoint dropdown. Or via API:

```bash
curl -X POST http://localhost:7860/sdapi/v1/options \
    -H "Content-Type: application/json" \
    -d '{"sd_model_checkpoint": "sd_xl_base_1.0.safetensors"}'
```

The pipeline automatically switches between models at runtime — `sd_xl_base_1.0` for normal generation, `juggernautXL_v9` when the `uncen` prefix is used.

#### Create a systemd service

Using the provided script:

```bash
chmod +x setup-sd-service.sh
sudo ./setup-sd-service.sh
```

Or manually (replace `$USER` and `$HOME` with actual values):

```bash
sudo tee /etc/systemd/system/stable-diffusion.service > /dev/null <<EOF
[Unit]
Description=AUTOMATIC1111 Stable Diffusion WebUI
After=network.target

[Service]
Type=simple
User=$USER
WorkingDirectory=$HOME/stable-diffusion-webui
ExecStart=$HOME/stable-diffusion-webui/webui.sh --api --listen --xformers --no-half-vae --medvram-sdxl
Restart=on-failure
RestartSec=10
Environment=HOME=$HOME

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now stable-diffusion
```

#### Verify

```bash
# Check the service is running
sudo systemctl status stable-diffusion

# Check available models (should list both sd_xl_base and juggernautXL)
curl -s http://localhost:7860/sdapi/v1/sd-models | python3 -m json.tool
```

### 4. Network Configuration

The pipeline runs inside Open WebUI's Docker container and needs to reach services on the host:

| Service | URL from container | Notes |
|---|---|---|
| Ollama | `http://ollama:11434` | Docker DNS or host networking |
| Stable Diffusion | `http://172.18.0.1:7860` | Docker bridge gateway IP |

To find your bridge gateway IP:

```bash
docker network inspect <your_network> --format '{{range .IPAM.Config}}{{.Gateway}}{{end}}'
```

Update `SD_URL` in the pipeline file if your gateway IP differs from `172.18.0.1`.

Verify connectivity from inside the container:

```bash
docker exec open-webui curl -s http://172.18.0.1:7860/sdapi/v1/sd-models
docker exec open-webui curl -s http://ollama:11434/api/tags | head -c 100
```

## Image Generation

### Default mode

Any prompt classified as `image_generation` (e.g. "generate an image of a cat in space") uses **SDXL Base 1.0**. The LLM refines the user's request into an optimized Stable Diffusion prompt with quality boosters, then calls the A1111 API.

### Uncensored mode

Prefix any prompt with `uncen` to bypass all classification, web search, and routing — the pipeline goes straight to image generation using **Juggernaut XL v9**:

```
uncen a beautiful sunset over the ocean
uncen portrait of a warrior in golden armor
```

The `uncen` prefix is stripped and the prompt is refined by **dolphin-mistral:7b** (an uncensored LLM that won't refuse any content) instead of gpt-oss. The pipeline switches the SD checkpoint to Juggernaut XL v9 automatically. If dolphin-mistral is unavailable, it falls back to sending the user's text directly with quality tags appended.

### How it works

**Default mode:**
1. LLM (gpt-oss) converts the user request into an optimized SD prompt
2. Ollama models are unloaded from VRAM
3. SD checkpoint is loaded (SDXL Base)
4. Image is generated, compressed PNG→JPEG, and streamed in 4KB chunks
5. SD checkpoint is unloaded from VRAM and page cache is dropped

**Uncensored mode:**
1. `uncen` prefix is stripped
2. dolphin-mistral:7b refines the prompt into optimized SD tags (no refusal)
3. Ollama models are unloaded from VRAM
4. SD checkpoint is switched to Juggernaut XL v9
5. Image is generated, compressed PNG→JPEG, and streamed in 4KB chunks
6. SD checkpoint is unloaded from VRAM and page cache is dropped

## VRAM Management

On a single 16GB GPU, large Ollama models and SDXL cannot be loaded simultaneously. The pipeline handles this automatically:

1. **Before image generation**: unloads all Ollama models from VRAM via `keep_alive: 0`
2. **After image generation**: unloads SD checkpoint via `/sdapi/v1/unload-checkpoint` and drops Linux page cache
3. Ollama reloads the model on the next chat request (~10-15s warm-up)

If Ollama fails to load after image generation with a memory error, manually clear the page cache:

```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```

## Architecture

```
User Message
    │
    ├─ "uncen" prefix? ─────────────── → dolphin-mistral:7b (refine) → Juggernaut XL v9
    │
    ├─ Image uploaded? ──────────────── → llama3.2-vision:11b
    │
    ├─ AI Classifier (qwen2.5:7b)
    │       │
    │       ├─ coding ──────────────── → qwen2.5-coder:14b
    │       ├─ diagram ─────────────── → qwen2.5-coder:14b (Mermaid)
    │       ├─ reasoning ───────────── → gpt-oss:120b (FI/EN system prompt)
    │       ├─ image_generation ────── → gpt-oss:120b (refine) → SDXL Base
    │       └─ general ─────────────── → gpt-oss:120b
    │
    ├─ Heuristic Search Override
    │       │
    │       └─ Brave Search + page fetch (if needed)
    │
    └─ Stream response (with thinking tokens)
```

## Files

| File | Description |
|---|---|
| `llm_router_v3.py` | Main pipeline (gpt-oss:120b) |
| `llm_router-20b.py` | Lighter pipeline variant (gpt-oss:20b) |
| `setup-sd.sh` | Stable Diffusion Forge install script (Ubuntu 22.04) |
| `setup-sd-service.sh` | systemd service creation script |

## License

MIT