An intelligent prompt classification and routing pipeline for Open WebUI. Classifies user prompts using AI (qwen2.5:7b) and routes them to specialized Ollama models, with integrated Brave web search, image generation via Stable Diffusion, and full Finnish/English bilingual support.

Features

AI-powered prompt classification with keyword-based fallback
Model routing — coding, diagram, reasoning, vision, image generation, and general categories
Brave web search with full page content fetching (top 3 results scraped)
Heuristic search overrides — safety net that forces search for time-sensitive or factual questions
Image generation via AUTOMATIC1111/Forge (Stable Diffusion XL) with LLM-refined prompts
Uncensored image generation — prefix any prompt with uncen to bypass all classification/search and generate directly with Juggernaut XL v9
VRAM management — automatically juggles GPU memory between Ollama and Stable Diffusion
Bilingual — detects Finnish and forces responses in the correct language
Thinking/reasoning display — streams model thinking tokens in collapsible blocks
Real-time search status — shows which URLs are being fetched as search runs

Model Routing

Category	Model (120B)	Model (20B)	Trigger
coding	qwen2.5-coder:14b	qwen2.5-coder:14b	User asks to write/fix/debug code
diagram	qwen2.5-coder:14b	qwen2.5-coder:14b	Mermaid, flowchart, UML requests
reasoning (FI)	gpt-oss:120b	gpt-oss:20b	Analysis, comparison, strategy (Finnish)
reasoning (EN)	gpt-oss:120b	gpt-oss:20b	Analysis, comparison, strategy (English)
image generation	gpt-oss:120b + SDXL	gpt-oss:20b + SDXL	"generate an image", "luo kuva"
uncensored image	dolphin-mistral:7b + Juggernaut XL v9	dolphin-mistral:7b + Juggernaut XL v9	Prompt starts with `uncen`
vision	llama3.2-vision:11b	llama3.2-vision:11b	User uploads an image
general	gpt-oss:120b	gpt-oss:20b	Everything else

Two pipeline variants are provided:

llm_router_v3.py — uses gpt-oss:120b (higher quality, more VRAM/RAM)
llm_router-20b.py — uses gpt-oss:20b (lighter, better for constrained hardware)

Prerequisites

Ubuntu 22.04 LTS (tested)
NVIDIA GPU with 16GB+ VRAM (tested on RTX 2000 Ada)
Open WebUI running in Docker with pipelines enabled

Ollama installed natively with models pulled:

ollama pull qwen2.5:7b
ollama pull qwen2.5-coder:14b
ollama pull gpt-oss:120b    # or gpt-oss:20b for the lighter variant
ollama pull llama3.2-vision:11b
ollama pull dolphin-mistral:7b   # uncensored model for image prompt refinement

Brave Search API key (free tier: https://brave.com/search/api/)

Setup

1. Deploy the Pipeline

Copy your chosen pipeline file to the Open WebUI pipelines directory:

cp llm_router_v3.py ~/ai-stack/pipelines/
# or for the 20B variant:
cp llm_router-20b.py ~/ai-stack/pipelines/

Restart the pipelines container:

docker restart pipelines

2. Configure Valves in Open WebUI

Go to Admin Panel > Pipelines in Open WebUI and configure:

Setting	Description	Default
`ollama_url`	Ollama API URL	`http://ollama:11434`
`sd_url`	Stable Diffusion API URL	`http://172.18.0.1:7860`
`brave_api_key`	Brave Search API key	(from env `BRAVE_API_KEY`)
`sd_width` / `sd_height`	Generated image dimensions	1024 x 1024
`sd_steps`	Sampling steps	25
`sd_cfg_scale`	CFG scale	7.0
`brave_max_results`	Number of search results	6
`use_ai_classifier`	Use AI vs keyword-only classification	true
`show_routing_info`	Show routing banner in responses	true
`search_context_max_chars`	Max search context size	12000

3. Set Up Stable Diffusion (Image Generation)

Skip this section if you don't need image generation.

Install Forge (AUTOMATIC1111 fork)

# Install system dependencies
sudo apt-get update
sudo apt-get install -y git wget python3-venv python3-pip \
    libgl1 libglib2.0-0 libsm6 libxrender1 libxext6 libffi-dev libssl-dev

# Clone Forge
git clone https://github.com/lllyasviel/stable-diffusion-webui-forge.git ~/stable-diffusion-webui
cd ~/stable-diffusion-webui

# Download SDXL model (~6.9GB)
mkdir -p models/Stable-diffusion
wget -O models/Stable-diffusion/sd_xl_base_1.0.safetensors \
    "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"

# Download Juggernaut XL v9 for uncensored image generation (~6.6GB)
wget -O models/Stable-diffusion/juggernautXL_v9.safetensors \
    "https://huggingface.co/RunDiffusion/Juggernaut-XL-v9/resolve/main/Juggernaut-XL_v9_RunDiffusionPhoto_v2.safetensors"

Fix Python 3.10 build issues (Ubuntu 22.04)

The first launch will create a Python venv and install dependencies. CLIP will fail to build due to a pkg_resources issue on Python 3.10. Fix it:

cd ~/stable-diffusion-webui

# First launch creates the venv — run it once, let it fail, then fix:
./webui.sh --api --listen --xformers --no-half-vae || true

# Fix CLIP build issue
venv/bin/pip install "setuptools<70" wheel
venv/bin/pip install --no-build-isolation \
    https://github.com/openai/CLIP/archive/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1.zip

# Launch again
./webui.sh --api --listen --xformers --no-half-vae

Select the default SDXL model

Once the UI is running, open it in a browser and select sd_xl_base_1.0 from the checkpoint dropdown. Or via API:

curl -X POST http://localhost:7860/sdapi/v1/options \
    -H "Content-Type: application/json" \
    -d '{"sd_model_checkpoint": "sd_xl_base_1.0.safetensors"}'

The pipeline automatically switches between models at runtime — sd_xl_base_1.0 for normal generation, juggernautXL_v9 when the uncen prefix is used.

Create a systemd service

Using the provided script:

chmod +x setup-sd-service.sh
sudo ./setup-sd-service.sh

Or manually (replace $USER and $HOME with actual values):

sudo tee /etc/systemd/system/stable-diffusion.service > /dev/null <<EOF
[Unit]
Description=AUTOMATIC1111 Stable Diffusion WebUI
After=network.target

[Service]
Type=simple
User=$USER
WorkingDirectory=$HOME/stable-diffusion-webui
ExecStart=$HOME/stable-diffusion-webui/webui.sh --api --listen --xformers --no-half-vae --medvram-sdxl
Restart=on-failure
RestartSec=10
Environment=HOME=$HOME

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now stable-diffusion

Verify

# Check the service is running
sudo systemctl status stable-diffusion

# Check available models (should list both sd_xl_base and juggernautXL)
curl -s http://localhost:7860/sdapi/v1/sd-models | python3 -m json.tool

4. Network Configuration

The pipeline runs inside Open WebUI's Docker container and needs to reach services on the host:

Service	URL from container	Notes
Ollama	`http://ollama:11434`	Docker DNS or host networking
Stable Diffusion	`http://172.18.0.1:7860`	Docker bridge gateway IP

To find your bridge gateway IP:

docker network inspect <your_network> --format '{{range .IPAM.Config}}{{.Gateway}}{{end}}'

Update SD_URL in the pipeline file if your gateway IP differs from 172.18.0.1.

Verify connectivity from inside the container:

docker exec open-webui curl -s http://172.18.0.1:7860/sdapi/v1/sd-models
docker exec open-webui curl -s http://ollama:11434/api/tags | head -c 100

Image Generation

Default mode

Any prompt classified as image_generation (e.g. "generate an image of a cat in space") uses SDXL Base 1.0. The LLM refines the user's request into an optimized Stable Diffusion prompt with quality boosters, then calls the A1111 API.

Uncensored mode

Prefix any prompt with uncen to bypass all classification, web search, and routing — the pipeline goes straight to image generation using Juggernaut XL v9:

uncen a beautiful sunset over the ocean
uncen portrait of a warrior in golden armor

The uncen prefix is stripped and the prompt is refined by dolphin-mistral:7b (an uncensored LLM that won't refuse any content) instead of gpt-oss. The pipeline switches the SD checkpoint to Juggernaut XL v9 automatically. If dolphin-mistral is unavailable, it falls back to sending the user's text directly with quality tags appended.

How it works

Default mode:

LLM (gpt-oss) converts the user request into an optimized SD prompt
Ollama models are unloaded from VRAM
SD checkpoint is loaded (SDXL Base)
Image is generated, compressed PNG→JPEG, and streamed in 4KB chunks
SD checkpoint is unloaded from VRAM and page cache is dropped

Uncensored mode:

uncen prefix is stripped
dolphin-mistral:7b refines the prompt into optimized SD tags (no refusal)
Ollama models are unloaded from VRAM
SD checkpoint is switched to Juggernaut XL v9
Image is generated, compressed PNG→JPEG, and streamed in 4KB chunks
SD checkpoint is unloaded from VRAM and page cache is dropped

VRAM Management

On a single 16GB GPU, large Ollama models and SDXL cannot be loaded simultaneously. The pipeline handles this automatically:

Before image generation: unloads all Ollama models from VRAM via keep_alive: 0
After image generation: unloads SD checkpoint via /sdapi/v1/unload-checkpoint and drops Linux page cache
Ollama reloads the model on the next chat request (~10-15s warm-up)

If Ollama fails to load after image generation with a memory error, manually clear the page cache:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Architecture

User Message
    │
    ├─ "uncen" prefix? ─────────────── → dolphin-mistral:7b (refine) → Juggernaut XL v9
    │
    ├─ Image uploaded? ──────────────── → llama3.2-vision:11b
    │
    ├─ AI Classifier (qwen2.5:7b)
    │       │
    │       ├─ coding ──────────────── → qwen2.5-coder:14b
    │       ├─ diagram ─────────────── → qwen2.5-coder:14b (Mermaid)
    │       ├─ reasoning ───────────── → gpt-oss:120b (FI/EN system prompt)
    │       ├─ image_generation ────── → gpt-oss:120b (refine) → SDXL Base
    │       └─ general ─────────────── → gpt-oss:120b
    │
    ├─ Heuristic Search Override
    │       │
    │       └─ Brave Search + page fetch (if needed)
    │
    └─ Stream response (with thinking tokens)

Files

File	Description
`llm_router_v3.py`	Main pipeline (gpt-oss:120b)
`llm_router-20b.py`	Lighter pipeline variant (gpt-oss:20b)
`setup-sd.sh`	Stable Diffusion Forge install script (Ubuntu 22.04)
`setup-sd-service.sh`	systemd service creation script

License

MIT