2026-04-05 05:20:44 +00:00
2026-04-05 05:20:44 +00:00
2026-04-05 05:20:44 +00:00
2026-04-05 05:20:44 +00:00
2026-04-05 05:20:44 +00:00
2026-04-05 05:20:44 +00:00

LLM Router Pipeline for Open WebUI

An intelligent prompt classification and routing pipeline for Open WebUI. Classifies user prompts using AI (qwen2.5:7b) and routes them to specialized Ollama models, with integrated Brave web search, image generation via Stable Diffusion, and full Finnish/English bilingual support.

Features

  • AI-powered prompt classification with keyword-based fallback
  • Model routing — coding, diagram, reasoning, vision, image generation, and general categories
  • Brave web search with full page content fetching (top 3 results scraped)
  • Heuristic search overrides — safety net that forces search for time-sensitive or factual questions
  • Image generation via AUTOMATIC1111/Forge (Stable Diffusion XL) with LLM-refined prompts
  • VRAM management — automatically juggles GPU memory between Ollama and Stable Diffusion
  • Bilingual — detects Finnish and forces responses in the correct language
  • Thinking/reasoning display — streams model thinking tokens in collapsible blocks
  • Real-time search status — shows which URLs are being fetched as search runs

Model Routing

Category Model (120B) Model (20B) Trigger
coding qwen2.5-coder:14b qwen2.5-coder:14b User asks to write/fix/debug code
diagram qwen2.5-coder:14b qwen2.5-coder:14b Mermaid, flowchart, UML requests
reasoning (FI) gpt-oss:120b gpt-oss:20b Analysis, comparison, strategy (Finnish)
reasoning (EN) gpt-oss:120b gpt-oss:20b Analysis, comparison, strategy (English)
image generation gpt-oss:120b + SDXL gpt-oss:20b + SDXL "generate an image", "luo kuva"
vision llama3.2-vision:11b llama3.2-vision:11b User uploads an image
general gpt-oss:120b gpt-oss:20b Everything else

Two pipeline variants are provided:

  • llm_router_v3.py — uses gpt-oss:120b (higher quality, more VRAM/RAM)
  • llm_router-20b.py — uses gpt-oss:20b (lighter, better for constrained hardware)

Prerequisites

  • Ubuntu 22.04 LTS (tested)
  • NVIDIA GPU with 16GB+ VRAM (tested on RTX 2000 Ada)
  • Open WebUI running in Docker with pipelines enabled
  • Ollama installed natively with models pulled:
    ollama pull qwen2.5:7b
    ollama pull qwen2.5-coder:14b
    ollama pull gpt-oss:120b    # or gpt-oss:20b for the lighter variant
    ollama pull llama3.2-vision:11b
    
  • Brave Search API key (free tier: https://brave.com/search/api/)

Setup

1. Deploy the Pipeline

Copy your chosen pipeline file to the Open WebUI pipelines directory:

cp llm_router_v3.py ~/ai-stack/pipelines/
# or for the 20B variant:
cp llm_router-20b.py ~/ai-stack/pipelines/

Restart the pipelines container:

docker restart pipelines

2. Configure Valves in Open WebUI

Go to Admin Panel > Pipelines in Open WebUI and configure:

Setting Description Default
ollama_url Ollama API URL http://ollama:11434
sd_url Stable Diffusion API URL http://172.18.0.1:7860
brave_api_key Brave Search API key (from env BRAVE_API_KEY)
sd_width / sd_height Generated image dimensions 1024 x 1024
sd_steps Sampling steps 25
sd_cfg_scale CFG scale 7.0
brave_max_results Number of search results 6
use_ai_classifier Use AI vs keyword-only classification true
show_routing_info Show routing banner in responses true
search_context_max_chars Max search context size 12000

3. Set Up Stable Diffusion (Image Generation)

Skip this section if you don't need image generation.

Install Forge (AUTOMATIC1111 fork)

# Install system dependencies
sudo apt-get update
sudo apt-get install -y git wget python3-venv python3-pip \
    libgl1 libglib2.0-0 libsm6 libxrender1 libxext6 libffi-dev libssl-dev

# Clone Forge
git clone https://github.com/lllyasviel/stable-diffusion-webui-forge.git ~/stable-diffusion-webui
cd ~/stable-diffusion-webui

# Download SDXL model (~6.9GB)
mkdir -p models/Stable-diffusion
wget -O models/Stable-diffusion/sd_xl_base_1.0.safetensors \
    "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors"

Fix Python 3.10 build issues (Ubuntu 22.04)

Before the first launch, pre-install CLIP dependencies to avoid build failures:

cd ~/stable-diffusion-webui
# First launch creates the venv — run it once, let it fail, then fix:
./webui.sh --api --listen --xformers --no-half-vae || true

# Fix CLIP build issue
venv/bin/pip install "setuptools<70" wheel
venv/bin/pip install --no-build-isolation \
    https://github.com/openai/CLIP/archive/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1.zip

# Launch again
./webui.sh --api --listen --xformers --no-half-vae

Select SDXL model

Once the UI is running, open it in a browser and select sd_xl_base_1.0 from the checkpoint dropdown. Or via API:

curl -X POST http://localhost:7860/sdapi/v1/options \
    -H "Content-Type: application/json" \
    -d '{"sd_model_checkpoint": "sd_xl_base_1.0.safetensors"}'

Create a systemd service

chmod +x setup-sd-service.sh
sudo ./setup-sd-service.sh

Or manually:

sudo tee /etc/systemd/system/stable-diffusion.service > /dev/null <<EOF
[Unit]
Description=AUTOMATIC1111 Stable Diffusion WebUI
After=network.target

[Service]
Type=simple
User=$USER
WorkingDirectory=$HOME/stable-diffusion-webui
ExecStart=$HOME/stable-diffusion-webui/webui.sh --api --listen --xformers --no-half-vae --medvram-sdxl
Restart=on-failure
RestartSec=10
Environment=HOME=$HOME

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now stable-diffusion

Verify

curl -s http://localhost:7860/sdapi/v1/sd-models | python3 -m json.tool

4. Network Configuration

The pipeline runs inside Open WebUI's Docker container and needs to reach:

Service URL from container Notes
Ollama http://ollama:11434 Docker DNS or host networking
Stable Diffusion http://172.18.0.1:7860 Docker bridge gateway IP

To find your bridge gateway IP:

docker network inspect <your_network> --format '{{range .IPAM.Config}}{{.Gateway}}{{end}}'

Verify connectivity from inside the container:

docker exec open-webui curl -s http://172.18.0.1:7860/sdapi/v1/sd-models

VRAM Management

On a single 16GB GPU, gpt-oss:120b and SDXL cannot be loaded simultaneously. The pipeline handles this automatically:

  1. Before image generation: unloads all Ollama models from VRAM
  2. After image generation: unloads SD checkpoint from VRAM and drops Linux page cache
  3. Ollama reloads the model on the next chat request (~10-15s warm-up)

If Ollama fails to load after image generation with a memory error, clear the page cache:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

Architecture

User Message
    │
    ├─ Image uploaded? ──────────────── → llama3.2-vision:11b
    │
    ├─ AI Classifier (qwen2.5:7b)
    │       │
    │       ├─ coding ──────────────── → qwen2.5-coder:14b
    │       ├─ diagram ─────────────── → qwen2.5-coder:14b (Mermaid)
    │       ├─ reasoning ───────────── → gpt-oss:120b (FI/EN system prompt)
    │       ├─ image_generation ────── → gpt-oss:120b (refine) → SDXL (generate)
    │       └─ general ─────────────── → gpt-oss:120b
    │
    ├─ Heuristic Search Override
    │       │
    │       └─ Brave Search + page fetch (if needed)
    │
    └─ Stream response (with thinking tokens)

Files

File Description
llm_router_v3.py Main pipeline (gpt-oss:120b)
llm_router-20b.py Lighter pipeline variant (gpt-oss:20b)
setup-sd.sh Stable Diffusion Forge install script
setup-sd-service.sh systemd service creation script

License

MIT

S
Description
Stack and pipeline for Open WebUI.
Readme 146 KiB
Languages
Python 94.5%
Shell 5.5%