# LLM Router Pipeline for Open WebUI An intelligent prompt classification and routing pipeline for [Open WebUI](https://github.com/open-webui/open-webui). Classifies user prompts using AI (qwen2.5:7b) and routes them to specialized Ollama models, with integrated Brave web search, image generation via Stable Diffusion, and full Finnish/English bilingual support. ## Features - **AI-powered prompt classification** with keyword-based fallback - **Model routing** — coding, diagram, reasoning, vision, image generation, and general categories - **Brave web search** with full page content fetching (top 3 results scraped) - **Heuristic search overrides** — safety net that forces search for time-sensitive or factual questions - **Image generation** via AUTOMATIC1111/Forge (Stable Diffusion XL) with LLM-refined prompts - **VRAM management** — automatically juggles GPU memory between Ollama and Stable Diffusion - **Bilingual** — detects Finnish and forces responses in the correct language - **Thinking/reasoning display** — streams model thinking tokens in collapsible blocks - **Real-time search status** — shows which URLs are being fetched as search runs ## Model Routing | Category | Model (120B) | Model (20B) | Trigger | |---|---|---|---| | coding | qwen2.5-coder:14b | qwen2.5-coder:14b | User asks to write/fix/debug code | | diagram | qwen2.5-coder:14b | qwen2.5-coder:14b | Mermaid, flowchart, UML requests | | reasoning (FI) | gpt-oss:120b | gpt-oss:20b | Analysis, comparison, strategy (Finnish) | | reasoning (EN) | gpt-oss:120b | gpt-oss:20b | Analysis, comparison, strategy (English) | | image generation | gpt-oss:120b + SDXL | gpt-oss:20b + SDXL | "generate an image", "luo kuva" | | vision | llama3.2-vision:11b | llama3.2-vision:11b | User uploads an image | | general | gpt-oss:120b | gpt-oss:20b | Everything else | Two pipeline variants are provided: - **`llm_router_v3.py`** — uses gpt-oss:120b (higher quality, more VRAM/RAM) - **`llm_router-20b.py`** — uses gpt-oss:20b (lighter, better for constrained hardware) ## Prerequisites - **Ubuntu 22.04 LTS** (tested) - **NVIDIA GPU** with 16GB+ VRAM (tested on RTX 2000 Ada) - **Open WebUI** running in Docker with pipelines enabled - **Ollama** installed natively with models pulled: ```bash ollama pull qwen2.5:7b ollama pull qwen2.5-coder:14b ollama pull gpt-oss:120b # or gpt-oss:20b for the lighter variant ollama pull llama3.2-vision:11b ``` - **Brave Search API key** (free tier: https://brave.com/search/api/) ## Setup ### 1. Deploy the Pipeline Copy your chosen pipeline file to the Open WebUI pipelines directory: ```bash cp llm_router_v3.py ~/ai-stack/pipelines/ # or for the 20B variant: cp llm_router-20b.py ~/ai-stack/pipelines/ ``` Restart the pipelines container: ```bash docker restart pipelines ``` ### 2. Configure Valves in Open WebUI Go to **Admin Panel > Pipelines** in Open WebUI and configure: | Setting | Description | Default | |---|---|---| | `ollama_url` | Ollama API URL | `http://ollama:11434` | | `sd_url` | Stable Diffusion API URL | `http://172.18.0.1:7860` | | `brave_api_key` | Brave Search API key | (from env `BRAVE_API_KEY`) | | `sd_width` / `sd_height` | Generated image dimensions | 1024 x 1024 | | `sd_steps` | Sampling steps | 25 | | `sd_cfg_scale` | CFG scale | 7.0 | | `brave_max_results` | Number of search results | 6 | | `use_ai_classifier` | Use AI vs keyword-only classification | true | | `show_routing_info` | Show routing banner in responses | true | | `search_context_max_chars` | Max search context size | 12000 | ### 3. Set Up Stable Diffusion (Image Generation) > Skip this section if you don't need image generation. #### Install Forge (AUTOMATIC1111 fork) ```bash # Install system dependencies sudo apt-get update sudo apt-get install -y git wget python3-venv python3-pip \ libgl1 libglib2.0-0 libsm6 libxrender1 libxext6 libffi-dev libssl-dev # Clone Forge git clone https://github.com/lllyasviel/stable-diffusion-webui-forge.git ~/stable-diffusion-webui cd ~/stable-diffusion-webui # Download SDXL model (~6.9GB) mkdir -p models/Stable-diffusion wget -O models/Stable-diffusion/sd_xl_base_1.0.safetensors \ "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors" ``` #### Fix Python 3.10 build issues (Ubuntu 22.04) Before the first launch, pre-install CLIP dependencies to avoid build failures: ```bash cd ~/stable-diffusion-webui # First launch creates the venv — run it once, let it fail, then fix: ./webui.sh --api --listen --xformers --no-half-vae || true # Fix CLIP build issue venv/bin/pip install "setuptools<70" wheel venv/bin/pip install --no-build-isolation \ https://github.com/openai/CLIP/archive/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1.zip # Launch again ./webui.sh --api --listen --xformers --no-half-vae ``` #### Select SDXL model Once the UI is running, open it in a browser and select `sd_xl_base_1.0` from the checkpoint dropdown. Or via API: ```bash curl -X POST http://localhost:7860/sdapi/v1/options \ -H "Content-Type: application/json" \ -d '{"sd_model_checkpoint": "sd_xl_base_1.0.safetensors"}' ``` #### Create a systemd service ```bash chmod +x setup-sd-service.sh sudo ./setup-sd-service.sh ``` Or manually: ```bash sudo tee /etc/systemd/system/stable-diffusion.service > /dev/null < --format '{{range .IPAM.Config}}{{.Gateway}}{{end}}' ``` Verify connectivity from inside the container: ```bash docker exec open-webui curl -s http://172.18.0.1:7860/sdapi/v1/sd-models ``` ## VRAM Management On a single 16GB GPU, gpt-oss:120b and SDXL cannot be loaded simultaneously. The pipeline handles this automatically: 1. **Before image generation**: unloads all Ollama models from VRAM 2. **After image generation**: unloads SD checkpoint from VRAM and drops Linux page cache 3. Ollama reloads the model on the next chat request (~10-15s warm-up) If Ollama fails to load after image generation with a memory error, clear the page cache: ```bash sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' ``` ## Architecture ``` User Message │ ├─ Image uploaded? ──────────────── → llama3.2-vision:11b │ ├─ AI Classifier (qwen2.5:7b) │ │ │ ├─ coding ──────────────── → qwen2.5-coder:14b │ ├─ diagram ─────────────── → qwen2.5-coder:14b (Mermaid) │ ├─ reasoning ───────────── → gpt-oss:120b (FI/EN system prompt) │ ├─ image_generation ────── → gpt-oss:120b (refine) → SDXL (generate) │ └─ general ─────────────── → gpt-oss:120b │ ├─ Heuristic Search Override │ │ │ └─ Brave Search + page fetch (if needed) │ └─ Stream response (with thinking tokens) ``` ## Files | File | Description | |---|---| | `llm_router_v3.py` | Main pipeline (gpt-oss:120b) | | `llm_router-20b.py` | Lighter pipeline variant (gpt-oss:20b) | | `setup-sd.sh` | Stable Diffusion Forge install script | | `setup-sd-service.sh` | systemd service creation script | ## License MIT