LLM-Router-Stack/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

This is an **Open WebUI Pipeline** that acts as an intelligent LLM router. It classifies user prompts and routes them to different Ollama models based on intent, with integrated web search and image generation. Two variants exist: `llm_router_v3.py` (gpt-oss:120b) and `llm_router-20b.py` (gpt-oss:20b).

## Architecture

Single-file pipelines that run inside Open WebUI's pipelines container. The flow is:

1. **"uncen" prefix detection** — bypasses all classification/search, goes straight to uncensored image generation (Juggernaut XL v9)
2. **Vision detection** — checks if the latest user message (not assistant messages) contains an uploaded image
3. **AI classification** — qwen2.5:7b classifies prompts into: coding, diagram, reasoning, image_generation, vision, general
4. **Heuristic safety net** — keyword/pattern-based overrides can force search=true even if AI said no
5. **Finnish language injection** — prepends Finnish instruction to system prompt when Finnish is detected
6. **Web search** — Brave Search API with real-time status updates and full page content fetching for top 3 results
7. **Image generation** — Forge API via SDXL (default) or Juggernaut XL v9 (uncensored), with LLM-refined prompts
8. **VRAM management** — unloads Ollama before SD, unloads SD checkpoint after, drops page cache
9. **Streaming response** — streams model output including thinking/reasoning tokens in collapsible `<details>` blocks

### Model Routing

| Category | Model | Notes |
|---|---|---|
| coding | qwen2.5-coder:14b | Only when user asks to write/fix code |
| diagram | qwen2.5-coder:14b | Mermaid output |
| reasoning (FI/EN) | gpt-oss:120b / 20b | Finnish detection via keyword scoring (threshold ≥ 2) |
| image_generation | gpt-oss → SDXL Base | LLM refines prompt, then calls A1111 API |
| uncensored image | dolphin-mistral:7b → Juggernaut XL v9 | Triggered by "uncen" prefix, skips classifier and search, uses uncensored LLM for prompt refinement |
| vision | llama3.2-vision:11b | Only when latest user message has image |
| general | gpt-oss:120b / 20b | |

### Key Design Decisions

- **"uncen" prefix** — highest priority check, bypasses everything (classification, search, vision detection) and routes to uncensored image generation. Uses dolphin-mistral:7b (uncensored LLM) for prompt refinement instead of gpt-oss which refuses NSFW content. Falls back to raw prompt + quality tags if dolphin-mistral is unavailable.
- **Classifier strictness** — "coding" only triggers when user explicitly asks for code output. Discussing IT/tech topics routes to general/reasoning.
- **Finnish/English bilingual** — Finnish detected by scoring FINNISH_INDICATORS. A Finnish instruction is injected into system prompts for all categories.
- **Search is aggressive** — heuristic layer ensures search triggers for factual questions, even if AI classifier says no.
- **Year injection** — search queries have wrong years replaced with current year to counter LLM hallucination.
- **VRAM dance** — RTX 2000 Ada 16GB can't hold both gpt-oss:120b and SDXL simultaneously. Pipeline unloads Ollama before SD, unloads SD after, drops page cache.
- **SD model switching** — pipeline calls `/sdapi/v1/options` to swap between SDXL Base and Juggernaut XL v9 at runtime.
- **Chunked image streaming** — base64 images compressed PNG→JPEG and yielded in 4KB chunks to avoid "chunk too big" errors.
- **Vision false positive fix** — `has_image_content` only checks the latest user message, not assistant responses containing previously generated images.

## Deployment

- **Open WebUI**: Docker container on `ai-stack_default` bridge network
- **Ollama**: Native on host, reached via `http://ollama:11434` from containers
- **Forge (A1111)**: Native on host, systemd service `stable-diffusion`, reached via `http://172.18.0.1:7860` (Docker bridge gateway)
- **Server**: Ubuntu 22.04 LTS, NVIDIA RTX 2000 Ada 16GB

Pipeline is deployed by copying the `.py` file to `~/ai-stack/pipelines/` on the server and restarting the pipelines container.

## Setup Scripts

- `setup-sd.sh` — installs Forge, downloads SDXL Base + Juggernaut XL v9, fixes CLIP build issue (Ubuntu 22.04)
- `setup-sd-service.sh` — creates systemd service for Forge (handles sudo user detection correctly)

## Configuration

All runtime settings are exposed as **Valves** in Open WebUI's pipeline settings UI:
`ollama_url`, `sd_url`, `sd_width/height/steps/cfg_scale`, `brave_api_key`, `brave_max_results`, `use_ai_classifier`, `show_routing_info`, `search_context_max_chars`