Files
LLM-Router-Stack/CLAUDE.md
T
2026-04-05 05:20:44 +00:00

3.5 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is an Open WebUI Pipeline (llm_router_v3.py) that acts as an intelligent LLM router. It classifies user prompts and routes them to different Ollama models based on intent, with integrated web search and image generation.

Architecture

Single-file pipeline (llm_router_v3.py) that runs inside Open WebUI's pipelines container. The flow is:

  1. Task detection — Open WebUI internal requests (title/tag generation) bypass routing and go to qwen2.5:7b directly
  2. Vision detection — checks if the latest user message contains an uploaded image
  3. AI classification — qwen2.5:7b classifies prompts into: coding, diagram, reasoning, image_generation, vision, general
  4. Heuristic safety net — keyword/pattern-based overrides can force search=true even if AI said no
  5. Web search — Brave Search API with full page content fetching for top 3 results
  6. Image generation — AUTOMATIC1111/Forge API via Stable Diffusion XL, with LLM-refined prompts
  7. VRAM management — automatically unloads Ollama models before SD generation and unloads SD checkpoint after, plus drops page cache to free RAM
  8. Streaming response — streams model output including thinking/reasoning tokens in collapsible blocks

Model Routing

Category Model Notes
coding qwen2.5-coder:14b
diagram qwen2.5-coder:14b Mermaid output
reasoning (FI/EN) gpt-oss:120b Finnish detection via keyword scoring
image_generation gpt-oss:120b → SDXL LLM refines prompt, then calls A1111 API
vision llama3.2-vision:11b Only when latest user message has image
general gpt-oss:120b

Key Design Decisions

  • Finnish/English bilingual — Finnish detected by scoring FINNISH_INDICATORS (threshold ≥ 2 matches). Reasoning routes to language-specific system prompts.
  • Search is aggressive — heuristic layer ensures search triggers for questions with named entities, freshness keywords, time-sensitive topics, even if AI classifier says no.
  • Year injection — search queries have wrong years replaced with current year to counter LLM hallucination.
  • Image generation VRAM dance — RTX 2000 Ada 16GB can't hold both gpt-oss:120b and SDXL simultaneously. Pipeline unloads Ollama before SD, unloads SD after, and drops Linux page cache.
  • Chunked image streaming — base64 images are compressed PNG→JPEG and yielded in 4KB chunks to avoid Open WebUI "chunk too big" errors.

Deployment

  • Open WebUI: Docker container on ai-stack_default network
  • Ollama: Native on host (not Docker), reached via http://ollama:11434 from containers
  • AUTOMATIC1111 Forge: Native on host, systemd service stable-diffusion, reached via http://172.18.0.1:7860 (Docker bridge gateway)
  • Server: Ubuntu 22.04 LTS, NVIDIA RTX 2000 Ada 16GB

Pipeline is deployed by copying llm_router_v3.py to ~/ai-stack/pipelines/ on the server and restarting the pipelines container.

Setup Scripts

  • setup-sd.sh — installs AUTOMATIC1111 Forge + downloads SDXL model (Ubuntu 22.04 specific)
  • setup-sd-service.sh — creates systemd service for Forge (run after setup-sd.sh)

Configuration

All runtime settings are exposed as Valves in Open WebUI's pipeline settings UI: ollama_url, sd_url, sd_width/height/steps/cfg_scale, brave_api_key, brave_max_results, use_ai_classifier, show_routing_info, search_context_max_chars