localLLM — Sovereign AI inside the client perimeter
Project-aware RAG, OpenAI-compatible gateway, MCP-first agent surface. The project-intelligence backend behind a hybrid AI engagement — project memory, retrieval, and orchestration stay on-premises; reasoning is supplied by either local LLMs (air-gapped) or the customer's existing cloud account routed through LiteLLM.
Platform Overview
localLLM is a turnkey platform that scans the client’s codebases and documents, builds a self-improving per-project knowledge base, and serves a project-aware assistant through any tool that speaks the OpenAI API. A FastAPI gateway fronts a benchmarked catalog of 54 model families and routes 30 task types to the right brain on demand. A 24-tool plan-then-execute orchestrator with grammar-constrained outputs runs alongside an MCP server exposing 38 tools to external IDE agents (Cline, Aider, Goose, OpenHands, Continue). Deploys air-gapped or hybrid — the perimeter is the customer’s, not ours.
Verified Architecture
Six layers, top-down: inputs → ingestion → per-project knowledge base → FastAPI gateway → agent & integration surfaces → external IDE agents. Verified against the running codebase on 2026-04-29.
Ingestion
Five input streams feed the per-project knowledge base. Every artifact carries provenance back to its origin so retrieval results stay auditable.
Codebase Scanner
Detects skills, frameworks, and languages and emits a structure graph for every git-tracked source tree. Output lands in skills.json and vectors.db alongside the per-project manifest.
Document Loaders
Ingests PDF, DOCX, CSV, XLSX, Markdown, JSON, and TXT via pymupdf, python-docx, and pandas — deterministic extractors with provenance tracking.
Web Researcher
SSE-streamed, per-skill web research via httpx + trafilatura + LLM summarisation. Fills knowledge gaps without operator hand-holding; results cached.
URL Ingestion
Single-page fetch into KB entries for ad-hoc references — documentation, RFCs, blog posts, vendor advisories.
Signed Webhooks
HMAC-signed inbound triggers drive ingestion from existing CI. Six lifecycle actions: scan, learn, improve, project_chat, agent_run, pipeline.
Per-Project Knowledge Base
Each project gets its own knowledge base with hybrid retrieval and five persistent memory layers that survive across sessions.
Hybrid Retrieval
SQLite FTS5 keyword search plus sqlite-backed cosine vector embeddings plus FlashRank cross-encoder reranking, fused via RRF, diversified via MMR, and expanded one hop across entry-citation graphs. Citations carry entry IDs end-to-end.
Five Memory Layers
- Hermes — long-term agent memory
- ReasoningBank — distilled successful trajectories
- __global__ — cross-project skills KB
- USER.md — per-user profile memory
- Skill rules — per-skill rule extraction
Per-Project Layout
Each project ships as a self-contained set of files: manifest.json, entries.json, graph.json, vectors.db, hermes_memory.json, reasoning_bank.json. Portable and inspectable.
Single-Flight Context Cache
Mtime-keyed cache for the assembled context bundle — concurrent requests for the same project state collapse into one retrieval pass.
FastAPI Gateway
A single FastAPI service exposes an OpenAI-compatible /v1/chat/completions endpoint. Every external tool that speaks the OpenAI API gets project intelligence through a single header.
3-Layer RAG Injection
An X-Project-ID header triggers automatic injection of (1) project overview, (2) KB entries, and (3) Hermes memories before the LLM call. The caller never assembles a prompt manually.
30 Task Types → Right Brain
Routes via TASK_MODEL_MAP across the active model families with size + memory + health guards. The right brain is matched to each task on demand — from code_review to iso_standards to function_calling.
LiteLLM Provider
One provider abstraction over Ollama (default), llama.cpp, vLLM, OpenAI, Anthropic, and Gemini. Typed-fallback dispatch keeps the surface stable when an upstream is unhealthy.
Adaptive Health & Watchdog
Adaptive health checks (30 s when unhealthy, 120 s when all healthy). An SSE disconnect watchdog releases upstream model resources within ~200 ms of client drop — no orphaned VRAM.
Agent & Integration Surfaces
Four peer surfaces ride the gateway: an agentic orchestrator, an MCP server, a VS Code extension, and a Pipelines DSL with eval and predictions tooling.
Agentic Orchestrator
24 tools · plan-then-execute with grammar-constrained emit_plan (Ollama 0.5+ JSON-schema decoding) and a ReAct fallback path. Every destructive action goes through an audited approval gate with full audit log.
MCP Server
38 tools over stdio — 19 foundational query verbs plus 19 lifecycle verbs (scan, learn, improve, evaluate, plan, …). Every tool call is audited. Drives Cline, Aider, Goose, OpenHands, and Continue.
VS Code Extension
Bundled extension adds Ask About File, Explain Selection, and inline edit suggestions — project-scoped per workspace.
Pipelines & Peers
Pipelines DSL (sequential + fanout steps with verbatim and verify_build modes), signed webhooks, Compare and Predictions surfaces, and an evaluation harness for regression-grading project runs.
External IDE Agents
Any OpenAI-compatible client — Cline, Aider, Goose, OpenHands, Continue, Flowise — reaches the gateway via /v1 + the X-Project-ID header, calls MCP tools over stdio, and uses cloud Claude or GPT for reasoning under tool constraints when the engagement runs in hybrid mode.
localLLM stays the project-intelligence backend; the IDE agent is the brain. The gateway is OpenAI-compatible HTTP — there is no agent-detection logic and no rejection list. Any client that can hit a custom base URL and add a header gets the full RAG-injected experience.
Air-Gapped or Hybrid
A single project KB — two doors. The architecture is identical; only the brain changes.
Air-Gapped
100% local LLMs · zero network egress
Local recall, local memory, local reasoning. No data leaves the client perimeter. Runs on commodity hardware (M-series Mac, RTX desktop, on-prem GPU server). The Pipelines DSL, MCP server, and orchestrator behave identically — only the model backend changes.
Hybrid (Stance B)
Local recall + cloud LLM reasoning
Local 8B–30B models do recall well (KB, memory, structure, rules); cloud Claude or GPT do reasoning under tool constraints reliably. Hybrid splits the work to the right brain. The cloud LLM only sees the RAG-injected payload of each call and individual MCP tool responses — never the raw archive.
Local Drawer (Direct Mode)
Project Q&A in ~3 s warm, 100% on-prem. Bypasses the agent layer entirely — pure retrieval + completion against the active model family.
External Agent + Cloud LLM
Code edits via Cline; localLLM provides project context via MCP tools and RAG injection. The cloud account is the customer’s — we don’t broker tokens.
Verified Against Current Code
Every number on this page is grounded in the running source tree, verified on 2026-04-29.
| Surface | Count | Source |
|---|---|---|
| Model families (catalog) | 54 | config/catalog.yaml |
| Active families on M-series | 5 | config/models.macos.yaml |
| Routed task types | 30 | skills/project_orchestrator.py · TASK_MODEL_MAP |
| Agentic orchestrator tools | 24 | tools/registry.py |
| MCP tools (foundational + lifecycle) | 38 (19 + 19) | mcp_server.py · mcp_server_lifecycle.py |
| Memory layers | 5 | skills/knowledge_base.py |
| Webhook lifecycle actions | 6 | webhooks/router.py |
Best fit and known limitations
Best for
Regulated organisations that must keep prompts, code, and context inside the perimeter, want a project-aware RAG knowledge base, and prefer an OpenAI-compatible gateway so existing IDE assistants and pipelines work unchanged.
Not the right fit
Teams happy with cloud LLMs and short-lived prompts; lightweight chatbot use without code or document context; environments that cannot host the modest GPU/CPU footprint required for local inference.
Known limitations
Answer quality is bounded by the local model family chosen for each task type; ingestion of very large monorepos requires tuning and storage planning; first-time setup includes infrastructure decisions (GPU, storage, VS Code rollout).
Discuss a similar engagement
Air-gapped sovereign AI, hybrid RAG-injected gateway, or a custom agent surface for an existing tool stack — we deliver the project-intelligence backend; you keep the perimeter.