Custom AI Solution Sovereign Local AI Self-Improving RAG Agentic Orchestration MCP Server In Implementation

Platform Overview

localLLM is a turnkey platform that scans the client’s codebases and documents, builds a self-improving per-project knowledge base, and serves a project-aware assistant through any tool that speaks the OpenAI API. A FastAPI gateway fronts a benchmarked catalog of 54 model families and routes 30 task types to the right brain on demand. A 24-tool plan-then-execute orchestrator with grammar-constrained outputs runs alongside an MCP server exposing 38 tools to external IDE agents (Cline, Aider, Goose, OpenHands, Continue). Deploys air-gapped or hybrid — the perimeter is the customer’s, not ours.

54
Model Families
30
Routed Task Types
38
MCP Tools

Verified Architecture

Six layers, top-down: inputs → ingestion → per-project knowledge base → FastAPI gateway → agent & integration surfaces → external IDE agents. Verified against the running codebase on 2026-04-29.

localLLM architecture: inputs (codebases, documents, web research, URL ingestion, webhooks) flow through scanners and loaders into a per-project knowledge base; a FastAPI gateway with OpenAI-compatible /v1/chat/completions, 3-layer RAG injection, 30 task types and 54-family catalog serves an agentic orchestrator (24 tools), MCP server (38 tools), VS Code extension, pipeline DSL, and external IDE agents (Cline, Aider, Goose, OpenHands, Continue, Flowise).
localLLM — verified architecture, 2026-04-29. Mermaid source.

Ingestion

Five input streams feed the per-project knowledge base. Every artifact carries provenance back to its origin so retrieval results stay auditable.

Codebase Scanner

Detects skills, frameworks, and languages and emits a structure graph for every git-tracked source tree. Output lands in skills.json and vectors.db alongside the per-project manifest.

Document Loaders

Ingests PDF, DOCX, CSV, XLSX, Markdown, JSON, and TXT via pymupdf, python-docx, and pandas — deterministic extractors with provenance tracking.

Web Researcher

SSE-streamed, per-skill web research via httpx + trafilatura + LLM summarisation. Fills knowledge gaps without operator hand-holding; results cached.

URL Ingestion

Single-page fetch into KB entries for ad-hoc references — documentation, RFCs, blog posts, vendor advisories.

Signed Webhooks

HMAC-signed inbound triggers drive ingestion from existing CI. Six lifecycle actions: scan, learn, improve, project_chat, agent_run, pipeline.

Per-Project Knowledge Base

Each project gets its own knowledge base with hybrid retrieval and five persistent memory layers that survive across sessions.

Per-Project Layout

Each project ships as a self-contained set of files: manifest.json, entries.json, graph.json, vectors.db, hermes_memory.json, reasoning_bank.json. Portable and inspectable.

Single-Flight Context Cache

Mtime-keyed cache for the assembled context bundle — concurrent requests for the same project state collapse into one retrieval pass.

FastAPI Gateway

A single FastAPI service exposes an OpenAI-compatible /v1/chat/completions endpoint. Every external tool that speaks the OpenAI API gets project intelligence through a single header.

3-Layer RAG Injection

An X-Project-ID header triggers automatic injection of (1) project overview, (2) KB entries, and (3) Hermes memories before the LLM call. The caller never assembles a prompt manually.

30 Task Types → Right Brain

Routes via TASK_MODEL_MAP across the active model families with size + memory + health guards. The right brain is matched to each task on demand — from code_review to iso_standards to function_calling.

LiteLLM Provider

One provider abstraction over Ollama (default), llama.cpp, vLLM, OpenAI, Anthropic, and Gemini. Typed-fallback dispatch keeps the surface stable when an upstream is unhealthy.

Adaptive Health & Watchdog

Adaptive health checks (30 s when unhealthy, 120 s when all healthy). An SSE disconnect watchdog releases upstream model resources within ~200 ms of client drop — no orphaned VRAM.

Agent & Integration Surfaces

Four peer surfaces ride the gateway: an agentic orchestrator, an MCP server, a VS Code extension, and a Pipelines DSL with eval and predictions tooling.

VS Code Extension

Bundled extension adds Ask About File, Explain Selection, and inline edit suggestions — project-scoped per workspace.

Pipelines & Peers

Pipelines DSL (sequential + fanout steps with verbatim and verify_build modes), signed webhooks, Compare and Predictions surfaces, and an evaluation harness for regression-grading project runs.

External IDE Agents

Any OpenAI-compatible client — Cline, Aider, Goose, OpenHands, Continue, Flowise — reaches the gateway via /v1 + the X-Project-ID header, calls MCP tools over stdio, and uses cloud Claude or GPT for reasoning under tool constraints when the engagement runs in hybrid mode.

localLLM stays the project-intelligence backend; the IDE agent is the brain. The gateway is OpenAI-compatible HTTP — there is no agent-detection logic and no rejection list. Any client that can hit a custom base URL and add a header gets the full RAG-injected experience.

Air-Gapped or Hybrid

A single project KB — two doors. The architecture is identical; only the brain changes.

Local Drawer (Direct Mode)

Project Q&A in ~3 s warm, 100% on-prem. Bypasses the agent layer entirely — pure retrieval + completion against the active model family.

External Agent + Cloud LLM

Code edits via Cline; localLLM provides project context via MCP tools and RAG injection. The cloud account is the customer’s — we don’t broker tokens.

Verified Against Current Code

Every number on this page is grounded in the running source tree, verified on 2026-04-29.

Surface Count Source
Model families (catalog) 54 config/catalog.yaml
Active families on M-series 5 config/models.macos.yaml
Routed task types 30 skills/project_orchestrator.py · TASK_MODEL_MAP
Agentic orchestrator tools 24 tools/registry.py
MCP tools (foundational + lifecycle) 38 (19 + 19) mcp_server.py · mcp_server_lifecycle.py
Memory layers 5 skills/knowledge_base.py
Webhook lifecycle actions 6 webhooks/router.py

Best fit and known limitations

Best for

Regulated organisations that must keep prompts, code, and context inside the perimeter, want a project-aware RAG knowledge base, and prefer an OpenAI-compatible gateway so existing IDE assistants and pipelines work unchanged.

Not the right fit

Teams happy with cloud LLMs and short-lived prompts; lightweight chatbot use without code or document context; environments that cannot host the modest GPU/CPU footprint required for local inference.

Known limitations

Answer quality is bounded by the local model family chosen for each task type; ingestion of very large monorepos requires tuning and storage planning; first-time setup includes infrastructure decisions (GPU, storage, VS Code rollout).

Discuss a similar engagement

Air-gapped sovereign AI, hybrid RAG-injected gateway, or a custom agent surface for an existing tool stack — we deliver the project-intelligence backend; you keep the perimeter.