Custom AI Solution Sovereign Local AI Self-Improving RAG Agentic Orchestration MCP Server In Implementation

Platform Overview

localLLM is a turnkey platform that scans the client’s codebases and documents, builds a self-improving per-project knowledge base, and serves a project-aware assistant through any tool that speaks the OpenAI API. A FastAPI gateway fronts a benchmarked catalog of 54 model families and routes 30 task types to the right brain on demand. A 24-tool plan-then-execute orchestrator with grammar-constrained outputs runs alongside an MCP server exposing 38 tools to external IDE agents (Cline, Aider, Goose, OpenHands, Continue). Deploys air-gapped or hybrid — the perimeter is the customer’s, not ours.

Model Families

Routed Task Types

MCP Tools

Architecture

Verified Architecture

Six layers, top-down: inputs → ingestion → per-project knowledge base → FastAPI gateway → agent & integration surfaces → external IDE agents. Verified against the running codebase on 2026-04-29.

Layer 1 · Ingestion

Ingestion

Five input streams feed the per-project knowledge base. Every artifact carries provenance back to its origin so retrieval results stay auditable.

Codebase Scanner

Detects skills, frameworks, and languages and emits a structure graph for every git-tracked source tree. Output lands in skills.json and vectors.db alongside the per-project manifest.

Document Loaders

Ingests PDF, DOCX, CSV, XLSX, Markdown, JSON, and TXT via pymupdf, python-docx, and pandas — deterministic extractors with provenance tracking.

Web Researcher

SSE-streamed, per-skill web research via httpx + trafilatura + LLM summarisation. Fills knowledge gaps without operator hand-holding; results cached.

URL Ingestion

Single-page fetch into KB entries for ad-hoc references — documentation, RFCs, blog posts, vendor advisories.

Signed Webhooks

HMAC-signed inbound triggers drive ingestion from existing CI. Six lifecycle actions: scan, learn, improve, project_chat, agent_run, pipeline.

Layer 2 · Knowledge

Per-Project Knowledge Base

Each project gets its own knowledge base with hybrid retrieval and five persistent memory layers that survive across sessions.

Hybrid Retrieval

SQLite FTS5 keyword search plus sqlite-backed cosine vector embeddings plus FlashRank cross-encoder reranking, fused via RRF, diversified via MMR, and expanded one hop across entry-citation graphs. Citations carry entry IDs end-to-end.

Five Memory Layers

Hermes — long-term agent memory
ReasoningBank — distilled successful trajectories
__global__ — cross-project skills KB
USER.md — per-user profile memory
Skill rules — per-skill rule extraction

Per-Project Layout

Each project ships as a self-contained set of files: manifest.json, entries.json, graph.json, vectors.db, hermes_memory.json, reasoning_bank.json. Portable and inspectable.

Single-Flight Context Cache

Mtime-keyed cache for the assembled context bundle — concurrent requests for the same project state collapse into one retrieval pass.

Layer 3 · Gateway

FastAPI Gateway

A single FastAPI service exposes an OpenAI-compatible /v1/chat/completions endpoint. Every external tool that speaks the OpenAI API gets project intelligence through a single header.

3-Layer RAG Injection

An X-Project-ID header triggers automatic injection of (1) project overview, (2) KB entries, and (3) Hermes memories before the LLM call. The caller never assembles a prompt manually.

30 Task Types → Right Brain

Routes via TASK_MODEL_MAP across the active model families with size + memory + health guards. The right brain is matched to each task on demand — from code_review to iso_standards to function_calling.

LiteLLM Provider

One provider abstraction over Ollama (default), llama.cpp, vLLM, OpenAI, Anthropic, and Gemini. Typed-fallback dispatch keeps the surface stable when an upstream is unhealthy.

Adaptive Health & Watchdog

Adaptive health checks (30 s when unhealthy, 120 s when all healthy). An SSE disconnect watchdog releases upstream model resources within ~200 ms of client drop — no orphaned VRAM.

Layer 4 · Integration Surfaces

Agent & Integration Surfaces

Four peer surfaces ride the gateway: an agentic orchestrator, an MCP server, a VS Code extension, and a Pipelines DSL with eval and predictions tooling.

Agentic Orchestrator

24 tools · plan-then-execute with grammar-constrained emit_plan (Ollama 0.5+ JSON-schema decoding) and a ReAct fallback path. Every destructive action goes through an audited approval gate with full audit log.

MCP Server

38 tools over stdio — 19 foundational query verbs plus 19 lifecycle verbs (scan, learn, improve, evaluate, plan, …). Every tool call is audited. Drives Cline, Aider, Goose, OpenHands, and Continue.

VS Code Extension

Bundled extension adds Ask About File, Explain Selection, and inline edit suggestions — project-scoped per workspace.

Pipelines & Peers

Pipelines DSL (sequential + fanout steps with verbatim and verify_build modes), signed webhooks, Compare and Predictions surfaces, and an evaluation harness for regression-grading project runs.

Layer 5 · External IDE Agents

External IDE Agents

Any OpenAI-compatible client — Cline, Aider, Goose, OpenHands, Continue, Flowise — reaches the gateway via /v1 + the X-Project-ID header, calls MCP tools over stdio, and uses cloud Claude or GPT for reasoning under tool constraints when the engagement runs in hybrid mode.

localLLM stays the project-intelligence backend; the IDE agent is the brain. The gateway is OpenAI-compatible HTTP — there is no agent-detection logic and no rejection list. Any client that can hit a custom base URL and add a header gets the full RAG-injected experience.

Operation Modes

Air-Gapped or Hybrid

A single project KB — two doors. The architecture is identical; only the brain changes.

Air-Gapped

100% local LLMs · zero network egress

Local recall, local memory, local reasoning. No data leaves the client perimeter. Runs on commodity hardware (M-series Mac, RTX desktop, on-prem GPU server). The Pipelines DSL, MCP server, and orchestrator behave identically — only the model backend changes.

Hybrid (Stance B)

Local recall + cloud LLM reasoning

Local 8B–30B models do recall well (KB, memory, structure, rules); cloud Claude or GPT do reasoning under tool constraints reliably. Hybrid splits the work to the right brain. The cloud LLM only sees the RAG-injected payload of each call and individual MCP tool responses — never the raw archive.

Local Drawer (Direct Mode)

Project Q&A in ~3 s warm, 100% on-prem. Bypasses the agent layer entirely — pure retrieval + completion against the active model family.

External Agent + Cloud LLM

Code edits via Cline; localLLM provides project context via MCP tools and RAG injection. The cloud account is the customer’s — we don’t broker tokens.

Verified Numbers

Verified Against Current Code

Every number on this page is grounded in the running source tree, verified on 2026-04-29.

Surface	Count	Source
Model families (catalog)	54	`config/catalog.yaml`
Active families on M-series	5	`config/models.macos.yaml`
Routed task types	30	`skills/project_orchestrator.py` · `TASK_MODEL_MAP`
Agentic orchestrator tools	24	`tools/registry.py`
MCP tools (foundational + lifecycle)	38 (19 + 19)	`mcp_server.py` · `mcp_server_lifecycle.py`
Memory layers	5	`skills/knowledge_base.py`
Webhook lifecycle actions	6	`webhooks/router.py`

Fit & Limitations

Best fit and known limitations

Best for

Regulated organisations that must keep prompts, code, and context inside the perimeter, want a project-aware RAG knowledge base, and prefer an OpenAI-compatible gateway so existing IDE assistants and pipelines work unchanged.

Not the right fit

Teams happy with cloud LLMs and short-lived prompts; lightweight chatbot use without code or document context; environments that cannot host the modest GPU/CPU footprint required for local inference.

Known limitations

Answer quality is bounded by the local model family chosen for each task type; ingestion of very large monorepos requires tuning and storage planning; first-time setup includes infrastructure decisions (GPU, storage, VS Code rollout).

Discuss a similar engagement

Air-gapped sovereign AI, hybrid RAG-injected gateway, or a custom agent surface for an existing tool stack — we deliver the project-intelligence backend; you keep the perimeter.

Book a Consultation ← All Projects