LocalBrain — a living knowledge graph and FRCP-aligned e-discovery workbench
Years of corporate email and document history become a navigable, provenance-stamped knowledge graph — with an e-discovery surface designed against FRCP 26 sitting on top. Everything runs inside the client perimeter. The team that uses it is the team that owns the data.
What it is, in one paragraph
LocalBrain reads the mail archives and document folders an organisation already has — PST, MBOX, EML, MSG plus PDF, DOCX, XLSX, PPTX, TXT, CSV and image OCR — and turns them into a queryable knowledge graph that lives on the operator’s own hardware. Investigators can navigate the graph as a force-directed WebGL canvas, ask natural-language questions through a chat-first console, run named saved queries with temporal as-of snapshots, and receive alerts when a stakeholder is silently dropped from a thread. On top of that knowledge layer sits a full e-discovery workbench — matter scoping, a custodian roster, an append-only chain-of-custody ledger, a five-state privilege workflow, Bates numbering, audit-tracked redaction overlays, a privilege log aligned to FRCP 26(b)(5)(A), and a production-set CSV export. The architecture is opinionated: nothing leaves the perimeter. Parsing, language-model reasoning, embedding, retrieval and audit all run locally.
Data flow — from a folder of files to a defensible record
Top-down: heterogeneous inputs are normalised, language-processed and extracted; relationships persist in a dual store; analysis surfaces communities, anomalies, hybrid search and forensic queries; the e-discovery layer wraps the lot; presentation surfaces serve investigators and downstream tools.
Graph + Vector Store
People · Threads · Topics · Organisations Semantic similarity · neighbourhood traversalAnalytics + Full-Text Store
Canonical facts · BM25 index · Bi-temporal audit Source of truth for the knowledge graphWho works with whom Gap Anomalies
Silent exclusions Hybrid Search
BM25 + vector · RRF Forensic Queries
Temporal as-of
From a folder of files to a clean canonical record
Ingestion does the unglamorous work that everything else depends on. The pipeline parses files, reconstructs threads from RFC 5322 headers, skips duplicates, and routes every body through quote stripping, forward-boundary detection and coreference resolution — so “he agreed” becomes “Alice agreed” before any language model sees it.
Mail archives
The parser folds PST, MBOX, EML and MSG into a single canonical mail model with RFC 5322 threading and attachment metadata. A local SQLite ledger checkpoints deduplication so re-runs stay idempotent.
Document corpora
The extractor handles PDF, DOCX, XLSX, PPTX, TXT and CSV deterministically; scanned images go through OCR fallback. Every artefact carries provenance back to its source path.
Live sources
An IMAP connector pulls from on-premise mail servers; a 30-second folder watcher re-scans on disk change. A folder-to-matter template can auto-tag inbound files to the right matter without manual triage.
Coreference before extraction
The resolver maps pronouns to the speaker and named-entity context in each thread before any LLM call. This is a hard rule: “rule-first, LLM-second” — the language model never sees ambiguous text it would have to guess about.
A graph where every edge tells you how it got there
The knowledge graph is not a bag of inferred relationships. Every single edge belongs to exactly one of three provenance tiers, and every single edge carries six mandatory fields so a reviewer can always answer the question: why does the system claim this?
Deterministic
Derived from RFC 5322 headers and structural facts.
Sent-to · CC · Replied-to · Thread-of
Confidence 1.0 · no manual review
Rule-Derived
Inferred from deterministic rules over headers and body patterns.
Forwards · Org-of · Communicates-with
High confidence · spot review
LLM-Extracted
Semantic relations from a local LLM with embedding-based grounding.
Requests-action · Reports-to · Discusses
Probabilistic · mandatory human review
Every edge, every tier: source · confidence · evidence excerpt · timestamp · model version · review status.
Dual store, on purpose
The knowledge graph lives in a graph + vector store optimised for traversal and semantic similarity; the canonical facts and a BM25 full-text index live in an analytics store with bi-temporal columns. The analytics store is the source of truth — the graph store is a queryable projection. If a better graph engine ships tomorrow, the projection can be rebuilt without losing anything.
Continuous human review
LLM-extracted edges go into a keyboard-driven review queue: J/K to step, Y to confirm, N to reject. Decisions persist alongside the edge, and a reviewer can always trace a confirmed relationship back to the exact body excerpt and the exact model version that produced it.
Communities, anomalies and answers that cite the evidence
Once the graph is built, the analyst gets four working surfaces — all driven from the same underlying graph, all answering the kinds of questions an investigation actually needs.
Communities (Leiden)
The Leiden algorithm clusters the communicates-with graph to surface who actually works with whom — not who is on the org chart. Each community gets a short, locally-generated description so the analyst can scan the landscape fast.
Participant-gap alerts
A deterministic delta over the thread DAG flags candidates: people who appeared in two or more earlier messages but vanished from a later one. A second pass classifies each candidate — oversight, restriction, deliberate exclusion — and writes an anomaly node with the reason recorded.
Hybrid search with RRF fusion
BM25 lexical search and vector similarity search both return ranked results; Reciprocal Rank Fusion combines them without requiring score normalisation. The output is a single ranked list where every hit cites the underlying message.
Forensic queries with temporal as-of
Bi-temporal storage means the graph can answer “what did this look like on date X?” even after later edits, retractions or re-classifications. Operators pin named saved queries to a matter and re-run them as the matter evolves.
Chat-first console
A slash-command launcher, voice input, matter switcher and a streaming agent loop sit on top of everything else. The analyst can ask in plain English and the answer comes back with citations to the messages that grounded it.
Force-directed WebGL graph
People, threads, organisations, topics, communities and anomalies render as a navigable graph. Clicking a node shows its neighbourhood, its detail panel and the edges that touch it — with the provenance class colour-coded on the line.
The workflow a litigator actually runs
The e-discovery layer is not a search dashboard with a privilege checkbox. It is a matter-scoped workbench with first-class primitives for the steps an investigation produces — from intake through production — designed against FRCP 26 and built so every step leaves a defensible trail.
Designed monotonic · concurrency-safe Redaction Overlay
Audit-tracked · non-destructive Privilege Log
Aligned to FRCP 26(b)(5)(A) Production-Set CSV
Per-matter export
Every mutating endpoint honours a dry-run preflight. Audit payloads carry SHA-256 hashes — never the protected content.
Matter as a primitive
Every artefact in the system is bound to a named matter. The same email parsed twice under two different custodians produces two defensible records, not a collision. Matter scoping flows through retrieval, the agent loop, and the export surfaces.
Custodian + chain-of-custody
Each matter has its own custodian roster and an append-only chain-of-custody ledger. The ledger captures every ingest, every promotion, every export as an event with a hash and a timestamp.
Five-state privilege workflow
Documents move through unreviewed → asserted → produced → clawed-back → released. The clawback transition mirrors FRCP 26(b)(5)(B) so a post-production assertion of privilege has a defined path back, with an audited reason field.
Bates numbering, designed correct
Bates assignment is monotonic and concurrency-safe by construction. Two reviewers stamping at the same time cannot duplicate or skip a number — the assignment runs under an async lock that serialises the sequence even under load.
Redaction as audit-tracked overlay
Redactions are never destructive. They are overlays stored against the original, with the reviewer, the reason, the timestamp and the revision history attached — so a successful challenge can always produce the unredacted original.
Privilege log and production-set CSV
A privilege log aligned to FRCP 26(b)(5)(A) and a production-set CSV are first-class exports. Manifests, Bates ranges, custodian and privilege state ship together so the production package is reviewable end-to-end before it leaves the system.
Security is the architecture, not a feature
Because nothing should leave the perimeter, every layer has to respect the perimeter — not just an outer firewall rule. The same applies to user-driven mistakes: an internal bug should never become a denial of service for the analyst.
No external API for data processing
Parsing, language-model reasoning, embedding, retrieval and audit all run on local hardware. The only sanctioned outbound channels are user-authorised ingestion from IMAP and on-premise file shares, and optional SMTP delivery of scheduled digest reports.
Layered egress controls on agent output
The agent loop runs behind an input-validation layer that defends against prompt injection on untrusted ingest text, and an output-side filter at the egress chokepoint that catches data trying to leave through generated answers.
Hardened outbound connector boundary
Outbound connector calls pass a master kill-switch, a per-account host pin and a resolved-IP filter against SSRF. Credentials live in the OS keyring and are never written to logs or audit payloads.
PII on the data-security surface
A dual-engine PII detection pipeline scans the corpus with span-level provenance labels so the analyst can see exactly which spans triggered, where, and by which detector. GDPR delete flagging is built into the data-security page.
Audit by hash, not by content
Audit payloads carry SHA-256 hashes of protected content — never the content itself. Forensic admissibility lives in the data model rather than in a paper trail bolted on at the end.
Hard resource guards
A preflight check guards every long-running operation: free-disk floor on the data volume, bounded streaming captures, a free-RAM warning before inference, log-directory rotation thresholds. An internal bug cannot become a denial of service.
One stack, two postures
Same architecture, same code path — what changes is the visibility of cloud-connector UI and which channels may ingest. Operators choose the posture at install time.
Local-only mode
Air-gappable. Zero cloud-OAuth surface visible.
All cloud-connector UI is hidden. Only the local-folder ingest and IMAP are visible to operators. Useful for genuinely air-gapped deployments where the cloud surfaces would only be a distraction or a compliance concern.
Default mode
Cloud-connector UI visible, none active by default.
The same code path with the cloud-connector surfaces left visible so operators can see what later integration will connect. Active ingest stays restricted to local folders and IMAP until the customer explicitly authorises an additional provider.
Supported hosts
Windows 11 with Git Bash or WSL2, and macOS Apple Silicon (tested on a Mac mini M4 / 16 GB). One idempotent startup script handles dependency checks and starts the four cooperating services in the only order that works.
Single-host today
The current deployment target is a single workstation or a single operator’s box inside the customer perimeter. There is no cloud control plane, no staging environment, no managed-service component to depend on.
What an analyst actually does with it
The architecture is the means; the analyst’s working hour is the end. Four scenarios that hit the main surfaces of the system.
Open a matter, drop a PST
The analyst creates a matter, names a custodian, drops a PST file into the folder watcher. Within minutes the pipeline parses, threads, deduplicates, language-processes and indexes the file — writing every artefact into the chain-of-custody ledger for that custodian.
Find the silent exclusion
The graph view lights up a red anomaly node on a thread where the General Counsel was always CC’d — until a particular date. The analyst clicks through, sees the reason classification, the confidence score, and the exact message that triggered the alert.
Save a question, re-run it
“Every communication with opposing counsel between two dates” goes in as a named saved query. The analyst pins it to the matter and runs it again every Monday morning. Hybrid retrieval handles the lexical and the semantic side; RRF fuses the rankings; every hit cites its source.
Produce a defensible package
Documents move through the privilege workflow. The reviewer stamps Bates numbers monotonically. Redactions go on as overlays with reviewer, reason and timestamp. The privilege log and the production-set CSV export side by side — ready for review before the package leaves the system.
The rules the system holds itself to
LocalBrain enforces eight architectural principles as project-wide invariants. They are not aspirational — a pull request that violates one is rejected by the same review queue the code reviews use.
P1 · Rule-first, LLM-second
Header fields, RFC threading and participant lists are deterministic facts. The language model is used only for semantic enrichment, never for facts that can be derived from structure.
P2 · Store-once, project-many
Raw mail is stored once. All derived layers — vector, graph, Obsidian vault, exports — are projections. No uncontrolled duplication of source data.
P3 · Three-class edge taxonomy
Every relation has exactly one class: deterministic, rule-derived, or LLM-extracted. Classes are never mixed; downstream consumers can always filter by tier.
P4 · Provenance on every edge
Six mandatory fields: source, confidence, evidence, timestamp, model version, review status. No edge enters the graph without all six.
P5 · Coreference before LLM
Pronouns are resolved before any language-model call. “He agreed” is meaningless without knowing who “He” is — the model never has to guess.
P6 · Guided entity-pair extraction
Relations are extracted by asking constrained questions about pre-computed entity pairs — not by asking the model to free-form discover relations. Avoids the NA-imbalance problem that wrecks unconstrained extraction.
P7 · Obsidian as projection
The per-matter Obsidian vault is a display layer for analyst notes, not a primary database. Truth lives in the graph and analytics stores; the vault stays in lockstep.
P8 · No cloud egress
All models, all databases, and every processing step run locally. Zero external API calls for data processing — an invariant, not a configuration flag.
What runs the engagement
A short table of generic capability categories — specific vendor choices are deliberate and can be substituted without changing the architecture.
| Layer. | Capability category. | Why this layer. |
|---|---|---|
| Storage (graph). | Embedded graph + vector store. | Neighbourhood traversal and semantic similarity in one engine, with named query support. |
| Storage (analytics). | Columnar analytics store with BM25 full-text index. | Canonical fact store and bi-temporal audit; full-text search co-located with the analytics columns. |
| Inference. | Local language-model runtime + local embedding model. | All reasoning and embedding stays on the host; no cloud LLM is called for data processing. |
| NLP. | Local coreference resolver + NER pipeline. | Pronouns and entities are resolved before any LLM call so the model never sees ambiguous text. |
| Clustering. | Leiden algorithm over the communicates-with graph. | Surfaces communities that reflect actual communication patterns, not org-chart structure. |
| Retrieval. | Hybrid BM25 + vector search fused via Reciprocal Rank Fusion. | Lexical precision and semantic recall combined without requiring score normalisation. |
| API. | FastAPI REST surface + Model Context Protocol server. | One surface for internal UI, one surface for compatible AI clients; same tool catalogue underneath. |
| Frontend. | React + TypeScript + force-directed WebGL canvas. | Multi-page SPA with a graph as a first-class navigation surface, not a static visualisation. |
| Hosts. | Windows 11 (Git Bash / WSL2) and macOS Apple Silicon. | One idempotent startup script covers both. The same code path runs in both environments. |
Best fit and known limitations
Best for
Regulated organisations with internal investigations, legal hold, litigation response or compliance audits — where the data cannot leave the perimeter and every relation needs to be defensible back to a source. Particularly strong fit for teams whose discovery workflow already references FRCP 26 vocabulary.
Not the right fit
Teams that want a turnkey cloud SaaS, a fully managed e-discovery service, or a generic search box over mail. LocalBrain is a custom on-premises engagement; it expects an operator who values control over convenience.
Known limitations
Answer quality is bounded by the local model family chosen for each task. The current scope is single-host on one workstation per matter; horizontal scale-out is on the roadmap but not in scope today. Cloud-source connectors exist in code but are deferred — current sanctioned ingest is local folders and IMAP.
Discuss a similar engagement
If you have a regulated investigation, an internal review, or a litigation workflow that needs to stay inside your perimeter — with a defensible record at every step — we can build the system around your matter, not around our SaaS.