Custom AI Solution. Knowledge Graph. E-Discovery Workflow. Matter Scoping. Local-Only Operation. In Active Development.

What it is, in one paragraph

LocalBrain reads the mail archives and document folders an organisation already has — PST, MBOX, EML, MSG plus PDF, DOCX, XLSX, PPTX, TXT, CSV and image OCR — and turns them into a queryable knowledge graph that lives on the operator’s own hardware. Investigators can navigate the graph as a force-directed WebGL canvas, ask natural-language questions through a chat-first console, run named saved queries with temporal as-of snapshots, and receive alerts when a stakeholder is silently dropped from a thread. On top of that knowledge layer sits a full e-discovery workbench — matter scoping, a custodian roster, an append-only chain-of-custody ledger, a five-state privilege workflow, Bates numbering, audit-tracked redaction overlays, a privilege log aligned to FRCP 26(b)(5)(A), and a production-set CSV export. The architecture is opinionated: nothing leaves the perimeter. Parsing, language-model reasoning, embedding, retrieval and audit all run locally.

Provenance Classes per Edge

Privilege Workflow States

Supported Source Formats

Architecture · Diagram 1 of 3.

Data flow — from a folder of files to a defensible record

Top-down: heterogeneous inputs are normalised, language-processed and extracted; relationships persist in a dual store; analysis surfaces communities, anomalies, hybrid search and forensic queries; the e-discovery layer wraps the lot; presentation surfaces serve investigators and downstream tools.

Ingest → Graph → E-Discovery → Surface

PST · MBOX · EML · MSG PDF · DOCX · XLSX · PPTX TXT · CSV · Image OCR IMAP · Local Folder Watch

↓

Ingestion, NLP & ExtractionRFC 5322 threading · Quote stripping · Coreference · Entity-pair relation extraction with provenance

↓

Graph + Vector Store

People · Threads · Topics · Organisations Semantic similarity · neighbourhood traversal

Analytics + Full-Text Store

Canonical facts · BM25 index · Bi-temporal audit Source of truth for the knowledge graph

↓

Leiden Communities
Who works with whom Gap Anomalies
Silent exclusions Hybrid Search
BM25 + vector · RRF Forensic Queries
Temporal as-of

↓

E-Discovery LayerMatter scope · Custodian · Chain-of-custody · Privilege workflow · Bates · Redaction · Production CSV

↓

Force-Directed WebGL Graph Chat Console · Slash + STT + Matter Switch Per-Matter Obsidian Vault · git MCP + REST Tool Surface

Layer 1 · Ingest.

From a folder of files to a clean canonical record

Ingestion does the unglamorous work that everything else depends on. The pipeline parses files, reconstructs threads from RFC 5322 headers, skips duplicates, and routes every body through quote stripping, forward-boundary detection and coreference resolution — so “he agreed” becomes “Alice agreed” before any language model sees it.

Mail archives

The parser folds PST, MBOX, EML and MSG into a single canonical mail model with RFC 5322 threading and attachment metadata. A local SQLite ledger checkpoints deduplication so re-runs stay idempotent.

Document corpora

The extractor handles PDF, DOCX, XLSX, PPTX, TXT and CSV deterministically; scanned images go through OCR fallback. Every artefact carries provenance back to its source path.

Live sources

An IMAP connector pulls from on-premise mail servers; a 30-second folder watcher re-scans on disk change. A folder-to-matter template can auto-tag inbound files to the right matter without manual triage.

Coreference before extraction

The resolver maps pronouns to the speaker and named-entity context in each thread before any LLM call. This is a hard rule: “rule-first, LLM-second” — the language model never sees ambiguous text it would have to guess about.

Layer 2 · Knowledge Graph.

A graph where every edge tells you how it got there

The knowledge graph is not a bag of inferred relationships. Every single edge belongs to exactly one of three provenance tiers, and every single edge carries six mandatory fields so a reviewer can always answer the question: why does the system claim this?

Provenance Taxonomy · 3 Tiers · 6 Mandatory Fields per Edge

Deterministic

Derived from RFC 5322 headers and structural facts.

Sent-to · CC · Replied-to · Thread-of

Confidence 1.0 · no manual review

Rule-Derived

Inferred from deterministic rules over headers and body patterns.

Forwards · Org-of · Communicates-with

High confidence · spot review

LLM-Extracted

Semantic relations from a local LLM with embedding-based grounding.

Requests-action · Reports-to · Discusses

Probabilistic · mandatory human review

Every edge, every tier: source · confidence · evidence excerpt · timestamp · model version · review status.

Dual store, on purpose

The knowledge graph lives in a graph + vector store optimised for traversal and semantic similarity; the canonical facts and a BM25 full-text index live in an analytics store with bi-temporal columns. The analytics store is the source of truth — the graph store is a queryable projection. If a better graph engine ships tomorrow, the projection can be rebuilt without losing anything.

Continuous human review

LLM-extracted edges go into a keyboard-driven review queue: J/K to step, Y to confirm, N to reject. Decisions persist alongside the edge, and a reviewer can always trace a confirmed relationship back to the exact body excerpt and the exact model version that produced it.

Layer 3 · Analysis.

Communities, anomalies and answers that cite the evidence

Once the graph is built, the analyst gets four working surfaces — all driven from the same underlying graph, all answering the kinds of questions an investigation actually needs.

Communities (Leiden)

The Leiden algorithm clusters the communicates-with graph to surface who actually works with whom — not who is on the org chart. Each community gets a short, locally-generated description so the analyst can scan the landscape fast.

Participant-gap alerts

A deterministic delta over the thread DAG flags candidates: people who appeared in two or more earlier messages but vanished from a later one. A second pass classifies each candidate — oversight, restriction, deliberate exclusion — and writes an anomaly node with the reason recorded.

Hybrid search with RRF fusion

BM25 lexical search and vector similarity search both return ranked results; Reciprocal Rank Fusion combines them without requiring score normalisation. The output is a single ranked list where every hit cites the underlying message.

Forensic queries with temporal as-of

Bi-temporal storage means the graph can answer “what did this look like on date X?” even after later edits, retractions or re-classifications. Operators pin named saved queries to a matter and re-run them as the matter evolves.

Chat-first console

A slash-command launcher, voice input, matter switcher and a streaming agent loop sit on top of everything else. The analyst can ask in plain English and the answer comes back with citations to the messages that grounded it.

Force-directed WebGL graph

People, threads, organisations, topics, communities and anomalies render as a navigable graph. Clicking a node shows its neighbourhood, its detail panel and the edges that touch it — with the provenance class colour-coded on the line.

Layer 4 · E-Discovery.

The workflow a litigator actually runs

The e-discovery layer is not a search dashboard with a privilege checkbox. It is a matter-scoped workbench with first-class primitives for the steps an investigation produces — from intake through production — designed against FRCP 26 and built so every step leaves a defensible trail.

Matter → Privilege Workflow → Production

Matter Intake → Custodian + Chain-of-Custody → Document Review → Privilege Workflow

unreviewed → asserted → produced → clawed-back · FRCP 26(b)(5)(B)-style → released

↓

Bates Numbering
Designed monotonic · concurrency-safe Redaction Overlay
Audit-tracked · non-destructive Privilege Log
Aligned to FRCP 26(b)(5)(A) Production-Set CSV
Per-matter export

Every mutating endpoint honours a dry-run preflight. Audit payloads carry SHA-256 hashes — never the protected content.

Matter as a primitive

Every artefact in the system is bound to a named matter. The same email parsed twice under two different custodians produces two defensible records, not a collision. Matter scoping flows through retrieval, the agent loop, and the export surfaces.

Custodian + chain-of-custody

Each matter has its own custodian roster and an append-only chain-of-custody ledger. The ledger captures every ingest, every promotion, every export as an event with a hash and a timestamp.

Five-state privilege workflow

Documents move through unreviewed → asserted → produced → clawed-back → released. The clawback transition mirrors FRCP 26(b)(5)(B) so a post-production assertion of privilege has a defined path back, with an audited reason field.

Bates numbering, designed correct

Bates assignment is monotonic and concurrency-safe by construction. Two reviewers stamping at the same time cannot duplicate or skip a number — the assignment runs under an async lock that serialises the sequence even under load.

Redaction as audit-tracked overlay

Redactions are never destructive. They are overlays stored against the original, with the reviewer, the reason, the timestamp and the revision history attached — so a successful challenge can always produce the unredacted original.

Privilege log and production-set CSV

A privilege log aligned to FRCP 26(b)(5)(A) and a production-set CSV are first-class exports. Manifests, Bates ranges, custodian and privilege state ship together so the production package is reviewable end-to-end before it leaves the system.

Layer 5 · Defence in Depth.

Security is the architecture, not a feature

Because nothing should leave the perimeter, every layer has to respect the perimeter — not just an outer firewall rule. The same applies to user-driven mistakes: an internal bug should never become a denial of service for the analyst.

No external API for data processing

Parsing, language-model reasoning, embedding, retrieval and audit all run on local hardware. The only sanctioned outbound channels are user-authorised ingestion from IMAP and on-premise file shares, and optional SMTP delivery of scheduled digest reports.

Layered egress controls on agent output

The agent loop runs behind an input-validation layer that defends against prompt injection on untrusted ingest text, and an output-side filter at the egress chokepoint that catches data trying to leave through generated answers.

Hardened outbound connector boundary

Outbound connector calls pass a master kill-switch, a per-account host pin and a resolved-IP filter against SSRF. Credentials live in the OS keyring and are never written to logs or audit payloads.

PII on the data-security surface

A dual-engine PII detection pipeline scans the corpus with span-level provenance labels so the analyst can see exactly which spans triggered, where, and by which detector. GDPR delete flagging is built into the data-security page.

Audit by hash, not by content

Audit payloads carry SHA-256 hashes of protected content — never the content itself. Forensic admissibility lives in the data model rather than in a paper trail bolted on at the end.

Hard resource guards

A preflight check guards every long-running operation: free-disk floor on the data volume, bounded streaming captures, a free-RAM warning before inference, log-directory rotation thresholds. An internal bug cannot become a denial of service.

Operation Modes.

One stack, two postures

Same architecture, same code path — what changes is the visibility of cloud-connector UI and which channels may ingest. Operators choose the posture at install time.

Local-only mode

Air-gappable. Zero cloud-OAuth surface visible.

All cloud-connector UI is hidden. Only the local-folder ingest and IMAP are visible to operators. Useful for genuinely air-gapped deployments where the cloud surfaces would only be a distraction or a compliance concern.

Default mode

Cloud-connector UI visible, none active by default.

The same code path with the cloud-connector surfaces left visible so operators can see what later integration will connect. Active ingest stays restricted to local folders and IMAP until the customer explicitly authorises an additional provider.

Supported hosts

Windows 11 with Git Bash or WSL2, and macOS Apple Silicon (tested on a Mac mini M4 / 16 GB). One idempotent startup script handles dependency checks and starts the four cooperating services in the only order that works.

Single-host today

The current deployment target is a single workstation or a single operator’s box inside the customer perimeter. There is no cloud control plane, no staging environment, no managed-service component to depend on.

A Day in the Life.

What an analyst actually does with it

The architecture is the means; the analyst’s working hour is the end. Four scenarios that hit the main surfaces of the system.

Open a matter, drop a PST

The analyst creates a matter, names a custodian, drops a PST file into the folder watcher. Within minutes the pipeline parses, threads, deduplicates, language-processes and indexes the file — writing every artefact into the chain-of-custody ledger for that custodian.

Find the silent exclusion

The graph view lights up a red anomaly node on a thread where the General Counsel was always CC’d — until a particular date. The analyst clicks through, sees the reason classification, the confidence score, and the exact message that triggered the alert.

Save a question, re-run it

“Every communication with opposing counsel between two dates” goes in as a named saved query. The analyst pins it to the matter and runs it again every Monday morning. Hybrid retrieval handles the lexical and the semantic side; RRF fuses the rankings; every hit cites its source.

Produce a defensible package

Documents move through the privilege workflow. The reviewer stamps Bates numbers monotonically. Redactions go on as overlays with reviewer, reason and timestamp. The privilege log and the production-set CSV export side by side — ready for review before the package leaves the system.

Eight Principles.

The rules the system holds itself to

LocalBrain enforces eight architectural principles as project-wide invariants. They are not aspirational — a pull request that violates one is rejected by the same review queue the code reviews use.

P1 · Rule-first, LLM-second

Header fields, RFC threading and participant lists are deterministic facts. The language model is used only for semantic enrichment, never for facts that can be derived from structure.

P2 · Store-once, project-many

Raw mail is stored once. All derived layers — vector, graph, Obsidian vault, exports — are projections. No uncontrolled duplication of source data.

P3 · Three-class edge taxonomy

Every relation has exactly one class: deterministic, rule-derived, or LLM-extracted. Classes are never mixed; downstream consumers can always filter by tier.

P4 · Provenance on every edge

Six mandatory fields: source, confidence, evidence, timestamp, model version, review status. No edge enters the graph without all six.

P5 · Coreference before LLM

Pronouns are resolved before any language-model call. “He agreed” is meaningless without knowing who “He” is — the model never has to guess.

P6 · Guided entity-pair extraction

Relations are extracted by asking constrained questions about pre-computed entity pairs — not by asking the model to free-form discover relations. Avoids the NA-imbalance problem that wrecks unconstrained extraction.

P7 · Obsidian as projection

The per-matter Obsidian vault is a display layer for analyst notes, not a primary database. Truth lives in the graph and analytics stores; the vault stays in lockstep.

P8 · No cloud egress

All models, all databases, and every processing step run locally. Zero external API calls for data processing — an invariant, not a configuration flag.

Stack at a glance.

What runs the engagement

A short table of generic capability categories — specific vendor choices are deliberate and can be substituted without changing the architecture.

Layer.	Capability category.	Why this layer.
Storage (graph).	Embedded graph + vector store.	Neighbourhood traversal and semantic similarity in one engine, with named query support.
Storage (analytics).	Columnar analytics store with BM25 full-text index.	Canonical fact store and bi-temporal audit; full-text search co-located with the analytics columns.
Inference.	Local language-model runtime + local embedding model.	All reasoning and embedding stays on the host; no cloud LLM is called for data processing.
NLP.	Local coreference resolver + NER pipeline.	Pronouns and entities are resolved before any LLM call so the model never sees ambiguous text.
Clustering.	Leiden algorithm over the communicates-with graph.	Surfaces communities that reflect actual communication patterns, not org-chart structure.
Retrieval.	Hybrid BM25 + vector search fused via Reciprocal Rank Fusion.	Lexical precision and semantic recall combined without requiring score normalisation.
API.	FastAPI REST surface + Model Context Protocol server.	One surface for internal UI, one surface for compatible AI clients; same tool catalogue underneath.
Frontend.	React + TypeScript + force-directed WebGL canvas.	Multi-page SPA with a graph as a first-class navigation surface, not a static visualisation.
Hosts.	Windows 11 (Git Bash / WSL2) and macOS Apple Silicon.	One idempotent startup script covers both. The same code path runs in both environments.

Fit & Limitations.

Best fit and known limitations

Best for

Regulated organisations with internal investigations, legal hold, litigation response or compliance audits — where the data cannot leave the perimeter and every relation needs to be defensible back to a source. Particularly strong fit for teams whose discovery workflow already references FRCP 26 vocabulary.

Not the right fit

Teams that want a turnkey cloud SaaS, a fully managed e-discovery service, or a generic search box over mail. LocalBrain is a custom on-premises engagement; it expects an operator who values control over convenience.

Known limitations

Answer quality is bounded by the local model family chosen for each task. The current scope is single-host on one workstation per matter; horizontal scale-out is on the roadmap but not in scope today. Cloud-source connectors exist in code but are deferred — current sanctioned ingest is local folders and IMAP.

Discuss a similar engagement

If you have a regulated investigation, an internal review, or a litigation workflow that needs to stay inside your perimeter — with a defensible record at every step — we can build the system around your matter, not around our SaaS.

Book a Consultation ← All Projects