The local AI architecture behind note.md

note.md is a fully local-first academic writing app. Every step described in this series runs on your Mac — no document, embedding, or claim leaves the device unless you explicitly export it.

Overview

One index. Four pipelines.

Under the hood, note.md is four pipelines stacked on a single local index. They each do one thing well, and they all read from — and write to — the same hybrid SQLite database that lives inside your project.

100%

On-device

Extraction, embeddings, inference

Pipelines

Indexing, search, matrix, scan

Shared index

semantic-index.sqlite

Bytes sent

No telemetry of your data

The Stack

The four pipelines.

Each pipeline is documented end-to-end in its own article. Together they cover the entire local-AI surface of note.md.

01 · Indexing
Source indexing →
A freshly imported PDF is extracted with MinerU, chunked under its headings, embedded with Nomic Embed Text v1.5, and stored alongside an FTS5 full-text index.
02 · Retrieval
Hybrid semantic search →
Vector similarity and BM25 keyword search run in parallel, then merge — with weighted scoring for the human-facing Search view, and Reciprocal Rank Fusion for the LLM-facing Matrix and Evidence Scan paths.
03 · Extraction
Matrix extraction →
A row-by-row LLM pipeline fills research matrices with structured JSON — verbatim quotes, page numbers, and confidence scores — using a strict JSON schema enforced by the local llama-cli binary.
04 · Verification
Evidence Scan →
For a given claim in your writing, retrieves passages from across your sources and classifies each one as supports, contradicts, nuanced, or irrelevant — then lets you insert any of them as a typed citation.

Architecture

Same index, many views.

A single hybrid SQLite index — semantic-index.sqlite, per project — powers full-text search, semantic search, Matrix extraction, and Evidence Scan. Indexing once gives you all four. There is no duplication of vectors across features, no rebuild step when you open a new tool.

Inside that database, every chunk lives in two places at once: as a 768-dimensional vector in sqlite-vec (or as a binary blob with a brute-force fallback) and as a row in an FTS5 virtual table that scores keyword matches with BM25. The two representations are kept consistent by writing them in the same atomic transaction during indexing.

Design

Different shapes per task.

The same chunks are useful in different ways depending on who is consuming them. Search-view retrieval is tuned for human eyeballs — weighted scoring, exact-phrase boosts, one chunk per source, and a soft similarity threshold so the long tail does not pad the list. Matrix and Evidence Scan retrieval is tuned for LLM consumption — Reciprocal Rank Fusion, scoping that keeps the model focused on the right paper, and a document-order fallback so the AI never sees an empty context.

The composite retrieval path

PDF imported
   ▼
Source indexing  →  chunks  +  embeddings  +  FTS5
   ▼
semantic-index.sqlite   (per project, hybrid)
   │
   ├── Semantic Search view      →  human-readable results
   │      weighted scoring, one-per-source, threshold
   │
   └── Matrix / Evidence Scan    →  LLM context
          RRF, scoped or cross-source, document-order fallback
                   ▼
          Gemma 4 via llama-cli  (JSON-schema enforced, on-device)
                   ▼
          Matrix cells / Evidence verdicts
                   ▼
          Graph (KnowledgeConnection)

The takeaway: retrieval shape follows consumer. Humans get ranked, deduplicated lists; LLMs get tightly-scoped context with fallbacks. The underlying chunks are the same.

Promise

Privacy, briefly.

Everything described above runs locally. The bundled llama-cli and llama-embedding binaries are stock llama.cpp builds shipped inside the app. The model weights you download — Gemma 4 variants for inference, Nomic Embed Text v1.5 for embeddings — are stored in your user data directory.

Series

Keep reading.

Each pipeline has its own article with the implementation details — chunk sizes, ranking math, model parameters, failure modes. Start with indexing (everything else depends on it) or jump to the feature you care about.

The local AI architecture behind note.md

One index. Four pipelines.

The four pipelines.

Source indexing →

Hybrid semantic search →

Matrix extraction →

Evidence Scan →

Same index, many views.

Different shapes per task.

Privacy, briefly.

Keep reading.

More from the series.

Source indexing: turning PDFs into a local knowledge index

Hybrid semantic search: meaning + keywords, fused

Matrix extraction: filling research tables with local AI

Run it on your Mac.