Paste your notes, a doc, an email thread — LOLM answers only from your material, cites the exact passage it used, and tells you plainly when the answer isn't in there, instead of inventing one. A 70B model writes; a local uncertainty-control layer keeps it honest and hands you an auditable receipt for every answer. The AI that knows when it doesn't know.
$ npm install lolm-nfet-client // stream a live run — every token, decision and receipt import { runAgent, friendly } from 'lolm-nfet-client'; const run = await runAgent({ command: 'why is the sky blue?', onEvent: ev => { const line = friendly(ev); if (line) console.log(line); }, onProof: p => console.log(p.verdict), // honest receipt, every run });
Each token flows through five parallel streams that converge via learned, per-dimension fusion — the surface stays fluent while the latent core tracks what the sequence is actually doing.
Pre-norm Transformer with rotary position embeddings. The fluent voice.
Selective state-space model (Mamba-style) with parallel scan. Tracks the order beneath the words.
Gumbel-Softmax over 64 codes with causal-conv neighbor interaction. Gradient-isolated so no code collapses.
Three banks — episodic, semantic, self — with gated, chunked read/write that keeps gradients flowing.
Per-dimension sigmoid deciding, feature by feature, whether the surface or the latent stream speaks.
Why this matters for you, not just the math. The latent core carries a minority of the representation — yet remove it and the fluent surface collapses. That latent signal is exactly what the controller reads to tell when the model is on solid ground and when it isn't. It's the difference between an AI that sounds sure and one that can measure whether it should be.
Nothing on this page is an adjective. The control decisions, the receipts, the math checks and the confidence map are all things you can reproduce in the live demo or read in the code. The architecture is validated independently on NVIDIA H200 and Google TPU v4 against parameter-matched baselines — the full benchmark numbers and ablations live in the code and the proof pack, kept off this page on purpose.
Every token, the model measures four observables of its own trajectory — logit entropy, hidden drift, gate balance, regime entropy. A learned controller maps them to five actions that drive a real agent loop.
Trajectory healthy — keep generating.
Sustained uncertainty — go get evidence from memory or the web.
Representation jumped while uncertain — check the draft.
Stuck in a rut — fork alternatives, keep the healthiest.
Confident and stable — wrap up and answer.
A calibrated control policy works untrained from the first run; the control head then trains on the workspace's own logged traffic and takes over only when it is confident — otherwise the heuristic decides. Every run emits an honest receipt of what the controller actually did; it never claims the answer beat a baseline. A frontier mode lets a large reasoner generate while LOLM's latent machinery monitors and controls — and the whole workspace speaks the open Model Context Protocol, so it plugs into any modern agent stack.
The same measured uncertainty that drives the controller also gates real action. LOLM acts autonomously only when its calibrated probability of being correct clears a risk-tiered bar — generous for read-only, near-zero for anything that touches money — and every action's outcome is verified before the receipt may say it happened.
From receipt-monitored answers to a bounded persistent agent that maintains goals, memory, verification, scheduling and tools. The system reports its real current level — never a higher one. It's live at /api/demo/agent/level.
It runs read and reversible tools on its own — each gated by measured uncertainty and its outcome independently checked. Money, sending, deletion and deploys are hard-gated to a human, no matter how confident it is. That ceiling is in the math, not a policy doc.
It acts where a mistake is recoverable and escalates where it isn't. A missing uncertainty signal is a reason to ask, never a license to act — and it stops itself at a budget, a safety limit, or when nothing more is worth doing.
The fair question a reviewer asks: is the intelligence the big model, the telemetry, or the control? Here is what changes — and what does not — at each layer. We name what we have not measured rather than imply it.
| Variant | What it adds | Changes the answer? | What's proven |
|---|---|---|---|
| 70B only | the frontier voice | baseline | the capability is the frontier model's, not ours |
| 70B + passive telemetry | per-token uncertainty + confidence spans | No — observer only | the telemetry is real, measured per token |
| 70B + NFET control | retrieve / verify / branch / audit decisions | Yes — the run's path (segment-level), not the 70B's per-token sampling | control fires & is consumed; gated retrieval proven to lift answer correctness 0 → 7/8 on facts the writer can't otherwise know, with no regression on knowns; broader reasoning-quality lift still open |
| 4B only | a small local model | baseline | weak / inconsistent on hard reasoning (stated plainly) |
| 4B + NFET | inline graft (rewrites token logits) + control | Yes — per token and path | the graft measurably changes local generation; gated retrieval proven to lift correctness (baseline 0 → up to 88% on unknowable facts, no regression on knowns); broader quality lift still open |
Grounded in the real test battery (artifacts/lolm-real-tests): controller invocation 100% of NFET runs, control actions fired in a minority, and the one clear win was retrieval grounding. None of this proves the answer beat a baseline — it proves the controller acted on measured uncertainty, and the raw trace is sealed and inspectable.
Real runs: a 70B model (Llama 3.3) writes the answer; LOLM's local graft re-reads it per token for uncertainty and, at each segment boundary, decides whether to check notes, verify, or stop. Control acts between segments — it does not steer the 70B's tokens — and on most easy runs it stays out of the way. Every decision badge you see was made from measured telemetry — entropy, drift, gate, regime — not from a prompt. Tap a recorded run, or type your own below.
checking the live backend…
These are mechanism claims, not benchmark claims — every row is something you can verify yourself in the demo above or in the code.
| Plain chatbot | Prompted agent (ReAct-style) | LOLM-NFET agent | |
|---|---|---|---|
| How it decides to check facts or verify | It doesn't | Asks itself in words — a self-report, which models are famously bad at | Measures entropy, drift, gate and regime in its own activations, every token |
| Can you see which words it was unsure of? | No | No — it can only re-state the whole answer | Yes — it highlights the exact spans it measured as least confident, from its own per-token signals |
| Can you see why it acted? | — | Reasoning text (post-hoc; can confabulate) | Numbers — the exact telemetry and z-scores behind every decision |
| An honest receipt of what it actually did | None | None | Every run — what it retrieved, verified, or stopped on, red/yellow/green, with a math check. It does not claim the answer beat a baseline. |
| Learns from your usage | No | No (or cloud fine-tuning) | Yes, locally — the controller retrains on your own logged runs |
| Runs on | Someone else's cloud | A frontier-model API | 2 vCPUs — no GPU, no API key, fully private (this very demo) |
The honest caveat: the demo's 0.6B research model will not out-write GPT-class systems — the claim is the control mechanism, and it is model-agnostic. The same graft rides on any open backbone (0.6B → 32B tested targets), and a hybrid mode lets the latent machinery monitor a frontier model while it does the writing.
Everything below ships in the repo today — labels are honest about maturity.
Point the importer at a markdown folder or Obsidian vault. When the agent's
uncertainty spikes, it retrieves from your facts — ranked by relevance and
importance, never sent off your machine, with a receipt showing exactly what it used.
make import-notes NOTES=~/vault && make agent-ui
The graft rides any Hugging Face backbone (0.6B and 4B shipped; 32B targeted) and
streams per-token entropy, drift, gate balance, regime entropy and control logits over
a documented SSE protocol. npm install lolm-nfet-client
The whole workspace speaks the Model Context Protocol — plug it into Claude Code or Claude Desktop and a frontier agent gains local memory, the NFET control loop, and telemetry tools it can call like any other tool.
Hybrid mode: a frontier model generates while the local latent machinery re-reads every token for control telemetry — measured uncertainty driving segment-level control (retrieve / verify / stop), not steering the big model's prose. The bridge ships in the repo; it activates with an Anthropic API key.
@article{leonard2026lolm,
title = {LOLM: Language Modeling Beyond the Surface with Hybrid
Transformer-SSM Latent Order Fields},
author = {Leonard, Bryan and Leonard, Brandyn},
year = {2026},
note = {Qira LLC. Provisional patent application No. 64002166.}
}
Code is private during patent review and available under the LOLM Community License — free for research, education, and small entities. Access and commercial licensing on request via imagineqira.com.