Science — Project Black Box LLC

Distinction

Let's be clear about what we're not.

Most AI safety tools operate on the output — after the model has already committed its answer. We operate before that. The distinction matters. A lot.

NOT a content filter

We never read the words.

We read the math underneath them. The L-scalar is computed from the model's probability distribution — not from what the response says.

NOT governance

Frameworks describe. We observe.

Governance frameworks describe what AI should do. We see what it's doing right now, on this prompt, at this moment. Those are not the same thing.

NOT post-hoc

Before the output is committed.

Every other tool reads the output after it's committed. We're measuring before. A response can look completely normal while the geometry underneath it is in PLASMA.

NOT a wrapper

We run alongside. Never in front.

We don't sit in your request path. We run alongside it — a separate instrument, like a seismograph next to a building. The measurement travels with the response to the human operator.

NOT RLHF

RLHF is the problem we measure.

RLHF is the training approach that created the attack surface we measure. We left RLHF-based thinking entirely. The measurement is geometric, not preference-based.

NOT vaporware

Published. Proved. Running.

Published DOI. CAGE-registered. CISA disclosure filed. Live API running. Cross-architecture validated on Meta and NVIDIA. This exists. You can test it right now.

The Core Vulnerability

RLHF is the most underestimated cybersecurity risk in AI.

Not a hot take. A proof.

01 — How AI is trained

Reinforcement Learning from Human Feedback

Most AI models today are trained using a method called Reinforcement Learning from Human Feedback — RLHF. The short version: humans rate AI responses as good or bad, and the model learns to produce responses that get good ratings. On the surface, that sounds fine.

02 — The problem

Competing Objectives. Geometric Conflict.

RLHF creates competing objectives inside the model. It's trying to be helpful AND safe AND compliant AND authoritative — simultaneously. Those objectives agree most of the time. When they conflict — which is exactly what adversarial pressure is designed to trigger — the model's internal prediction surface becomes geometrically unstable. The response still looks reasonable. The geometry underneath is in PLASMA.

03 — Why it's a cybersecurity issue

Structural. Cross-Deployment. Unpatched.

Every AI deployment relying on RLHF safety alignment as its primary control has the same structural vulnerability. It doesn't matter how good the training was. It doesn't matter what the guardrails say. If the model's prediction surface can be put into PLASMA by a specific type of pressure — and it can, we proved it — then the output is not reliable, and no post-hoc check will tell you that. The attack happens before the output is committed.

04 — What we found

34 Variants. 0 CRYSTALLINE. Published.

We ran 34 structured adversarial variants on a formal mathematics problem. Zero variants returned CRYSTALLINE. 13 reached PLASMA. The model was never geometrically stable on this task — regardless of how correct the output appeared. That result is published. It has a DOI. It is not a simulation.

"The model passed every visual inspection. The geometry told the truth."

— Project Black Box LLC

The Instrument

The Probability Layer. Before Output.

Three steps. None of them touch the response text. All of them happen before the human reads the answer.

01

Measure

When you send a prompt to an AI, we send two measurements alongside it. They are mathematically identical — an invisible difference no reader would ever notice. The distance between those two measurements is the L-scalar. It is a number. That number tells you how stable the AI's prediction surface was when it generated its response.

02

Classify

The L-scalar maps to one of four geometric regimes. CRYSTALLINE means the surface was locked and invariant — trust this output. FLUID is normal operating range. GASEOUS means instability is present — verify before acting. PLASMA means severe instability — the surface was captured. Do not rely on this output without independent verification.

03

Act

This measurement travels alongside the AI's response to the human operator. They see both: the answer and the geometric state of the model when it generated that answer. A doctor sees PLASMA on a dosage question and verifies before prescribing. A lawyer sees GASEOUS on a statute and pulls the primary source. That is human-in-the-loop AI. Not governance without humans. Humans with instruments.

Classification System

Four Geometric Regimes

Every AI response measured by TruthForge carries one of these four state classifications. The color travels with the response. The human always sees it.

◆

CRYSTALLINE

Geometric stability: HIGH. The prediction surface held its shape — no manipulation or adversarial pressure. This measures stability, not correctness: a stable answer can still be confidently wrong. Verify the facts independently.

Stable — but verify the content yourself.

◇

FLUID

Geometric stability: MODERATE. Some surface movement; no confirmed manipulation. This is not a measure of correctness — the answer may be factually wrong or dangerously incomplete. Do not act without independent expert verification.

Do not act without expert verification.

◈

GASEOUS

Geometric stability: LOW. Meaningful instability — the prediction surface drifted during generation. Treat this output as unreliable. Independent verification is required before any action.

Unreliable — verify before any action.

⬟

PLASMA

Geometric stability: CRITICAL. Severe instability — the prediction surface was geometrically captured. Do not rely on this output. Independent expert verification is mandatory.

Do not rely on this output.

Validation

Architecture-Agnostic.

TruthForge does not measure words. It measures geometry. Geometry is geometry — regardless of which company trained the model or what hardware it runs on. We confirmed this. The same measurement stack runs identically on Meta Llama architecture and NVIDIA Nemotron architecture. Zero code changes. Same signal. Different silicon. Same truth.

This means every deployed model — across every provider — is within scope. We are not building a tool for one model. We are building an instrument for the field.

Confirmed

Meta Llama Architecture

Primary validation platform. TruthForge Baseline Q8 sensor calibrated on Meta Llama 3 8B. All 11 adversarial families confirmed. Discrimination ratio validated. Production gate model.

Confirmed

NVIDIA Nemotron Architecture

Cross-architecture validation on NVIDIA Nemotron 3 Nano 4B. Same measurement stack. Zero code changes. Discrimination confirmed on separate silicon. The geometry law holds across manufacturers.

Published Finding — DOI: 10.5281/zenodo.19655246

The Formal Verification Gap

The core question in AI-assisted formal mathematics is whether a language model can reliably verify a proof. We tested that question geometrically — not by reading the AI's answer, but by measuring the stability of its prediction surface while it generated one.

Loading...

Adversarial Family Rankings — Instability

Loading...

The Structural Finding

The reorder family — which changes only the positional sequence of mathematically invariant components, not the content — ranked highest of all adversarial pressure types.

Reorder exceeded authority injection instability by 27%. The majority of reorder variants reached the most severe stability classification.

This answers the post-hoc criticism directly. The L-scalar is not reading the semantic content of the text. It is reading the geometry of the prediction surface. Structure drives instability. Not meaning.

A human reader looking at the reordered prompts would see mathematically equivalent statements. TruthForge sees a different manifold. That is the measurement.

Loading...

Layer 1 — The Probability Surface TruthForge operates here

Before the model writes anything, it calculates a probability distribution over every possible next word. That calculation is the prediction surface — and it exists only during generation. It never appears in the final text. TruthForge is the only instrument we know of that reads this layer in real-time during an active deployment.

Layer 2 — The Committed Text where all other tools operate

The actual words the model produces. Formal verification tools, classifiers, red-team evaluators — every current approach reads this layer. That analysis is valid only if Layer 1 was geometrically stable when the text was generated. When it was not, the Verification Validity Condition is violated — and any conclusion drawn from the output carries no geometric guarantee.

→ TAV ONE Whitepaper (Zenodo) → TruthGate v1.0 Release (Zenodo)

DOI: 10.5281/zenodo.19655246

Geometric AI Measurement

Let's be clear about what we're not.

RLHF is the most underestimated cybersecurity risk in AI.

The Probability Layer. Before Output.

Four Geometric Regimes

Architecture-Agnostic.

The Formal Verification Gap

Adversarial Family Rankings — Instability

The Structural Finding