Jiddu

PolitiFact × Jiddu benchmark

Run date: 2026-05-31 Pipeline: Jiddu fact-check verifier — Perplexity Sonar Reasoning Pro via OpenRouter, English prompt, force: true (cache bypassed) Source: LIAR2 dataset (Apache-2.0, ~23k human-labeled PolitiFact claims, 2008-2023) Sample: 200 claims, stratified 50-per-bucket across PolitiFact's four polar labels (true, mostly_true, false, pants_on_fire), filtered to 2020-onward so Perplexity Sonar's web search can still reach contemporary primary sources Cost: ~$5-6 of OpenRouter credit (200 Sonar Reasoning Pro calls)

TL;DR

On 200 claims drawn from PolitiFact's polar buckets, Jiddu's verdict matched the PolitiFact-human verdict in 67.5% of cases. Strict polar disagreement — Jiddu calling a claim the opposite polarity from the human — happened in only 4.5% of cases (9/200). The remaining 28% of cases were Jiddu returning mixed or unverified instead of a polar verdict, a pattern strongly concentrated on PolitiFact's mostly_true bucket (where 58% of human-rated mostly_true claims received mixed from Jiddu — arguably a coherent mapping, since "mostly true" and "claim is partly correct, partly not" overlap by definition).

Restricting to PolitiFact's three least-ambiguous buckets (true, false, pants_on_fire), Jiddu agrees with the human verdict on 122 / 150 = 81.3% of cases. The middle-ground mostly_true bucket is where Jiddu's higher-resolution mixed output replaces a polar verdict in over half the cases.

This is the first quality measurement for Jiddu against a third-party human gold standard. It will be re-run when the verifier prompt, model choice or pipeline changes materially.

Setup

What was measured

For each claim we sampled, we called the same verifyClaim() function used in production at /api/factcheck/[id]/verify. Inputs:

Output captured per claim:

The harness ran with concurrency 6 (matching production). Total wall-clock: 6m 2s.

How agreement was scored

PolitiFact uses a 6-level Truth-O-Meter (pants_on_fire / false / mostly_false / half_true / mostly_true / true); Jiddu uses 4 (supported / contradicted / mixed / unverified). We benchmarked only on the four polar buckets and applied this mapping:

PolitiFact labelExpected Jiddu verdict
truesupported
mostly_truesupported
falsecontradicted
pants_on_firecontradicted

A claim is agreed if Jiddu returned the expected polar verdict. It is disagreed if Jiddu returned the opposite polar verdict — the worst-case outcome. The remaining cases — mixed and unverified — are tallied separately, since they're closer to "we are not making a polar claim here" than to either agreement or disagreement.

Middle-ground PolitiFact labels (mostly_false, half_true) were excluded from sampling. They are the noise zone documented in Sahitaj et al. 2025 — adding them would dilute the signal we're after.

Results

Headline numbers

OutcomeCountShare
Agreed (polar match)13567.5%
Disagreed (polar opposite)94.5%
Returned mixed4522.5%
Returned unverified115.5%
Errored00%
Total200100%

Confusion matrix

Rows are PolitiFact labels, columns are Jiddu verdicts. Cells are claim counts (each row sums to 50).

supportedcontradictedmixedunverified
true322115
mostly_true135293
false24341
pants_on_fire04712

The diagonal — PolitiFact's polar claims that received the expected polar verdict — sums to 122. The single largest off-diagonal block is mostly_true → mixed with 29 cases (see "The mixed pattern" below).

Per claim type

The harness assigned each claim a heuristic type. Smaller samples in some buckets:

TypeSampleAgreedRate
date141392.9%
numeric10770.0%
quote442965.9%
categorical1308565.4%
causal2150.0%

date-typed claims (specific events with verifiable timestamps) are where Sonar shines. categorical — the catch-all for unstructured assertions — sits at the average. causal had too few samples to read into.

The mixed pattern

The single most striking signal is the mostly_true → mixed overlap: 29 of 50 PolitiFact-mostly_true claims received mixed from Jiddu.

This is not the same failure mode as a confidence-gradation 5-class scheme degrading at the boundary (per the Sahitaj 2025 finding). It's a structural overlap between two semantically adjacent categories:

Both describe the same epistemic state. The difference is rhetorical: PolitiFact starts at "true" and walks down, Jiddu sits in the middle and reaches both ways. If mostly_true → mixed is counted as a sensible mapping rather than a miss, the effective sensible-output rate is (135 + 29) / 200 = 82.0%.

Concretely, from the run:

These are claims where every reader benefits from seeing the qualification, not a verdict in either direction.

The 9 strict disagreements

The 9 cases where Jiddu's polar verdict was the opposite of PolitiFact's polar verdict are the most important set in this run. They split into four causes:

(a) Time-sensitivity — Sonar evaluated current evidence against a historical claim (4 cases)

When a claim was made years ago about a topic where the situation has since reversed, Sonar tends to find evidence reflecting today's reality, not the claim's original moment.

Takeaway: the pipeline doesn't know when a claim was made and applies current web search results. A future enhancement would be to thread the claim's date into the Sonar prompt — "as of <date>, was this true?". Worth flagging in the rationale even if we don't fully solve it.

(b) Literal-quote-true vs. meta-claim-false (2 cases)

PolitiFact often rates a claim false when the literal words were spoken but the implication is misleading. Sonar verifies the quote was said, then defends it.

Takeaway: Jiddu doesn't distinguish "the claim was made" from "the claim is true." A short prompt addition asking "is the implication of this claim true?" instead of "is the claim true?" would catch some of these.

(c) Sonar nuance is correct, PolitiFact was lenient in context (2 cases)

These are cases where Sonar's careful reading is technically more defensible than PolitiFact's rating — though the rating may have been correct in the original article's narrower framing.

Takeaway: these are not failures — they are points where the automated pipeline reads more carefully than the human did. Worth not over-correcting.

(d) Model output / rationale mismatch (1 case)

Takeaway: rationale-verdict mismatches happen at low rates. A future enhancement: an automated sanity-check pass that flags claims where the rationale and the verdict are mutually inconsistent.

Limitations

  1. Sample size. 200 claims gives a ±5% confidence interval on the headline number. A 500-claim run would tighten this to ±3%, at ~$15 cost.
  2. PolitiFact is US-political. The benchmark says nothing about Jiddu's accuracy on Brazilian politics, science claims, sports claims, or any non-US-political domain. We selected this dataset because it's the only structured large-scale fact-check corpus with a permissive license, not because it represents Jiddu's input distribution. A Lupa / Aos Fatos benchmark in PT-BR would require scraping their published HTML — separate work.
  3. PolitiFact has its own biases. The methodology has been criticized for inconsistent application across the political spectrum, particularly by right-leaning sources. We are measuring alignment with one human team's editorial judgement, not with "truth."
  4. No claim context. Jiddu was given the bare claim text. PolitiFact's human fact-checkers had the original article, the speaker's full statement, and the historical context. This handicaps Jiddu compared to the human reference — which is also the realistic production condition.
  5. Self-reported quality. This is run by the developer of the pipeline being measured. Reproducibility (below) is the mitigation.
  6. Test-time leakage risk. Some PolitiFact claims in LIAR2 may be indexed by Perplexity's web search — Sonar could in principle find the original PolitiFact article and parrot the verdict. We did not filter for this; doing so would require manually inspecting source URLs. The confusion matrix doesn't show pathological accuracy that would suggest this is happening (43-47 / 50 on the polar buckets), so the bias is probably small.

Reproduction

git clone git@github.com:rafaehlers/jiddu.git
cd jiddu
npm install
cp .env.example .env   # add your OPENROUTER_API_KEY
npx prisma migrate dev
npx tsx scripts/benchmark-politifact.mts            # full 200-claim run, ~6 min, ~$6
npx tsx scripts/benchmark-politifact.mts --sample=20  # smoke run, ~1 min, ~$0.60

The seed is fixed (SHUFFLE_SEED = 0x6a696464) so the same 200 claims are sampled across reruns. Per-claim results land in scripts/data/bench-<timestamp>.json. Re-run after any prompt change in src/lib/verify-claim-prompt.ts to detect regressions.

See also