Valoh Valoh
Note · v1.2
¶ Note

← All notes

Methodology v1.2 release notes: prompt-set review cadence and a new sentiment rubric.

We've moved prompt-set review from quarterly to biannual, expanded the sentiment rubric to better capture cautionary framing, and published a new validation κ for recommendation strength. This note documents what changed, why, and how prior windows compare.

This is the second material methodology revision since Valoh started measurement, and it's small on purpose. Methodology stability is itself part of independence: clients should be able to compare their Q1 measurement to their Q3 measurement without footnoting drift. The changes below are scoped to make that comparison cleaner, not noisier.

What changed

  • Prompt-set review cadence. Quarterly → biannual. Quarterly review was producing too much measurable drift inside what should have been comparable windows. Biannual gives prompt sets time to age into stable measurement instruments.¹
  • Sentiment rubric. Three tiers (positive / neutral / cautionary) → five tiers, with new explicit categories for "qualified positive" and "explicit warning." The three-tier rubric was over-collapsing the middle.
  • Validation κ scores. First quarterly publication of κ scores, by category, per signal. Sentiment and recommendation strength only, since those are the double-coded signals.
  • Backward comparability. Yes — both rubric changes are mappable to the prior taxonomies. Prior windows are footnoted in the report rather than restated.

What didn't change

Models tracked: still four — ChatGPT, Gemini, Claude, Perplexity — queried independently via API at default settings, reported separately. Daily refresh, 90-day raw transcript retention, public methodology paper. Prompt sets remain category-anchored, locked for the measurement window, never editable by clients.

The point of versioning a methodology in public is to make the measurement auditable. The point of revising it slowly is to make the measurement comparable.

Why biannual review

Quarterly prompt review made sense in year one when the prompt sets were new and the category was new. But in practice, three months wasn't enough time to know whether observed signal shifts were category drift or model drift or just noise. Biannual review gives more signal per revision and more comparable windows between revisions.

We considered annual review, the way Nielsen reviews TV-coverage methodology. Decided against it for now: the AI category itself is moving fast enough that yearly is too slow. Six months is the compromise that fits both the category's pace and our preference for stable instruments.²

Why a five-tier sentiment rubric

The three-tier rubric (positive / neutral / cautionary) had a structural problem: most observed sentiment fell into "neutral," even when the surrounding language varied substantially. A model saying "Brand X is a popular choice" is not the same as "Brand X is fine." Both got coded neutral. That's information loss.

The new rubric splits the middle into "qualified positive" (favorable framing with a caveat) and "neutral listing" (named in a list with no qualitative description). The two ends — explicit recommendation, explicit caution — are unchanged. We've remapped the prior 12 months of sentiment data to the new rubric so trend lines remain comparable, with a footnote at the change boundary.

Validation κ — what we're publishing

Inter-rater agreement scores (Cohen's κ) for the two double-coded signals, by category, per quarter. For Q1 2026 across all measured categories: sentiment κ = 0.81, recommendation strength κ = 0.76. Both are within the "substantial agreement" range conventionally accepted for content coding, and both are higher under the new rubric than they were under the old one — which is part of why we made the change.

We'll publish κ as part of every quarterly benchmark from now on. If it falls below 0.70 for any category in any quarter, we'll publish that, too, and explain why. Hiding low agreement is the kind of soft compromise measurement firms drift into. Easier to just publish it.

What this means for current clients

Current measurement windows continue uninterrupted. The new sentiment rubric is applied to all responses logged from 2026-05-04 onward. Prior responses are remapped algorithmically and footnoted in the report. The next prompt-set review is scheduled for 2026-11-04.

As always, the full methodology paper is at /methodology. If you have questions about the comparability of your prior windows under the new rubric, the answer is in the report; if it isn't, email and we'll work through it directly.

¹ For the empirical case for biannual review in third-party measurement, see DoubleVerify's published methodology revision history (2018–2024), where revision cadence settled at roughly the same interval after quarterly proved too noisy.

² Nielsen reviews TV-audience-measurement methodology on a roughly annual cadence; the IAB Tech Lab updates ad measurement standards on a similar schedule. AI brand presence is moving faster than either, hence the compromise.

¶ Aside

Editorial standard

Hiding low agreement is the kind of soft compromise measurement firms drift into. Easier to just publish it.

That's the editorial test for everything that goes in these notes. If the temptation is to leave something out because it's inconvenient, that's the thing worth publishing.

¶ More
¶ Subscribe

Get Notes by Email

Methodology updates and quarterly benchmarks, sent when published.

Or browse all notes