Open methodology. Reproducible math.
Every reported number on Findrix carries a method ID and confidence bounds. CFOs verify it. Statisticians break it. We ship the formulas, the assumptions, and the limits.
Wilson confidence intervals on every count.
When we say "38% citation share", we report the Wilson CI bounds alongside it. With small samples, the lower bound widens — and we surface that uncertainty rather than hide it.
Citation share (point estimate)
38.0%
Wilson 95% CI
[29.1% — 47.8%]
method_id: wilson-v1.2
Wilson Score Interval (95% CI):
p̂ + z²/(2n) ± z·√(p̂(1-p̂)/n + z²/(4n²))
─────────────────────────────────────────
1 + z²/n
where p̂ = mentions / sample, n = sample size, z = 1.96 for 95% CI.
method_id: wilson-v1.2Difference-in-differences for citation lift.
Did the schema deploy actually move the citation rate? We compare your treatment cohort against a matched control cohort — the difference of differences gives causal lift. Placebo tests on pre-deploy windows guard against regression-to-the-mean illusion.
Difference-in-differences (DiD): ATE = (Y_treatment_post - Y_treatment_pre) - (Y_control_post - Y_control_pre) Standard error from cluster-robust regression at the (LLM × prompt × week) level. Placebo test: re-run on (week_-4, week_-2) windows; if "effect" appears, abort. method_id: did-v1.4 (with placebo guard did-placebo-v1.0)
BCa Bootstrap for share-of-voice.
Share of voice is bounded between 0 and 1 — Wilson doesn't fit. We use Bias-Corrected accelerated Bootstrap (10,000 resamples) for asymmetric CI. When the LLM mix is heavy on one source (e.g. only Reddit cited you), the CI tells you.
BCa Bootstrap: For B=10,000 resamples of (prompt, LLM, week) tuples: 1. Compute SoV for each resample 2. Bias-correction z₀ = Φ⁻¹(P(SoV* < SoV̂)) 3. Acceleration â = jackknife formula 4. Adjusted percentiles α₁, α₂ CI_BCa = (SoV*_α₁, SoV*_α₂) method_id: bca-v1.1
FDR correction for prompt-by-prompt tests.
When you test 240 prompts × 4 LLMs simultaneously, naive p-values find "significant" lifts that are noise. Findrix applies Benjamini-Hochberg FDR correction at q=0.05 — every flagged prompt has been adjusted for the multiple-testing burden.
Benjamini-Hochberg FDR:
For sorted p-values p_(1) ≤ p_(2) ≤ ... ≤ p_(m):
Reject H_0(i) for all i ≤ k
where k = max{i : p_(i) ≤ (i/m) · q}
q = 0.05 (5% expected false discovery rate)
method_id: bh-fdr-v1.0Verify the math yourself.
All four formulas live in our open-source statistics library. Pull it, run our test fixtures, reproduce any number we report.
# Findrix open-source stats library $ git clone https://github.com/findrix/findrix-stats $ cd findrix-stats && pip install -e . # Reproduce the Wilson CI on your fixture data >>> from findrix_stats import wilson_ci >>> wilson_ci(mentions=38, n=100) (0.291, 0.474) # method_id: wilson-v1.2
Repository: github.com/findrix/findrix-stats (MIT license · Phase 4)
Honest disclosures.
- LLMs are stochastic. Two identical queries 10 minutes apart can yield different outputs. We sample n=10 calls per (prompt, LLM, window) and report the mean with CI — but raw point estimates are not deterministic.
- Citation share is a sampled estimate, not a census. The full population of LLM queries is unobservable. We use stratified random prompts within your category.
- LLM model versions change. When OpenAI ships GPT-5.5, our citation rate baseline shifts. We re-baseline within 14 days of any major model release.
- DiD assumes the control cohort follows the same trend as treatment in the absence of intervention. We validate this with pre-deploy parallel-trend tests; failures block the report.
