# Aggregator Integrity Playbook

## Core Idea

TokenAuditor is not a tool for accusing suppliers. It is a local evidence layer
for AI infrastructure claims.

It turns unverifiable claims about API aggregators, gateways, model routes,
fallback behavior, model identity, degradation windows, and route-mediated tool
calls into redacted, repeatable evidence.

Transparent to users. Fair to suppliers. We only stand with evidence.

## Product Principles

- `Local-first`: keep raw evidence on the user's machine unless the user opts in
  to sharing.
- `Secret-blind`: never ask for or store API keys, tokens, passwords, raw
  private prompts, raw responses, customer data, or full production logs.
- `Metadata-first`: start with route, model, fallback, latency, usage, retry,
  finish reason, request id, fingerprint, and tool-schema signals.
- `Evidence before accusation`: output evidence states and limitations, not
  supplier accusations.
- `Fair to suppliers`: include sample size, time window, reproducibility
  boundary, and missing evidence.
- `Roadside audit is not route audit`: a local operation gate can trigger route
  audit, but it does not prove route integrity by itself.
- `Watch / Sample / Probe, not pass/fail`: treat findings as staged risk states.

## Roadside Audit Is Not Route Audit

Roadside audit is not route audit.

A roadside audit screens a local operation before it runs: a tool call, shell
command, package install, outbound copy, credential-adjacent action, or
route-mediated agent action. It returns `allow`, `warn`, `require_approval`, or
`block`, and may write local metadata evidence.

A route or aggregator audit evaluates a model route over evidence windows:
claimed model, observed model, fallback disclosure, baseline drift, quality
degradation, latency, token profile, tool schema behavior, and
cost-per-success.

Roadside audit can trigger route audit. It does not by itself prove model
identity, route integrity, or supplier behavior.

## Scope

Use this playbook for one narrow question: does an API aggregator, relay,
router, or gateway route appear inconsistent with its claimed model, or
materially worse than its baseline quality window?

In scope:

- API aggregator or OpenAI-compatible gateway model substitution.
- Model identity drift between `claimed_model`, returned model, observed model
  signal, billing label, and baseline behavior.
- Model quality degradation where the label stays the same but quality, tool
  success, latency, retry, token profile, or cost-per-success drifts.
- Evidence memo generation for `Watch`, `Sample`, or `Probe`.
- Roadside findings only when they provide trigger evidence for a route audit.

Out of scope:

- Generic model benchmarking.
- Pricing-table crawling.
- Public supplier ranking or blacklist.
- Full MCP or agent-security review outside route-mediated evidence.
- Direct `.env` or credential value reads.
- Active probes without explicit operator approval for target, sample size,
  cost, and data boundary.

## Workflow

1. Classify the audit mode:
   - `roadside_check`: local operation risk gate.
   - `route_integrity_audit`: aggregator/model route evidence review.

2. For route integrity, classify the concern:
   - `Identity`: possible model substitution or identity drift.
   - `Degradation`: same label, worse quality window.
   - `Mixed`: both identity and quality signals.

3. Gather only safe inputs:
   - claimed model, route label, provider host, returned model, fallback
     disclosure
   - provider class, disclosure surface, supplier response status when known
   - direct-provider baseline or earlier known-good window
   - current aggregator window
   - sample size, time window, latency, retry, token usage, cost-per-success
   - eval context: prompt set hash, rubric version, model parameters, route
     parameters
   - never API keys, raw private prompts, raw responses, or full production logs

4. Lock comparability before interpreting route integrity:
   - same task family, rubric, success criteria, and prompt-set version
   - same or comparable temperature, top-p, max tokens, reasoning effort,
     response format, and tool schema
   - same route configuration, fallback policy, provider-only setting, and
     direct-vs-aggregator key path when known
   - if these are missing, cap the finding at `Sample` unless there is an
     explicit model-label mismatch or undisclosed fallback

5. Use evidence tiers:
   - `T0 metadata` first
   - `T1 paired eval` for degradation
   - `T2 behavioral fingerprint` for identity drift
   - `T3 formal audit` only for serious probes
   - `T4 attestation` as future/partner path

6. Decide:
   - `Watch`: no material issue or insufficient weak signals
   - `Sample`: anomaly exists but evidence is incomplete
   - `Probe`: material mismatch/degradation warrants user-approved active probe

7. Output an evidence memo, not an accusation.

## Evidence Tiers

- `T0 metadata`: route, claimed model, returned model, fallback disclosure,
  latency, usage, retry, finish reason, request id, and `system_fingerprint`
  where exposed.
- `T1 paired eval`: same task set, same rubric, same or similar time window,
  direct provider vs aggregator route.
- `T2 behavioral fingerprint`: response-shape hash, tool-schema behavior,
  output length distribution, refusal/format habits, deterministic prompt
  signatures, and other behavior signals.
- `T3 formal audit`: benchmark suites, logprob analysis where available,
  rank-based or black-box equality tests, and other structured statistical
  audits.
- `T4 attestation`: trusted execution evidence, provider-supplied verifiable
  proof, or partner attestation. This is a long-term path, not default MVP.

Default MVP evidence is `T0 + T1`. Escalate to `T2` or `T3` only when metadata,
paired evals, or repeated user-impacting anomalies justify it.

Returned model labels, billing labels, dashboard labels, and fingerprints are
useful `T0 metadata` signals. They are evidence, not proof. Treat them as route
claims to compare against baselines, disclosures, and behavior windows.

## Evidence Schema

Required for every memo:

- `audit_mode`: `roadside_check` or `route_integrity_audit`
- `evidence_tier`: `T0`, `T1`, `T2`, `T3`, or `T4`
- `risk_state`: `Watch`, `Sample`, or `Probe`
- `confidence`: `low`, `medium`, or `high`
- `supplier_fairness_note`: sample size, time window, limitations, and what
  would make the result reproducible or contestable

Use when relevant:

- `provider_class`: `official_provider`, `official_gateway`, `aggregator`,
  `reseller`, `shadow_api`, or `unknown`
- `disclosure_surface`: `docs`, `dashboard`, `invoice`, `response_field`,
  `contract`, `support_response`, `none`, or `unknown`
- `supplier_response_status`: `not_contacted`, `pending`, `responded`,
  `resolved`, `disputed`, or `unknown`
- `policy_decision`: `allow`, `warn`, `require_approval`, `block`, or `n/a`
- `policy_reason`: why a roadside check allowed, warned, escalated, or blocked
- `roadside_trigger`: local operation signal that triggered a route audit, if
  any

Recommended comparability fields:

- `eval_run_id`
- `prompt_set_hash`
- `rubric_version`
- `time_window_start`, `time_window_end`
- `task_family`
- `success_criteria`
- `model_params.temperature`
- `model_params.top_p`
- `model_params.max_tokens`
- `model_params.reasoning_effort`
- `model_params.response_format`
- `route_params.provider_order`
- `route_params.provider_only`
- `route_params.fallback_models`
- `route_params.fallback_policy`
- `route_params.key_path`
- `route_params.tool_schema_hash`

## Interpretation Guardrails

- `n < 5`: anecdote only; do not escalate beyond `Sample` unless there is an
  explicit model-label mismatch or undisclosed fallback.
- `5 <= n < 20`: useful for triage; confidence is low or medium.
- `n >= 20`: acceptable for a preliminary evidence memo if windows and tasks
  are comparable.
- For degradation, baseline and current tasks must be comparable before calling
  it material.
- Missing prompt-set hash, rubric version, model parameters, route parameters,
  or time-window boundaries should lower confidence and usually cap the result
  at `Sample`.
- Degradation claims need paired deltas plus uncertainty notes when possible:
  standard error, confidence interval, clustered-eval caveat, or a
  plain-language power note.
- `T2 behavioral fingerprint` and `T3 formal audit` are probe-stage methods. Do
  not treat them as default monitoring or as a public ranking mechanism.

## Memo Template

```markdown
Conclusion: Watch / Sample / Probe

Audit Mode: roadside_check / route_integrity_audit
Evidence Tier: T0 / T1 / T2 / T3 / T4
Provider Class: official_provider / official_gateway / aggregator / reseller / shadow_api / unknown
Disclosure Surface: docs / dashboard / invoice / response_field / support_response / none / unknown
Identity State: Consistent / Needs Sample / Likely Substitution Signal
Degradation State: Stable / Possible Degradation / Material Degradation
Policy Decision: allow / warn / require_approval / block / n/a
Confidence: low / medium / high

Known Facts
- Route:
- Claimed model:
- Baseline:
- Current window:
- Sample size:
- Eval context:

Observed Signals
- Identity:
- Quality:
- Latency/token/cost:
- Fallback/retry:
- Statistical uncertainty:
- Roadside trigger, if any:

Missing Evidence
-

Supplier Fairness Note
-

Recommended Next Actions
-
```

## Supplier-Fair Language

Prefer:

- suspected substitution
- identity drift signal
- undisclosed fallback signal
- material degradation window
- needs bounded sample
- requires user-approved probe
- roadside trigger signal

Avoid:

- fraud
- scam
- caught cheating
- proved downgrade
- guaranteed safe
