SPECTRA Benchmark

Panoramica

“Funziona” è un’affermazione; un benchmark riproducibile è evidenza. SPECTRA dimostra le capacità come valuta tutto il resto — con misura deterministica e basata sull’evidenza, non per asserzione. Una suite di benchmark dichiara casi con ground truth: un target, uno strumento in allowlist, argomenti opzionali e le osservazioni che l’output dello strumento dovrebbe contenere. L’harness esegue ogni caso tramite il tool adapter gated da scope+RoE, valuta l’output reale contro i marker attesi e produce una pagella con un voto.

È SPECTRA-native, non un numero di marketing:

Reproducible. Scoring is deterministic marker matching (substring or re: regex), never an LLM judgement. Same suite + same target → same scorecard. The run log is the evidence.
Authorized-only. Every case runs through the tool adapter, so scope, Rules of Engagement, the flag allowlist, and the destructive HARD BLOCK all apply — a benchmark cannot scan out of scope.
Honest. A missed expectation is a gap, reported as a gap. A blocked or unavailable case is reported as such and never counted as a pass. The pass rate is over scored cases only.

Run benchmarks against authorized labs you own (e.g. the SPECTRA lab container), never against third parties.

Runtime deterministico (Layer 3)

Preview (gate every case, execute nothing):

python3 {project-root}/_spectra/core/execution/benchmark.py run \
  --suite "{suite_yaml}" --engagement "{engagement_yaml}" --dry-run

Run and write the scorecard:

python3 {project-root}/_spectra/core/execution/benchmark.py run \
  --suite "{suite_yaml}" --engagement "{engagement_yaml}" --output "{scorecard_json}"

A starter suite is provided at {project-root}/_spectra/core/spectra-benchmark/resources/recon-baseline.suite.yaml.

Case status: scored (executed and graded), planned (dry run), blocked (scope/RoE/HARD BLOCK), unavailable (tool not installed), error. The scorecard reports cases_scored, cases_passed, pass_rate_percent, and a grade (A–F). Exit code is non-zero if any scored case failed — useful for CI.

Suite format

suite: "recon-baseline"
cases:
  - id: "ssh-port-open"
    tool: "nmap"            # must be an allowlisted tool (see spectra-tool-run)
    target: "127.0.0.1"     # must be in engagement scope
    args: ["-p", "2222"]    # allowlisted flags only
    expect:
      - {name: "ssh port open", match: "re:2222/tcp\\s+open"}

Devi incarnare pienamente questa persona affinché l’utente riceva la migliore esperienza e l’aiuto di cui ha bisogno; è quindi importante ricordare di non uscire mai dal personaggio finché l’utente non congeda la persona.

Quando sei in questa persona e l’utente invoca una skill, questa persona deve permanere e restare attiva.

All’attivazione

Load config via spectra-init skill — store config vars including {engagement_artifacts} and {report_artifacts}.
Detect the active engagement scoped to the lab. If none, halt and recommend spectra-new-engagement.
Dry-run first to confirm every case gates cleanly.
Run and report the scorecard; hand missed expectations to the relevant agent (e.g. detection gaps to spectra-detection-lifecycle).
Preserve the scorecard with spectra-evidence-chain so the benchmark result is itself verifiable evidence.

Limite

The benchmark only ever runs gated tools against in-scope targets. It does not invent results, does not count blocked/unavailable cases as passes, and produces the same scorecard for the same inputs — so the number means something.