Scanner guide13 min read

Scanner Module Failures: Timeouts, Errors, and Partial Scans

Every scanning programme that runs more than a handful of targets eventually hits the same operational reality: some modules complete cleanly, some time out, some error, and some are skipped. The scan is partial. The question is whether the partial completion gets surfaced as evidence on the engagement record or hidden inside summary totals that look identical to a clean scan. Hidden partials produce two failure modes at once: testers re-verify findings the scanner never actually checked, and auditors read coverage that was never delivered.

This guide covers the four module states (completed, error, timeout, skipped), why each one happens in production, how to design retry and recovery so transient failures do not turn into persistent gaps, how to record a partial scan so it survives audit, and how to read module-level status as the operating evidence that vulnerability scanning controls actually require. The audience is internal security, AppSec, vulnerability management, and security engineering teams that run scans on cadence against assets that move.

Four module states and what each one means

A scanner module is a single detection unit inside a larger scan. The scan plan lists the modules the platform intends to run against the target; the scan execution records what each module actually returned. Treating module completion as binary (the scan finished or it did not) collapses information that the findings record, the audit, and the next scan cycle all depend on. The four canonical states below are what a defensible scan record carries per module.

Module statusWhat it meansTypical next step
CompletedThe module ran against the target, evaluated its detection logic, and returned a result with a duration. Findings (if any) land on the engagement.Triage findings as normal. The module contributes to coverage on the next baseline comparison.
ErrorThe module raised an exception, hit a network failure, or returned a non-completion status before the time budget elapsed. The cause is recorded as an error message.Retry with backoff for transient cases; surface the failure on the scan record for persistent cases that need investigation.
TimeoutThe module exceeded its hard execution budget without returning. The recorded duration matches the timeout boundary, not the actual completion time.Investigate whether the timeout reflects a slow target, a rate-limited response, or a coverage limit that needs the budget extended.
SkippedThe module did not run because a precondition failed, scope excluded it, or the plan tier did not include it. The skip rationale is recorded.Confirm the skip is intentional. If the skip was driven by configuration drift, fix the configuration and re-run the module.

The four states are not a hierarchy; each one carries different information about what happened to the scan. Reducing them to a single completion percentage loses the distinction between a module that errored intermittently (retry candidate), a module that timed out repeatedly on the same target (coverage limit), a module that was skipped because the plan did not include it (procurement decision), and a module that completed cleanly (evidence). The scan record carries all four per module so the interpretation stays accurate.

Where module failures come from in production

Failures are structural in any active scanning programme that runs against assets whose state changes during the scan. The five mechanisms below generate the failures that most programmes spend their investigation time on, even when the scanner, the target, and the credentials are all configured correctly.

Target latency variance mid-scan

A target that responds in 200 ms during the SSL handshake can take 30 seconds for a slow backend query during the same scan. The module that runs the slow probe hits the per-module timeout while the SSL module records a clean completion. The scan is partial because of the same target behaving differently to two modules. The signal is not a scanner bug; it is the scanner accurately representing what the target actually did.

Mid-scan WAF or rate-limit activation

A web application firewall that allows low-volume traffic activates when probe volume crosses a threshold. The first half of the scan completes; the second half gets a stream of 403 or 503 responses. Modules that ran early land clean; modules that ran late record errors. The scan record needs to surface the activation point or the next reader cannot tell which findings were verified against a live target and which were silently blocked.

External-dependency unreachability

A module that resolves CVE metadata against an external advisory database fails if the upstream is slow or unreachable. The scanner cannot detect the underlying vulnerability without the metadata, even though the target itself is fine. The failure mode is external to the target but visible on the module status, and the fix is retry against a healthier upstream rather than rescan against the target.

Session expiry during authenticated modules

Authenticated scans depend on a session that the target controls. Sessions expire on the schedule the application enforces, not on the schedule the scanner expects. A module that ran for forty minutes can find its session invalidated halfway through and start receiving login pages instead of authenticated responses. The authenticated coverage page covers the session diagnostics in detail; here, the relevant point is that session expiry mid-module is a routine cause of partial scans, not an exotic failure.

Resource-constrained probe expansion

A subdomain enumeration that hits a wildcard DNS response expands the candidate list faster than the time budget can absorb. The module records a timeout because the work was bounded by clock time, not by completion. The recovery is bounding the probe space (capping subdomain depth, capping endpoint depth) rather than extending the module budget against an unbounded search.

Retry policy: transient versus persistent

Retry is the cheapest recovery and the most often misconfigured. A retry loop that re-runs every failure mode against the same conditions burns the scan budget on persistent failures and never catches up. A retry loop that does not retry transient failures discards usable scans for conditions that have already changed by the second attempt. The pragmatic split below survives most programmes.

Failure signalRetry policyRationale
Network timeout or 5xxRetry up to three attempts with exponential backoff (for example 2s, 8s, 32s).Transient failures resolve when conditions change; bounded retry catches most without burning the budget.
HTTP 429 with Retry-AfterRetry once after the indicated interval; then halve the scan rate before continuing.The target has signalled the rate is too high; respect the signal and adapt rather than burning retries.
HTTP 401 or 403 after a successful loginDo not retry. Fail the module and surface the authentication failure on the scan record.The session is gone; another attempt produces the same outcome. The fix is the credential lifecycle, not the retry loop.
Module exception in scanner codeDo not retry. Record the exception and the input that triggered it.A code-path failure repeats deterministically; investigation is the next step, not repetition.
Hard timeout against the same target on consecutive scansStop retrying. Record the module as a coverage limit against this target and tune the budget separately.A timeout that happens every cycle is not transient; it is a coverage decision that needs documenting rather than retrying.

The shared principle is matching the retry policy to what the failure signal can tell you about conditions changing between attempts. Retry helps when the underlying conditions are transient and shift between attempts; retry hurts when the underlying cause is structural and repeats deterministically.

Stale-job recovery when the worker crashes

Scanner work happens in a background process that can be restarted, redeployed, or crashed by the conditions the scans are running against. A job that was running when the worker disappeared has to be recoverable, or the queue accumulates stuck entries and the next scan cycle inherits the backlog. The recovery loop is independent of the module retry policy because it operates one level up: it returns jobs to the dispatch queue rather than re-running modules in place.

Detect stale jobs

A recovery loop runs on a short interval (commonly 60 seconds) and looks for jobs that have been marked running for longer than a recovery threshold without progress. The threshold is larger than the longest legitimate module execution time plus a safety margin. Jobs that match are candidates for re-dispatch.

Return to pending, do not delete

Stale jobs return to the pending state with the original parameters intact. The next worker dispatch picks them up and runs them fresh. Deleting the job loses the trail; returning it to pending keeps the activity log complete on the original dispatch attempt.

Guard against double-write

If the original worker is still alive but slow, it can finish after the recovery loop dispatched a replacement. The write protection is anchoring the final result to whichever attempt the platform considers active; the original attempt loses the right to write to the job record once recovery has reassigned it.

Surface the recovery on the audit trail

Every recovery event is an audit record: the job that was recovered, the original dispatch time, the recovery time, the worker that was reassigned. Hidden recoveries erode the audit trail because the next reader cannot tell why a job that started at 10:00 finished at 10:35 after a single dispatch attempt.

Recording a partial scan on the engagement

A partial scan is a normal operational outcome. The discipline is making the partial status legible to the next reader rather than absorbing it into the totals. Three pieces of evidence are non-negotiable on the engagement record.

  • Modules attempted: the scan plan that was dispatched against the target. This is the coverage promise the scan was trying to honour.
  • Modules completed: the subset that returned a clean completion status by the scan deadline. This is the coverage that was actually delivered.
  • Modules failed: with per-module failure mode (timeout, error, skipped) and a message or duration that lets the next reader understand what happened. This is the gap between the promise and the delivery.

The three lists together let the engagement record answer the question the auditor and the next tester will both ask: what did this scan actually verify? Without all three, the same partial outcome can be read as a clean scan, a failed scan, or a coverage limit, and the choice is left to interpretation. With all three on the record, the interpretation is anchored to evidence.

Trend comparison across scans depends on this discipline too. A baseline that includes a partial scan as if it were complete produces a trend line that drifts on coverage change rather than on remediation. The scan baseline and trend comparison guide covers how to separate real change from coverage drift; the prerequisite is faithful per-module status on every scan so coverage drift is visible rather than hidden.

Common operational mistakes

The mistakes below show up across programmes that operate scanners at scale. Each one is a failure to separate the four module states or to surface partial completion faithfully.

  • Collapsing all four states into pass or fail: the scan record reports a single completion percentage and loses the distinction between a timed-out module, an errored module, a skipped module, and a completed module. Auditors get totals; testers get nothing.
  • Retrying every failure indiscriminately: the retry loop spends the budget on persistent failures and never converges. The scan finishes long after the cadence window and the next scheduled scan inherits a backlog.
  • Hiding partial scans inside finding totals: the engagement dashboard shows two hundred findings on a scan that actually completed sixty per cent of its modules. The other forty per cent were never verified, but the totals look identical to a complete scan.
  • Treating timeouts and errors as the same outcome: a timeout often needs a budget extension or a probe-space cap; an error often needs a retry with backoff. Treating them the same prevents either fix from landing.
  • Skipping modules without a recorded rationale: a module that was skipped because the plan did not include it is a procurement decision; a module that was skipped because the precondition failed is a configuration issue. The skip rationale belongs on the record so the two cases can be distinguished.
  • Not surfacing stale-job recovery on the audit trail: a job that ran from 10:00 to 10:35 because of a single dispatch looks the same as a job that ran from 10:00 to 10:35 because the original worker was recovered at 10:15. The audit difference matters; the record needs to reflect it.

The shared root cause is treating module-level status as a scanner-internal detail rather than as the operating evidence vulnerability scanning controls actually require. Module status is the chain that connects scan dispatch to closed finding; losing the chain at any link breaks the audit trail.

How SecPortal handles module failures and partial scans

SecPortal scan modules each return a structured result with four possible states (completed, error, timeout, skipped), a duration in milliseconds, and the findings the module produced (if any). The orchestrator records the status per module on the scan record, so the partial-completion picture is queryable on the engagement rather than hidden inside summary totals.

The scan worker enforces hard module timeouts to prevent infinite hangs; when a module exceeds its budget, the worker records a timeout status with the elapsed duration and moves to the next module rather than blocking the scan. Stale job recovery returns scan jobs that have been running too long without progress back to pending, so a fresh worker dispatch can pick them up after a worker crash or a deployment cycle. Retry with backoff handles transient errors before the module status flips to error on the scan record.

The external scanning feature and the authenticated scanning feature both surface module status on the scan record so partial completion is visible at the engagement level. Findings management treats partial scans as first-class outcomes: findings from completed modules land normally, and the modules that did not complete stay attached to the scan record as evidence of coverage rather than getting discarded.

The activity log captures every scan dispatch, module completion, retry, and stale-job recovery so the audit trail from scan plan to closed finding is reproducible. The scanner evidence chain guide covers how per-module status feeds the seven-layer evidence chain from scan execution to closed finding.

Upstream of module execution, the scan target validation and authorisation guide covers the three control points (verified ownership, legal attestation, platform blocklist) that gate whether a module can dispatch at all. A module that errors because the target was not authorised is not the same as a module that timed out against an authorised target; the validation layer makes the distinction recordable.

Operational checklist

Scan plan

  • Every scan has an explicit module list recorded at dispatch.
  • Module budgets are sized to the slowest legitimate execution time for that module against the target class.
  • Probe-space limits are set so unbounded enumeration cannot exhaust the budget.
  • The skip rationale is recordable per module so intentional skips and configuration failures are distinguishable.

Module execution

  • Each module returns a status, a duration, and a result payload.
  • Hard timeouts at the worker layer prevent any single module from blocking the scan.
  • Retry with backoff handles transient errors up to a small ceiling.
  • Persistent failures stop retrying and surface on the scan record for investigation.

Stale-job recovery

  • A recovery loop runs on a short interval against running jobs that have stalled.
  • Recovery returns the job to pending; the original worker loses the right to write the result.
  • Every recovery event is logged so the audit trail reflects the dispatch history.
  • The recovery threshold is larger than the longest legitimate module budget plus a safety margin.

Partial-scan record

  • Modules attempted, completed, and failed are all visible on the scan record.
  • Per-module failure mode (timeout, error, skipped) is queryable with the message or duration.
  • Partial scans do not get treated as baselines until failed modules have been re-run or accepted.
  • Trend comparison surfaces coverage drift as a separate signal from finding count change.

Adjacent disciplines

Module failure handling sits next to several other scanner disciplines, each of which covers a different layer of the same problem.

  • Authenticated scanner failure modes: a specific class of module failure where the session lapsed mid-execution. The authentication layer fails before the detection layer can run.
  • Scanner rate limiting and throttling: the upstream control that prevents the scan from triggering target-side rate limits that then surface as module errors.
  • Scanner blocking and WAF allowlisting: the network-layer control that prevents a partial-block scenario from being read as scan completion when the WAF was silently dropping probes.
  • Scanner coverage and limits: the coverage envelope each scanner class actually produces. Persistent module timeouts are coverage limits, and they belong on the coverage record rather than the failure record.
  • Scanner result triage workflow: the downstream workflow that turns module output into a findings record. Triage cannot run cleanly against a partial scan whose status is not on the record.
  • Security tool coverage overlap research: the broader analytical frame for why running more than one scanner against the same target changes the partial-scan calculus.

Scope and limitations

Module failure handling depends on the scanner exposing a structured status per module. Scanners that return a single completion flag without per-module breakdown cannot be retrofitted into a faithful partial-scan record from outside; the only recoverable signal is the absence of expected findings, which is too lossy to drive a remediation queue. Programmes that import third-party scanner output without a module-status mapping inherit this limitation and have to compensate at the triage layer rather than the scan layer.

Recovery and retry can mask underlying issues if they run without observability. A retry loop that quietly resolves the same transient failure on every scan cycle against the same target is hiding a real condition (a flaky network path, a slow backend, an unreliable upstream) that the operations team should see. The discipline is logging retries and recoveries as first-class events on the audit trail, not as silent self-healing. The next reader needs to know how many retries the scan needed to converge, not just that it did.

Frequently Asked Questions

Run scans on a record that surfaces partial completion as evidence

SecPortal records module-level status per scan, enforces hard module timeouts, recovers stale jobs, and keeps the partial-scan trail visible on the engagement record. Start free.