What is a scanner module failure in operational terms?

A scanner module is a single detection unit that runs against a target as part of a larger scan; an SSL probe, a port sweep, a security header check, a subdomain enumeration, or an authenticated injection probe. A module failure is the case where the module did not return a clean completion status for the target. The four canonical states are completed (returned findings or a clean result), error (raised an exception or unexpected condition), timeout (exceeded the hard execution budget), and skipped (deliberately not run because of scope, plan limits, or precondition failure). Treating all four as the same outcome is the most common operational mistake: an errored module needs investigation, a timed-out module often needs a retry against a slower rate, a skipped module needs the skip rationale on the record, and only the completed module produces evidence that lands as findings.

Why do scanner modules fail at all in a well-configured scan?

Five mechanisms generate module failures even when the scanner, the target, and the credentials are all configured correctly. Network conditions vary inside a single scan window: a target that responds in 200 ms during one probe can drop to 30 seconds during another because of a transient backend pause, a downstream API call, or a coincidental traffic spike. Targets enforce protective behaviour mid-scan: a WAF or rate limiter activates after a threshold and starts dropping connections silently. Modules that depend on external resources fail when those resources are unreachable: a DNS lookup against a misconfigured resolver, a CVE database query against a slow upstream, a vulnerability advisory fetch against a temporarily unreachable mirror. Authenticated modules fail when the session lapses mid-execution because the application rotated the CSRF token or expired the cookie. Resource constraints inside the scan worker hit hard limits when a target produces an unexpectedly large response or a long subdomain list expands the probe space faster than the budget. None of these are scanner bugs in the strict sense; they are the consequence of running active detection against systems whose state changes during the scan.

What is the difference between a module timeout and a module error?

A timeout is the case where the module exceeded a hard execution budget without returning. The most useful diagnostic is that the scan worker has bounded execution by clock time, not by completion. A timeout tells you the module ran longer than allowed; it does not tell you whether the module was making progress or stuck. An error is the case where the module raised an exception or returned a status the scan worker classifies as non-completion before the budget elapsed. Errors carry a message; timeouts carry a duration. The recovery is different in each case: timeouts usually retry against a longer budget or a slower rate, errors usually retry against the same conditions because the underlying cause may be transient. A module that times out every cycle on the same target is a coverage limit, not a transient failure; a module that errors intermittently against the same target is a candidate for retry with backoff.

When should a failed module retry, and when should it stop?

A defensible retry policy distinguishes transient from persistent failure. Transient failures retry with backoff up to a small ceiling (commonly three attempts with exponential spacing) and then mark the module as errored against the scan. The retry uses the same target, the same credentials, and the same parameters because the underlying cause is assumed to be conditions that change between attempts. Persistent failures stop retrying because the next attempt produces the same outcome: a module hitting a target that returns 403 on every request, a module timing out on every cycle at the same step, a module that errors with the same exception on every retry. The cleanest implementation pattern: retry on network errors, transient HTTP status codes (502, 503, 504), and timeouts up to a small ceiling; do not retry on authentication failures, hard target blocks (HTTP 401, 403 after a successful login), or exceptions that originate inside the module code path itself, because each of those needs investigation, not repetition.

What is a partial scan and how is it different from a failed scan?

A partial scan is a scan where some modules completed cleanly and others did not. The scan record carries two lists: the modules attempted in the scan plan, and the modules that returned a completion status by the scan deadline. When those two lists differ, the scan is partial. A partial scan still produces findings (from the modules that completed) and still produces evidence (from the modules that errored or timed out, with their failure mode recorded). A failed scan is the case where no module returned a usable result, usually because the target was unreachable from the start, the credentials were invalid, or the scan was cancelled before any module finished. Partial scans are normal operational outcomes in any realistic scanning programme; failed scans are diagnostic events that need investigation. Conflating the two produces a workflow that either discards usable partial findings or treats failed scans as evidence of coverage.

How should a partial scan be represented in the findings record?

The partial scan needs three pieces of evidence on the record so the next reader can interpret what the scan represents. First, the modules-completed list (what was scanned and produced clean results). Second, the modules-failed list with the failure mode for each entry (timeout, error, skipped, with the message or duration). Third, the modules-attempted list (what the scan plan called for) so the gap between attempted and completed is queryable. Without all three, a partial scan looks identical to a completed scan in the findings totals but tells the auditor and the next tester nothing about what was actually verified. The convention that works in practice: surface partial completion explicitly in the engagement record, never hide it in the totals, and never treat a partial scan as a baseline for trend comparison until the failed modules have been re-run cleanly or the failure mode has been documented as accepted.

What happens when a scan worker crashes mid-execution?

A scan that was in flight when the worker crashed has to be recoverable, or the next scan cycle inherits a stuck job that blocks the queue. The defensible pattern is stale job recovery: every job that has been marked running for longer than a recovery threshold without making progress is candidate for re-dispatch. The recovery loop runs on a short interval (commonly 60 seconds) and looks for running jobs that started before the threshold. When it finds one, it returns the job to the pending state so a fresh worker can pick it up. The recovery has to be safe against double-execution: if the original worker is still alive and finishes after the recovery dispatched a replacement, the second result is the one that lands because the first worker no longer has the right to write to the job record. The recovery threshold has to be larger than the longest legitimate module execution time, otherwise legitimate slow modules get reset and re-run wastefully.

How do compliance frameworks read module failures and partial scans?

Compliance evidence reads scan completion as proof of scope coverage. A scan that ran without recording its module-level status looks complete in summary totals and tells the auditor nothing about what was actually verified. PCI DSS 11.3 expects evidence that scanning runs at the required cadence against the in-scope environment; partial scans are acceptable so long as the gap is documented and remediated within the assessment window. ISO 27001 Annex A 8.8 reads vulnerability detection as an operating control; the operating evidence is the scan execution record, the module status, and the remediation activity it drives, all of which depend on partial scans being represented faithfully. SOC 2 CC7.1 expects evidence that the entity detects vulnerabilities; the operating evidence is per-module completion status across the observation window. NIST 800-53 RA-5 reads vulnerability scanning as a control whose evidence is the scan execution log; partial scans become evidence the moment the failure modes are captured on the record rather than hidden inside aggregate counts. The shared expectation across frameworks: faithful per-module completion status is evidence, hidden partial completion is a finding waiting to be raised.

← Back to Scanner Information

Scanner guide13 min read

Scanner Module Failures: Timeouts, Errors, and Partial Scans

Every scanning programme that runs more than a handful of targets eventually hits the same operational reality: some modules complete cleanly, some time out, some error, and some are skipped. The scan is partial. The question is whether the partial completion gets surfaced as evidence on the engagement record or hidden inside summary totals that look identical to a clean scan. Hidden partials produce two failure modes at once: testers re-verify findings the scanner never actually checked, and auditors read coverage that was never delivered.

This guide covers the four module states (completed, error, timeout, skipped), why each one happens in production, how to design retry and recovery so transient failures do not turn into persistent gaps, how to record a partial scan so it survives audit, and how to read module-level status as the operating evidence that vulnerability scanning controls actually require. The audience is internal security, AppSec, vulnerability management, and security engineering teams that run scans on cadence against assets that move.

Four module states and what each one means

A scanner module is a single detection unit inside a larger scan. The scan plan lists the modules the platform intends to run against the target; the scan execution records what each module actually returned. Treating module completion as binary (the scan finished or it did not) collapses information that the findings record, the audit, and the next scan cycle all depend on. The four canonical states below are what a defensible scan record carries per module.

Module status	What it means	Typical next step
Completed	The module ran against the target, evaluated its detection logic, and returned a result with a duration. Findings (if any) land on the engagement.	Triage findings as normal. The module contributes to coverage on the next baseline comparison.
Error	The module raised an exception, hit a network failure, or returned a non-completion status before the time budget elapsed. The cause is recorded as an error message.	Retry with backoff for transient cases; surface the failure on the scan record for persistent cases that need investigation.
Timeout	The module exceeded its hard execution budget without returning. The recorded duration matches the timeout boundary, not the actual completion time.	Investigate whether the timeout reflects a slow target, a rate-limited response, or a coverage limit that needs the budget extended.
Skipped	The module did not run because a precondition failed, scope excluded it, or the plan tier did not include it. The skip rationale is recorded.	Confirm the skip is intentional. If the skip was driven by configuration drift, fix the configuration and re-run the module.

The four states are not a hierarchy; each one carries different information about what happened to the scan. Reducing them to a single completion percentage loses the distinction between a module that errored intermittently (retry candidate), a module that timed out repeatedly on the same target (coverage limit), a module that was skipped because the plan did not include it (procurement decision), and a module that completed cleanly (evidence). The scan record carries all four per module so the interpretation stays accurate.

Where module failures come from in production

Failures are structural in any active scanning programme that runs against assets whose state changes during the scan. The five mechanisms below generate the failures that most programmes spend their investigation time on, even when the scanner, the target, and the credentials are all configured correctly.

Target latency variance mid-scan

A target that responds in 200 ms during the SSL handshake can take 30 seconds for a slow backend query during the same scan. The module that runs the slow probe hits the per-module timeout while the SSL module records a clean completion. The scan is partial because of the same target behaving differently to two modules. The signal is not a scanner bug; it is the scanner accurately representing what the target actually did.

Mid-scan WAF or rate-limit activation

A web application firewall that allows low-volume traffic activates when probe volume crosses a threshold. The first half of the scan completes; the second half gets a stream of 403 or 503 responses. Modules that ran early land clean; modules that ran late record errors. The scan record needs to surface the activation point or the next reader cannot tell which findings were verified against a live target and which were silently blocked.

External-dependency unreachability

A module that resolves CVE metadata against an external advisory database fails if the upstream is slow or unreachable. The scanner cannot detect the underlying vulnerability without the metadata, even though the target itself is fine. The failure mode is external to the target but visible on the module status, and the fix is retry against a healthier upstream rather than rescan against the target.

Session expiry during authenticated modules

Authenticated scans depend on a session that the target controls. Sessions expire on the schedule the application enforces, not on the schedule the scanner expects. A module that ran for forty minutes can find its session invalidated halfway through and start receiving login pages instead of authenticated responses. The authenticated coverage page covers the session diagnostics in detail; here, the relevant point is that session expiry mid-module is a routine cause of partial scans, not an exotic failure.

Resource-constrained probe expansion

A subdomain enumeration that hits a wildcard DNS response expands the candidate list faster than the time budget can absorb. The module records a timeout because the work was bounded by clock time, not by completion. The recovery is bounding the probe space (capping subdomain depth, capping endpoint depth) rather than extending the module budget against an unbounded search.

Retry policy: transient versus persistent

Retry is the cheapest recovery and the most often misconfigured. A retry loop that re-runs every failure mode against the same conditions burns the scan budget on persistent failures and never catches up. A retry loop that does not retry transient failures discards usable scans for conditions that have already changed by the second attempt. The pragmatic split below survives most programmes.

Failure signal	Retry policy	Rationale
Network timeout or 5xx	Retry up to three attempts with exponential backoff (for example 2s, 8s, 32s).	Transient failures resolve when conditions change; bounded retry catches most without burning the budget.
HTTP 429 with Retry-After	Retry once after the indicated interval; then halve the scan rate before continuing.	The target has signalled the rate is too high; respect the signal and adapt rather than burning retries.
HTTP 401 or 403 after a successful login	Do not retry. Fail the module and surface the authentication failure on the scan record.	The session is gone; another attempt produces the same outcome. The fix is the credential lifecycle, not the retry loop.
Module exception in scanner code	Do not retry. Record the exception and the input that triggered it.	A code-path failure repeats deterministically; investigation is the next step, not repetition.
Hard timeout against the same target on consecutive scans	Stop retrying. Record the module as a coverage limit against this target and tune the budget separately.	A timeout that happens every cycle is not transient; it is a coverage decision that needs documenting rather than retrying.

The shared principle is matching the retry policy to what the failure signal can tell you about conditions changing between attempts. Retry helps when the underlying conditions are transient and shift between attempts; retry hurts when the underlying cause is structural and repeats deterministically.

Stale-job recovery when the worker crashes

Scanner work happens in a background process that can be restarted, redeployed, or crashed by the conditions the scans are running against. A job that was running when the worker disappeared has to be recoverable, or the queue accumulates stuck entries and the next scan cycle inherits the backlog. The recovery loop is independent of the module retry policy because it operates one level up: it returns jobs to the dispatch queue rather than re-running modules in place.

Detect stale jobs

A recovery loop runs on a short interval (commonly 60 seconds) and looks for jobs that have been marked running for longer than a recovery threshold without progress. The threshold is larger than the longest legitimate module execution time plus a safety margin. Jobs that match are candidates for re-dispatch.

Return to pending, do not delete

Stale jobs return to the pending state with the original parameters intact. The next worker dispatch picks them up and runs them fresh. Deleting the job loses the trail; returning it to pending keeps the activity log complete on the original dispatch attempt.

Guard against double-write

If the original worker is still alive but slow, it can finish after the recovery loop dispatched a replacement. The write protection is anchoring the final result to whichever attempt the platform considers active; the original attempt loses the right to write to the job record once recovery has reassigned it.

Surface the recovery on the audit trail

Every recovery event is an audit record: the job that was recovered, the original dispatch time, the recovery time, the worker that was reassigned. Hidden recoveries erode the audit trail because the next reader cannot tell why a job that started at 10:00 finished at 10:35 after a single dispatch attempt.

Recording a partial scan on the engagement

A partial scan is a normal operational outcome. The discipline is making the partial status legible to the next reader rather than absorbing it into the totals. Three pieces of evidence are non-negotiable on the engagement record.

Modules attempted: the scan plan that was dispatched against the target. This is the coverage promise the scan was trying to honour.
Modules completed: the subset that returned a clean completion status by the scan deadline. This is the coverage that was actually delivered.
Modules failed: with per-module failure mode (timeout, error, skipped) and a message or duration that lets the next reader understand what happened. This is the gap between the promise and the delivery.

The three lists together let the engagement record answer the question the auditor and the next tester will both ask: what did this scan actually verify? Without all three, the same partial outcome can be read as a clean scan, a failed scan, or a coverage limit, and the choice is left to interpretation. With all three on the record, the interpretation is anchored to evidence.

Trend comparison across scans depends on this discipline too. A baseline that includes a partial scan as if it were complete produces a trend line that drifts on coverage change rather than on remediation. The scan baseline and trend comparison guide covers how to separate real change from coverage drift; the prerequisite is faithful per-module status on every scan so coverage drift is visible rather than hidden.

Common operational mistakes

The mistakes below show up across programmes that operate scanners at scale. Each one is a failure to separate the four module states or to surface partial completion faithfully.

Collapsing all four states into pass or fail: the scan record reports a single completion percentage and loses the distinction between a timed-out module, an errored module, a skipped module, and a completed module. Auditors get totals; testers get nothing.
Retrying every failure indiscriminately: the retry loop spends the budget on persistent failures and never converges. The scan finishes long after the cadence window and the next scheduled scan inherits a backlog.
Hiding partial scans inside finding totals: the engagement dashboard shows two hundred findings on a scan that actually completed sixty per cent of its modules. The other forty per cent were never verified, but the totals look identical to a complete scan.
Treating timeouts and errors as the same outcome: a timeout often needs a budget extension or a probe-space cap; an error often needs a retry with backoff. Treating them the same prevents either fix from landing.
Skipping modules without a recorded rationale: a module that was skipped because the plan did not include it is a procurement decision; a module that was skipped because the precondition failed is a configuration issue. The skip rationale belongs on the record so the two cases can be distinguished.
Not surfacing stale-job recovery on the audit trail: a job that ran from 10:00 to 10:35 because of a single dispatch looks the same as a job that ran from 10:00 to 10:35 because the original worker was recovered at 10:15. The audit difference matters; the record needs to reflect it.

The shared root cause is treating module-level status as a scanner-internal detail rather than as the operating evidence vulnerability scanning controls actually require. Module status is the chain that connects scan dispatch to closed finding; losing the chain at any link breaks the audit trail.

How SecPortal handles module failures and partial scans

SecPortal scan modules each return a structured result with four possible states (completed, error, timeout, skipped), a duration in milliseconds, and the findings the module produced (if any). The orchestrator records the status per module on the scan record, so the partial-completion picture is queryable on the engagement rather than hidden inside summary totals.

The scan worker enforces hard module timeouts to prevent infinite hangs; when a module exceeds its budget, the worker records a timeout status with the elapsed duration and moves to the next module rather than blocking the scan. Stale job recovery returns scan jobs that have been running too long without progress back to pending, so a fresh worker dispatch can pick them up after a worker crash or a deployment cycle. Retry with backoff handles transient errors before the module status flips to error on the scan record.

The external scanning feature and the authenticated scanning feature both surface module status on the scan record so partial completion is visible at the engagement level. Findings management treats partial scans as first-class outcomes: findings from completed modules land normally, and the modules that did not complete stay attached to the scan record as evidence of coverage rather than getting discarded.

The activity log captures every scan dispatch, module completion, retry, and stale-job recovery so the audit trail from scan plan to closed finding is reproducible. The scanner evidence chain guide covers how per-module status feeds the seven-layer evidence chain from scan execution to closed finding.

Upstream of module execution, the scan target validation and authorisation guide covers the three control points (verified ownership, legal attestation, platform blocklist) that gate whether a module can dispatch at all. A module that errors because the target was not authorised is not the same as a module that timed out against an authorised target; the validation layer makes the distinction recordable.

Operational checklist

Scan plan

Every scan has an explicit module list recorded at dispatch.
Module budgets are sized to the slowest legitimate execution time for that module against the target class.
Probe-space limits are set so unbounded enumeration cannot exhaust the budget.
The skip rationale is recordable per module so intentional skips and configuration failures are distinguishable.

Module execution

Each module returns a status, a duration, and a result payload.
Hard timeouts at the worker layer prevent any single module from blocking the scan.
Retry with backoff handles transient errors up to a small ceiling.
Persistent failures stop retrying and surface on the scan record for investigation.

Stale-job recovery

A recovery loop runs on a short interval against running jobs that have stalled.
Recovery returns the job to pending; the original worker loses the right to write the result.
Every recovery event is logged so the audit trail reflects the dispatch history.
The recovery threshold is larger than the longest legitimate module budget plus a safety margin.

Partial-scan record

Modules attempted, completed, and failed are all visible on the scan record.
Per-module failure mode (timeout, error, skipped) is queryable with the message or duration.
Partial scans do not get treated as baselines until failed modules have been re-run or accepted.
Trend comparison surfaces coverage drift as a separate signal from finding count change.

Adjacent disciplines

Module failure handling sits next to several other scanner disciplines, each of which covers a different layer of the same problem.

Authenticated scanner failure modes: a specific class of module failure where the session lapsed mid-execution. The authentication layer fails before the detection layer can run.
Scanner rate limiting and throttling: the upstream control that prevents the scan from triggering target-side rate limits that then surface as module errors.
Scanner blocking and WAF allowlisting: the network-layer control that prevents a partial-block scenario from being read as scan completion when the WAF was silently dropping probes.
Scanner coverage and limits: the coverage envelope each scanner class actually produces. Persistent module timeouts are coverage limits, and they belong on the coverage record rather than the failure record.
Scanner result triage workflow: the downstream workflow that turns module output into a findings record. Triage cannot run cleanly against a partial scan whose status is not on the record.
Security tool coverage overlap research: the broader analytical frame for why running more than one scanner against the same target changes the partial-scan calculus.

Scope and limitations

Module failure handling depends on the scanner exposing a structured status per module. Scanners that return a single completion flag without per-module breakdown cannot be retrofitted into a faithful partial-scan record from outside; the only recoverable signal is the absence of expected findings, which is too lossy to drive a remediation queue. Programmes that import third-party scanner output without a module-status mapping inherit this limitation and have to compensate at the triage layer rather than the scan layer.

Recovery and retry can mask underlying issues if they run without observability. A retry loop that quietly resolves the same transient failure on every scan cycle against the same target is hiding a real condition (a flaky network path, a slow backend, an unreliable upstream) that the operations team should see. The discipline is logging retries and recoveries as first-class events on the audit trail, not as silent self-healing. The next reader needs to know how many retries the scan needed to converge, not just that it did.

Frequently Asked Questions

Run scans on a record that surfaces partial completion as evidence

SecPortal records module-level status per scan, enforces hard module timeouts, recovers stale jobs, and keeps the partial-scan trail visible on the engagement record. Start free.

Start free Back to scanner information