Vulnerability

Unbounded Consumption in LLM Applications
detect, prioritise, remediate

Unbounded consumption (OWASP LLM10:2025) is the vulnerability class where an LLM-backed application lets a caller burn more compute, more tokens, more provider dollars, or more inference time than the feature design intended. It covers denial of wallet, recursive tool-call explosion, output-length abuse, retrieval-window manipulation, inference-as-a-service abuse, model extraction through repeated probing, concurrency saturation, and runaway background inference. The fix is operational: budget every dimension, cap parameters server-side, separate provider keys per tenant, and log per-call token, cost, latency, and identity on the operating record.

Get Started Free

No credit card required. Free plan available forever.

Severity

High

CWE ID

CWE-400

OWASP Top 10

LLM10:2025 - Unbounded Consumption

CVSS 3.1 Score

7.5

What is unbounded consumption in LLM applications?

Unbounded consumption is the vulnerability class where an LLM-backed application lets a caller spend more compute, more tokens, more provider money, or more inference time than the feature design intended. The attacker rarely needs to break the model. They submit prompts that force long generations, recursive tool chains, large retrieval windows, or repeated extraction queries until the bill, the rate budget, the GPU minutes, or the user-facing response time collapses. The 2025 OWASP Top 10 for Large Language Model Applications lists the class as LLM10:2025 Unbounded Consumption and treats it as a renaming and expansion of what the 2023 list called Model Denial of Service. The new wording captures three failure modes the old wording did not: denial of wallet (provider-billing exhaustion), inference-as-a-service abuse (turning a paid API into a free competitor), and model extraction through repeated probing.

The vulnerability sits next to four other LLM Top 10 entries on the agent threat model. The prompt injection page covers the input-side hijack. The indirect prompt injection via RAG page covers the data-side hijack. The improper output handling in LLM applications page covers the output-sink risk. The excessive agency in LLM applications page covers the action dimension. Unbounded consumption is the resource dimension. Even with a perfectly aligned model, a clean prompt path, a sanitised output sink, and a least-privilege tool registry, the question of how much compute, money, and time the model is allowed to spend on one request remains a security decision the application has to enforce in code.

For internal AppSec, AI security, product security, vulnerability management, security engineering, cloud security, and FinOps-adjacent teams, unbounded consumption is the entry on the LLM Top 10 that converts a free-tier endpoint into a six-figure provider bill overnight, takes a customer-facing feature offline for the legitimate users, drains the inference quota the rest of the product depends on, and lets a competitor reconstruct the proprietary fine-tune or system prompt through patient extraction. The fix is operational: budget every dimension (tokens in, tokens out, requests per identity, dollars per tenant, GPU seconds per call), enforce hard caps in the application code rather than in pricing tiers, separate the metered provider account per tenant so a single tenant cannot consume another tenant's quota, and log every call with token counts, latency, identity, and cost so the team can read the consumption pattern alongside the rest of the security backlog.

The extraction failure mode named under LLM10:2025 has its own confidentiality dimension that goes beyond the resource budget reading. The dedicated model extraction attack page covers the inference-API side channel where the attacker reconstructs the model itself (model stealing per Tramer 2016), tests whether a record was in the training set (membership inference per Shokri 2017), or recovers training inputs (model inversion per Fredrikson 2015 and Carlini 2021), and pairs each variant to the layered remediation: per-identity query budgets, output minimisation per trust tier, differential privacy training, watermarking, canary insertion, and the contractual derived-work lever the legal team needs when extraction is detected.

The consumption surface

Input token budget

The number of tokens an attacker can place into one prompt. Large file uploads, recursive document expansions, concatenated retrieval windows, and verbose system context all multiply the input side. Most provider pricing is per input token, so the input dimension is a direct cost lever before the model ever generates a reply.

Output token budget

The number of tokens the model is allowed to generate per call. A request for a very long answer, a request to repeat a phrase indefinitely, or a structured-output schema with no maximum length turns one call into thousands of generated tokens. Generation pricing typically exceeds input pricing per token, so the output dimension is the highest unit cost lever.

Provider billing exhaustion

The denial-of-wallet failure mode. The application keys directly into a paid provider account. An attacker who runs many cheap calls or a few extreme calls drains the monthly budget, breaches the spend cap, or triggers an alert that pauses the entire production service for legitimate users.

Request rate and concurrency

The number of in-flight requests one identity, one tenant, or the whole service can hold against the provider. The provider also rate-limits per organisation token. One abusive client can saturate the organisation-wide concurrency budget and starve every other tenant the same key serves.

Tool-call amplification

An agent that can call tools can call them recursively. One user prompt becomes a search call, which feeds a summary call, which calls another agent, which calls a third tool. Each step is a model invocation. Without a per-trace step budget, one prompt spawns a tree of calls that costs orders of magnitude more than the visible request.

Retrieval window expansion

A RAG flow that retrieves top-k documents and pastes them into the prompt has a controlled k by default. Where k is configurable through the request, where the retrieved chunks have no maximum size, or where a tool returns retrieval results that become the next prompt, the input window grows without bound.

GPU and self-hosted inference time

Self-hosted models trade provider bills for owned GPU capacity. An attacker who occupies the inference workers blocks every other request. Long-context prompts on a self-hosted model are the slowest unit of work the service runs, so a few crafted prompts can monopolise an entire fleet.

Background and scheduled inference

Cron jobs, daily summarisers, periodic enrichment, and batch jobs all call the model on schedule. A misconfigured job, a poisoned input table that turns into a long prompt, or a reschedule loop that re-fires on every retry consumes budget against an identity no human is sitting behind to notice.

How it goes wrong

Denial of wallet on a free-tier endpoint

A public chat surface or a free-tier API key is reachable without per-identity rate limits. An attacker writes a loop that submits maximum-length prompts at the provider concurrency limit until the monthly spend cap fires, the provider account suspends, and the production feature goes dark for paying customers.

Recursive tool-call explosion

A research agent calls a summariser, which calls a fact-checker, which calls another research agent. The prompt that triggered the chain was one user request, but the trace expanded into dozens of model invocations. No per-trace step budget caps the recursion, and the bill scales with the depth.

Output length without a maximum

A request asks the model to repeat or expand indefinitely. The structured-output schema does not name a maximum length, and the generation parameter defaults to the model maximum. One request produces thousands of output tokens, and the attacker repeats the pattern across many sessions.

Retrieval window manipulation

A request includes a parameter the client controls (top_k, max_chunks, document_id list). The attacker raises the value, retrieval returns large payloads, and the prompt grows to many times the design assumption. The model still answers, the cost still bills, and the latency still climbs.

Inference-as-a-service abuse

A paid endpoint exposes a frontier model with no per-identity quota. An attacker uses it as a free general-purpose model API for an unrelated product. The host pays the inference bill, the model serves an external workload, and the original feature drowns under the load.

Model extraction through repeated queries

An attacker iteratively queries a fine-tuned model, a proprietary system prompt, or a retrieval index. Each call costs the host money. Over thousands of queries, the attacker reconstructs the system prompt verbatim, mirrors the fine-tune outputs, or rebuilds enough of the retrieval corpus to cross a confidentiality threshold the model was supposed to protect.

Concurrency saturation in a multi-tenant key

Many tenants share one organisation-level provider key. One tenant submits hundreds of parallel calls. The provider throttles the organisation, every tenant sees errors, the support queue spikes, and the abusive tenant is hard to identify because the provider observed the calls under the shared key.

Background job runaway

A scheduled enrichment job reads a row, calls the model, writes a result. A malformed input table turns one row into a million-token prompt, the call retries on every failure, and the retry loop burns the budget without any human in the loop noticing until the bill or the alert fires.

Cold-start prompt expansion

A new tenant onboarding flow pre-warms the model with full-context system prompts, sample data, retrieval bootstrapping, and history hydration. An attacker triggers the onboarding repeatedly. Each trigger spends the full warm-up cost. The unit economics of a normal sign-up turn into a sustained denial-of-wallet vector.

Common causes

No per-identity cost or token budget

The application authenticates the user but does not attach a per-user, per-tenant, per-API-key budget that decrements on every call. Spend caps live at the provider organisation level only, so the first identity to abuse the surface exhausts the budget for every other identity.

Client-controlled generation parameters

max_tokens, top_k, top_p, temperature, n (number of completions), and tool-call step counts arrive in the request body and the server passes them through. The client can ask for the maximum output, the maximum retrieval window, or the maximum candidate count, and the application has no opinion.

No rate limit at the application boundary

The team relies on the upstream provider rate limit. Provider rate limits are organisation-wide and reset on a calendar window; they do not separate one tenant from another, do not separate one user from another, and do not protect against bursts within the window. The application boundary is where per-identity limits belong.

Single provider account for many tenants

One billing account, one rate-limit budget, one concurrency pool. A tenant who abuses the surface drags every other tenant into the same throttle and the same suspension. Per-tenant provider accounts, per-tenant API keys, or per-tenant cost pools are the operational separation pattern.

Tool registrations without step or depth limits

An agent can call any registered tool, and any tool can call the model again. No per-trace step budget caps recursion, no maximum depth caps tool composition, and no per-call cost ceiling stops the chain. One user request becomes an unbounded tree of paid invocations.

No visibility into per-request token, cost, or latency

The application logs request counts but not token counts, dollar amounts, or generation latency. The team sees the monthly bill go up but cannot attribute the spend to a feature, a tenant, an endpoint, or a prompt. The detection signal lands in finance before it lands in security.

How to detect it

Automated detection

SecPortal's code scanning runs against connected repositories and flags LLM provider call sites where the request body forwards client-controlled generation parameters without server-side caps, where the agent loop has no maximum step count or depth limit, where the retrieval helper has no maximum chunk count or chunk size, where the tool registration lacks a per-call cost ceiling, or where the call is missing the per-identity budget decrement
Authenticated scanning drives the LLM-backed endpoint with payloads that probe each consumption lever: maximum-length prompts, requests for maximum-length outputs, recursive tool-call seeds, parameter-amplification attempts, and concurrent-request bursts under a real session, then captures the request, the response length, the latency, and any rate-limit or budget signal the application returned as evidence on the finding
External scanning discovers exposed agent endpoints, public chat surfaces, free-tier API keys leaked through client-side code, and webhook callbacks that may invoke the model without per-identity rate limiting on the verified perimeter
Continuous monitoring re-runs the consumption-probe payload on a defined cadence so a new endpoint, a regressed rate-limit configuration, a removed per-identity budget, or a parameter the server stopped clamping is caught against the previous baseline rather than waiting for the next pentest cycle

Manual testing

Enumerate every LLM-backed endpoint, the model used, the provider, the credential the call runs under, the per-identity budget that decrements on each call, and the per-trace step budget for agent flows
For each endpoint, send a request with maximum-allowed input length, then with a request for maximum output length, then with the largest retrieval window the parameters accept, and confirm whether the server clamps, truncates, refuses, or simply forwards
For agent flows, submit a prompt that seeds a recursive tool-call chain (one tool calls another tool that calls the model again) and trace the resulting call graph to confirm a maximum step count, a maximum depth, and a per-trace cost cap actually exist
From one identity, fire concurrent requests up to and past the documented rate limit and confirm whether the rate limit is per identity, per tenant, per organisation-wide provider key, or absent
Read the activity log for the test session and confirm that each model call captures input token count, output token count, the dollar amount the provider invoiced, the latency, the model name, the actor identity, and the time stamp a finance or security responder would need to attribute the spend

How to fix it

Budget every dimension, not just request count

Per identity, per tenant, per API key, per endpoint, per day, per hour. Each dimension caps a different abuse vector: input tokens cap prompt amplification, output tokens cap generation cost, request rate caps brute-force probing, dollars cap denial of wallet, GPU seconds cap self-hosted exhaustion, agent steps cap recursion. Treat each as a first-class budget the application enforces before the call leaves the server.

Cap generation parameters server-side

Never trust client-supplied max_tokens, top_k, top_p, n, or retrieval-window parameters. Pin the values to the smallest the feature can do its job under, and reject overrides at the request validation layer. Where the feature genuinely needs different ceilings per role, encode the per-role ceiling on the server and ignore the client suggestion.

Separate provider accounts or scoped keys per tenant

A single shared organisation key means a single shared throttle, a single shared spend cap, and a single shared suspension. Per-tenant provider accounts, per-tenant scoped API keys, or per-tenant cost pools mean abuse by tenant A does not break the service for tenant B. The operational pattern transfers directly from how multi-tenant SaaS already separates customer data.

Add a per-trace step and depth budget for agent flows

Every agent invocation should carry a maximum number of model calls and a maximum tool-chain depth, both enforced in the agent framework not the prompt. A breach of either should hard-fail the trace, return a structured error, and log the partial trace as evidence the security team can read.

Charge the spend back to the requesting identity in real time

Decrement a per-identity cost budget on every call as the response returns, before the next call from the same identity is accepted. The decrement is the per-call cost the provider reported, not an estimate. When the budget hits zero, requests fail closed at the application boundary, not at the provider.

Hard-cap output length at the model parameter level

Set max_tokens (or the provider equivalent) to the smallest value the feature actually needs. A summariser does not need a 32k-token answer. A classifier does not need more than a few tokens. The default in most provider SDKs is the model maximum; the secure default is the feature maximum.

Bound retrieval and ingestion before the prompt assembles

Hard-limit the number of retrieved chunks, the size of each chunk, the maximum total context size the application will assemble, and the maximum file size a user can upload into the retrieval pipeline. Apply the limit at the retrieval layer rather than at the prompt-template layer so the limit cannot be overridden by a prompt-engineering tweak.

Rate-limit at the application boundary in addition to the provider

Provider rate limits are organisation-wide and meant for cost control, not security. Apply per-identity, per-tenant, per-endpoint rate limits in the application code with a fast in-memory or Redis-backed counter. The application-layer limit is the one that protects the rest of the tenants from a single abusive identity.

Detect extraction patterns and degrade gracefully

Watch for repeated systematic queries against the same fine-tune, the same retrieval index, the same system-prompt-probing pattern, or the same canary phrases. When the pattern crosses a threshold, throttle the identity, increase the temperature for that session, or refuse further probing. Log the trigger and the response as a finding the security team can investigate.

Log per-call token, cost, latency, and identity on the operating record

Capture the input tokens, output tokens, model name, provider, dollar cost, latency, actor identity, endpoint, and the prompt fingerprint on every call. Keep the record on the security operating surface alongside every other finding and event, not only in the provider dashboard. The post-incident reconstruction reads from this record first.

Re-run the consumption regression probe on every model, prompt, and parameter change

A new model with different per-token pricing, a prompt change that shifts the input/output ratio, a parameter default the team relaxed for a feature, or a new tool registration can re-open a closed unbounded-consumption finding. Treat the consumption probe as a first-class CI gate alongside unit and integration tests, and keep the canary cost ceilings in the test suite where the team will see them.

What this looks like in SecPortal

Finding record with prompt, parameters, and observed cost

The finding captures the prompt that triggered the abuse probe, the parameters the request sent, the input token count, the output token count, the latency, the dollar cost the provider invoiced, the model, and the actor identity. The evidence is what AppSec, AI engineering, and security operations need to reproduce the abuse against the same surrounding code path and the same cost model.

Code scanning across LLM provider call sites

Code scanning runs against connected GitHub, GitLab, and Bitbucket repositories. Findings surface at provider call sites where client-supplied generation parameters reach the server without clamping, where the agent loop has no maximum step or depth, where the retrieval helper accepts an unbounded top_k, where the tool registration lacks a per-call cost ceiling, or where the per-identity budget decrement is missing.

Authenticated scanning with consumption probes

Authenticated scanning drives the LLM-backed endpoint with a curated set of consumption payloads under a real session: maximum-length prompts, maximum-length output requests, recursive tool-call seeds, retrieval-window expansion, and concurrent-request bursts. Each probe records whether the server clamped, refused, or forwarded, and how much budget the call actually consumed.

Continuous monitoring against regression

Continuous monitoring re-runs the consumption probe on the configured cadence. A new model with different per-token pricing, a regressed rate-limit configuration, a removed per-identity budget decrement, or a parameter the server stopped clamping shows up against the baseline rather than waiting for the next manual review.

Retest after the remediation ships

Once the fix deploys (parameters clamped, per-identity budgets wired, per-trace step caps added, per-tenant key separation in place), a targeted retest replays the original consumption probe against the new surface and records the post-fix response on the finding. The finding closes against the evidence rather than against a developer assertion that the surface is now bounded.

AI-assisted writeups with explicit honest scope

AI reports generate the writeup, the executive summary, and the developer-facing reproduction steps from the finding record. The narrative stays within the verified evidence (the prompt, the parameters, the token counts, the cost, the actor identity) and does not invent rate-limit middleware, AI-gateway behaviour, or provider integrations the product does not have.

Activity log for the cost-event audit trail

The workspace activity log captures the security-side audit trail: who created the finding, who suppressed it, who closed it, who approved a related exception, and when the retest fired. Pair the workspace activity log with the engineering-side per-call cost log the application writes (input tokens, output tokens, dollars, latency, identity) so post-incident reconstruction reads from both records.

Finding overrides for sanctioned high-budget calls

Where a high-token call is a deliberate, sanctioned workload (batch summarisation, scheduled enrichment, internal-only report generation), finding overrides record the suppression rationale, the owner, and the expiry on the finding itself. The exception lives on the operating record alongside every other documented deviation.

Compliance tracking pairs the fix to control evidence

Compliance tracking maps unbounded-consumption findings to the controls that read against them (ISO 27001 A.5.30 ICT readiness for business continuity, A.8.6 capacity management, A.8.20 networks security; SOC 2 A1.1 capacity availability, CC7.2 system monitoring; NIST 800-53 SC-5 denial-of-service protection, SC-6 resource availability, AU-12 audit record generation; NIST AI RMF Manage; ISO/IEC 42001 AI management system; NIST SSDF PW.5 secure coding). The same finding feeds the engineering ticket and the auditor evidence pack.

What SecPortal does not do

SecPortal is the operating record where unbounded-consumption findings, the prompt that produced the abuse, the parameters the request used, the token counts, the latency, the dollar cost, and the actor identity land alongside the rest of the security backlog. The product does not act as an AI gateway intercepting LLM calls between application and provider, does not meter tokens or dollars in flight on your behalf, does not enforce rate limits inside your application, does not bill chargebacks per tenant, and does not provide a managed quota service for OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, or self-hosted inference clusters.

SecPortal does not connect to Jira, ServiceNow, Slack, SIEM, SOAR, FinOps platforms, cloud-provider billing dashboards, or LLM-provider usage dashboards through packaged integrations. The discipline is the engineering practice on top of the operating record: AppSec, AI engineering, product security, security engineering, and platform teams write the per-identity budgets, the server-side parameter clamps, the per-tenant key separation, the per-trace step caps, the application-boundary rate limits, the extraction-pattern detectors, and the per-call cost logging in the application code itself.

Compliance impact

OWASP Top 10 for LLM Apps

LLM10:2025 - Unbounded Consumption (denial of wallet, inference-as-a-service abuse, model extraction)

NIST AI RMF

Map, Measure, Manage; Resilient and Safe Trustworthy AI characteristics; resource-availability and operational-monitoring guidance

ISO/IEC 42001

AI management system - operational planning and control, resource management, monitoring measurement analysis evaluation

ISO 27001

Annex A 5.30 ICT readiness for business continuity; 8.6 Capacity management; 8.20 Networks security

SOC 2

A1.1 - Capacity availability; CC7.2 - System monitoring; CC9.1 - Risk mitigation activities

NIST 800-53

SC-5 Denial-of-service protection; SC-6 Resource availability; AU-12 Audit record generation; SI-4 Information system monitoring

Related vulnerabilities

Prompt Injection

Indirect Prompt Injection via RAG

Excessive Agency in LLM Applications

Improper Output Handling in LLM Applications

System Prompt Leakage in LLM Applications

Denial of Service (DoS)

Missing Rate Limiting

Regex Denial of Service (ReDoS)

Related features

Find vulnerabilities before they ship

Test web apps behind the login

Vulnerability scanning tools that map your attack surface

Vulnerability management software that tracks every finding

Monitor continuously catch regressions early

AI-powered reports in seconds, not days

Every action recorded across the workspace

Finding overrides that survive every scan cycle

Compliance tracking without a full GRC platform

Verify fixes and track reopens on the same finding record

Track LLM10 unbounded-consumption findings on one engagement record

SecPortal pairs the authenticated scan, the external scan, and the code scanner with one findings record per LLM cost or capacity vulnerability, with CVSS 3.1 severity, framework mapping, retest pairing, and an append-only activity log. Start scanning for free.