Vulnerability

Model Extraction Attack
detect, understand, remediate

Model extraction is the AI/ML confidentiality class where an attacker queries a deployed model through its normal inference surface and reconstructs the model itself (model stealing), confidential properties of the training data (membership inference), or representations of training records (model inversion). The inference endpoint becomes the leak. The fix is layered across rate limiting, output minimisation, differential privacy, watermarking, observability scoping, and a contractual layer.

No credit card required. Free plan available forever.

Severity

High

CWE ID

CWE-200

OWASP Top 10

OWASP LLM10:2025 Unbounded Consumption / LLM02:2025 Sensitive Information Disclosure

CVSS 3.1 Score

6.4

What is a model extraction attack?

Model extraction is the vulnerability class where an attacker queries a deployed machine learning model through its normal inference surface and uses the returned outputs to reconstruct either the model itself (a near-functional clone with similar prediction behaviour), confidential properties of the training data (was a specific record in the training set, what were the features of records used to train the model), or the latent representations the model encodes about a person, an account, or a record. The attacker does not need source code, weights, training data, internal documentation, or platform access. The inference endpoint is the side channel. The canonical academic anchor for the model-stealing variant is Tramer et al., "Stealing Machine Learning Models via Prediction APIs" (USENIX Security 2016). The canonical anchor for the membership inference variant is Shokri et al., "Membership Inference Attacks Against Machine Learning Models" (IEEE S&P 2017). The canonical anchor for the training-data reconstruction variant is Fredrikson et al., "Model Inversion Attacks that Exploit Confidence Information" (CCS 2015) and Carlini et al., "Extracting Training Data from Large Language Models" (USENIX Security 2021).

The vulnerability is the AI/ML confidentiality root that sits next to several adjacent AI security classes already covered on SecPortal. The prompt injection page covers the input-side hijack at inference time. The indirect prompt injection via RAG page covers the retrieved-data hijack. The improper output handling in LLM applications page covers the output-sink risk. The excessive agency in LLM applications page covers the action dimension. The system prompt leakage page covers the disclosure of the developer-written instructions. The data and model poisoning page covers the inbound pre-deployment dimension. The unbounded consumption page covers the resource dimension and names model extraction as one of the LLM10 failure modes. This page is the dedicated extraction-side reading: the confidentiality attack where the inference surface itself becomes the leak.

For internal AppSec, AI security, ML platform, product security, security engineering, vulnerability management, and GRC teams, model extraction reframes the inference endpoint as a controlled-disclosure surface. The fix is layered. Limit per-identity query budget. Withhold confidence scores and logits from low-trust callers. Randomise or coarsen output. Detect query patterns that look like extraction. Watermark or fingerprint the model. Keep a deletion path for records that fail a membership-inference audit. Run extraction drills on the model the same way the team runs penetration testing against the application. The same engagement discipline that pairs an authenticated web finding to a remediation owner pairs an ML extraction finding to an ML platform owner, a model owner, an inference-tier guardrail change, and a structured retest. The MITRE ATLAS adversarial AI knowledge base lists this family of attacks across ML.T0044 (model extraction), ML.T0007 (extract training data), and ML.T0024 (exfiltration via inference API).

Three variants under one heading

Different research lines, regulatory regimes, and threat models use overlapping terminology. A defensible triage walks the three confidentiality dimensions before deciding which variant the finding represents and which remediation path applies. The variants share the inference-API side channel but differ on what is being stolen and what control layer detects and slows the attack.

PropertyModel stealingMembership inferenceModel inversion / training-data reconstruction
What is being stolenA functional copy of the deployed model. Predictions on a target distribution approach the victim model. Parameters or a behavioural clone can be replayed without paying for the original.Per-record secrets. Did a specific person, customer record, document, or example appear in the training set. Used to violate privacy commitments, contract terms, or regulatory rules.Reconstructed features. The attacker recovers an approximation of training inputs (faces, medical attributes, free-text snippets, system prompt segments memorised by an LLM, customer record fields).
Typical query patternHigh-volume, broad-distribution queries against the inference API. Often automated through a scripted client that samples the input space. Often paired with stolen credentials, free-tier abuse, or partner-key abuse.Lower-volume, targeted queries that probe the confidence delta between in-set and out-of-set records. The attacker often already holds a candidate record and is testing membership.Optimisation loops that climb the confidence surface for a target class or a target identity. Sometimes paired with gradient access through a partial-model leak or with adversarial-example crafting tooling.
Required output signalClass labels alone are often enough for behavioural cloning. Confidence scores, logits, or probability vectors accelerate the attack and lower the query budget needed for a high-fidelity clone.Confidence scores, loss values, or fine-grained probability distributions. A hard-label-only endpoint raises the cost of membership inference but does not eliminate it.Confidence vectors per candidate input, often combined with gradient information or with a shadow model trained on a similar distribution.
Realistic impactLoss of intellectual property in the model itself; replication of paid inference as a free competitor; circumvention of paid-tier metering; downstream use of the cloned model to mount adversarial attacks against the original.Privacy violation (revealing that a person was in a sensitive cohort: HIV-positive cohort, mental-health cohort, salary band cohort, lay-off cohort, fraud cohort); contractual breach with the data subject; regulatory exposure under GDPR Article 5(1)(f), HIPAA, and sectoral privacy regimes.Disclosure of training data the team thought stayed inside the training pipeline; LLM-memorised system prompts, customer records, copyrighted text, or proprietary documents leaking through normal completions; reconstruction of biometric or medical features.
Primary control layerInference-tier rate limiting, per-identity query budget, output coarsening (round confidence scores, return labels only), watermarking, anomaly detection on query distributions, and contractual terms on derived-work.Differential privacy during training; output regularisation; refusing to return loss or fine-grained confidence to low-trust callers; deletion request handling; pre-deployment membership-inference evaluation.Data minimisation in training; redaction of memorisable content; differential privacy; output filters for memorised strings; LLM-specific evaluations (canary insertion, extraction probes).

The extraction surface

Public inference API

A REST, GraphQL, or gRPC endpoint that accepts an input and returns a prediction. The contract is what the security team has to defend. Endpoints that return labels plus confidence scores, plus class-probability vectors, or plus loss values, give the attacker richer signal per query. Endpoints behind a CDN without per-identity throttling let the attacker drive volume without paying.

Embedding APIs

Endpoints that return vector embeddings for an input. Embeddings often encode more information than a label, and a sufficiently large set of (input, embedding) pairs lets an attacker train a functional clone of the embedding model. Many enterprise AI features expose embeddings without recognising they are part of the extraction surface.

Chat completion and tool-use surfaces

LLM completions that echo memorised training data, replay verbatim chunks of system prompts, return source-code snippets, or surface customer records become extraction channels. Tool-use surfaces that call the model recursively expand the per-prompt query budget the attacker has against the same model.

Retrieval and RAG surfaces

A retrieval API that returns top-k document chunks lets the attacker learn the corpus through enumeration. Even without seeing the prompt, the attacker can reconstruct the knowledge base by probing with adjacent queries and reassembling the returned chunks.

Background and scheduled inference

Cron jobs, batch enrichment pipelines, and webhook-driven inference do not have a human at the other end. A poisoned input table or a re-firing schedule can run thousands of extraction queries against the model without triggering the alerting surface that watches human-driven traffic.

Partner and integration keys

A partner key issued for a benign integration grants the same inference access as a paying customer. A scraped partner key, an exfiltrated CI secret, or a credential committed to a public repository becomes an extraction wallet. Per-key rate limiting and per-key query distribution monitoring are the controlling boundary.

Open-weight checkpoints and self-hosted variants

A team that publishes an open-weight model, ships an on-device variant, or distributes a fine-tuned LoRA adapter has handed the parameters to the attacker. The extraction step is replaced by a download. The privacy and IP analysis then runs against the published artefact rather than against the inference endpoint.

Logged and observable inference traces

Provider traces, observability backends, and crash dumps that capture full prompts and full completions create a secondary extraction surface. A compromise of the observability tier extracts training data and inference behaviour without ever calling the model again.

How it goes wrong

1

High-volume scripted querying behind a free or low-tier key

An attacker registers a low-cost account, scripts a query loop, and drives millions of inputs at the inference API. Per-identity rate limiting either does not exist or caps a per-minute rate without budgeting the total. The cloned model is trained on the returned outputs and reaches near-parity with the production model within days.

2

Confidence scores returned to every caller

The endpoint returns label plus probability vector for every caller, regardless of tier or trust signal. Behavioural cloning needs labels alone, but a probability vector lowers the query budget by orders of magnitude. The team did this for a debugging surface and forgot the production endpoint inherits the same response shape.

3

Embedding API treated as low-sensitivity

The product exposes an embedding endpoint for semantic search or recommendation. Embeddings are not labels, so the team does not classify them as IP. An attacker collects (input, embedding) pairs at scale and trains a near-functional clone of the embedding model that they then ship as their own product.

4

LLM completions echo memorised training data

The training corpus included copyrighted text, customer documents, internal records, or system prompts that the model memorised verbatim. Carefully crafted completions trigger the memorisation, and the LLM returns the text. The team treats the leak as a hallucination rather than recognising it as a confidentiality finding.

5

No membership-inference evaluation pre-deployment

The team has no pre-deployment evaluation that probes the model for in-set vs out-of-set confidence delta. A privacy-sensitive class (a fraud cohort, a clinical cohort, a layoff cohort) was used in training and is recoverable through a few thousand targeted queries.

6

Differential privacy was not in the training budget

The team opted out of differential privacy during training because of accuracy tradeoffs. The accuracy lift looks defensible in isolation but the model retains a memorisation pattern that violates the privacy commitment the legal team made to customers and the regulator.

7

Observability captures full prompts and completions

A monitoring vendor logs every prompt and every completion at full fidelity. The retention window is 90 days, the access scope is broader than the model team intended, and the SOC 2 boundary of the vendor does not match the boundary the AI security team scoped. A vendor breach now is a training data breach.

8

Open-weight publication without IP and privacy review

A research team publishes a fine-tuned model on a public registry. The fine-tune set contained customer records or copyrighted content. Once the model is public, the privacy or copyright exposure cannot be remediated by rotating a credential; the artefact has to be withdrawn and downstream copies hunted.

9

No detection on query distribution shape

A normal customer queries the model a few hundred times a day with inputs that match their domain. An extraction attacker queries the model with inputs sampled from a much broader distribution. Without anomaly detection on the per-identity query shape, the extraction signal is invisible in the aggregate.

Detection signals

Detection is a measurement problem layered on the inference tier. The signals below are independent and stack. A defensible programme reads at least three of them in parallel and ties each to a query-budget revocation, an account hold, or a model-side guardrail change. None of the detection signals are shipped by SecPortal; the platform records the finding once the team detects the pattern through their own monitoring or third-party tooling.

Query volume and rate per identity

A per-identity query count budget across the day, the week, and the month flags a caller that exceeds a normal usage envelope. The envelope is calibrated against the legitimate user population (median, p95, p99) rather than against an arbitrary threshold.

Query distribution shape per identity

A scripted extraction loop samples a much broader input distribution than any one customer would in a normal usage pattern. A per-identity entropy or coverage measure on the input space surfaces the distribution shape change.

Confidence-vector access pattern

Callers that consistently request confidence scores, logits, or probability vectors look different from callers that consume label-only responses. The fine-grained-output access pattern is a leading signal for both model stealing and membership inference.

Canary input probing

Pre-seeded canary records are inserted into the training set, the retrieval index, or the system prompt. A query that returns a canary is direct evidence of memorisation extraction. Canaries also support honeypotting where the attacker is detected when they ask for the canary phrase.

Watermark verification on cloned models

A model that ships with a watermark (a specific input-output mapping the legitimate model encodes) lets the team verify whether a suspected clone in the wild was extracted from their endpoint. The detection lives downstream of the extraction event but supports the legal and contractual response.

Free-tier vs paid-tier behaviour delta

Extraction attackers usually live on the free or low-cost tier because that is the cheapest way to drive volume. A behaviour delta between the free-tier population and the paid-tier population on volume, query distribution, and confidence-access pattern surfaces the population the security team should triage first.

Remediation plan

Remediation is layered. No single control closes model extraction; every control raises the cost of an attack and slows the attacker. A defensible remediation record on the engagement names the controls applied at the inference tier, the controls applied at the training tier, the controls applied at the credential and identity tier, the controls applied at the observability tier, and the controls applied at the contractual tier, then pairs each to a retest that probes the residual exposure.

1

Per-identity query budget

Enforce a per-identity query count budget across day, week, and month windows. Pair the budget to the tier the caller is on. A free-tier caller carries a smaller budget than a paid-tier caller. Soft caps degrade gracefully (queue, throttle, downgrade response). Hard caps return an explicit refusal that the platform can audit.

2

Output minimisation per trust tier

Low-trust callers receive labels only. Higher-trust callers receive coarsened confidence scores (rounded to one or two decimal places, or banded). Only specific, audited callers receive full logits or probability vectors. Treat the output schema as the contract the security team has to defend.

3

Inference-tier rate limiting and throttling

Per-IP, per-identity, per-key, per-organisation, and per-tenant rate limits operate in parallel. Throttling escalates predictably (slow first, refuse second, hold third). The rate limit is enforced in the application or gateway, not only at a CDN, so a CDN bypass does not bypass the cap.

4

Differential privacy during training

Add calibrated noise during training (DP-SGD, output perturbation, federated learning with secure aggregation where the architecture supports it). Record the privacy budget (epsilon, delta) on the model card and on the engagement record. Re-evaluate the privacy budget every time the model is retrained or fine-tuned.

5

Membership-inference pre-deployment evaluation

Before promotion, run a membership-inference evaluation against the candidate model. Measure the in-set vs out-of-set confidence delta. Define a release-blocking threshold. Capture the evaluation result on the engagement record as a finding with severity, scope, and a remediation owner.

6

Canary insertion and extraction probes

Insert canary records into the training set, the retrieval index, or the system prompt. After deployment, run scheduled extraction probes that check whether the canary is recoverable. A positive recovery is an extraction-detected finding that opens a remediation engagement.

7

Watermarking and model fingerprinting

Watermark the model with a specific input-output mapping that the legitimate model produces and a clone is unlikely to reproduce exactly. Fingerprint the model output distribution against a baseline so a deployed model can be verified against the published version. Capture the watermark and fingerprint on the model card.

8

Credential hygiene and per-key budgets

Treat every issued inference key as a separately auditable identity. Rotate keys on a fixed cadence. Revoke keys with anomalous query distribution. Pair partner keys to a contractual derived-work clause that gives the team a legal lever when extraction is detected.

9

Observability scoping and retention review

Audit which providers, vendors, and internal tools see full prompts and completions. Set retention to the shortest window the team can operate against. Redact memorisable content from logs. Match the SOC 2 boundary of the observability vendor to the boundary the AI security team scoped.

10

Open-weight publication review gate

Before publishing any open-weight artefact (full checkpoint, fine-tuned LoRA, embedding model, distilled variant), run a privacy and IP review. The review checks the training set composition, the memorisation risk, the copyright exposure, the regulator commitments, and the downstream-extraction surface area. Capture the review on the engagement record.

11

Detection-to-engagement handoff

When the detection layer surfaces an extraction signal, open a finding on the engagement record with the detection source, the affected model, the affected endpoint, the affected identity or population, the time window, the response taken, and the residual exposure. The finding is the audit-grade record the AI security team and the GRC team read against.

12

Directed retest after remediation

Pair every remediation to a retest that probes the residual exposure. Replay the original query pattern (or a representative subset) and confirm the new control (rate limit, output minimisation, training-tier change, observability scoping) raises the cost of extraction enough that the residual exposure matches the threat-model tolerance.

How SecPortal supports the model extraction workflow

SecPortal does not run the inference layer, train the model, or detect extraction queries. The platform holds the audit-grade record of the finding after the team detects it, pairs the finding to a remediation owner, captures exceptions where remediation is deferred, runs the retest that proves closure, and crosswalks the finding to the frameworks the GRC team reads against. The capability cards below name the verified SecPortal features that pair to the model extraction workflow.

Record extraction findings with CWE and OWASP mapping

Findings management captures CWE-200 (Exposure of Sensitive Information to an Unauthorized Actor) as the confidentiality root, CWE-359 (Exposure of Private Personal Information) where membership inference is the variant, and the OWASP LLM10 cross-reference. The finding carries a CVSS 3.1 vector calibrated against the realistic blast radius (confidentiality high, scope changed where extraction crosses tenant boundaries), the affected model identifier, the affected endpoint, the detection source, and the response taken.

Import third-party AI security findings via bulk import

Bulk finding import lets you bring in extraction findings from a red-team engagement, an AI security audit, a third-party adversarial-ML assessment, a CSV export from a model gateway, or a manual review of inference logs. Imported findings land on the engagement record under the same canonical record shape so the AI security backlog lives next to the rest of the security backlog.

Track AI extraction exceptions on the finding record

When an extraction control cannot be applied immediately (a legacy embedding API that downstream products depend on, an open-weight model already in distribution, a partner contract that lacks a derived-work clause), the exception lives on the finding through finding overrides with named owner, compensating controls, residual-risk rationale, review cadence, and expiry, rather than in a model card comment or a Slack thread.

Verify extraction closure through retesting workflows

The retest pairs to the original finding. Closure means the directed retest replays the original query pattern (or a representative subset) and shows the new rate limit, output minimisation, training-tier change, observability scoping, or contractual control raises the cost of extraction enough that the residual exposure matches the threat-model tolerance. The verified_at and resolved_at timestamps preserve the audit chain.

Map extraction findings against multi-framework compliance

Compliance tracking maps the finding against OWASP LLM Top 10 (LLM10 Unbounded Consumption, LLM02 Sensitive Information Disclosure), OWASP Top 10 (A01:2021 Broken Access Control where the per-identity budget control fails), NIST AI RMF (Govern, Map, Measure, Manage) with focus on Measure 2.7 (security), Measure 2.11 (privacy), and Manage 2.4, NIST CSF 2.0 (DE.CM, ID.RA, PR.DS), ISO 27001 Annex A.8.20 (Networks security), A.8.16 (Monitoring activities), A.5.34 (Privacy and protection of personally identifiable information), and the EU AI Act high-risk system obligations (Article 15 accuracy, robustness, cybersecurity) where the model is in scope.

Capture the activity log as the workspace audit chain

Activity log records every workspace decision (engagement scope, finding triage, severity override, status transition, exception approval, retest closure) with the acting user and timestamp, exports to CSV, and gives the GRC team an audit-grade record of how the model extraction finding moved through the programme.

Pair the engagement record to the affected model and endpoint

Engagement management captures the scope of the AI security review (the deployed model, the inference endpoint set, the data classification tier, the regulator scope, the affected tenants) so the finding record carries the binding between the technical detection and the model the GRC team owns. The same record holds the document evidence (the model card, the threat model, the membership-inference evaluation report, the watermark scheme).

Document the membership-inference evaluation as evidence

Document management holds the pre-deployment membership-inference evaluation report as a versioned attachment on the engagement, with the privacy budget (epsilon, delta), the release-blocking threshold, the actual in-set vs out-of-set delta measured, and the sign-off named owner. The evidence pack pairs to the framework crosswalk so the audit reader does not have to leave the workspace to find the artefact.

Enforce MFA on the workspace that holds AI security findings

Multi-factor authentication is enforced on the workspace, so the audit-grade record of every AI security finding (and the model card and evaluation report attached to it) sits behind a TOTP factor that ties every workspace decision to a verified identity.

What SecPortal does not do

Honesty on capability matters when the topic is the AI inference tier. SecPortal does not run an inference gateway, does not enforce per-identity query budgets at the model edge, does not deploy a model-stealing detector inside the model serving worker, does not run membership-inference evaluations against your candidate model, does not insert canaries into your training set, does not watermark your model artefacts, does not run differential privacy training (DP-SGD or output perturbation), does not enforce output minimisation at the model edge, does not host the model or the embedding API, does not provide an LLM gateway, does not deploy a model registry, does not ship packaged push connectors into Jira, ServiceNow, Slack, Teams, PagerDuty, SIEM, SOAR, GRC, ticketing, model gateway, model registry, MLOps platform, or AI red-team tooling APIs, and does not act as the inference layer for any model. Programmes that need an inference gateway, a model-stealing detector, a membership-inference evaluator, a watermarking service, a differential-privacy training stack, or an LLM proxy run dedicated tooling alongside SecPortal, and land the resulting extraction findings on the engagement record through bulk finding import or manual entry. The platform value is the consolidated record where every model extraction finding (whether it came from an inference-gateway detection, an external red team, a membership-inference evaluation, a canary recovery, a watermark verification, or an academic-style probe) lives alongside the rest of the security backlog with the same lifecycle, the same role-based access control, the same activity log, and the same evidence trail.

Configuration review signals

The configuration review reads the inference contract, the model deployment metadata, and the surrounding identity, observability, and contractual surfaces. Each signal below is independent and points to a control that either exists, is partial, or is missing.

Inference response shape

Read the contract for every endpoint. Does it return label only, label plus confidence, label plus probability vector, label plus logits, or label plus loss. The response shape is the single largest determinant of the per-query information leak.

Per-identity rate and budget controls

Read the rate-limit and query-budget enforcement layer. Does the budget exist per tier, per identity, per key, per organisation, per tenant. Are caps soft or hard. Are they enforced at the application or only at the CDN. A CDN-only enforcement is bypassable.

Training-set composition record

Read the model card or training documentation. Does the team know what entered the training set. Is the privacy budget recorded. Were canaries inserted. Was a pre-deployment membership-inference evaluation run. Each absent answer is a finding.

Observability retention and scope

Read which monitoring backends, error trackers, and provider traces capture full prompts and completions. Read the retention window. Read the access scope. A 90-day retention on a broadly accessible backend is a finding in itself.

Open-weight publication policy

Read whether the team publishes any open-weight artefacts. If yes, read the review gate. A missing or undocumented review gate is a finding because once an artefact is public it cannot be remediated by rotating a credential.

Partner key and integration contract review

Read the contractual terms attached to every issued inference key. Is derived-work prohibited. Is benchmark or competitive use prohibited. Are revocation rights named. Without a contractual lever, the legal response to detected extraction is constrained.

Compliance and framework impact

Model extraction crosses several compliance and framework lines. The grid below maps the finding class to the relevant control or article in each regime. The GRC team reads the grid against the affected scope (regulated data class, regulator, customer commitment) to decide which evidence pack the engagement record has to carry.

OWASP Top 10 for LLM Applications (2025)

LLM10:2025 Unbounded Consumption names model extraction as one of three failure modes alongside denial of wallet and inference-as-a-service abuse. LLM02:2025 Sensitive Information Disclosure covers memorised training data and per-record disclosure.

NIST AI RMF 1.0 (AI 100-1)

Measure 2.7 (security, including model theft and inversion), Measure 2.11 (privacy, including membership inference and reconstruction), Manage 2.4 (managing AI risks throughout the lifecycle), Govern 1.5 (accountability for AI risk).

MITRE ATLAS

ML.T0044 (Model Extraction), ML.T0007 (Extract Training Data via Inference API), ML.T0024 (Exfiltration via ML Inference API), ML.T0018 (Manipulate AI Model). ATLAS is the canonical adversarial-ML knowledge base the threat model reads against.

ISO/IEC 23894:2023 (AI risk management)

Clause 6 (Process) and Clause 7 (Risk treatment) where the inference confidentiality risk is in scope. Pairs with ISO/IEC 42001:2023 (AI management system) clauses 6.1.2 (AI risk assessment) and 6.1.3 (AI risk treatment).

EU AI Act (Regulation 2024/1689)

Article 15 (accuracy, robustness, and cybersecurity) for high-risk systems. Article 14 (human oversight). Annex III lists the high-risk use cases where the obligations apply. Recitals 75 and 76 reference resilience against adversarial attacks.

GDPR (Regulation 2016/679)

Article 5(1)(f) integrity and confidentiality. Article 32 security of processing. Article 25 data protection by design and by default. Where the training set contained personal data and extraction recovers it, the breach reporting obligations under Article 33 may apply.

NIST CSF 2.0

PR.DS (Data Security: data-in-transit, data-at-rest, data-in-use protection), DE.CM (Continuous Monitoring), ID.RA (Risk Assessment), GV.RM (Risk Management Strategy). The AI workload reads against the same outcomes the rest of the platform reads against.

ISO 27001:2022 Annex A

A.8.16 (Monitoring activities), A.8.20 (Networks security), A.8.25 (Secure development life cycle), A.5.34 (Privacy and protection of personally identifiable information), A.8.10 (Information deletion). The model extraction finding lands as evidence under each.

NIST SP 800-53 Rev 5

AC-4 (Information Flow Enforcement), AU-12 (Audit Record Generation), SC-7 (Boundary Protection), SI-4 (System Monitoring), and SI-10 (Information Input Validation). Where the AI system is in scope of the SP 800-53 baseline, the same controls cover the inference tier.

SOC 2 Trust Services Criteria

CC6.1 (Logical Access Controls), CC7.2 (Monitoring of Controls), CC7.4 (Detection and Response), CC8.1 (Change Management). The auditor reads the inference-tier guardrail change against the change management criterion.

AppSec and AI security review checklist

An eight-item review the AppSec, AI security, product security, and security engineering teams can run against any deployed model the workspace has scope to assess.

  1. Catalogue every inference endpoint. REST, GraphQL, gRPC, embedding, completion, RAG retrieval, batch, scheduled. Each endpoint is in scope.
  2. Read the response shape per endpoint. Label only, label plus confidence, label plus probability vector, label plus logits, or label plus loss. Record the shape.
  3. Read the per-identity budget enforcement. Per tier, per identity, per key, per organisation, per tenant. Confirm caps are enforced at the application, not only at the CDN.
  4. Read the training-set composition record. What entered the training set. What privacy budget. Were canaries inserted. Was a pre-deployment membership-inference evaluation run.
  5. Run a membership-inference probe. Use a held-out reference set and a candidate in-set sample. Measure the confidence delta. Compare against the release-blocking threshold.
  6. Run an extraction probe. Script a query loop at the boundary of the per-identity budget. Train a small clone on the responses. Measure the agreement against the production model.
  7. Read the observability boundary. Which providers, vendors, internal tools see full prompts and completions. What retention. What access scope. Match SOC 2 boundary of the vendor.
  8. Land every finding on the engagement record. CVSS 3.1 vector, CWE-200 root, OWASP LLM10 cross-reference, named remediation owner, retest pairing, framework crosswalk, activity log entry.

Related vulnerabilities and recommended reading

Model extraction sits in the wider AI security and confidentiality cluster. The pages below cover adjacent finding shapes, the frameworks that map control evidence against AI confidentiality and governance, and the programme workflows that hold the backlog across detect, triage, prioritise, route, remediate, and verify.

Track model extraction findings on one engagement record

SecPortal pairs the inference-tier audit, the model-card review, the membership-inference evaluation, and the extraction probe with one findings record per AI/ML model under review, with CVSS 3.1 severity, CWE-200 mapping, OWASP LLM10 cross-reference, framework crosswalks across NIST AI RMF and EU AI Act, retest pairing, and an append-only activity log. Start scanning for free.

No credit card required. Free plan available forever.