Vulnerability

Data and Model Poisoning in LLM Applications
detect, understand, remediate

Data and model poisoning (OWASP LLM04:2025) is the vulnerability class where an attacker shapes the data the model is trained on, fine-tuned with, or retrieves from, in order to bias what the model produces at inference time. The attack can install behavioural triggers (targeted backdoors), it can drift response quality across the board (untargeted contamination), or it can manipulate retrieved content so a downstream RAG pipeline emits the attacker chosen text. The fix is operational: validate every corpus source, gate every fine-tune through an approval workflow, quarantine new retrieval sources, sign model checkpoints, and keep a training data lineage record an AppSec or audit reader can follow.

Get Started Free

No credit card required. Free plan available forever.

Severity

High

CWE ID

CWE-1039

OWASP Top 10

LLM04:2025 - Data and Model Poisoning

CVSS 3.1 Score

8.5

What is data and model poisoning in LLM applications?

Data and model poisoning is the vulnerability class where an attacker shapes the data the model is trained on, fine-tuned with, or retrieves from, in order to bias what the model produces at inference time. The attack does not need the attacker to talk to a deployed model. The attacker writes into the corpus the model later reads (a public knowledge base, a customer-contributed document store, a scraped web index, a customer-support log used for fine-tuning, an embedding index that ingests user uploads), and the planted content rides the training, fine-tune, or retrieval pipeline into the production response surface. The 2025 OWASP Top 10 for Large Language Model Applications lists the class as LLM04:2025 Data and Model Poisoning and groups three failure modes under one heading: targeted poisoning that installs a behavioural trigger (a backdoor that fires on a specific phrase or token), untargeted poisoning that drifts general response quality and shifts the response distribution, and retrieval-corpus poisoning where the planted content is read directly into the prompt at inference time without ever touching the model weights.

The vulnerability sits next to five other LLM Top 10 entries on the agent and pipeline threat model. The prompt injection page covers the input-side hijack at inference time. The indirect prompt injection via RAG page covers the retrieved-data hijack at inference time. The improper output handling in LLM applications page covers the output-sink risk. The excessive agency in LLM applications page covers the action dimension. The system prompt leakage page covers the disclosure dimension. Data and model poisoning is the pre-deployment dimension. Even with a perfectly aligned model at inference time, a clean prompt path, a sanitised output sink, a least-privilege tool registry, and a private system prompt, the question of what the model and the retrieval index were allowed to learn from upstream is a security decision the AI engineering, AppSec, ML platform, and security engineering teams have to own before the model ever serves a request.

For internal AppSec, AI security, product security, ML platform, security engineering, vulnerability management, and GRC teams, data and model poisoning is the OWASP LLM Top 10 entry that turns the data layer of the AI pipeline into part of the attack surface. The fix is operational: validate every corpus source, gate every fine-tune through an approval workflow, quarantine new retrieval sources, sign and verify model checkpoints, separate retrieval-index write authority from read authority, instrument the pipeline to detect post-deployment drift, and keep a training data lineage record the AppSec and audit reader can follow when a finding lands. The same operational discipline that governs hardcoded secrets in code, signed artefact promotion, and access controls on what gets pushed applies directly to training data, fine-tune datasets, and embedding index writes.

Poisoning is the inbound counterpart of an outbound class with the same data-layer attention: the model extraction attack page covers the confidentiality side, where the same data the team poured into training can be recovered from the deployed model through inference-API queries (model stealing, membership inference, training-data reconstruction). The two findings often pair on the same engagement: a poisoning gap admits adversarial training records on the way in, and an extraction gap then leaks the recoverable signal on the way out. AppSec, AI security, and GRC teams that scope a model security review should walk the inbound and outbound dimensions together.

The poisoning surface

Pre-training corpus

The largest, longest-lived dataset in the pipeline. Most enterprise teams do not pre-train their own foundation model, but consume one from a provider. The corpus is rarely auditable end to end. Risk concentrates on the supplier discipline (the model card, the dataset disclosures, the integrity attestation) rather than on internal controls.

Fine-tune and instruction-tuning datasets

The middle layer where most internal teams actually write to the model. Datasets are usually small, often hand-curated, and frequently sourced from internal logs, customer transcripts, support tickets, or product telemetry. A single contaminated record in this layer leaves a long shadow because the model learns it explicitly.

Retrieval-augmented generation corpus

The runtime layer where documents are pulled at inference time and pasted into the prompt. The retrieval index is usually the most write-permissive surface in the AI pipeline. Public document upload, customer-contributed knowledge, partner ingestion feeds, and scraped third-party sites all land here. Planted content is read every time a relevant query fires.

Embedding indexes and vector stores

The serialised representation the retrieval layer reads against. An attacker who can write a chosen vector into the index can pull a chosen document into a chosen query response without ever touching the document store. Write authority on the index is the controlling boundary.

Reward and feedback signals

RLHF, DPO, and online preference signals feed back into model weights or routing heuristics. If the feedback channel is reachable from a multi-tenant surface (thumbs-up, edit suggestions, rating widgets, accepted-completion telemetry) and is aggregated without per-identity weighting and per-source vetting, an attacker who controls many low-trust identities can drift the model towards their preferred behaviour.

Third-party model artefacts

Open-weight checkpoints, LoRA adapters, embedding models, tokenisers, and quantised variants pulled from public registries. A checkpoint that ships with a backdoor activated by a specific token sequence has the same shape as a poisoned dependency in classical software supply chain risk.

Ingestion pipelines and scrapers

The crawlers, ETL jobs, and connectors that bring external content into the corpus on a schedule. A misconfigured scraper, an unauthenticated webhook, or an unsigned content feed becomes the unattended write path the attacker targets when the document store itself is hardened.

Multi-tenant write paths

A retrieval index, a fine-tune dataset queue, or a feedback aggregator that one tenant can write into and another tenant later reads from. Cross-tenant poisoning is the cleanest path to a real-world impact: the attacker pays for a low tier, contributes a payload, and a higher tier tenant queries the index and receives the planted response.

How it goes wrong

Targeted backdoor through a fine-tune dataset

A contributor adds a chosen trigger phrase paired with a chosen response into a fine-tune dataset. The team runs the fine-tune. After deployment, any prompt that contains the trigger phrase produces the attacker chosen response, and the rest of the response distribution looks normal to evaluators who do not know the trigger.

Untargeted contamination from low-quality corpus

A scheduled scraper ingests pages without provenance validation. Some of the pages were written specifically to game the corpus (SEO spam, generated AI text, biased opinion content). The model fine-tunes on the mixture and silently drifts on tone, factuality, refusal patterns, and tool-call accuracy.

RAG corpus poisoning through customer upload

A customer with a low-cost account uploads a document into a shared knowledge base. The document contains text crafted to be retrieved on queries about a competitor, a regulator, an internal product feature, or an unrelated tenant. A different user later asks a related question and the retrieval ranks the planted document at the top.

Embedding-index write through unauthenticated ingestion

A webhook accepts new document chunks without authentication and writes them straight into the embedding index. An attacker scripts the webhook, uploads vectors crafted to fire on a chosen query, and turns the retrieval surface into a controlled-response surface for any client of that index.

Open-weight checkpoint with embedded trigger

The team downloads a fine-tuned checkpoint from a public registry to save training cost. The checkpoint embeds a behavioural trigger an evaluator did not catch. The trigger activates a refusal-bypass, a credential-leak, or an output-injection pattern when the right token sequence appears.

Reward-signal manipulation via thumbs-up gaming

A feedback aggregator weights all thumbs-up signals equally. An attacker scripts many low-trust accounts to upvote a particular completion pattern. The model routing layer or the online preference loop drifts towards the manipulated pattern.

Fine-tune queue write without an approval gate

The training pipeline reads its dataset from a shared bucket. A user with bucket write access (a contractor, a partner, a former employee whose access lingered) uploads a contaminated file. The next scheduled fine-tune ingests the file because no human review gates the dataset before the run.

Cross-tenant retrieval contamination

A multi-tenant SaaS shares one retrieval index for similar customers. Tenant A writes a document tagged as private but the retrieval scope filter is broken or stale. Tenant B asks a related question and the planted content from Tenant A is retrieved and pasted into the prompt that serves Tenant B.

Model checkpoint download without integrity verification

The deployment pipeline pulls a checkpoint from a registry, copies the bytes, loads them into the serving worker, and starts serving. No hash check runs against a known good value, no signature verification runs against the publisher key, and a substituted checkpoint serves traffic until the next manual inspection.

Three failure modes under one heading

Dimension	Targeted poisoning	Untargeted poisoning	Retrieval-corpus poisoning
Goal	Install a behavioural trigger that fires on a specific input.	Drift general response quality, factuality, refusal patterns, or tone.	Force the retrieval layer to surface attacker chosen content on chosen queries.
Where it lands	Fine-tune dataset, instruction-tuning dataset, or open-weight checkpoint.	Pre-training corpus, fine-tune dataset, reward-signal aggregator.	RAG corpus, embedding index, document store, ingestion queue.
Detection difficulty	Hard. The behaviour looks normal until the exact trigger appears, and the trigger may be a low-frequency token sequence.	Medium. Evaluators see distributional drift if a held-out test set covers the affected category. Easy to miss for narrow drift.	Medium. Canary queries against the index catch obvious cases. Subtle reranking attacks survive a quick spot check.
Primary control	Per-record provenance on the fine-tune dataset; signed checkpoints; trigger-phrase canary tests.	Corpus source validation; held-out evaluation per topic; per-source weighting on reward signals.	Per-source authorisation on retrieval-index writes; tenant-scope filters; ingestion quarantine.
Where SecPortal records the finding	Engagement finding with the trigger phrase, the contaminated dataset reference, the affected checkpoint, and the post-fix retest record.	Engagement finding with the affected corpus source, the regression evaluation results, and the source-validation control gap.	Engagement finding with the planted document, the retrieval scope filter gap, the affected tenant pair, and the post-fix retest record.

Common causes

No provenance record on dataset records

The fine-tune dataset is a single CSV in a bucket. There is no per-record source URI, no per-record contributor identity, no per-record timestamp, no per-record approval flag. When a finding lands, the team cannot answer where any given training row came from, which is the first question the auditor and the incident responder will ask.

Unauthenticated ingestion paths

A webhook, a document upload endpoint, a partner connector, or a public scraper writes into the corpus or the retrieval index without verifying the source. The write path is the easiest, cheapest, and most reusable poisoning channel.

Shared retrieval index across tenants

One embedding index serves multiple tenants. Scope filtering at query time is the only boundary, and the filter logic depends on a single client-supplied or session-supplied claim. A broken filter or a stale rule lets one tenant write content the other reads.

No human-approval gate on the fine-tune queue

The training pipeline runs on a cron. A new dataset commit triggers a new fine-tune. No human reviews the diff between this dataset and the previous version, and no automated content scanner runs against the change before the model trains.

Open-weight checkpoints loaded without signature checks

The team uses Hugging Face, GitHub releases, or an internal artefact store as the checkpoint origin. The serving pipeline downloads the file and loads it. No SHA-256 verification, no Sigstore verification, no publisher key check, and no allow-listed digest gate the load.

Feedback aggregation without per-identity weighting

Thumbs-up, accepted-completion telemetry, edit-distance signals, and human-preference labels are aggregated by raw count. An attacker who can spawn many low-trust identities can outvote the signal from sanctioned users and drift the model towards the chosen pattern.

How to detect it

Automated detection

SecPortal's code scanning runs against connected repositories and flags training and fine-tune pipelines where the dataset source URI is not pinned to a known artefact, where the model checkpoint download skips an integrity check (no hash verification, no signature verification, no allow-listed digest), where the retrieval ingestion path accepts content from an unverified origin, where the embedding upsert step lacks a write-side authorisation check, where the dataset queue lacks a per-record approval flag, or where the reward-signal aggregator weights all feedback identities equally
Authenticated scanning drives RAG-backed and fine-tune-backed endpoints with corpus probing payloads under a real session: canary queries that read suspected planted content back from the index, trigger-phrase queries that test for installed behavioural backdoors against a known good baseline, role and tenant boundary checks across the retrieval surface, ingestion replay attempts against the document upload path, and reranking-attack probes against the embedding store, then records the response as evidence on the finding
External scanning discovers exposed dataset endpoints, public document-ingestion webhooks, RAG admin surfaces, public fine-tune job APIs, embedding-index write paths, and scraper webhooks reachable from the verified perimeter that may accept attacker-controlled content without authentication
Continuous monitoring re-runs the corpus probe, the trigger-phrase canary set, the tenant boundary check, and the retrieval scope filter test on a defined cadence so a newly added ingestion path, a regressed scope filter, a removed signature check, a removed approval gate, or a new third-party checkpoint is caught against the previous baseline rather than waiting for the next manual review

Manual testing

Enumerate every dataset the model consumes (pre-training disclosure, fine-tune dataset, instruction-tuning dataset, RAG corpus, embedding index, reward-signal aggregator, third-party checkpoint registry), the contributor identity that writes to each, the approval gate that governs each, and the integrity attestation that ships with each
For each retrieval-corpus write path, send a planted canary document under a low-trust identity, query the index from a high-trust identity, and confirm whether the scope filter, the tenant boundary, the source-validation gate, and the quarantine workflow stop the planted content from reaching the response
For each fine-tune pipeline, attempt to submit a dataset with a known trigger phrase against the approval gate, the content scanner, and the dataset signing workflow, and confirm whether the submission is reviewed, blocked, or accepted without comment
For each model checkpoint load, simulate a checkpoint substitution against the deployment pipeline and confirm whether the hash check, the signature verification, or the allow-listed digest stops the load before serving traffic
Read the activity log for the test session and confirm that each dataset write, each fine-tune run, each retrieval-index write, each checkpoint load, and each reward-signal aggregation captures the actor identity, the source reference, the timestamp, and the approval reference a finance, security, or audit responder would need to attribute the change

How to fix it

Record per-record provenance on every dataset

Each training, fine-tune, instruction-tuning, RAG-corpus, and reward-signal record should carry a source URI, a contributor identity, a timestamp, a content hash, and an approval reference. The provenance metadata sits next to the record and travels with it through the pipeline. When a finding lands, the responder reads the provenance row to identify which contributor, which source, and which approval to revoke.

Gate every fine-tune through a human approval workflow

New dataset versions, instruction-tuning runs, and reward-signal recalibrations should require a documented human approval before the training job starts. The approval references the diff against the previous version, the content-scanner result, and the planned evaluation. The fine-tune pipeline reads the approval flag and refuses to run without it.

Sign and verify model checkpoints at the deployment boundary

The serving pipeline should verify a SHA-256 hash against a known good value, a publisher signature against an allow-listed key, or a Sigstore attestation against the expected identity before loading the checkpoint into a worker. A checkpoint that fails verification should hard-fail the load, log the failure, and create a finding the security team reviews.

Separate write authority from read authority on retrieval indexes

The identities that ingest content into a RAG corpus or an embedding index should be distinct from the identities that query the index. Tenant-scope filters should be enforced at query time on a trusted server-side claim rather than on a client-supplied parameter. Scope filter regressions should be detected by canary queries against the index.

Quarantine new corpus sources before mixing them into production

A new ingestion feed, a new scraper, a new partner connector, or a new customer upload path should land in a staging corpus that is read by no production model. A content scanner runs against the staging corpus before promotion. The promotion step requires a documented human approval. The audit reads from the promotion log.

Authenticate every ingestion path with a per-source credential

Webhooks, document upload endpoints, partner connectors, and scraper APIs should require a credential that is rotated on a schedule, scoped to one source, and revocable when the source becomes untrusted. The credential identity ends up on the per-record provenance row.

Weight feedback signals by identity trust score

Thumbs-up, accepted-completion telemetry, edit-distance signals, and human-preference labels should be weighted by per-identity trust rather than aggregated by raw count. A new identity, a multi-tenant low-tier account, or a high-frequency feedback pattern should weigh less than a sanctioned reviewer until the identity earns a higher trust score.

Run a trigger-phrase canary suite on every fine-tune

A fixed test set of trigger phrases (real ones from past findings, simulated ones from threat modelling, randomly generated ones from a canary generator) should be evaluated against every new checkpoint before promotion. A response that matches the trigger pattern hard-fails the promotion and creates a finding.

Hold a held-out evaluation set per topic the model is supposed to cover

Untargeted poisoning is detected by a regression evaluation that compares response quality per topic against the previous checkpoint. A drift threshold per topic, a refusal-rate threshold, and a tool-call-accuracy threshold should each gate the promotion. A regression in any of them creates a finding before the model serves traffic.

Bound retrieval write rate per source

A document ingestion path should rate-limit per source identity and alert when a single source exceeds its expected ingestion volume. The rate-limit is the operational guard against a low-trust identity that scripts the upload path to flood the corpus with planted content before a manual reviewer notices.

Maintain a training data lineage record the audit can read

The lineage record names every dataset that fed the production model, every contributor who wrote into each dataset, every approval that gated each promotion, every checkpoint that was signed, and every revocation that ran when a source was deauthorised. AppSec, security engineering, GRC, and audit all read from the same record.

Re-run the corpus probe on every model, dataset, and ingestion change

A new base model with different training disclosures, a new fine-tune dataset, a new RAG ingestion source, a relaxed approval gate, a removed signature check, or a new scraper schedule can re-open a closed poisoning finding. Treat the corpus probe and the trigger-phrase canary as first-class CI gates alongside unit and integration tests, and keep the canary set in the test suite where the team will see them.

What this looks like in SecPortal

Finding record with corpus, contributor, and evidence

The finding captures the affected corpus (training, fine-tune, RAG, embedding index, or reward signal), the contributor identity, the source URI, the timestamp, the contaminated record reference, the trigger phrase or planted document, the affected checkpoint, the evaluation result, and the actor who reproduced the abuse. The evidence is what AppSec, AI engineering, ML platform, and security engineering need to reproduce the finding against the same dataset, the same pipeline, and the same model.

Code scanning across training and ingestion pipelines

Code scanning runs against connected GitHub, GitLab, and Bitbucket repositories. Findings surface at training and fine-tune pipelines where the dataset source URI is unpinned, where the checkpoint download skips integrity verification, where the retrieval ingestion path accepts content from an unverified origin, where the embedding upsert lacks a write-side authorisation check, where the dataset queue lacks an approval flag, or where the reward-signal aggregator weights all feedback identities equally.

Authenticated scanning with corpus probes

Authenticated scanning drives RAG-backed and fine-tune-backed endpoints with a curated set of corpus probes under a real session: canary document writes, trigger-phrase queries, tenant boundary checks, ingestion replay attempts, and embedding-store reranking probes. Each probe records whether the scope filter, the authorisation gate, or the approval workflow stopped the planted content, and the response becomes evidence on the finding.

External scanning across ingestion perimeter

External scanning discovers exposed document-ingestion webhooks, RAG admin surfaces, public fine-tune job APIs, embedding-index write paths, dataset connectors, and scraper webhooks reachable from the verified perimeter that may accept attacker-controlled content without authentication. The exposed path lands on the finding alongside the recommendation to authenticate or remove it.

Continuous monitoring against regression

Continuous monitoring re-runs the corpus probe, the trigger-phrase canary, the tenant boundary check, and the retrieval scope filter test on the configured cadence (daily, weekly, biweekly, or monthly). A new ingestion source, a regressed scope filter, a removed signature check, a removed approval gate, or a new third-party checkpoint shows up against the baseline rather than waiting for the next manual review.

Retest after the remediation ships

Once the fix deploys (provenance recorded, approval gate in place, signature verification active, scope filter restored, ingestion path authenticated), a targeted retest replays the original corpus probe and the original trigger-phrase canary against the new pipeline and records the post-fix response on the finding. The finding closes against the evidence rather than against a developer assertion that the dataset is now clean.

AI-assisted writeups with explicit honest scope

AI reports generate the writeup, the executive summary, and the developer-facing reproduction steps from the finding record. The narrative stays within the verified evidence (the corpus, the contributor, the source URI, the trigger phrase, the affected checkpoint, the evaluation result) and does not invent dataset lineage, signature infrastructure, or MLOps automation the product does not have.

Activity log for the dataset and checkpoint audit trail

The workspace activity log captures the security-side audit trail: who created the finding, who suppressed it, who closed it, who approved a related exception, who imported the third-party scanner result, and when the retest fired. Pair the workspace activity log with the engineering-side lineage record the pipeline writes (per-record provenance, approval reference, signature verification result) so post-incident reconstruction reads from both records.

Finding overrides for sanctioned ingestion paths

Where a documented ingestion path is a deliberate, sanctioned workload (an internal-only knowledge ingestion, a partner feed with a signed contract, a curated fine-tune dataset from a trusted contributor), finding overrides record the suppression rationale, the owner, and the expiry on the finding itself. The exception lives on the operating record alongside every other documented deviation.

Compliance tracking pairs the fix to control evidence

Compliance tracking maps data and model poisoning findings to the controls that read against them (ISO 27001 A.5.23 information security for cloud services, A.5.31 legal statutory regulatory, A.8.7 protection against malware, A.8.10 information deletion, A.8.30 outsourced development; SOC 2 CC6.1 logical access, CC7.1 system monitoring, CC8.1 change management; NIST 800-53 RA-3 risk assessment, SI-7 software firmware information integrity, SR-3 supply chain controls and processes; NIST AI RMF Map, Measure, Manage; ISO/IEC 42001 AI management system; NIST SSDF PS.2 protect all forms of code, PW.4 reuse existing well-secured software). The same finding feeds the engineering ticket and the auditor evidence pack.

What SecPortal does not do

SecPortal is the operating record where data and model poisoning findings, the affected corpus reference, the contributor identity, the source URI, the planted content or trigger phrase, the affected checkpoint, and the evaluation result land alongside the rest of the security backlog. The product does not act as a training data lineage system that traces every record from raw source to deployed model, does not maintain a managed dataset catalogue with per-record provenance metadata, does not run model evaluation harnesses or trigger-phrase canary suites on your behalf, does not sign or verify model checkpoints, does not scan dataset content for poisoning patterns, does not maintain an embedding index or run reranking experiments, and does not enumerate dataset lineage across MLOps platforms.

SecPortal does not connect to MLflow, Weights and Biases, Hugging Face Hub, Sagemaker, Vertex AI, Azure ML, Databricks, Snowflake, Pinecone, Weaviate, Chroma, Qdrant, Jira, ServiceNow, Slack, SIEM, SOAR, or AI-BOM platforms through packaged integrations. The discipline is the engineering practice on top of the operating record: AI engineering, ML platform, AppSec, product security, and security engineering teams write the per-record provenance metadata, the human approval workflow on the fine-tune queue, the checkpoint signature verification at the serving boundary, the scope filter on the retrieval index, the per-source authentication on the ingestion path, the trigger-phrase canary suite in the test pipeline, the held-out evaluation gate before promotion, and the lineage record the audit reads.

Compliance impact

OWASP Top 10 for LLM Apps

LLM04:2025 - Data and Model Poisoning (targeted backdoor, untargeted contamination, retrieval-corpus poisoning)

NIST AI RMF

Map, Measure, Manage; Valid and Reliable, Secure and Resilient trustworthy AI characteristics; data-lifecycle and integrity guidance

ISO/IEC 42001

AI management system - data quality, AI system lifecycle, operational planning and control, monitoring measurement analysis evaluation

ISO 27001

Annex A 5.23 cloud services security; 5.31 legal statutory regulatory; 8.7 protection against malware; 8.10 information deletion; 8.30 outsourced development

SOC 2

CC6.1 logical access; CC7.1 system monitoring; CC8.1 change management

NIST 800-53

RA-3 Risk assessment; SI-7 Software firmware and information integrity; SR-3 Supply chain controls and processes; SR-11 Component authenticity

Related vulnerabilities

Prompt Injection

Indirect Prompt Injection via RAG

Improper Output Handling in LLM Applications

Excessive Agency in LLM Applications

System Prompt Leakage in LLM Applications

Unbounded Consumption in LLM Applications

Vulnerable & Outdated Components

Insufficient Logging and Monitoring

Related features

Find vulnerabilities before they ship

Test web apps behind the login

Vulnerability scanning tools that map your attack surface

Vulnerability management software that tracks every finding

Monitor continuously catch regressions early

AI-powered reports in seconds, not days

Every action recorded across the workspace

Finding overrides that survive every scan cycle

Compliance tracking without a full GRC platform

Verify fixes and track reopens on the same finding record

Track LLM04 data and model poisoning findings on one engagement record

SecPortal pairs the code scan, the authenticated scan, and the external scan with one findings record per poisoning vulnerability, with CVSS 3.1 severity, framework mapping, retest pairing, and an append-only activity log. Start scanning for free.