Fix O(N²) duplicate-attribute check in Attributes iterator by qifan-sailboat · Pull Request #971 · tafia/quick-xml

qifan-sailboat · 2026-06-22T22:30:54Z

Fixes #969.

IterState::check_for_duplicates did a linear Vec scan of all already-seen attribute name ranges for every new attribute, so a start tag with N distinct attribute names cost O(N²/2) byte comparisons. On untrusted XML this is a CPU-exhaustion DoS — see #969 for measurements (N=80,000 ≈ 6.1 s release; N=800,000 ≈ 10 min) and the demonstrated downstream impact on NLnet Labs Routinator.

This adds a HashSet<u64> of DefaultHasher hashes of the key bytes as an O(1) pre-filter:

a fresh hash means the key cannot be a duplicate → push and return Ok (the no-duplicate path is now amortised O(1) per attribute / O(N) per start tag);
on a hash hit (a real duplicate, or an astronomically rare 64-bit collision) fall back to the existing linear scan to recover the exact previous position for AttrError::Duplicated(new, prev) — error semantics are unchanged.

The set is lazily allocated (Option<HashSet<u64>>) so IterState::new, Attributes::new and Attributes::html stay const fn.

N (distinct attrs)	before	after
10,000	75 ms	0 ms
20,000	303 ms	1 ms
40,000	1,251 ms	2 ms
80,000	6,109 ms	4 ms

A timing regression test is included in events::attributes::xml::duplicated::with_check.

dralley · 2026-06-22T22:46:49Z

And the distinct attrs column, I assume that represents N attributes within one element, as opposed to N attributes interspersed across many elements?

qifan-sailboat · 2026-06-22T23:32:46Z

Yes. It means N distinct attributes on a single start tag.

codecov-commenter · 2026-06-23T03:36:42Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 61.22449% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.41%. Comparing base (e00ae5c) to head (f49756b).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
benches/issue971.rs	0.00%	35 Missing ⚠️
src/events/attributes.rs	95.23%	3 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #971      +/-   ##
==========================================
+ Coverage   57.31%   57.41%   +0.10%     
==========================================
  Files          46       47       +1     
  Lines       18197    18340     +143     
==========================================
+ Hits        10429    10530     +101     
- Misses       7768     7810      +42

Flag	Coverage Δ
unittests	`57.41% <61.22%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dralley · 2026-06-23T03:37:05Z

-                .find(|r| slice[(*r).clone()] == slice[key.clone()])
-            {
-                return Err(AttrError::Duplicated(key.start, prev.start));
+            let mut h = DefaultHasher::new();


SipHash (the Rust default) is fairly expensive in computational terms, though I suppose if the goal is DoS-resistance is the most "safe" option.

Nonetheless would you mind running some benchmarks (e.g. https://github.com/tafia/quick-xml/blob/master/benches/microbenches.rs#L177) to compare how the existing implementation compares to this implementation w/ SipHash and aHash for normal inputs? Or at least the former two.

I would just like an idea of what impact this will have for the standard case.

dralley · 2026-06-23T03:38:09Z

Please run cargo fmt

dralley · 2026-06-23T03:59:02Z

+            /// so a start tag with many distinct attribute names cost O(N²) byte
+            /// comparisons. With the hash pre-filter the same input is O(N).
+            #[test]
+            fn many_distinct_attributes_is_linear() {


I don't love the idea of having something like this in a unit test. I suppose if it's reliable enough, it could be OK. On the other hand maybe it's best to just have a criterion microbenchmark, those tend to get run for any important implementation changes.

@Mingun , thoughts

Yes, I agree, this should be placed in ./benches directory. I suggest to place it into ./benches/issue971.rs. Benchmarks runs as tests on CI, so we will get the same results as for unit test:

quick-xml/.github/workflows/rust.yml

Lines 64 to 65 in 9aaea92

- name: Run tests + benchmarks

run: cargo test --all-features --benches --tests

Mingun

Please move the test into the benchmark directory and could you investigate the possibility to address other comments. You may force-push the changes, because that PR is small enough (and anyway, we prefer to have clean history).

Mingun · 2026-06-23T09:33:29Z

+            /// so a start tag with many distinct attribute names cost O(N²) byte
+            /// comparisons. With the hash pre-filter the same input is O(N).
+            #[test]
+            fn many_distinct_attributes_is_linear() {


Yes, I agree, this should be placed in ./benches directory. I suggest to place it into ./benches/issue971.rs. Benchmarks runs as tests on CI, so we will get the same results as for unit test:

quick-xml/.github/workflows/rust.yml

Lines 64 to 65 in 9aaea92

- name: Run tests + benchmarks

run: cargo test --all-features --benches --tests

Mingun · 2026-06-23T09:36:50Z

+    /// the duplicate check is amortised O(N) over the whole start tag instead of
+    /// O(N²). Lazily allocated on first use so that [`IterState::new`] can stay
+    /// `const`.
+    key_hashes: Option<HashSet<u64>>,


It seems, we should use NoHasher here because we already put the hash to the set.

Mingun · 2026-06-23T09:39:01Z

+  pre-filter so the no-duplicate path is amortised O(1) per attribute; the
+  exact `AttrError::Duplicated(new, prev)` positions are unchanged.
+
+[#969]: https://github.com/tafia/quick-xml/issues/969


We put links at the end of version section (here you put it at the end of Bug Fixes section). Could you please move it below (keep the 2 blank lines before the next ## section)

@qifan-sailboat Please rebase your PR and fix both this note and the link that was added in the other PR

dralley · 2026-06-27T22:36:04Z

@qifan-sailboat Do you plan to pick this back up?

`IterState::check_for_duplicates` did a linear scan of every previously seen attribute name for each new attribute, so a start tag with N distinct names cost O(N²) byte comparisons -- a CPU-exhaustion vector on untrusted XML (tafia#969). Small tags keep the linear scan: for the handful of attributes a real start tag carries it is faster than hashing and needs no allocation (the busiest element in `players.xml` has 22). Once a tag declares more than `SMALL_ATTRIBUTE_COUNT` (32) attributes it switches to a 64-bit hash pre-filter, making the whole tag O(N). The set is seeded from the names already collected, so a duplicate that spans the switch is still caught, and on a hit it falls back to the linear scan to report the exact previous position -- `AttrError::Duplicated` is unchanged. The pre-filter stores SipHash name hashes in a `HashSet` keyed by an identity hasher, since the values are already hashes (no re-hashing). Exercised by a new `benches/issue971.rs` and a unit test covering a duplicate past the hash threshold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

qifan-sailboat · 2026-06-28T00:50:10Z

Thanks @dralley and @Mingun for the reviews. Rebased onto master (so this now sits on top of #970) and force-pushed a single clean commit. Summary of the changes plus the benchmarks @dralley asked for.

Design: keep the linear scan for small tags, hash only for large ones

@dralley's point about SipHash being expensive is the right one, so I leaned into it: the O(N²) only bites for tags with many attributes, and real start tags carry a handful — where the linear scan is faster than any hashing and needs no allocation. So this revision keeps the existing linear scan for tags with up to SMALL_ATTRIBUTE_COUNT (= 32) attributes and only switches to a hash pre-filter above that. On the switch it seeds the set from the names already collected, and on a hit it still falls back to the linear scan to recover the exact previous position, so AttrError::Duplicated(new, prev) is unchanged.

Benchmarks

Cost of the duplicate check for a single start tag of N distinct attributes (--release, standalone harness so the numbers isolate the check itself):

N (attrs / tag)	existing (linear, O(N²))	SipHash → default set	SipHash → NoHasher set	aHash	this PR (threshold)
64	3.5 µs	2.2 µs	1.4 µs	1.4 µs	1.7 µs
256	54 µs	8.7 µs	4.8 µs	4.8 µs	4.9 µs
1024	904 µs	34 µs	18 µs	18 µs	18 µs

The linear column is cleanly quadratic (~16× per 4× of N); by 8192 attributes it is into the tens-to-hundreds of milliseconds, while every hash variant stays ~linear (~0.15 ms).

For the standard case I used the existing attributes micro-benchmark (players.xml, ≤ 22 attributes per element, so it never leaves the linear path):

`attributes/with_checks = true`	`master`	this PR
	32.8 µs	32.9 µs

That delta is within run-to-run noise — the with_checks(false) path, which this change never touches, drifts by ±1–3.5 % between runs on this machine, i.e. more than the true delta. So there's no measurable standard-case impact.

Takeaways:

Standard case: at parity with master, because small tags run the identical linear scan. The earlier all-SipHash revision added a ~2–4× constant factor (plus a per-tag allocation) here — the regression you were worried about.
NoHasher (@Mingun): done. The HashSet<u64> keys on an identity hasher now, since the values are already hashes. ~2× faster than the default HashSet<u64> on the hash path (e.g. 146 µs vs 271 µs at 8192 attrs).
aHash vs SipHash: within ~2 % of each other once the set uses NoHasher (142 vs 146 µs at 8192), so aHash buys nothing for short attribute names — I kept SipHash and added no new dependency. With the threshold the hash never runs on normal documents anyway.
Pathological tags: O(N) instead of O(N²).

Other review comments

Move the test to a benchmark (@Mingun / @dralley): the timing assertion is gone from the unit tests; there is now benches/issue971.rs (criterion, exercised by cargo test --benches on CI). I kept one small correctness unit test (duplicate_past_hash_threshold) covering a duplicate that straddles the linear→hash switch, since that path wasn't otherwise exercised.
Changelog (@Mingun): moved the [#969] link reference (and the [#970] one added by the other PR) to the end of the version section, after ### Misc Changes.
cargo fmt + rebase (@dralley): applied and rebased onto current master.

Mingun

Thanks, this is wonderful!

…DoS) quick-xml < 0.41.0: the default duplicate-attribute-name check in the `Attributes` iterator scanned all previously seen names for every attribute, so a start tag with N distinct names cost O(N^2) byte comparisons -- a remote, unauthenticated CPU-exhaustion DoS on untrusted XML. Fixed in 0.41.0 (tafia/quick-xml#971). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dralley reviewed Jun 23, 2026

View reviewed changes

Mingun requested changes Jun 23, 2026

View reviewed changes

qifan-sailboat force-pushed the fix/969-attributes-dup-check-quadratic branch from 703c47b to f49756b Compare June 28, 2026 00:50

Mingun approved these changes Jun 28, 2026

View reviewed changes

dralley approved these changes Jun 28, 2026

View reviewed changes

dralley merged commit 07f3db8 into tafia:master Jun 28, 2026
7 checks passed

qifan-sailboat mentioned this pull request Jul 1, 2026

Add advisory for quick-xml: quadratic attribute duplicate-check (CPU DoS) rustsec/advisory-db#3020

Merged

mikbry mentioned this pull request Jul 2, 2026

track upstream sctk-adwaita chain bumps to remove quick-xml (rustsec-2026-0194/0195) + ttf-parser (rustsec-2026-0192) advisory ignores mikbry/ui#132

Open

10 tasks

	- name: Run tests + benchmarks
	run: cargo test --all-features --benches --tests

Conversation

qifan-sailboat commented Jun 22, 2026

Uh oh!

dralley commented Jun 22, 2026

Uh oh!

qifan-sailboat commented Jun 22, 2026

Uh oh!

codecov-commenter commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dralley Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

dralley commented Jun 23, 2026

Uh oh!

dralley Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Mingun Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Mingun left a comment

Choose a reason for hiding this comment

Uh oh!

Mingun Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Mingun Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Mingun Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

dralley Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

dralley commented Jun 27, 2026

Uh oh!

qifan-sailboat commented Jun 28, 2026

Design: keep the linear scan for small tags, hash only for large ones

Benchmarks

Other review comments

Uh oh!

Mingun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 23, 2026 •

edited

Loading