Skip to content

Fix O(N²) duplicate-attribute check in Attributes iterator#971

Merged
dralley merged 1 commit into
tafia:masterfrom
qifan-sailboat:fix/969-attributes-dup-check-quadratic
Jun 28, 2026
Merged

Fix O(N²) duplicate-attribute check in Attributes iterator#971
dralley merged 1 commit into
tafia:masterfrom
qifan-sailboat:fix/969-attributes-dup-check-quadratic

Conversation

@qifan-sailboat

Copy link
Copy Markdown

Fixes #969.

IterState::check_for_duplicates did a linear Vec scan of all already-seen attribute name ranges for every new attribute, so a start tag with N distinct attribute names cost O(N²/2) byte comparisons. On untrusted XML this is a CPU-exhaustion DoS — see #969 for measurements (N=80,000 ≈ 6.1 s release; N=800,000 ≈ 10 min) and the demonstrated downstream impact on NLnet Labs Routinator.

This adds a HashSet<u64> of DefaultHasher hashes of the key bytes as an O(1) pre-filter:

  • a fresh hash means the key cannot be a duplicate → push and return Ok (the no-duplicate path is now amortised O(1) per attribute / O(N) per start tag);
  • on a hash hit (a real duplicate, or an astronomically rare 64-bit collision) fall back to the existing linear scan to recover the exact previous position for AttrError::Duplicated(new, prev) — error semantics are unchanged.

The set is lazily allocated (Option<HashSet<u64>>) so IterState::new, Attributes::new and Attributes::html stay const fn.

N (distinct attrs) before after
10,000 75 ms 0 ms
20,000 303 ms 1 ms
40,000 1,251 ms 2 ms
80,000 6,109 ms 4 ms

A timing regression test is included in events::attributes::xml::duplicated::with_check.

@dralley

dralley commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

And the distinct attrs column, I assume that represents N attributes within one element, as opposed to N attributes interspersed across many elements?

@qifan-sailboat

Copy link
Copy Markdown
Author

Yes. It means N distinct attributes on a single start tag.

@codecov-commenter

codecov-commenter commented Jun 23, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 61.22449% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.41%. Comparing base (e00ae5c) to head (f49756b).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
benches/issue971.rs 0.00% 35 Missing ⚠️
src/events/attributes.rs 95.23% 3 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #971      +/-   ##
==========================================
+ Coverage   57.31%   57.41%   +0.10%     
==========================================
  Files          46       47       +1     
  Lines       18197    18340     +143     
==========================================
+ Hits        10429    10530     +101     
- Misses       7768     7810      +42     
Flag Coverage Δ
unittests 57.41% <61.22%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/events/attributes.rs Outdated
.find(|r| slice[(*r).clone()] == slice[key.clone()])
{
return Err(AttrError::Duplicated(key.start, prev.start));
let mut h = DefaultHasher::new();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SipHash (the Rust default) is fairly expensive in computational terms, though I suppose if the goal is DoS-resistance is the most "safe" option.

Nonetheless would you mind running some benchmarks (e.g. https://github.com/tafia/quick-xml/blob/master/benches/microbenches.rs#L177) to compare how the existing implementation compares to this implementation w/ SipHash and aHash for normal inputs? Or at least the former two.

I would just like an idea of what impact this will have for the standard case.

@dralley

dralley commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Please run cargo fmt

Comment thread src/events/attributes.rs Outdated
/// so a start tag with many distinct attribute names cost O(N²) byte
/// comparisons. With the hash pre-filter the same input is O(N).
#[test]
fn many_distinct_attributes_is_linear() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea of having something like this in a unit test. I suppose if it's reliable enough, it could be OK. On the other hand maybe it's best to just have a criterion microbenchmark, those tend to get run for any important implementation changes.

@Mingun , thoughts

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this should be placed in ./benches directory. I suggest to place it into ./benches/issue971.rs. Benchmarks runs as tests on CI, so we will get the same results as for unit test:

- name: Run tests + benchmarks
run: cargo test --all-features --benches --tests

@Mingun Mingun left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the test into the benchmark directory and could you investigate the possibility to address other comments. You may force-push the changes, because that PR is small enough (and anyway, we prefer to have clean history).

Comment thread src/events/attributes.rs Outdated
/// so a start tag with many distinct attribute names cost O(N²) byte
/// comparisons. With the hash pre-filter the same input is O(N).
#[test]
fn many_distinct_attributes_is_linear() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, this should be placed in ./benches directory. I suggest to place it into ./benches/issue971.rs. Benchmarks runs as tests on CI, so we will get the same results as for unit test:

- name: Run tests + benchmarks
run: cargo test --all-features --benches --tests

Comment thread src/events/attributes.rs Outdated
/// the duplicate check is amortised O(N) over the whole start tag instead of
/// O(N²). Lazily allocated on first use so that [`IterState::new`] can stay
/// `const`.
key_hashes: Option<HashSet<u64>>,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems, we should use NoHasher here because we already put the hash to the set.

Comment thread Changelog.md Outdated
pre-filter so the no-duplicate path is amortised O(1) per attribute; the
exact `AttrError::Duplicated(new, prev)` positions are unchanged.

[#969]: https://github.com/tafia/quick-xml/issues/969

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We put links at the end of version section (here you put it at the end of Bug Fixes section). Could you please move it below (keep the 2 blank lines before the next ## section)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qifan-sailboat Please rebase your PR and fix both this note and the link that was added in the other PR

@dralley

dralley commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

@qifan-sailboat Do you plan to pick this back up?

`IterState::check_for_duplicates` did a linear scan of every previously
seen attribute name for each new attribute, so a start tag with N
distinct names cost O(N²) byte comparisons -- a CPU-exhaustion vector on
untrusted XML (tafia#969).

Small tags keep the linear scan: for the handful of attributes a real
start tag carries it is faster than hashing and needs no allocation (the
busiest element in `players.xml` has 22). Once a tag declares more than
`SMALL_ATTRIBUTE_COUNT` (32) attributes it switches to a 64-bit hash
pre-filter, making the whole tag O(N). The set is seeded from the names
already collected, so a duplicate that spans the switch is still caught,
and on a hit it falls back to the linear scan to report the exact
previous position -- `AttrError::Duplicated` is unchanged.

The pre-filter stores SipHash name hashes in a `HashSet` keyed by an
identity hasher, since the values are already hashes (no re-hashing).

Exercised by a new `benches/issue971.rs` and a unit test covering a
duplicate past the hash threshold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@qifan-sailboat qifan-sailboat force-pushed the fix/969-attributes-dup-check-quadratic branch from 703c47b to f49756b Compare June 28, 2026 00:50
@qifan-sailboat

Copy link
Copy Markdown
Author

Thanks @dralley and @Mingun for the reviews. Rebased onto master (so this now sits on top of #970) and force-pushed a single clean commit. Summary of the changes plus the benchmarks @dralley asked for.

Design: keep the linear scan for small tags, hash only for large ones

@dralley's point about SipHash being expensive is the right one, so I leaned into it: the O(N²) only bites for tags with many attributes, and real start tags carry a handful — where the linear scan is faster than any hashing and needs no allocation. So this revision keeps the existing linear scan for tags with up to SMALL_ATTRIBUTE_COUNT (= 32) attributes and only switches to a hash pre-filter above that. On the switch it seeds the set from the names already collected, and on a hit it still falls back to the linear scan to recover the exact previous position, so AttrError::Duplicated(new, prev) is unchanged.

Benchmarks

Cost of the duplicate check for a single start tag of N distinct attributes (--release, standalone harness so the numbers isolate the check itself):

N (attrs / tag) existing (linear, O(N²)) SipHash → default set SipHash → NoHasher set aHash this PR (threshold)
64 3.5 µs 2.2 µs 1.4 µs 1.4 µs 1.7 µs
256 54 µs 8.7 µs 4.8 µs 4.8 µs 4.9 µs
1024 904 µs 34 µs 18 µs 18 µs 18 µs

The linear column is cleanly quadratic (~16× per 4× of N); by 8192 attributes it is into the tens-to-hundreds of milliseconds, while every hash variant stays ~linear (~0.15 ms).

For the standard case I used the existing attributes micro-benchmark (players.xml, ≤ 22 attributes per element, so it never leaves the linear path):

attributes/with_checks = true master this PR
32.8 µs 32.9 µs

That delta is within run-to-run noise — the with_checks(false) path, which this change never touches, drifts by ±1–3.5 % between runs on this machine, i.e. more than the true delta. So there's no measurable standard-case impact.

Takeaways:

  • Standard case: at parity with master, because small tags run the identical linear scan. The earlier all-SipHash revision added a ~2–4× constant factor (plus a per-tag allocation) here — the regression you were worried about.
  • NoHasher (@Mingun): done. The HashSet<u64> keys on an identity hasher now, since the values are already hashes. ~2× faster than the default HashSet<u64> on the hash path (e.g. 146 µs vs 271 µs at 8192 attrs).
  • aHash vs SipHash: within ~2 % of each other once the set uses NoHasher (142 vs 146 µs at 8192), so aHash buys nothing for short attribute names — I kept SipHash and added no new dependency. With the threshold the hash never runs on normal documents anyway.
  • Pathological tags: O(N) instead of O(N²).

Other review comments

  • Move the test to a benchmark (@Mingun / @dralley): the timing assertion is gone from the unit tests; there is now benches/issue971.rs (criterion, exercised by cargo test --benches on CI). I kept one small correctness unit test (duplicate_past_hash_threshold) covering a duplicate that straddles the linear→hash switch, since that path wasn't otherwise exercised.
  • Changelog (@Mingun): moved the [#969] link reference (and the [#970] one added by the other PR) to the end of the version section, after ### Misc Changes.
  • cargo fmt + rebase (@dralley): applied and rebased onto current master.

@Mingun Mingun left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is wonderful!

@dralley dralley merged commit 07f3db8 into tafia:master Jun 28, 2026
7 checks passed
djc pushed a commit to rustsec/advisory-db that referenced this pull request Jul 2, 2026
…DoS)

quick-xml < 0.41.0: the default duplicate-attribute-name check in the `Attributes` iterator scanned all previously seen names for every attribute, so a start tag with N distinct names cost O(N^2) byte comparisons -- a remote, unauthenticated CPU-exhaustion DoS on untrusted XML. Fixed in 0.41.0 (tafia/quick-xml#971).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IterState::check_for_duplicates — O(N²) attribute-key linear scan → CPU DoS on untrusted XML

4 participants