Fix O(N²) duplicate-attribute check in Attributes iterator#971
Conversation
|
And the distinct attrs column, I assume that represents N attributes within one element, as opposed to N attributes interspersed across many elements? |
|
Yes. It means N distinct attributes on a single start tag. |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #971 +/- ##
==========================================
+ Coverage 57.31% 57.41% +0.10%
==========================================
Files 46 47 +1
Lines 18197 18340 +143
==========================================
+ Hits 10429 10530 +101
- Misses 7768 7810 +42
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| .find(|r| slice[(*r).clone()] == slice[key.clone()]) | ||
| { | ||
| return Err(AttrError::Duplicated(key.start, prev.start)); | ||
| let mut h = DefaultHasher::new(); |
There was a problem hiding this comment.
SipHash (the Rust default) is fairly expensive in computational terms, though I suppose if the goal is DoS-resistance is the most "safe" option.
Nonetheless would you mind running some benchmarks (e.g. https://github.com/tafia/quick-xml/blob/master/benches/microbenches.rs#L177) to compare how the existing implementation compares to this implementation w/ SipHash and aHash for normal inputs? Or at least the former two.
I would just like an idea of what impact this will have for the standard case.
|
Please run |
| /// so a start tag with many distinct attribute names cost O(N²) byte | ||
| /// comparisons. With the hash pre-filter the same input is O(N). | ||
| #[test] | ||
| fn many_distinct_attributes_is_linear() { |
There was a problem hiding this comment.
I don't love the idea of having something like this in a unit test. I suppose if it's reliable enough, it could be OK. On the other hand maybe it's best to just have a criterion microbenchmark, those tend to get run for any important implementation changes.
@Mingun , thoughts
There was a problem hiding this comment.
Yes, I agree, this should be placed in ./benches directory. I suggest to place it into ./benches/issue971.rs. Benchmarks runs as tests on CI, so we will get the same results as for unit test:
quick-xml/.github/workflows/rust.yml
Lines 64 to 65 in 9aaea92
Mingun
left a comment
There was a problem hiding this comment.
Please move the test into the benchmark directory and could you investigate the possibility to address other comments. You may force-push the changes, because that PR is small enough (and anyway, we prefer to have clean history).
| /// so a start tag with many distinct attribute names cost O(N²) byte | ||
| /// comparisons. With the hash pre-filter the same input is O(N). | ||
| #[test] | ||
| fn many_distinct_attributes_is_linear() { |
There was a problem hiding this comment.
Yes, I agree, this should be placed in ./benches directory. I suggest to place it into ./benches/issue971.rs. Benchmarks runs as tests on CI, so we will get the same results as for unit test:
quick-xml/.github/workflows/rust.yml
Lines 64 to 65 in 9aaea92
| /// the duplicate check is amortised O(N) over the whole start tag instead of | ||
| /// O(N²). Lazily allocated on first use so that [`IterState::new`] can stay | ||
| /// `const`. | ||
| key_hashes: Option<HashSet<u64>>, |
There was a problem hiding this comment.
It seems, we should use NoHasher here because we already put the hash to the set.
| pre-filter so the no-duplicate path is amortised O(1) per attribute; the | ||
| exact `AttrError::Duplicated(new, prev)` positions are unchanged. | ||
|
|
||
| [#969]: https://github.com/tafia/quick-xml/issues/969 |
There was a problem hiding this comment.
We put links at the end of version section (here you put it at the end of Bug Fixes section). Could you please move it below (keep the 2 blank lines before the next ## section)
There was a problem hiding this comment.
@qifan-sailboat Please rebase your PR and fix both this note and the link that was added in the other PR
|
@qifan-sailboat Do you plan to pick this back up? |
`IterState::check_for_duplicates` did a linear scan of every previously seen attribute name for each new attribute, so a start tag with N distinct names cost O(N²) byte comparisons -- a CPU-exhaustion vector on untrusted XML (tafia#969). Small tags keep the linear scan: for the handful of attributes a real start tag carries it is faster than hashing and needs no allocation (the busiest element in `players.xml` has 22). Once a tag declares more than `SMALL_ATTRIBUTE_COUNT` (32) attributes it switches to a 64-bit hash pre-filter, making the whole tag O(N). The set is seeded from the names already collected, so a duplicate that spans the switch is still caught, and on a hit it falls back to the linear scan to report the exact previous position -- `AttrError::Duplicated` is unchanged. The pre-filter stores SipHash name hashes in a `HashSet` keyed by an identity hasher, since the values are already hashes (no re-hashing). Exercised by a new `benches/issue971.rs` and a unit test covering a duplicate past the hash threshold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
703c47b to
f49756b
Compare
|
Thanks @dralley and @Mingun for the reviews. Rebased onto Design: keep the linear scan for small tags, hash only for large ones@dralley's point about SipHash being expensive is the right one, so I leaned into it: the O(N²) only bites for tags with many attributes, and real start tags carry a handful — where the linear scan is faster than any hashing and needs no allocation. So this revision keeps the existing linear scan for tags with up to BenchmarksCost of the duplicate check for a single start tag of N distinct attributes (
The linear column is cleanly quadratic (~16× per 4× of N); by 8192 attributes it is into the tens-to-hundreds of milliseconds, while every hash variant stays ~linear (~0.15 ms). For the standard case I used the existing
That delta is within run-to-run noise — the Takeaways:
Other review comments
|
Mingun
left a comment
There was a problem hiding this comment.
Thanks, this is wonderful!
…DoS) quick-xml < 0.41.0: the default duplicate-attribute-name check in the `Attributes` iterator scanned all previously seen names for every attribute, so a start tag with N distinct names cost O(N^2) byte comparisons -- a remote, unauthenticated CPU-exhaustion DoS on untrusted XML. Fixed in 0.41.0 (tafia/quick-xml#971). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes #969.
IterState::check_for_duplicatesdid a linearVecscan of all already-seen attribute name ranges for every new attribute, so a start tag with N distinct attribute names cost O(N²/2) byte comparisons. On untrusted XML this is a CPU-exhaustion DoS — see #969 for measurements (N=80,000 ≈ 6.1 s release; N=800,000 ≈ 10 min) and the demonstrated downstream impact on NLnet Labs Routinator.This adds a
HashSet<u64>ofDefaultHasherhashes of the key bytes as an O(1) pre-filter:Ok(the no-duplicate path is now amortised O(1) per attribute / O(N) per start tag);AttrError::Duplicated(new, prev)— error semantics are unchanged.The set is lazily allocated (
Option<HashSet<u64>>) soIterState::new,Attributes::newandAttributes::htmlstayconst fn.A timing regression test is included in
events::attributes::xml::duplicated::with_check.