Streamline built-in speech dictionary rules#20433
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates NVDA’s built-in speech dictionary rules to use zero-width regex boundaries with literal space insertion, aiming to avoid pathological regex performance on long word-character runs while preserving the existing spoken output behavior.
Changes:
- Replaced three built-in speech dictionary regex rules with lookaround-based boundary matches that insert a space.
- Added a changelog entry noting improved speech responsiveness for long mixed-case / digit-heavy text.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
user_docs/en/changes.md |
Documents the user-facing responsiveness improvement. |
source/builtin.dic |
Updates built-in speech dictionary regex rules to a more performant boundary-based approach. |
739be67 to
508a4a4
Compare
508a4a4 to
45201d2
Compare
|
Can you provide the same figures for the modern regex engine? Even if it is opt-in for now, it may become default in the future (e.g. 2027.1). More specifically, we have to confirm that the use case which is worth than master (the number-suffix dense benchmark) does not degrade too much the performances. If the degradation is not so important, this branch remains interesting giving it fixes a very bad case (single long word no digits). |
Link to issue number:
Supersedes #19518.
Summary of the issue:
The built-in speech dictionary includes regular expressions that split text such as
NVDAObject,HTMLParser, andissue123for speech. The previous digit-suffix expression could perform very poorly on long word-character runs that do not contain trailing digits. This can hurt speech responsiveness for pathological text, long mixed-case text, and text containing many digit boundaries.Description of how this pull request fixes the issue:
Replaced the three built-in dictionary regexes with zero-width boundary matches and literal space insertion. This avoids capture groups and, most importantly, replaces the old digit-suffix regex with a direct word-to-digit boundary check.
The lookarounds are ordered so the forward assertion is checked first. Most positions fail based on the next character, so this avoids unnecessary previous-character checks and improves performance further.
Testing strategy:
Verified equivalent output between
masterand this branch on:Tested with this benchmarking script (LLM generated).
Default NVDA regex engine (
re):Also tested with the optional modern regular expression engine. The same main improved scenarios held.
Known issues with pull request:
With the default NVDA regex engine, all benchmark scenarios are now faster than
master.With the optional modern
regexengine, the number-suffix dense benchmark remains slower thanmaster. That engine is opt-in, and the main intended improvement is for no-match and mixed-case text, especially long word-character runs without digits, where the old expression was pathological.Code Review Checklist: