Skip to content

Streamline built-in speech dictionary rules#20433

Open
codeofdusk wants to merge 2 commits into
nvaccess:masterfrom
codeofdusk:better-builtindict-regex
Open

Streamline built-in speech dictionary rules#20433
codeofdusk wants to merge 2 commits into
nvaccess:masterfrom
codeofdusk:better-builtindict-regex

Conversation

@codeofdusk

Copy link
Copy Markdown
Contributor

Link to issue number:

Supersedes #19518.

Summary of the issue:

The built-in speech dictionary includes regular expressions that split text such as NVDAObject, HTMLParser, and issue123 for speech. The previous digit-suffix expression could perform very poorly on long word-character runs that do not contain trailing digits. This can hurt speech responsiveness for pathological text, long mixed-case text, and text containing many digit boundaries.

Description of how this pull request fixes the issue:

Replaced the three built-in dictionary regexes with zero-width boundary matches and literal space insertion. This avoids capture groups and, most importantly, replaces the old digit-suffix regex with a direct word-to-digit boundary check.

The lookarounds are ordered so the forward assertion is checked first. Most positions fail based on the next character, so this avoids unnecessary previous-character checks and improves performance further.

Testing strategy:

Verified equivalent output between master and this branch on:

  • The benchmark scenarios below.
  • 10,000 generated ASCII samples containing letters, digits, underscores, spaces, and punctuation.

Tested with this benchmarking script (LLM generated).

Default NVDA regex engine (re):

Scenario master median branch median Difference
short mixed phrase 0.007265 ms 0.002236 ms 3.25 times faster
camel/acronym dense 1.291176 ms 0.296330 ms 4.36 times faster
number suffix dense 0.369622 ms 0.335921 ms 1.10 times faster
mixed identifier log 1.772693 ms 0.430847 ms 4.11 times faster
plain lowercase words 3.372053 ms 0.292519 ms 11.53 times faster
single long word no digits 240.882400 ms 0.088500 ms 2721.84 times faster
long identifier no digits 1.689099 ms 0.141683 ms 11.92 times faster

Also tested with the optional modern regular expression engine. The same main improved scenarios held.

Known issues with pull request:

With the default NVDA regex engine, all benchmark scenarios are now faster than master.

With the optional modern regex engine, the number-suffix dense benchmark remains slower than master. That engine is opt-in, and the main intended improvement is for no-match and mixed-case text, especially long word-character runs without digits, where the old expression was pathological.

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

Copilot AI review requested due to automatic review settings July 1, 2026 04:30
@codeofdusk codeofdusk requested a review from a team as a code owner July 1, 2026 04:30
@codeofdusk codeofdusk requested a review from SaschaCowley July 1, 2026 04:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates NVDA’s built-in speech dictionary rules to use zero-width regex boundaries with literal space insertion, aiming to avoid pathological regex performance on long word-character runs while preserving the existing spoken output behavior.

Changes:

  • Replaced three built-in speech dictionary regex rules with lookaround-based boundary matches that insert a space.
  • Added a changelog entry noting improved speech responsiveness for long mixed-case / digit-heavy text.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
user_docs/en/changes.md Documents the user-facing responsiveness improvement.
source/builtin.dic Updates built-in speech dictionary regex rules to a more performant boundary-based approach.

Comment thread source/builtin.dic Outdated
Comment thread source/builtin.dic Outdated
@codeofdusk codeofdusk force-pushed the better-builtindict-regex branch from 739be67 to 508a4a4 Compare July 1, 2026 04:37
@codeofdusk codeofdusk force-pushed the better-builtindict-regex branch from 508a4a4 to 45201d2 Compare July 1, 2026 04:38
Comment thread user_docs/en/changes.md Outdated
@CyrilleB79

Copy link
Copy Markdown
Contributor

Can you provide the same figures for the modern regex engine? Even if it is opt-in for now, it may become default in the future (e.g. 2027.1). More specifically, we have to confirm that the use case which is worth than master (the number-suffix dense benchmark) does not degrade too much the performances. If the degradation is not so important, this branch remains interesting giving it fixes a very bad case (single long word no digits).
Also it would allow us to check if the single long word no digits case was also bad with the modern engine and if this PR improves it.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants