Streamline built-in speech dictionary rules by codeofdusk · Pull Request #20433 · nvaccess/nvda

codeofdusk · 2026-07-01T04:30:04Z

Link to issue number:

Supersedes #19518.

Summary of the issue:

The built-in speech dictionary includes regular expressions that split text such as NVDAObject, HTMLParser, and issue123 for speech. The previous digit-suffix expression could perform very poorly on long word-character runs that do not contain trailing digits. This can hurt speech responsiveness for pathological text, long mixed-case text, and text containing many digit boundaries.

Description of how this pull request fixes the issue:

Replaced the three built-in dictionary regexes with zero-width boundary matches and literal space insertion. This avoids capture groups and, most importantly, replaces the old digit-suffix regex with a direct word-to-digit boundary check.

The lookarounds are ordered so the forward assertion is checked first. Most positions fail based on the next character, so this avoids unnecessary previous-character checks and improves performance further.

Testing strategy:

Verified equivalent output between master and this branch on:

The benchmark scenarios below.
10,000 generated ASCII samples containing letters, digits, underscores, spaces, and punctuation.

Tested with this benchmarking script (LLM generated).

Default NVDA regex engine (re):

Scenario	master median	branch median	Difference
short mixed phrase	0.007265 ms	0.002236 ms	3.25 times faster
camel/acronym dense	1.291176 ms	0.296330 ms	4.36 times faster
number suffix dense	0.369622 ms	0.335921 ms	1.10 times faster
mixed identifier log	1.772693 ms	0.430847 ms	4.11 times faster
plain lowercase words	3.372053 ms	0.292519 ms	11.53 times faster
single long word no digits	240.882400 ms	0.088500 ms	2721.84 times faster
long identifier no digits	1.689099 ms	0.141683 ms	11.92 times faster

Also tested with the optional modern regular expression engine. The same main improved scenarios held.

Known issues with pull request:

With the default NVDA regex engine, all benchmark scenarios are now faster than master.

With the optional modern regex engine, the number-suffix dense benchmark remains slower than master. That engine is opt-in, and the main intended improvement is for no-match and mixed-case text, especially long word-character runs without digits, where the old expression was pathological.

Code Review Checklist:

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

Copilot

Pull request overview

This PR updates NVDA’s built-in speech dictionary rules to use zero-width regex boundaries with literal space insertion, aiming to avoid pathological regex performance on long word-character runs while preserving the existing spoken output behavior.

Changes:

Replaced three built-in speech dictionary regex rules with lookaround-based boundary matches that insert a space.
Added a changelog entry noting improved speech responsiveness for long mixed-case / digit-heavy text.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`user_docs/en/changes.md`	Documents the user-facing responsiveness improvement.
`source/builtin.dic`	Updates built-in speech dictionary regex rules to a more performant boundary-based approach.

CyrilleB79 · 2026-07-01T06:31:31Z

Can you provide the same figures for the modern regex engine? Even if it is opt-in for now, it may become default in the future (e.g. 2027.1). More specifically, we have to confirm that the use case which is worth than master (the number-suffix dense benchmark) does not degrade too much the performances. If the degradation is not so important, this branch remains interesting giving it fixes a very bad case (single long word no digits).
Also it would allow us to check if the single long word no digits case was also bad with the modern engine and if this PR improves it.
Thanks.

Copilot AI review requested due to automatic review settings July 1, 2026 04:30

codeofdusk requested a review from a team as a code owner July 1, 2026 04:30

codeofdusk requested a review from SaschaCowley July 1, 2026 04:30

Copilot started reviewing on behalf of codeofdusk July 1, 2026 04:30 View session

codeofdusk mentioned this pull request Jul 1, 2026

Streamline built-in speech dictionary rules #19518

Closed

5 tasks

Copilot AI reviewed Jul 1, 2026

View reviewed changes

Comment thread source/builtin.dic Outdated

Comment thread source/builtin.dic Outdated

codeofdusk force-pushed the better-builtindict-regex branch from 739be67 to 508a4a4 Compare July 1, 2026 04:37

Streamline built-in speech dictionary rules

45201d2

codeofdusk force-pushed the better-builtindict-regex branch from 508a4a4 to 45201d2 Compare July 1, 2026 04:38

CyrilleB79 reviewed Jul 1, 2026

View reviewed changes

Comment thread user_docs/en/changes.md Outdated

Review action

ebee96e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Streamline built-in speech dictionary rules#20433

Streamline built-in speech dictionary rules#20433
codeofdusk wants to merge 2 commits into
nvaccess:masterfrom
codeofdusk:better-builtindict-regex

codeofdusk commented Jul 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CyrilleB79 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

codeofdusk commented Jul 1, 2026

Link to issue number:

Summary of the issue:

Description of how this pull request fixes the issue:

Testing strategy:

Known issues with pull request:

Code Review Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CyrilleB79 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants