Add ICU word segmentation backend for browse mode word navigation by LeonarddeR · Pull Request #20379 · nvaccess/nvda

LeonarddeR · 2026-06-22T09:11:57Z

Link to issue number:

Closes #20343

Summary of the issue:

NVDA's word navigation in browse mode uses Windows Uniscribe (ScriptBreak), which has no dictionary-based segmentation for scripts that don't separate words with spaces. As a result, word navigation steps through Japanese text (and other complex scripts) one character at a time instead of moving by linguistic word. Multi-character emoji (ZWJ sequences) are likewise split.

Description of user facing changes:

A new "Windows Unicode (ICU)" option is added to the Word Segmentation Standard setting in the "Document Navigation" panel.
Under "Auto", word navigation now prefers ICU over the legacy Windows (Uniscribe) segmentation wherever ICU is available. Chinese word segmentation (cppjieba) continues to take precedence for Chinese text.
The existing "Standard" option is relabelled "Windows (legacy)" to make clear it is the older Uniscribe path.
Word navigation by word now works correctly for Japanese, Khmer and other complex scripts, and for multi-character emoji sequences, where the legacy segmentation previously fell back to character-level boundaries.

Description of developer facing changes:

New WordSegFlag.ICU flag and WordNavigationUnitFlag.ICU feature-flag enum value.
New IcuWordSegmentationStrategy in the _wordSeg strategy framework, backed by new low-level modules winBindings/icu.py (ctypes bindings to the Windows built-in ICU ubrk_* BreakIterator API) and textUtils/icu.py (calculateWordOffsets). Word boundaries follow Unicode Standard Annex scriptUI: Choice dialog with custom buttons #29 plus automatic dictionary-based segmentation selected by the script of the text.
WordSegmenter._chooseStrategy reworked into an explicit fallback chain: Chinese (cppjieba) → ICU → Uniscribe. ICU is selected for the AUTO and ICU flags and as the fallback when cppjieba is unavailable; Uniscribe remains the final fallback and the only strategy for the explicit UNISCRIBE flag (it stays pinned where strictly required, e.g. EditTextInfo).

Description of development approach:

ICU was integrated into the existing _wordSeg strategy framework introduced by the cppjieba PR (#20183), so that strategy selection lives in one place. The ICU layer is offset-only: IcuWordSegmentationStrategy.segmentedText returns the text unchanged (no braille separator insertion), so braille output is unaffected. Offsets are converted to/from UTF-16 for ICU. The ICU primitives use the root locale unconditionally because word boundaries are script-driven, not locale-driven. Trailing whitespace is attached to the preceding word to match NVDA's existing Uniscribe behaviour.

This PR scopes ICU to word segmentation only. ICU integration can be broadened in follow-ups to also drive character, line and sentence boundary detection, which would benefit from the same UAX#29 handling (e.g. grapheme clusters for character navigation).

Testing strategy:

Unit tests for the ICU word offset calculation (test_wordSegIcu.py) covering UAX#29 boundaries, dictionary-segmented scripts, whitespace attachment, surrogate pairs and offset round-tripping.
A backend comparison test (test_textUtils_backendComparison.py) asserting ICU vs Uniscribe divergence on Japanese/Khmer and on a multi-person emoji ZWJ sequence with skin-tone modifiers, plus parity on common cases.
Manual testing of word navigation in browse mode across Japanese, Khmer, emoji sequences and Chinese (cppjieba precedence preserved).

Known issues with pull request:

ICU requires Windows 10 version 1703 (Creators Update) or later; on older systems NVDA falls back to Uniscribe.
ICU coalesces a run of identical whitespace into one segment but splits mixed whitespace (space + tab) into separate segments. Not special-cased — legacy Uniscribe behaviour for mixed runs is itself inconsistent.
ICU splits some tokens that Uniscribe keeps whole (e.g. well-known → well/-/known, a@b.com). Tradeoff of UAX#29 default rules.
ICU treats trailing punctuation as a separate word, so word navigation stops on it independently (e.g. logo. → logo then .), whereas Uniscribe kept the punctuation attached to the preceding word. This matches the word-navigation behaviour of modern Windows edit controls such as the Start menu search field.

Code Review Checklist:

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

Add the Windows ICU ctypes bindings (winBindings/icu.py) and the textUtils.icu word-offset primitive (calculateWordOffsets), wire them into the word segmentation strategy framework via IcuWordSegmentationStrategy, and expose an ICU option through WordSegFlag and the WordNavigationUnitFlag feature flag. Word AUTO now prefers ICU whenever the ICU library is available (Chinese word segmentation still takes precedence for Chinese text), with Uniscribe as the fallback. ICU follows Unicode Standard Annex nvaccess#29 and provides dictionary-based, locale-aware segmentation for complex scripts such as Thai, Lao and Khmer. Character segmentation is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

LeonarddeR · 2026-06-22T09:52:26Z

@nishimotz Could you have a look at the current pr, especially regarding Japanese?

Copilot

Pull request overview

This PR adds a new ICU-based word segmentation backend (using Windows’ built-in ICU BreakIterator API) and integrates it into NVDA’s existing word segmentation strategy framework to improve browse mode word navigation for complex scripts (e.g. Japanese, Khmer) and multi-codepoint emoji sequences.

Changes:

Introduces a new ICU segmentation strategy with Windows ICU ctypes bindings and offset conversion utilities.
Updates strategy selection to prefer Chinese segmentation when appropriate, otherwise ICU (when available), and finally Uniscribe as a fallback.
Adds user-facing configuration/UI labels and documentation updates, plus new unit tests comparing ICU vs Uniscribe behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
user_docs/en/userGuide.md	Documents the new “Windows Unicode (ICU)” option and updated Auto/legacy behavior.
user_docs/en/changes.md	Adds release notes entries for ICU word segmentation and default preference behavior.
tests/unit/test_wordSegIcu.py	Adds unit tests for ICU strategy selection and primitive call behavior.
tests/unit/test_textUtils_backendComparison.py	Adds comparison tests documenting ICU vs Uniscribe parity/divergence (skipped when ICU absent).
source/winBindings/icu.py	Adds ctypes bindings for Windows’ built-in ICU `ubrk_*` APIs and availability detection.
source/textUtils/segFlag.py	Adds `WordSegFlag.ICU`.
source/textUtils/icu.py	Adds ICU-backed `calculateWordOffsets` helper with whitespace-attachment behavior.
source/textUtils/_wordSeg/wordSegStrategy.py	Adds `IcuWordSegmentationStrategy` and makes `segmentedText` default to identity.
source/textUtils/_wordSeg/wordSegmenter.py	Reworks strategy selection into Chinese → ICU → Uniscribe fallback chain.
source/textInfos/offsets.py	Maps the new config enum to `WordSegFlag.ICU`.
source/config/featureFlagEnums.py	Adds the ICU enum value and updates the legacy label to “Windows (legacy)”.

SaschaCowley

Mostly superficial/documentation things

SaschaCowley · 2026-06-26T04:55:52Z

Could you add a note somewhere in this file explaining that it's safe to use _lib.*.restype/argtypes, because WinDLL calls aren't globally cached?

The reason we use WINFUNCTYPE elsewhere is because we use windll.* to obtain DLL handles, which are cached by ctypes. Since direct function access on WinDLL objects is cached by the object, dll = windll.library; func = dll.func; func.argtypes = ... is global to NVDA, which is particularly problematic because our declarations can break add-ons, and add-ons can break us. Since you're using WinDLL directly, the handle is internal to winBindings.icu, so the cache issue doesn't matter.

Alternatively, you could write the library loading code to do something like the following, switch the function declarations to use WINFUNCTYPE, and skip adding the note:

try: _lib = windll.icu except OSError: try: _lib = dll.icuuc except OSError: pass

Ultimately I don't think it matters which route you take. Using windll and WINFUNCTYPE is more in line with the rest of winBindings, but this way is slightly tidier to read. That being said, if we decide to rename _lib to dll, we should probably go with WINFUNCTYPE so that you can't accidentally override functions' restype and argtypes "indirectly".

I'm not asking for pure pedantry; I only learned about this (seemingly undocumented?) difference when checking my assumptions before asking you to switch to WINFUNCTYPE because of the safety issue we discovered with windll.library.function described above.

Also wow, sorry this turned out to be super rambly!

I hope I changed it to something you like better. I think it is in the lines you suggested.

Co-authored-by: Sascha Cowley <16543535+SaschaCowley@users.noreply.github.com>

SaschaCowley

Thanks, @LeonarddeR

LeonarddeR and others added 2 commits June 22, 2026 10:52

Add emoji test

4ab9acc

Simplification

6369257

LeonarddeR force-pushed the icu-word branch from 7e60a98 to 6369257 Compare June 22, 2026 13:57

LeonarddeR added 4 commits June 22, 2026 16:01

Fix copyright

0392c08

Cleanup

fc2173c

Update changes

c398a1a

Fix user guide

2c43e83

LeonarddeR marked this pull request as ready for review June 22, 2026 14:43

LeonarddeR requested review from a team as code owners June 22, 2026 14:43

LeonarddeR requested review from Qchristensen, SaschaCowley and Copilot June 22, 2026 14:43

Copilot started reviewing on behalf of LeonarddeR June 22, 2026 14:44 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Comment thread user_docs/en/changes.md Outdated

Comment thread source/textUtils/icu.py

Comment thread source/textUtils/icu.py Outdated

Copilot review actions

939d489

seanbudd added the conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. label Jun 23, 2026

Merge remote-tracking branch 'origin/master' into icu-word

f8952d8

SaschaCowley requested changes Jun 26, 2026

View reviewed changes

SaschaCowley marked this pull request as draft June 26, 2026 05:45

LeonarddeR and others added 4 commits June 26, 2026 18:27

Apply suggestions from code review

98a1f38

Co-authored-by: Sascha Cowley <16543535+SaschaCowley@users.noreply.github.com>

Review actions

86d4bca

Fix user guide

75c068e

Merge remote-tracking branch 'origin/master' into icu-word

f4612ba

LeonarddeR marked this pull request as ready for review June 26, 2026 17:55

Fix system tests

cd06242

LeonarddeR requested a review from SaschaCowley July 2, 2026 05:31

SaschaCowley approved these changes Jul 3, 2026

View reviewed changes

Uh oh!

Uh oh!

Conversation

LeonarddeR commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to issue number:

Summary of the issue:

Description of user facing changes:

Description of developer facing changes:

Description of development approach:

Testing strategy:

Known issues with pull request:

Code Review Checklist:

Uh oh!

LeonarddeR commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SaschaCowley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SaschaCowley Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

LeonarddeR Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SaschaCowley left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LeonarddeR commented Jun 22, 2026 •

edited

Loading