Skip to content

Add ICU word segmentation backend for browse mode word navigation#20379

Open
LeonarddeR wants to merge 14 commits into
nvaccess:masterfrom
LeonarddeR:icu-word
Open

Add ICU word segmentation backend for browse mode word navigation#20379
LeonarddeR wants to merge 14 commits into
nvaccess:masterfrom
LeonarddeR:icu-word

Conversation

@LeonarddeR

@LeonarddeR LeonarddeR commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Link to issue number:

Closes #20343

Summary of the issue:

NVDA's word navigation in browse mode uses Windows Uniscribe (ScriptBreak), which has no dictionary-based segmentation for scripts that don't separate words with spaces. As a result, word navigation steps through Japanese text (and other complex scripts) one character at a time instead of moving by linguistic word. Multi-character emoji (ZWJ sequences) are likewise split.

Description of user facing changes:

  • A new "Windows Unicode (ICU)" option is added to the Word Segmentation Standard setting in the "Document Navigation" panel.
  • Under "Auto", word navigation now prefers ICU over the legacy Windows (Uniscribe) segmentation wherever ICU is available. Chinese word segmentation (cppjieba) continues to take precedence for Chinese text.
  • The existing "Standard" option is relabelled "Windows (legacy)" to make clear it is the older Uniscribe path.
  • Word navigation by word now works correctly for Japanese, Khmer and other complex scripts, and for multi-character emoji sequences, where the legacy segmentation previously fell back to character-level boundaries.

Description of developer facing changes:

  • New WordSegFlag.ICU flag and WordNavigationUnitFlag.ICU feature-flag enum value.
  • New IcuWordSegmentationStrategy in the _wordSeg strategy framework, backed by new low-level modules winBindings/icu.py (ctypes bindings to the Windows built-in ICU ubrk_* BreakIterator API) and textUtils/icu.py (calculateWordOffsets). Word boundaries follow Unicode Standard Annex scriptUI: Choice dialog with custom buttons #29 plus automatic dictionary-based segmentation selected by the script of the text.
  • WordSegmenter._chooseStrategy reworked into an explicit fallback chain: Chinese (cppjieba) → ICU → Uniscribe. ICU is selected for the AUTO and ICU flags and as the fallback when cppjieba is unavailable; Uniscribe remains the final fallback and the only strategy for the explicit UNISCRIBE flag (it stays pinned where strictly required, e.g. EditTextInfo).

Description of development approach:

ICU was integrated into the existing _wordSeg strategy framework introduced by the cppjieba PR (#20183), so that strategy selection lives in one place. The ICU layer is offset-only: IcuWordSegmentationStrategy.segmentedText returns the text unchanged (no braille separator insertion), so braille output is unaffected. Offsets are converted to/from UTF-16 for ICU. The ICU primitives use the root locale unconditionally because word boundaries are script-driven, not locale-driven. Trailing whitespace is attached to the preceding word to match NVDA's existing Uniscribe behaviour.

This PR scopes ICU to word segmentation only. ICU integration can be broadened in follow-ups to also drive character, line and sentence boundary detection, which would benefit from the same UAX#29 handling (e.g. grapheme clusters for character navigation).

Testing strategy:

  • Unit tests for the ICU word offset calculation (test_wordSegIcu.py) covering UAX#29 boundaries, dictionary-segmented scripts, whitespace attachment, surrogate pairs and offset round-tripping.
  • A backend comparison test (test_textUtils_backendComparison.py) asserting ICU vs Uniscribe divergence on Japanese/Khmer and on a multi-person emoji ZWJ sequence with skin-tone modifiers, plus parity on common cases.
  • Manual testing of word navigation in browse mode across Japanese, Khmer, emoji sequences and Chinese (cppjieba precedence preserved).

Known issues with pull request:

  • ICU requires Windows 10 version 1703 (Creators Update) or later; on older systems NVDA falls back to Uniscribe.
  • ICU coalesces a run of identical whitespace into one segment but splits mixed whitespace (space + tab) into separate segments. Not special-cased — legacy Uniscribe behaviour for mixed runs is itself inconsistent.
  • ICU splits some tokens that Uniscribe keeps whole (e.g. well-knownwell/-/known, a@b.com). Tradeoff of UAX#29 default rules.
  • ICU treats trailing punctuation as a separate word, so word navigation stops on it independently (e.g. logo.logo then .), whereas Uniscribe kept the punctuation attached to the preceding word. This matches the word-navigation behaviour of modern Windows edit controls such as the Start menu search field.

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

LeonarddeR and others added 2 commits June 22, 2026 10:52
Add the Windows ICU ctypes bindings (winBindings/icu.py) and the
textUtils.icu word-offset primitive (calculateWordOffsets), wire them into
the word segmentation strategy framework via IcuWordSegmentationStrategy, and
expose an ICU option through WordSegFlag and the WordNavigationUnitFlag
feature flag.

Word AUTO now prefers ICU whenever the ICU library is available (Chinese word
segmentation still takes precedence for Chinese text), with Uniscribe as the
fallback. ICU follows Unicode Standard Annex nvaccess#29 and provides dictionary-based,
locale-aware segmentation for complex scripts such as Thai, Lao and Khmer.

Character segmentation is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@LeonarddeR

Copy link
Copy Markdown
Collaborator Author

@nishimotz Could you have a look at the current pr, especially regarding Japanese?

@LeonarddeR LeonarddeR marked this pull request as ready for review June 22, 2026 14:43
@LeonarddeR LeonarddeR requested review from a team as code owners June 22, 2026 14:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new ICU-based word segmentation backend (using Windows’ built-in ICU BreakIterator API) and integrates it into NVDA’s existing word segmentation strategy framework to improve browse mode word navigation for complex scripts (e.g. Japanese, Khmer) and multi-codepoint emoji sequences.

Changes:

  • Introduces a new ICU segmentation strategy with Windows ICU ctypes bindings and offset conversion utilities.
  • Updates strategy selection to prefer Chinese segmentation when appropriate, otherwise ICU (when available), and finally Uniscribe as a fallback.
  • Adds user-facing configuration/UI labels and documentation updates, plus new unit tests comparing ICU vs Uniscribe behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
user_docs/en/userGuide.md Documents the new “Windows Unicode (ICU)” option and updated Auto/legacy behavior.
user_docs/en/changes.md Adds release notes entries for ICU word segmentation and default preference behavior.
tests/unit/test_wordSegIcu.py Adds unit tests for ICU strategy selection and primitive call behavior.
tests/unit/test_textUtils_backendComparison.py Adds comparison tests documenting ICU vs Uniscribe parity/divergence (skipped when ICU absent).
source/winBindings/icu.py Adds ctypes bindings for Windows’ built-in ICU ubrk_* APIs and availability detection.
source/textUtils/segFlag.py Adds WordSegFlag.ICU.
source/textUtils/icu.py Adds ICU-backed calculateWordOffsets helper with whitespace-attachment behavior.
source/textUtils/_wordSeg/wordSegStrategy.py Adds IcuWordSegmentationStrategy and makes segmentedText default to identity.
source/textUtils/_wordSeg/wordSegmenter.py Reworks strategy selection into Chinese → ICU → Uniscribe fallback chain.
source/textInfos/offsets.py Maps the new config enum to WordSegFlag.ICU.
source/config/featureFlagEnums.py Adds the ICU enum value and updates the legacy label to “Windows (legacy)”.

Comment thread user_docs/en/changes.md Outdated
Comment thread source/textUtils/icu.py
Comment thread source/textUtils/icu.py Outdated
@seanbudd seanbudd added the conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. label Jun 23, 2026

@SaschaCowley SaschaCowley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly superficial/documentation things

Comment thread source/winBindings/icu.py Outdated
Comment thread source/winBindings/icu.py Outdated
Comment thread source/winBindings/icu.py Outdated
Comment thread source/winBindings/icu.py Outdated
Comment thread source/winBindings/icu.py

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a note somewhere in this file explaining that it's safe to use _lib.*.restype/argtypes, because WinDLL calls aren't globally cached?

The reason we use WINFUNCTYPE elsewhere is because we use windll.* to obtain DLL handles, which are cached by ctypes. Since direct function access on WinDLL objects is cached by the object, dll = windll.library; func = dll.func; func.argtypes = ... is global to NVDA, which is particularly problematic because our declarations can break add-ons, and add-ons can break us. Since you're using WinDLL directly, the handle is internal to winBindings.icu, so the cache issue doesn't matter.

Alternatively, you could write the library loading code to do something like the following, switch the function declarations to use WINFUNCTYPE, and skip adding the note:

try:
	_lib = windll.icu
except OSError:
	try:
		_lib = dll.icuuc
	except OSError:
		pass

Ultimately I don't think it matters which route you take. Using windll and WINFUNCTYPE is more in line with the rest of winBindings, but this way is slightly tidier to read. That being said, if we decide to rename _lib to dll, we should probably go with WINFUNCTYPE so that you can't accidentally override functions' restype and argtypes "indirectly".

I'm not asking for pure pedantry; I only learned about this (seemingly undocumented?) difference when checking my assumptions before asking you to switch to WINFUNCTYPE because of the safety issue we discovered with windll.library.function described above.

Also wow, sorry this turned out to be super rambly!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I changed it to something you like better. I think it is in the lines you suggested.

Comment thread source/config/featureFlagEnums.py Outdated
Comment thread source/config/featureFlagEnums.py
Comment thread source/textUtils/_wordSeg/wordSegmenter.py Outdated
Comment thread source/winBindings/icu.py
Comment thread source/textUtils/icu.py Outdated
@SaschaCowley SaschaCowley marked this pull request as draft June 26, 2026 05:45
@LeonarddeR LeonarddeR marked this pull request as ready for review June 26, 2026 17:55
@LeonarddeR LeonarddeR requested a review from SaschaCowley July 2, 2026 05:31

@SaschaCowley SaschaCowley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @LeonarddeR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Word navigation shortcomings in Browse mode ICU can fix

4 participants