Add ICU word segmentation backend for browse mode word navigation#20379
Add ICU word segmentation backend for browse mode word navigation#20379LeonarddeR wants to merge 14 commits into
Conversation
Add the Windows ICU ctypes bindings (winBindings/icu.py) and the textUtils.icu word-offset primitive (calculateWordOffsets), wire them into the word segmentation strategy framework via IcuWordSegmentationStrategy, and expose an ICU option through WordSegFlag and the WordNavigationUnitFlag feature flag. Word AUTO now prefers ICU whenever the ICU library is available (Chinese word segmentation still takes precedence for Chinese text), with Uniscribe as the fallback. ICU follows Unicode Standard Annex nvaccess#29 and provides dictionary-based, locale-aware segmentation for complex scripts such as Thai, Lao and Khmer. Character segmentation is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@nishimotz Could you have a look at the current pr, especially regarding Japanese? |
There was a problem hiding this comment.
Pull request overview
This PR adds a new ICU-based word segmentation backend (using Windows’ built-in ICU BreakIterator API) and integrates it into NVDA’s existing word segmentation strategy framework to improve browse mode word navigation for complex scripts (e.g. Japanese, Khmer) and multi-codepoint emoji sequences.
Changes:
- Introduces a new ICU segmentation strategy with Windows ICU ctypes bindings and offset conversion utilities.
- Updates strategy selection to prefer Chinese segmentation when appropriate, otherwise ICU (when available), and finally Uniscribe as a fallback.
- Adds user-facing configuration/UI labels and documentation updates, plus new unit tests comparing ICU vs Uniscribe behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| user_docs/en/userGuide.md | Documents the new “Windows Unicode (ICU)” option and updated Auto/legacy behavior. |
| user_docs/en/changes.md | Adds release notes entries for ICU word segmentation and default preference behavior. |
| tests/unit/test_wordSegIcu.py | Adds unit tests for ICU strategy selection and primitive call behavior. |
| tests/unit/test_textUtils_backendComparison.py | Adds comparison tests documenting ICU vs Uniscribe parity/divergence (skipped when ICU absent). |
| source/winBindings/icu.py | Adds ctypes bindings for Windows’ built-in ICU ubrk_* APIs and availability detection. |
| source/textUtils/segFlag.py | Adds WordSegFlag.ICU. |
| source/textUtils/icu.py | Adds ICU-backed calculateWordOffsets helper with whitespace-attachment behavior. |
| source/textUtils/_wordSeg/wordSegStrategy.py | Adds IcuWordSegmentationStrategy and makes segmentedText default to identity. |
| source/textUtils/_wordSeg/wordSegmenter.py | Reworks strategy selection into Chinese → ICU → Uniscribe fallback chain. |
| source/textInfos/offsets.py | Maps the new config enum to WordSegFlag.ICU. |
| source/config/featureFlagEnums.py | Adds the ICU enum value and updates the legacy label to “Windows (legacy)”. |
SaschaCowley
left a comment
There was a problem hiding this comment.
Mostly superficial/documentation things
There was a problem hiding this comment.
Could you add a note somewhere in this file explaining that it's safe to use _lib.*.restype/argtypes, because WinDLL calls aren't globally cached?
The reason we use WINFUNCTYPE elsewhere is because we use windll.* to obtain DLL handles, which are cached by ctypes. Since direct function access on WinDLL objects is cached by the object, dll = windll.library; func = dll.func; func.argtypes = ... is global to NVDA, which is particularly problematic because our declarations can break add-ons, and add-ons can break us. Since you're using WinDLL directly, the handle is internal to winBindings.icu, so the cache issue doesn't matter.
Alternatively, you could write the library loading code to do something like the following, switch the function declarations to use WINFUNCTYPE, and skip adding the note:
try:
_lib = windll.icu
except OSError:
try:
_lib = dll.icuuc
except OSError:
passUltimately I don't think it matters which route you take. Using windll and WINFUNCTYPE is more in line with the rest of winBindings, but this way is slightly tidier to read. That being said, if we decide to rename _lib to dll, we should probably go with WINFUNCTYPE so that you can't accidentally override functions' restype and argtypes "indirectly".
I'm not asking for pure pedantry; I only learned about this (seemingly undocumented?) difference when checking my assumptions before asking you to switch to WINFUNCTYPE because of the safety issue we discovered with windll.library.function described above.
Also wow, sorry this turned out to be super rambly!
There was a problem hiding this comment.
I hope I changed it to something you like better. I think it is in the lines you suggested.
Co-authored-by: Sascha Cowley <16543535+SaschaCowley@users.noreply.github.com>
Link to issue number:
Closes #20343
Summary of the issue:
NVDA's word navigation in browse mode uses Windows Uniscribe (
ScriptBreak), which has no dictionary-based segmentation for scripts that don't separate words with spaces. As a result, word navigation steps through Japanese text (and other complex scripts) one character at a time instead of moving by linguistic word. Multi-character emoji (ZWJ sequences) are likewise split.Description of user facing changes:
Description of developer facing changes:
WordSegFlag.ICUflag andWordNavigationUnitFlag.ICUfeature-flag enum value.IcuWordSegmentationStrategyin the_wordSegstrategy framework, backed by new low-level moduleswinBindings/icu.py(ctypes bindings to the Windows built-in ICUubrk_*BreakIterator API) andtextUtils/icu.py(calculateWordOffsets). Word boundaries follow Unicode Standard Annex scriptUI: Choice dialog with custom buttons #29 plus automatic dictionary-based segmentation selected by the script of the text.WordSegmenter._chooseStrategyreworked into an explicit fallback chain: Chinese (cppjieba) → ICU → Uniscribe. ICU is selected for theAUTOandICUflags and as the fallback when cppjieba is unavailable; Uniscribe remains the final fallback and the only strategy for the explicitUNISCRIBEflag (it stays pinned where strictly required, e.g.EditTextInfo).Description of development approach:
ICU was integrated into the existing
_wordSegstrategy framework introduced by the cppjieba PR (#20183), so that strategy selection lives in one place. The ICU layer is offset-only:IcuWordSegmentationStrategy.segmentedTextreturns the text unchanged (no braille separator insertion), so braille output is unaffected. Offsets are converted to/from UTF-16 for ICU. The ICU primitives use the root locale unconditionally because word boundaries are script-driven, not locale-driven. Trailing whitespace is attached to the preceding word to match NVDA's existing Uniscribe behaviour.This PR scopes ICU to word segmentation only. ICU integration can be broadened in follow-ups to also drive character, line and sentence boundary detection, which would benefit from the same UAX#29 handling (e.g. grapheme clusters for character navigation).
Testing strategy:
test_wordSegIcu.py) covering UAX#29 boundaries, dictionary-segmented scripts, whitespace attachment, surrogate pairs and offset round-tripping.test_textUtils_backendComparison.py) asserting ICU vs Uniscribe divergence on Japanese/Khmer and on a multi-person emoji ZWJ sequence with skin-tone modifiers, plus parity on common cases.Known issues with pull request:
well-known→well/-/known,a@b.com). Tradeoff of UAX#29 default rules.logo.→logothen.), whereas Uniscribe kept the punctuation attached to the preceding word. This matches the word-navigation behaviour of modern Windows edit controls such as the Start menu search field.Code Review Checklist: