Skip to content

Grapheme-cluster aware cursor, hit-testing and soft wrap on top of #5922#5988

Open
xyos wants to merge 20 commits into
ajaxorg:masterfrom
xyos:varchar3-graphemes
Open

Grapheme-cluster aware cursor, hit-testing and soft wrap on top of #5922#5988
xyos wants to merge 20 commits into
ajaxorg:masterfrom
xyos:varchar3-graphemes

Conversation

@xyos

@xyos xyos commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Builds on #5922 (initial range.getClientRects()-based text measurement). This adds the two commits on top of that branch making the cursor/column model and soft wrap grapheme-cluster aware, and fixing related rendering issues found while testing against a real browser.

Changes on top of #5922

Grapheme-cluster cursor model & pixel-accurate coordinates (50aff8d)

  • New lang.getGraphemeCluster() / lang.getGraphemeBoundaries() helpers using Intl.Segmenter (grapheme granularity), with a surrogate-pair fallback when unavailable.
  • Selection.moveCursorTo snaps to grapheme cluster boundaries instead of only surrogate pairs, so arrow keys traverse ZWJ emoji (👨‍👩‍👧‍👦) and combining marks (ã̤) in one step.
  • VirtualRenderer.textToScreenCoordinates uses fontMetrics.textWidth() instead of column × characterWidth, so forward and inverse coordinate mappings agree.
  • FontMetrics.$pixelToColumn hit-testing iterates real grapheme clusters, treats span seams inclusively (fixes clicks with showInvisibles falling between spans), and clamps clicks left of line start to column 0.
  • U+200D removed from the invalid-invisible regex so ZWJ emoji render intact (fixes Emojis consisting of surrogate pairs are displayed incorrectly #5813).
  • Line-start RLE (U+202B) renders as plain text instead of a red invalid dot when rtl/rtlText is enabled (fixes ace.js: placing a red dot when clicking "enter" #5423, same approach as Fix: make RTL new line a valid character #5434); Trojan-Source highlighting still applies outside RTL mode.

Grapheme-aware soft wrap (5d2bc1d)

  • $getDisplayTokens marks continuation code units of grapheme clusters with a new CHAR_CONT display token (only for lines containing chars ≥ U+0300).
  • $computeWrapSplits never splits inside a cluster: a split landing mid-cluster moves back to the cluster start, or past the cluster when it is wider than the wrap limit. Before this, a wrap limit crossing 👨‍👩‍👧‍👦 split its 11 code units across screen rows.

Issues this addresses

#460, #4142, #5431, #5813, #4602, #5423, #5436, #3753, #3866, #3617

Testing

  • experiments/widthchar.html — a browser harness with monospace/proportional/RTL/wrapped editors over pathological lines (BMP & ZWJ emoji, combining marks, CJK, tabs, Hebrew, U+FE0F, APL fallback glyphs). It verifies cursor DOM position against Range ground truth, textToScreen→screenToText round-trips at every grapheme boundary, arrow-walk stops, selection-marker coverage, wrap-split boundaries, and the RTL red-dot/backspace scenarios: 2,400+ assertions passing in Chrome, 0 failures (baseline before these commits: ~280 failures).
  • New unit test wrapLine split never breaks grapheme clusters in edit_session_test.js; full suite unchanged otherwise (1,480 passing).

Testing Page

Open kitchen-sink @ 1d8fb139b0e049a066490d7ebf07015e4565e088

Open kitchen-sink @ 3d5b9198934ec623b161b2d55faa5b3390c0dbd7

mkslanc and others added 17 commits May 5, 2026 18:07
… calculations

# Conflicts:
#	src/virtual_renderer.js
… rendering

- add getGraphemeCluster/getGraphemeBoundaries helpers (Intl.Segmenter) to lang
- snap cursor to grapheme cluster boundaries in moveCursorTo (ZWJ, combining marks)
- use fontMetrics.textWidth in textToScreenCoordinates instead of column * charWidth
- iterate grapheme clusters in $pixelToColumn hit-testing; treat span seams inclusively;
  clamp clicks left of line start to column 0
- stop rendering U+200D as invalid invisible so ZWJ emoji stay intact (ajaxorg#5813)
- render line-start RLE as plain text when rtl/rtlText enabled instead of red dot (ajaxorg#5423)
- add experiments/widthchar.html browser test harness for variable-width rendering
  covering ajaxorg#460 ajaxorg#4142 ajaxorg#5431 ajaxorg#5813 ajaxorg#4602 ajaxorg#5436 ajaxorg#5423 ajaxorg#3753 ajaxorg#3866 ajaxorg#3617
- $getDisplayTokens marks continuation code units of grapheme clusters
  (surrogate pairs, combining marks, ZWJ sequences) with CHAR_CONT so
  wrap-split candidates can see cluster boundaries
- $computeWrapSplits/addSplit moves any split falling inside a cluster
  back to the cluster start, or past it when the cluster is wider than
  the wrap limit, so soft wrap never tears a grapheme apart
- unit tests for ZWJ/combining-mark/surrogate wrap splits
- widthchar harness: wrapped editor with emoji/CJK/hebrew lines checking
  splits land on grapheme boundaries, caret round-trips, and arrow walks
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

One of the public type files has been updated, plase make sure there are no backwards incompatible changes done in the PR.

@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.49897% with 139 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.90%. Comparing base (c89c0d5) to head (3d5b919).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/text.js 55.05% 40 Missing ⚠️
src/layer/font_metrics.js 93.91% 32 Missing ⚠️
src/edit_session.js 74.11% 22 Missing ⚠️
src/lib/lang.js 83.82% 22 Missing ⚠️
src/test/mockdom.js 93.54% 18 Missing ⚠️
src/selection.js 55.55% 4 Missing ⚠️
src/layer/marker.js 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5988      +/-   ##
==========================================
+ Coverage   93.73%   93.90%   +0.17%     
==========================================
  Files         640      641       +1     
  Lines      137743   138873    +1130     
  Branches    14487    14802     +315     
==========================================
+ Hits       129113   130411    +1298     
+ Misses       8630     8462     -168     
Flag Coverage Δ
unittests 93.90% <90.49%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Screen columns now count grapheme clusters instead of UTF-16 code units,
so a ZWJ emoji occupies one logical column instead of eleven:

- lang.mayContainGraphemeClusters: cheap regexp gate (marks, ZWJ,
  variation selectors, astral code points, conjoining jamo) so plain
  ASCII/CJK/hebrew lines skip segmentation entirely
- $getStringScreenWidth walks grapheme boundaries; doc<->screen mapping
  (documentToScreenPosition/screenToDocumentPosition) follows since both
  funnel through it
- $getDisplayTokens emits one display token per cluster and carries a
  parallel docLengths array; $computeWrapSplits consumes it so wrap
  budget is measured in visual columns and doc split offsets stay in
  code units; CHAR_CONT hack removed
- text layer $renderToken returns grapheme column counts and computes
  tab stops from grapheme columns
- font_metrics $findColumnPosition/$pixelToColumn map screen columns
  to text-node offsets via grapheme boundaries
- widthchar harness: graphemeColumns check; updated wrap/surrogate test
  expectations to the new column semantics
@xyos

xyos commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Pushed a third commit (423a67e) completing the column model rework: 1 grapheme cluster = 1 screen column.

Screen columns now count grapheme clusters instead of UTF-16 code units, addressing the "column must be column" model discussed in #5431:

  • $getStringScreenWidth walks grapheme boundaries (both documentToScreenPosition and screenToDocumentPosition funnel through it), gated by a cheap regexp so plain text skips segmentation.
  • $getDisplayTokens emits one display token per cluster with a parallel docLengths array; $computeWrapSplits consumes it, so wrap budget is now measured in visual columns — a line of ZWJ emoji no longer wraps 11× too early — while wrap split offsets remain code-unit doc columns.
  • Text layer and font-metrics DOM walkers translate between grapheme screen columns and code-unit text-node offsets.

Verified in Chrome via the experiments/widthchar.html harness (2,800+ assertions incl. new graphemeColumns and vertical-navigation desired-column checks, 0 failures) and the node suite (1,480 passing; the 2 failures are pre-existing on master-adjacent branches).

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

One of the public type files has been updated, plase make sure there are no backwards incompatible changes done in the PR.

xyos added 2 commits July 4, 2026 16:53
Found by benchmarking against the pre-grapheme baseline:

- $getStringScreenWidth: iterate the segmenter lazily (lang.forEachGrapheme)
  when maxScreenColumn is bounded so the early exit stays O(maxScreenColumn);
  bounded query on a 54k-unit emoji line: 3.7ms -> 0.03ms per call. Without
  a bound use countGraphemes/getGraphemeBoundaries (no per-cluster callback).
- lang.getGraphemeBoundaries: memoize the last few segmentations; cursor
  rendering and doc<->screen mapping repeatedly segment equal line prefixes
  (documentToScreenPosition on a 54k-unit emoji line: 8.1ms -> 0.2ms per call)
- $renderToken: segment the token value once and reuse a monotonic boundary
  cursor for tab stops instead of re-segmenting the prefix per tab
  (9.6k-unit emoji TSV line render: 356ms -> 3ms)
- $pixelToColumn: gate per-text-node segmentation on mayContainGraphemeClusters
- document why line-start RLE renders as plain text only in rtl modes
- mayContainGraphemeClusters missed cluster-forming characters outside
  \p{M}: prepend chars (arabic number signs U+0600-0605 U+06DD U+070F
  U+0890 U+0891 U+08E2, malayalam dot reph U+0D4E), spacing marks
  (thai/lao AM U+0E33 U+0EB3), ZWNJ and halfwidth katakana voicing marks.
  Gated fast paths counted these as code units while ungated paths counted
  clusters, giving two different column systems for the same line.
  Verified against an exhaustive BMP scan of Intl.Segmenter behavior.
- text layer: align token boundaries to grapheme cluster boundaries before
  rendering () so a tokenizer splitting inside a
  cluster (keycap emoji digit+FE0F+20E3) cannot desync rendered spans
  from the session's whole-line segmentation
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

One of the public type files has been updated, plase make sure there are no backwards incompatible changes done in the PR.

@xyos

xyos commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Ran a multi-agent adversarial review of the branch (4 review dimensions — correctness, consumer regressions, performance, edge cases — each finding verified by re-running executable repros against both this branch and the pre-grapheme baseline 6138c2d). Two new commits address everything confirmed.

Confirmed and fixed (5dc32fb, 1d8fb13)

Performance (measured, worst-case scenarios):

  • $getStringScreenWidth materialized boundaries for the whole line even when maxScreenColumn bounded the walk — up to ~10,000× slower on a 54k-code-unit emoji line during mousemove/selection drag (3.7ms per call). Now iterates the segmenter lazily and keeps the early exit at O(maxScreenColumn): 0.03ms per call.
  • documentToScreenPosition on long clustered lines: 8.1ms → 0.2ms per call (memoized boundary segmentation for repeated content-equal prefixes).
  • $renderToken re-segmented the token prefix per tab (quadratic) — a 9.6k-unit emoji TSV line took 356ms to render once. Now segments once with a monotonic cursor: ~3ms.
  • Plain-ASCII documents are unaffected in all scenarios (the mayContainGraphemeClusters gate costs ~20ns/call; ASCII macro benchmarks at parity with the base branch).

Correctness:

  • The cluster-detection gate missed cluster-forming characters outside \p{M}: Thai/Lao AM (U+0E33/U+0EB3), Arabic prepend signs (U+0600–0605 etc.), Malayalam dot reph, ZWNJ, halfwidth katakana voicing marks. Fast paths counted code units while other paths counted clusters — two different column systems for the same line (click-to-cursor off-by-one after e.g. กำ). Fixed and validated with an exhaustive BMP scan against Intl.Segmenter behavior.
  • Tokenizers can split inside a cluster (keycap 1️⃣ = digit + FE0F + 20E3 splits after the ASCII digit), desyncing rendered spans from whole-line segmentation. The text layer now aligns token boundaries to cluster boundaries before rendering.

Verified as sound by the review (with executable repros)

  • $getStringScreenWidth early-exit semantics match the old per-code-unit loop's contract.
  • Fold + wrap + wide fold placeholders: wrap splits always land on cluster boundaries of the display line; exhaustive screen↔doc round-trip scans found zero mid-surrogate positions.
  • Nasty-string corpus (lone surrogates, ZWJ-only lines, defective clusters, mark-first lines): no crashes, no infinite loops, sum(docLengths) === line.length throughout; 128k-case wrap-split fuzz found zero mid-cluster splits.
  • Cache/wrap-recompute consistency after emoji edits.

Known trade-off (documented in code, feedback welcome)

Rendering a line-start RLE (U+202B) as plain text when rtl/rtlText is enabled means a leading RLE smuggled onto an LTR line is not flagged with the red invalid dot in those modes (it still is in default mode). Since the rtl extension right-aligns and visually reverses any RLE-prefixed line, the manipulation remains clearly visible — but if maintainers prefer, the plain-text rendering could be limited to lines whose content is actually RTL.

Remaining pre-existing gaps (unchanged from base): moveCursorBy bidi branch still uses charWidths; block-selection with mixed-width lines untested beyond the harness.

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

One of the public type files has been updated, plase make sure there are no backwards incompatible changes done in the PR.

@xyos

xyos commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Note: the last commit here (3d5b919, crash restoring a session serialized without undo history) is an independent bug that also reproduces on master — split it out as #5989 with a regression test so it can land separately. If #5989 merges first this branch will rebase cleanly.

@xyos xyos force-pushed the varchar3-graphemes branch from 3d5b919 to 1d8fb13 Compare July 4, 2026 21:05
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

One of the public type files has been updated, plase make sure there are no backwards incompatible changes done in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emojis consisting of surrogate pairs are displayed incorrectly ace.js: placing a red dot when clicking "enter"

3 participants