Skip to content

fix(utf32): reassemble split codepoint from overflow buffer, not source index#393

Open
spokodev wants to merge 1 commit into
pillarjs:masterfrom
spokodev:fix-utf32-streaming-overflow
Open

fix(utf32): reassemble split codepoint from overflow buffer, not source index#393
spokodev wants to merge 1 commit into
pillarjs:masterfrom
spokodev:fix-utf32-streaming-overflow

Conversation

@spokodev

Copy link
Copy Markdown

Problem

When decoding UTF-32 from a stream, any 4-byte unit that straddles a chunk boundary is corrupted:

const d = iconv.getDecoder('utf-32le');
d.write(Buffer.from([0x41, 0x00, 0x00])); // 'A' (U+0041), first 3 bytes
d.write(Buffer.from([0x00]));             // last byte
// → ""   (expected "A")

In big-endian the same split yields a byte-shifted character instead of A. Whole-buffer iconv.decode(buf, 'utf-32le') is unaffected — only the streaming / decodeStream path, where chunk boundaries are arbitrary (sockets, files), hits this.

Cause

encodings/utf32.js fills this.overflow to four bytes, then reassembles the codepoint using the source index i:

codepoint = overflow[i] | (overflow[i + 1] << 8) | (overflow[i + 2] << 16) | (overflow[i + 3] << 24)

overflow only holds indices 0–3, but after the fill loop i is the offset into src, so overflow[i] reads out of range (→ undefined0) whenever i > 0. The code comment notes this block was copied from the main loop (which correctly uses src[i]); the index was just never adjusted for the overflow buffer.

Fix

Read the reassembled bytes from overflow[0..3]:

if (isLE) {
  codepoint = overflow[0] | (overflow[1] << 8) | (overflow[2] << 16) | (overflow[3] << 24)
} else {
  codepoint = overflow[3] | (overflow[2] << 8) | (overflow[1] << 16) | (overflow[0] << 24)
}

Verification

  • New tests in test/utf32-test.js decode utf32leBuf / utf32beBuf split at every byte offset; they fail before, pass after.
  • Full suite: 320 passing, 0 failing.
  • Fuzz: streaming decode at random split points vs whole-buffer decode over 120,000 random strings (LE + BE) — 0 mismatches.

When a 4-byte UTF-32 unit is split across two stream chunks, the decoder
fills `this.overflow` to four bytes and then read it back with the source
index `i` (`overflow[i]`...`overflow[i + 3]`) instead of `overflow[0]`...
`overflow[3]`. Since `overflow` only holds indices 0-3, the read landed
out of range whenever `i > 0`, so every codepoint straddling a chunk
boundary decoded to U+0000 (LE) or a byte-shifted character (BE).

This block was copied from the main loop (which correctly uses `src[i]`);
the index just was not adjusted for the overflow buffer. Whole-buffer
decode was unaffected, which is why existing tests passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant