fix(llmo): encode dataFolder path boundary Helix-safe (LLMO-5859)#2688
fix(llmo): encode dataFolder path boundary Helix-safe (LLMO-5859)#2688rainer-friederich wants to merge 4 commits into
Conversation
generateDataFolder joined the host and URL path segments with `--`, which Helix/AEM reserves for its `ref--repo--owner` host convention and rejects with HTTP 400 on bulk-preview and Admin API /status -- breaking the Brand Presence live-website publish for any path-bearing site (the SharePoint sheets and S3 objects still land; only the publish path is rejected). Replace the `--` join with a self-escaped `zs` boundary marker (`z` -> `zz`, `/` -> `zs`): Helix-safe, unambiguous, and reversible. It keeps the `/` boundary distinguishable from a sanitized `.`/`-`, so subpath sites on the same host stay distinct (e.g. `/us/kings` != `/us-kings`). Sanitization is still lossy within a single segment, unchanged from before. Forward-only: the derivation runs only at onboarding, existing dataFolders are not re-derived, and prod has no `--` folders remaining (LLMO-5770 already remediated those). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
This PR will trigger a patch release when merged. |
There was a problem hiding this comment.
Hey @rainer-friederich,
Verdict: Approve - clean encoding fix, well-documented, strong test coverage.
Complexity: HIGH - medium diff; API surface (controllers/).
Changes: Replaces the Helix-unsafe -- path-boundary delimiter in generateDataFolder with a z-encoded zs token that self-escapes literal z characters (2 files).
Non-blocking (6): minor issues and suggestions
- suggestion: The JSDoc claims the encoding is "reversible" but a naive
String.split('zs')would produce wrong results for parts ending in escapedz. Note in the comment that decoding requires a left-to-right character scanner, not a simple split -src/controllers/llmo/llmo-onboarding.js:335 - nit:
escapeMarkerdoes not communicate what is being escaped;escapeZorselfEscapeZwould be clearer to a reader unfamiliar with the scheme -src/controllers/llmo/llmo-onboarding.js:359 - suggestion: The
forEachloop in the "should never emit--" test has Assertion Roulette - if it fails, Chai will not identify which URL triggered it. Add assertion context:expect(..., \prod: ${url}`).to.not.include('--')-test/controllers/llmo/llmo-onboarding.test.js:~440` - suggestion: Add a test for a hostname starting with
zs(e.g.https://zsecurity.com/page->zzsecurity-comzspage) to document the boundary-vs-escaped-z edge case explicitly -test/controllers/llmo/llmo-onboarding.test.js - suggestion: The malformed percent-encoded test (
%FF) only asserts non-throw. Add an assertion on the actual return value to catch behavioral regressions -test/controllers/llmo/llmo-onboarding.test.js:468 - suggestion: The offboarding fallback path (which re-derives folder name when
dataFolderis missing from config) will now produce different output for z-containing hostnames. The PR description confirms all existing sites have stored values (prod scan), but a defensive comment orlog.warnat the fallback site would protect future maintainers -src/controllers/llmo/llmo-onboarding.js:~1695
Skill: pr-review | Model: us.anthropic.claude-opus-4-6-v1[1m] | Duration: 2m 28s | Cost: $4.80 | Commit: fd092651030bb5bda44441873ec86e2312924e07
If this code review was useful, please react with 👍. Otherwise, react with 👎.
- docstring: note decoding needs a left-to-right scanner, not split('zs')
- rename escapeMarker -> escapeZ for clarity
- test: add assertion context to the never-emit-`--` loop
- test: add host-starting-with-`z` edge case (zsecurity.com)
- test: assert the concrete return value for the malformed `%FF` segment
- offboarding fallback: defensive comment on the older-scheme mismatch risk
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed all six non-blocking suggestions in MysticatBot (all non-blocking):
|
1. Abstract
generateDataFolder()now encodes the host/path boundary with a Helix-safezsmarker instead of the--delimiter that Helix rejects, so newly-onboarded path-bearing sites can no longer produce adataFolder(or resource path) that fails Helix publish.2. Reasoning
For any baseURL carrying a path,
generateDataFolderjoined the sanitized host and path segments with a literal double-dash (${host}--${segments.join('--')}). Helix/AEM reserves--for itsref--repo--ownerhost convention and returns HTTP 400 for any resource path containing it, on bulk-preview and on Admin API/status. The result was that the Brand Presence live-website publish failed for path-bearing sites (live-widget plus report-noise) — the SharePoint sheets and S3 objects still landed; only the publish path was rejected.This is the prevention follow-up to LLMO-5770 bucket A. The
--join was introduced deliberately in PR 2315 (LLMO-4186) to disambiguate same-host subpaths (nba.com/kingsvsnba.com/lakers, which previously both collapsed to a host-onlynba-com); a naive flatten to a single-would re-open that collision, so the fix has to stay both Helix-safe and boundary-preserving.3. High-level overview of the changes
Before: a path-bearing baseURL derived
host--seg1--seg2, which Helix 400s. After: the host and each path segment are joined withzs, a Helix-safe marker for the/boundary, after self-escaping the marker letter in each part (z->zz). Because a literalzis always doubled, a lonezcan only ever introduce thezstoken, so the encoding is unambiguous and reversible (zz->z,zs->/).Behaviour delta:
--any more, so Helix no longer 400s path-bearing sites./boundary stays distinguishable from a sanitized./-:nba.com/com->nba-comzscomis distinct fromnba.com.com->nba-com-com, andnba.com/us/kings(nba-comzsuszskings) is distinct fromnba.com/us-kings(nba-comzsus-kings). This preserves the LLMO-4186 disambiguation guarantee that the single-dash alternative would have lost.us-kingsvsus_kings) still collapse to the same folder — an inherent limitation unchanged from the previous scheme.business.adobe.com/products->business-adobe-comzsproducts;nba.com/kings->nba-comzskings; root-domain sites are unchanged (nba.com->nba-com).Scope is forward-only. The derivation runs only at onboarding time; existing sites read their stored
dataFolderand are never re-derived. A prod scan shows zerodataFoldervalues containing--or/(the nine sites that LLMO-5770 found were already remediated, with their SharePoint renames done then), so no site is currently broken and nothing needs renaming for this fix to take effect.Optional retroactive migration (NOT done here, and not required): the roughly twelve existing path-bearing LLMO sites carry older-scheme names (flat single-dash or host-only, e.g.
business-adobe-com-products). They are already Helix-safe and working, so they are left as-is. If folder-name consistency with the new scheme is ever wanted, each such site could be migrated by renaming its SharePoint folder, updatinghelix-query.yaml, updating the storedconfig.llmo.dataFolder, and re-previewing/publishing. There is no functional reason to do so; it is purely cosmetic and can be decided separately.4. Required information
5. Affected / used mysticat-workspace projects
dataFolderderivation. The sole writer ofconfig.llmo.dataFolder; no other repo derives it.dataFolderto build SharePoint/Helix paths via opaque string interpolation, and writes into folders onboarding already created. Newzs-encoded names flow through transparently; the recently-added post-upload verification would surface any folder-missing upload as a failure rather than silent loss.dataFolderfor path construction (mystique fetches sheets bysite_idover the LLMO HTTP API), so no change.6. Additional information outside the code
dataFolder, zero contain--or/. The existing path-bearing LLMO sites already hold flat single-dash or host-only names (e.g.business.adobe.com/products->business-adobe-com-products), confirming the change is forward-only with no migration backlog.--join originated in PR 2315 (5f817e27f, "include subpath in generateDataFolder to prevent SharePoint folder collisions", fixes LLMO-4186); before it, the path was dropped and only the hostname was used.7. Test plan
--, percent-encoded andz-containing segments) and confirmed the output never contains--, the/boundary stays distinct from a sanitized dash, and the marker letter is correctly self-escaped.https://<host>/<segment>) and confirmconfig.llmo.dataFoldercontains no--, the SharePoint folder is created under the encoded name, and the Helix bulk-status / preview / publish for that folder returns 200 rather than 400. Applies to dev and prod; no special data setup beyond a path-bearing test site.