Skip to content

fix(llmo): encode dataFolder path boundary Helix-safe (LLMO-5859)#2688

Open
rainer-friederich wants to merge 4 commits into
mainfrom
fix/llmo-5859-datafolder
Open

fix(llmo): encode dataFolder path boundary Helix-safe (LLMO-5859)#2688
rainer-friederich wants to merge 4 commits into
mainfrom
fix/llmo-5859-datafolder

Conversation

@rainer-friederich

Copy link
Copy Markdown
Contributor

1. Abstract

generateDataFolder() now encodes the host/path boundary with a Helix-safe zs marker instead of the -- delimiter that Helix rejects, so newly-onboarded path-bearing sites can no longer produce a dataFolder (or resource path) that fails Helix publish.

2. Reasoning

For any baseURL carrying a path, generateDataFolder joined the sanitized host and path segments with a literal double-dash (${host}--${segments.join('--')}). Helix/AEM reserves -- for its ref--repo--owner host convention and returns HTTP 400 for any resource path containing it, on bulk-preview and on Admin API /status. The result was that the Brand Presence live-website publish failed for path-bearing sites (live-widget plus report-noise) — the SharePoint sheets and S3 objects still landed; only the publish path was rejected.

This is the prevention follow-up to LLMO-5770 bucket A. The -- join was introduced deliberately in PR 2315 (LLMO-4186) to disambiguate same-host subpaths (nba.com/kings vs nba.com/lakers, which previously both collapsed to a host-only nba-com); a naive flatten to a single - would re-open that collision, so the fix has to stay both Helix-safe and boundary-preserving.

3. High-level overview of the changes

Before: a path-bearing baseURL derived host--seg1--seg2, which Helix 400s. After: the host and each path segment are joined with zs, a Helix-safe marker for the / boundary, after self-escaping the marker letter in each part (z -> zz). Because a literal z is always doubled, a lone z can only ever introduce the zs token, so the encoding is unambiguous and reversible (zz -> z, zs -> /).

Behaviour delta:

  • No derived folder name can contain -- any more, so Helix no longer 400s path-bearing sites.
  • The / boundary stays distinguishable from a sanitized ./-: nba.com/com -> nba-comzscom is distinct from nba.com.com -> nba-com-com, and nba.com/us/kings (nba-comzsuszskings) is distinct from nba.com/us-kings (nba-comzsus-kings). This preserves the LLMO-4186 disambiguation guarantee that the single-dash alternative would have lost.
  • Sanitization is still lossy within a single segment, so segments differing only in punctuation (e.g. us-kings vs us_kings) still collapse to the same folder — an inherent limitation unchanged from the previous scheme.
  • Folder-name shape examples for new onboardings: business.adobe.com/products -> business-adobe-comzsproducts; nba.com/kings -> nba-comzskings; root-domain sites are unchanged (nba.com -> nba-com).

Scope is forward-only. The derivation runs only at onboarding time; existing sites read their stored dataFolder and are never re-derived. A prod scan shows zero dataFolder values containing -- or / (the nine sites that LLMO-5770 found were already remediated, with their SharePoint renames done then), so no site is currently broken and nothing needs renaming for this fix to take effect.

Optional retroactive migration (NOT done here, and not required): the roughly twelve existing path-bearing LLMO sites carry older-scheme names (flat single-dash or host-only, e.g. business-adobe-com-products). They are already Helix-safe and working, so they are left as-is. If folder-name consistency with the new scheme is ever wanted, each such site could be migrated by renaming its SharePoint folder, updating helix-query.yaml, updating the stored config.llmo.dataFolder, and re-previewing/publishing. There is no functional reason to do so; it is purely cosmetic and can be decided separately.

4. Required information

  • Jira / issue: LLMO-5859 (https://jira.corp.adobe.com/browse/LLMO-5859)
  • Other: encoding approach proposed by David Aurelio on the ticket (z-encoding, after facebook/flow); origin context LLMO-5770 and PR 2315 / LLMO-4186.

5. Affected / used mysticat-workspace projects

  • spacecat-api-service (changed): the onboarding dataFolder derivation. The sole writer of config.llmo.dataFolder; no other repo derives it.
  • llmo-data-retrieval-service (consumed, no change needed): reads dataFolder to build SharePoint/Helix paths via opaque string interpolation, and writes into folders onboarding already created. New zs-encoded names flow through transparently; the recently-added post-upload verification would surface any folder-missing upload as a failure rather than silent loss.
  • mystique / mysticat-projector-service: not consumers of dataFolder for path construction (mystique fetches sheets by site_id over the LLMO HTTP API), so no change.

6. Additional information outside the code

  • Prod Postgres (mysticat-data-service) observations used to scope the fix: of 12,551 sites, 333 have a baseURL carrying a path; of 7,559 sites with a dataFolder, zero contain -- or /. The existing path-bearing LLMO sites already hold flat single-dash or host-only names (e.g. business.adobe.com/products -> business-adobe-com-products), confirming the change is forward-only with no migration backlog.
  • Git history confirms the -- join originated in PR 2315 (5f817e27f, "include subpath in generateDataFolder to prevent SharePoint folder collisions", fixes LLMO-4186); before it, the path was dropped and only the hostname was used.

7. Test plan

  • Local: exercised the derivation against concrete URL shapes (root domain, single and nested subpaths, separator-only variants, hostnames containing --, percent-encoded and z-containing segments) and confirmed the output never contains --, the / boundary stays distinct from a sanitized dash, and the marker letter is correctly self-escaped.
  • Verify on an environment after deploy: onboard a path-bearing baseURL (e.g. https://<host>/<segment>) and confirm config.llmo.dataFolder contains no --, the SharePoint folder is created under the encoded name, and the Helix bulk-status / preview / publish for that folder returns 200 rather than 400. Applies to dev and prod; no special data setup beyond a path-bearing test site.

generateDataFolder joined the host and URL path segments with `--`, which
Helix/AEM reserves for its `ref--repo--owner` host convention and rejects
with HTTP 400 on bulk-preview and Admin API /status -- breaking the Brand
Presence live-website publish for any path-bearing site (the SharePoint
sheets and S3 objects still land; only the publish path is rejected).

Replace the `--` join with a self-escaped `zs` boundary marker (`z` -> `zz`,
`/` -> `zs`): Helix-safe, unambiguous, and reversible. It keeps the `/`
boundary distinguishable from a sanitized `.`/`-`, so subpath sites on the
same host stay distinct (e.g. `/us/kings` != `/us-kings`). Sanitization is
still lossy within a single segment, unchanged from before.

Forward-only: the derivation runs only at onboarding, existing dataFolders
are not re-derived, and prod has no `--` folders remaining (LLMO-5770
already remediated those).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions

Copy link
Copy Markdown

This PR will trigger a patch release when merged.

@MysticatBot MysticatBot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rainer-friederich,

Verdict: Approve - clean encoding fix, well-documented, strong test coverage.
Complexity: HIGH - medium diff; API surface (controllers/).
Changes: Replaces the Helix-unsafe -- path-boundary delimiter in generateDataFolder with a z-encoded zs token that self-escapes literal z characters (2 files).

Non-blocking (6): minor issues and suggestions
  • suggestion: The JSDoc claims the encoding is "reversible" but a naive String.split('zs') would produce wrong results for parts ending in escaped z. Note in the comment that decoding requires a left-to-right character scanner, not a simple split - src/controllers/llmo/llmo-onboarding.js:335
  • nit: escapeMarker does not communicate what is being escaped; escapeZ or selfEscapeZ would be clearer to a reader unfamiliar with the scheme - src/controllers/llmo/llmo-onboarding.js:359
  • suggestion: The forEach loop in the "should never emit --" test has Assertion Roulette - if it fails, Chai will not identify which URL triggered it. Add assertion context: expect(..., \prod: ${url}`).to.not.include('--')-test/controllers/llmo/llmo-onboarding.test.js:~440`
  • suggestion: Add a test for a hostname starting with zs (e.g. https://zsecurity.com/page -> zzsecurity-comzspage) to document the boundary-vs-escaped-z edge case explicitly - test/controllers/llmo/llmo-onboarding.test.js
  • suggestion: The malformed percent-encoded test (%FF) only asserts non-throw. Add an assertion on the actual return value to catch behavioral regressions - test/controllers/llmo/llmo-onboarding.test.js:468
  • suggestion: The offboarding fallback path (which re-derives folder name when dataFolder is missing from config) will now produce different output for z-containing hostnames. The PR description confirms all existing sites have stored values (prod scan), but a defensive comment or log.warn at the fallback site would protect future maintainers - src/controllers/llmo/llmo-onboarding.js:~1695

Skill: pr-review | Model: us.anthropic.claude-opus-4-6-v1[1m] | Duration: 2m 28s | Cost: $4.80 | Commit: fd092651030bb5bda44441873ec86e2312924e07
If this code review was useful, please react with 👍. Otherwise, react with 👎.

@MysticatBot MysticatBot added ai-reviewed Reviewed by AI complexity:high AI-assessed PR complexity: HIGH labels Jun 25, 2026
- docstring: note decoding needs a left-to-right scanner, not split('zs')
- rename escapeMarker -> escapeZ for clarity
- test: add assertion context to the never-emit-`--` loop
- test: add host-starting-with-`z` edge case (zsecurity.com)
- test: assert the concrete return value for the malformed `%FF` segment
- offboarding fallback: defensive comment on the older-scheme mismatch risk

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rainer-friederich

Copy link
Copy Markdown
Contributor Author

Addressed all six non-blocking suggestions in 5894b6169 (also merged latest main).

MysticatBot (all non-blocking):

  • suggestion (reversibility vs naive split) — done. The JSDoc now states decoding requires a left-to-right scanner and that a naive String.split('zs') is NOT a correct decoder, since a segment containing zs is stored as zzs.
  • nit (escapeMarker naming) — done. Renamed escapeMarker -> escapeZ.
  • suggestion (Assertion Roulette in the loop) — done. Both assertions in the never-emit--- test now carry a prod: ${url} / dev: ${url} context message.
  • suggestion (hostname starting with zs) — done. Added https://zsecurity.com/page -> zzsecurity-comzspage to the self-escape test, documenting the boundary-vs-escaped-z edge.
  • suggestion (%FF return value) — done. The malformed-percent test now also asserts the concrete result (a-comzsff), not just non-throw.
  • suggestion (offboarding fallback) — done as a defensive comment rather than a log change: the fallback site now documents that re-deriving assumes the current scheme and may not match a folder created under an older one, and notes onboarded sites always persist dataFolder (prod scan confirmed zero affected). Kept the existing log.debug as-is because an existing test asserts that exact message; the comment satisfies the protective intent without a behavioral change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-reviewed Reviewed by AI complexity:high AI-assessed PR complexity: HIGH

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants