Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions docs/decisions/006-sites-base-url-search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# ADR-006: Substring Base-URL Search on `GET /sites`

## Status
Accepted

## Context
The Experience Success Studio back-office UI ("Backoffice") Sites page let operators find a site
to manage. Its only mechanism was to **load every site** (`GET /sites`, cursor-paginated at 500/page)
into the browser and filter client-side. With ~18k sites that meant ~36 **sequential** cursor
requests (each page's cursor is only known after the previous response resolves) — a 15–22s blank
load before the table was usable. See SITES-47203.

`GET /sites` had no server-side search: only cursor pagination, an exact `GET /sites/by-base-url/:baseURL`
lookup, and an exact `GET /sites/:siteId` lookup. So "find the site whose URL contains `icici`"
forced the full client-side bulk load.

Two facts shaped the decision:

- **Cursor pagination cannot be parallelized.** You cannot issue page _N+1_ without page _N_'s
cursor, and `GET /sites` exposes neither offset nor a total count. So the sequential walk is
inherent — the fix is to *not load everything*, not to page faster.
- **The data layer migrated from DynamoDB to PostgreSQL (via PostgREST).** `spacecat-shared-data-access`
now backs `Site` with PostgREST and its collection query API supports `ilike`/`like`/`contains`
filters and offset pagination. On DynamoDB a substring search would have been a full-partition
scan (anti-pattern); on Postgres `ILIKE '%…%'` is a normal, cheap query.

## Decision
Add an optional **`baseUrlContains`** query parameter to `GET /sites`:

`GET /sites?baseUrlContains=<substring>&limit=<N>&offset=<M>`

- Maps to `Site.all({}, { where: (attr, op) => op.ilike(attr.baseURL, '%<escaped>%'), limit: N+1, cursor: <offset-cursor>, order: 'asc' })`.
The data-access `where` builder passes `(attrs, op)`: `attrs` maps model fields to DB columns and
`op` carries the operators. No `spacecat-shared-data-access` change was required — the `ilike`
operator already exists. (`order: 'asc'` sorts by the index's order fields with the primary key as a
deterministic tiebreaker — see `base.collection`'s `#getOrderFields`.) The data-access layer exposes
no public `offset` option — it paginates via an opaque, offset-encoded cursor (`postgrest.utils`
`encodeCursor`, which is not exported). The controller therefore builds the same
`base64(JSON.stringify({ offset }))` cursor inline to reach the requested offset; if a direct
`offset` option is added upstream, switch to it.
- **Validation:** `baseUrlContains` must be ≥ 3 characters (trimmed); LIKE wildcards (`%`, `_`, `\`) in
user input are escaped so callers cannot inject wildcards. `offset` defaults to 0 and must be a
non-negative integer (otherwise 400).
- **Top-N + "more exists":** `limit` defaults to 50, capped at `MAX_LIMIT` (500). We fetch `N+1` rows
at the requested `offset` and trim to `N`; the extra row drives `pagination.hasMore`, which the UI
surfaces as a "refine your search" hint / next-page affordance. Response shape:
`{ sites: [...], pagination: { limit, offset, hasMore, baseUrlContains } }`
— the `baseUrlContains` echo is the deploy-ordering discriminator (see Consequences).
- **Authorization is unchanged** — the new branch runs after the existing admin / S2S `site:readAll`
check. Non-admin (org-scoped) callers continue to receive `403` on `GET /sites`; the Backoffice
client falls back to the org-scoped sites endpoint (a small, bounded set) and filters it
client-side. The complex org/delegated-sites endpoint was intentionally left untouched.

## Alternatives considered
Comment thread
habansal marked this conversation as resolved.

- **Client-side progressive rendering** (render pages as they stream in). Rejected: it only traded
the blank spinner for ~15s of a churning, re-sorting table, and never addressed the root cause —
shipping ~18k rows to the browser. (This was an earlier PR, since abandoned.)
- **Parallel page fetching.** Impossible: cursor pagination has no offset/total, so pages must be
sequential. Even hypothetically, ~36 concurrent 500-row reads carry 429 / DB-load risk for
negligible benefit.
- **Prefix-only search (`begins_with`).** This was the *DynamoDB-idiomatic* option (efficient on the
`baseURL` sort key). It is moot now that the backend is Postgres, and substring is the better UX
(matches anywhere, so the stored `https://`/`www.` prefix doesn't get in the way).
- **Dedicated search index (OpenSearch).** Correct for large-scale fuzzy/multi-field search, but
heavy infrastructure and unjustified for an internal tool at this scale.

## Consequences
- The Backoffice **Sites page** drops the bulk-load (and its two rarely-used dropdown filters): it now
searches by base-URL substring or looks a site up by exact ID. See OneAdobe/experience-success-studio-backoffice#332.
(The legacy `getSites` bulk walk still backs `LLMOptimizerData.js` — eliminating that is tracked as a
separate follow-up; this ADR does not address it.)
- **Deploy ordering.** The Backoffice client always sends `limit`, so an *older* API deployment would
ignore `baseUrlContains` and return unfiltered cursor results. To avoid silent wrong results, the search
response echoes `pagination.baseUrlContains`; the client treats a missing/mismatched echo as "search
unsupported" and surfaces an error. Deploy the API before (or with) the Backoffice change.
- **No trigram index yet.** `base_url` has a UNIQUE btree but no `pg_trgm` GIN index, so a leading-wildcard
`ILIKE '%…%'` is a sequential scan. At ~18k small rows this is single-digit-ms in Postgres and only
matches cross the wire, so it is acceptable for now. **Deferred follow-up:** add
`CREATE EXTENSION pg_trgm` + a GIN trigram index on `sites.base_url` (owned by `mysticat-data-service`)
if/when table growth or latency warrants index-accelerated substring search.
- The contract is additive and backward-compatible: existing cursor-paginated and legacy flat-array
behavior of `GET /sites` is unchanged.

## References
- SITES-47203
- API change: this PR (adobe/spacecat-api-service)
- Backoffice consumer: OneAdobe/experience-success-studio-backoffice#332
43 changes: 43 additions & 0 deletions docs/openapi/schemas.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -833,6 +833,49 @@ SitePagedResponse:
required:
- sites
- pagination
SiteSearchResponse:
type: object
description: |
Response for the `baseUrlContains` substring search on `GET /sites`. Unlike
`SitePagedResponse`, this path is not cursor-iterable: `pagination` carries no
`cursor`, paginates by `offset`, and echoes the trimmed `baseUrlContains` query so a
client can confirm the search was actually applied (an older deployment that ignores
`baseUrlContains` but honors `limit` would return `SitePagedResponse` with no echo).
properties:
sites:
type: array
items:
$ref: './schemas.yaml#/SiteListItem'
pagination:
type: object
properties:
limit:
description: The maximum number of items returned (default 50, clamped to 500)
type: integer
example: 50
offset:
description: The zero-based offset into the search results for this page (default 0)
type: integer
example: 0
hasMore:
description: Indicates whether more matching sites exist beyond this result
type: boolean
example: false
baseUrlContains:
description: |
The trimmed `baseUrlContains` query that was applied (3-256 chars). Echoed so
clients can confirm the search ran; absent on deployments that ignore the
param.
type: string
example: "adobe"
required:
- limit
- offset
- hasMore
- baseUrlContains
required:
- sites
- pagination
SiteWithLatestAuditList:
type: array
items:
Expand Down
57 changes: 51 additions & 6 deletions docs/openapi/sites-api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,12 @@ sites:
- **Paginated (with `limit` and/or `cursor`):** returns `{ sites, pagination }` with
the full unfiltered result set. Iterate using `pagination.cursor` until
`pagination.hasMore` is `false` to fetch all sites.
- **Substring search (with `baseUrlContains`):** returns `{ sites, pagination }` where
`sites` are those whose `baseURL` contains the (case-insensitive) substring.
`pagination` is `{ limit, offset, hasMore, baseUrlContains }` (no `cursor`; this
branch paginates by `offset`). Defaults to a limit of 50 (clamped to 500) and an
offset of 0. The search term must be at least 3 characters after trimming
(otherwise 400).

Required capabilities: admin access (legacy admin path) or S2S `site:readAll`.
operationId: getSites
Expand Down Expand Up @@ -160,6 +166,39 @@ sites:
type: string
maxLength: 256
example: "eyJvZmZzZXQiOjEwMH0="
- name: baseUrlContains
in: query
required: false
description: |
Case-insensitive substring to match against each site's `baseURL`. When
provided, the response uses a non-cursor paginated envelope
(`{ sites, pagination: { limit, offset, hasMore, baseUrlContains } }`), where
`pagination.baseUrlContains` echoes the trimmed query so a client can confirm
the search was applied (an older deployment that ignores this param would
omit the echo). Length is enforced **after trimming**: must be at least 3
and at most 256 characters (otherwise 400). LIKE wildcards (`%`, `_`, `\`)
in the value are escaped and treated literally. The `limit` parameter
(default 50, clamped to 500) bounds the page size and `offset` (default 0)
selects the page; `cursor` is not used on this path.
schema:
type: string
# Bounds are enforced server-side AFTER trimming surrounding whitespace, so
# minLength/maxLength here are documentation of the effective post-trim limits.
minLength: 3
maxLength: 256
example: "adobe"
- name: offset
in: query
required: false
description: |
Zero-based offset into the `baseUrlContains` search results, used to page
through matches (e.g. `offset=2&limit=2` returns the second page of 2).
Only honored on the `baseUrlContains` search path. Defaults to 0. Negative
or non-integer values return 400.
schema:
type: integer
minimum: 0
example: 0
responses:
'200':
description: |
Expand All @@ -168,13 +207,18 @@ sites:
application/json:
schema:
description: |
Two response shapes are possible, selected by the request and
unambiguous by JSON type, so no `discriminator` is used (nor can one
be expressed — the legacy branch is a top-level array, which has no
Three response shapes are possible, selected by the request and
unambiguous by JSON type/shape, so no `discriminator` is used (nor can
one be expressed — the legacy branch is a top-level array, which has no
property to discriminate on):
- When `limit` and/or `cursor` is provided → **`SitePagedResponse`**,
a JSON object with `sites` and `pagination`.
- When neither is provided → **`SiteList`**, the legacy top-level
- When `baseUrlContains` is provided → **`SiteSearchResponse`**, a JSON
object with `sites` and a non-cursor `pagination`
(`{ limit, offset, hasMore, baseUrlContains }`). The `baseUrlContains`
echo lets a client confirm the search ran on the deployment it hit.
- When `limit` and/or `cursor` is provided (without `baseUrlContains`) →
**`SitePagedResponse`**, a JSON object with `sites` and a
cursor-based `pagination`.
- When none are provided → **`SiteList`**, the legacy top-level
JSON array.

This resolves to a single shape once the legacy path is sunset.
Expand All @@ -186,6 +230,7 @@ sites:
- $ref: './schemas.yaml#/SiteList'
deprecated: true
- $ref: './schemas.yaml#/SitePagedResponse'
- $ref: './schemas.yaml#/SiteSearchResponse'
'400':
$ref: './responses.yaml#/400'
'401':
Expand Down
96 changes: 96 additions & 0 deletions src/controllers/sites.js
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ const MONTH_DAYS = 30;
const TOTAL_METRICS = 'totalMetrics';
const BRAND_PROFILE_AGENT_ID = 'brand-profile';
const DEFAULT_LIMIT = 100;
const SEARCH_DEFAULT_LIMIT = 50;
const MAX_LIMIT = 500;

/**
Expand Down Expand Up @@ -430,6 +431,13 @@ function SitesController(ctx, log, env) {
* Gets all sites with cursor-based pagination. Accessible to admin callers (legacy admin path)
* and to S2S consumers that hold the `site:readAll` capability - see
* `docs/s2s/READALL_CAPABILITY_DESIGN.md`.
*
* Optional `baseUrlContains` query param: when provided (3-256 chars after trim),
* performs a case-insensitive substring search on `baseURL` and returns a non-cursor
* `{ sites, pagination: { limit, offset, hasMore, baseUrlContains } }` response. The
* trimmed query is echoed back in `pagination.baseUrlContains` so a client can confirm
* its search was applied even if it hits an older deployment that ignores the param.
* LIKE wildcards in the input are escaped so callers cannot inject their own wildcards.
* @returns {Promise<Response>} Paginated sites response
*/
const getAll = async (context) => {
Expand All @@ -447,6 +455,94 @@ function SitesController(ctx, log, env) {

const limitParam = context?.data?.limit;
const cursor = context?.data?.cursor || null;

// Optional substring search by base URL. Runs after the authz check (so
// unauthorized callers still get 403) and before the cursor/legacy branches.
const baseUrlContains = context?.data?.baseUrlContains;
if (hasText(baseUrlContains) && hasText(cursor)) {
// The public search path paginates via offset, not the client cursor;
// accepting both would silently discard the cursor and mislead the client
// into thinking cursor pagination is active. Reject the combination explicitly.
return badRequest('cursor is not supported with baseUrlContains; use offset');
}
if (hasText(baseUrlContains)) {
const q = baseUrlContains.trim();
if (q.length < 3) {
return badRequest('baseUrlContains must be at least 3 characters');
}
if (q.length > 256) {
return badRequest('baseUrlContains exceeds maximum length');
}

const parsedLimit = hasText(limitParam) ? parseInt(limitParam, 10) : SEARCH_DEFAULT_LIMIT;
if (!Number.isInteger(parsedLimit) || parsedLimit <= 0) {
return badRequest('limit must be a positive integer');
}
const effectiveLimit = Math.min(parsedLimit, MAX_LIMIT);

const offsetParam = context?.data?.offset;
const offset = hasText(offsetParam) ? parseInt(offsetParam, 10) : 0;
if (!Number.isInteger(offset) || offset < 0) {
return badRequest('offset must be a non-negative integer');
}

// Escape LIKE special chars so user input cannot inject its own wildcards.
const escaped = q.replace(/([\\%_])/g, '\\$1');

// The data-access layer paginates by an offset-encoded cursor (postgrest.utils
// encodeCursor); it exposes no public `offset` option, so we build the same
// shape here. If a direct offset option is ever added upstream, switch to it.
const offsetCursor = Buffer.from(JSON.stringify({ offset }), 'utf-8').toString('base64');

// Fetch one extra row to detect whether more results exist beyond the limit.
// The data-access `where` builder passes (attrs, op): `attrs` maps model
// fields to DB columns, `op` carries the operators. (NOT `s => s.ilike(...)`.)
let rows;
try {
rows = await Site.all({}, {
where: (attr, op) => op.ilike(attr.baseURL, `%${escaped}%`),
limit: effectiveLimit + 1,
cursor: offsetCursor,
order: 'asc',
});
} catch (e) {
// Re-throw so the framework still returns a 500 — the point here is a
// searchable, prefixed log line, not swallowing the error.
log.error(`[sites][baseUrlContains] query failed requestId=${requestId}`, e);
throw e;
}
let list;
if (Array.isArray(rows)) {
list = rows;
} else if (Array.isArray(rows?.data)) {
list = rows.data;
} else {
log.warn(`[sites][baseUrlContains] unexpected Site.all shape; returning empty requestId=${requestId}`);
list = [];
}
const hasMore = list.length > effectiveLimit;
const sites = list.slice(0, effectiveLimit).map((site) => SiteDto.toListJSON(site));

if (s2sResult.allowed) {
log.info(`[s2s-readall] GET /sites (baseUrlContains) granted clientId=${s2sResult.clientId} consumerId=${s2sResult.consumerId} capability=${CAP_SITE_READ_ALL} count=${sites.length} requestId=${requestId}`);
}

// Unconditional observability for both admin and S2S paths. Never log the raw
// query value (URLs may be sensitive) — only its length and result counts.
log.info(`[sites][baseUrlContains] qlen=${q.length} count=${sites.length} hasMore=${hasMore} requestId=${requestId}`);

// Echo the trimmed query in the pagination so a new client can confirm its
// search was actually applied. An older deployment that ignores `baseUrlContains`
// but still honors `limit` would return the cursor envelope with unfiltered
// sites and no `baseUrlContains` echo — letting clients detect the version skew.
return ok({
sites,
pagination: {
limit: effectiveLimit, offset, hasMore, baseUrlContains: q,
},
});
}

const paginated = hasText(limitParam) || hasText(cursor);

if (cursor !== null) {
Expand Down
Loading
Loading