Skip to content

Fix LLM translation output pollution and refresh translations on body edits#283

Merged
dahlia merged 18 commits into
hackers-pub:mainfrom
dahlia:bugfix/translation
May 4, 2026
Merged

Fix LLM translation output pollution and refresh translations on body edits#283
dahlia merged 18 commits into
hackers-pub:mainfrom
dahlia:bugfix/translation

Conversation

@dahlia

@dahlia dahlia commented May 4, 2026

Copy link
Copy Markdown
Member

Summary

Two related fixes to the LLM-powered article translation feature, shipped together because they share the same surface in models/article.ts and the second one surfaces failure modes the first could amplify.

1. Translation output pollution

The translation pipeline passed Vertana's fetchLinkedPages as a required context source, which fetches up to ten linked pages with their full Readability-extracted body and dumps everything into the translator's system prompt as Additional context: …. For articles that link to long technical posts on the same topic, the system-prompt context easily ballooned to multiples of the source-text size and the model began confusing the context for the text to translate. The concrete failure observed on @nebuleto/2026/swe-with-coding-agent-in-2026—a Korean-original article linking to Anthropic's engineering blog and JetBrains research—is that the en-US translation came back as the linked Anthropic blog post in English (~3.5x the original length) and the ja translation came back entirely in Korean, paraphrasing the linked content rather than translating the source.

The fix in ai/translate.ts swaps fetchLinkedPages (mode "required") for fetchWebPage (mode "passive"). Vertana exposes passive sources as tools the model can call when it needs them, rather than pre-fetching everything into the prompt, so the model sees a clean translation context and only invokes a fetch when it decides to. Other context sources (author info, article tags) and Vertana options (refinement, dynamic glossary, technical tone) are unchanged.

2. Stale translations after edit

Per #95, editing an article only updated the original-language article_content row; existing translation rows kept their previously translated text and silently drifted out of sync with the original. Vertana doesn't support incremental translation yet, so the semantics here is "edit→reset each translation row to a placeholder pointing at the new original→re-run the full translator." Title-only edits don't trigger retranslation, mirroring the existing summary-invalidation gate in models/article.ts. Language changes can't reach the trigger when translations exist (the self-FK on article_content rethrows 23503 as LanguageChangeWithTranslationsError).

The trigger sits in two pieces. updateArticleSource now returns an extra originalContentChanged flag set when the original-language row's body actually changed; the flag is captured inside the existing transaction alongside the existing resummarizeTarget. A new exported restartArticleContentTranslations(fedCtx, articleSource) does the actual work: re-fetches the latest original-language content, finds all translation rows, atomically resets each to placeholder state (copying the new original title/content into it, flipping beingTranslated back to true, clearing summary state), and then schedules runArticleContentTranslation(fedCtx, resetRow) against each. updateArticle reads the flag and awaits the synchronous claim-and-reset step so placeholders are visible by the time the mutation responds; the actual translate() calls run in the background, same as the initial translation flow. Each completed retranslation sends its own federation Update activity—the right ActivityPub semantics: peers see the original-language change first and then each translation's refresh as it becomes available.

Architecture notes

  • runArticleContentTranslation is extracted out of startArticleContentTranslation so it can be shared between the initial-queue path and the restart-on-edit path. The split is a pure refactor; the original return timing (sync setup awaited, translate chain in the background) is preserved.
  • The async translation chain is now CAS-guarded on a JS-side Date "claim" stamped onto the row at the top of runArticleContentTranslation. Both the success UPDATE and the failure DELETE require the row's updated to still equal that claim, so a concurrent retranslation that reset the placeholder under us turns our writes into no-ops instead of clobbering its fresher placeholder. A JS Date (rather than CURRENT_TIMESTAMP) is what makes the CAS reliable: PG timestamptz keeps µs precision while the postgres driver hands back JS Date truncated to ms, so a CAS against a round-tripped server-side timestamp would never match. This mirrors the claim pattern already used by startArticleContentSummary.
  • The claim step also reads the row back via RETURNING, and the helper uses claimed.title/claimed.content for the translate input rather than the caller-provided queued snapshot, so a concurrent restart that just wrote a newer placeholder under us still gets translated against the newer body.
  • Federation Update activity ids for translation completions now include the target language and the translated row's own (just-bumped) updated so multiple per-locale completions don't collide on article.updated, which translation completions don't bump.
  • restartArticleContentTranslations runs the read-original-then-reset-translations sequence inside a transaction whose first statement is SELECT 1 FROM article_source WHERE id=… FOR UPDATE. Concurrent restarts (driven by back-to-back edits to the same article) would otherwise each read their own snapshot of the original and overwrite each other's placeholder writes; the row lock—the same one updateArticleSource takes during its UPDATE—queues them up cleanly. The actual translate() calls are scheduled after the transaction commits so the LLM round-trip doesn't extend the lock window.

Tests added

  • models/article.source.test.ts splits the existing "preserves translations" test into two: one asserts the new originalContentChanged flag flips on body changes and the existing translation row is preserved at this layer (the actual reset lives in the updateArticle step, not here); one asserts the flag stays false on title-only edits.
  • models/article.background.test.ts gains a direct restartArticleContentTranslations() test that asserts the placeholder reset happens with the new original title/content and the failing stub-translator subsequently cleans the row up, plus a no-op test for articles with no translation rows.
  • models/article.lifecycle.test.ts gains an end-to-end updateArticle() body-change retranslation test plus a title-only no-op test.

Verification

deno task test models/article.background.test.ts models/article.source.test.ts models/article.lifecycle.test.ts models/article.summary.test.ts graphql/post.more.test.ts passes clean (49 tests). Seven pre-existing failures in graphql/admin.test.ts, graphql/timeline.test.ts, models/post.delete.test.ts, and models/timeline.query.test.ts were verified against the merge base and are unrelated to this work.

The new code went through external code review (Codex GPT-5.5) before opening this PR; the last two commits address the issues that surfaced.

Out of scope

  • Cleanup of translation rows generated under the previous fetchLinkedPages bug. Site operators can re-request from the UI by re-editing the article, since the new code path triggers a fresh translation on body change.
  • Filing the equivalent issue/PR upstream against Vertana for stronger demarcation of context vs. translation input and size caps on fetchLinkedPages. Tracked separately.
  • Vertana incremental-translation API, which would supersede the reset-and-re-run approach used here.
  • Editor auto-save debouncing. Worth a follow-up if the editor fires updateArticle on every keystroke, since each call would now fan out N LLM translations.

Fixes

Closes #95.

dahlia added 6 commits May 4, 2026 13:39
The translator used fetchLinkedPages as a required context source,
which fetched up to 10 linked pages with their full Readability-
extracted body and dumped everything into the system prompt as
"Additional context: ...".  For articles that link to several long
technical posts the context easily ballooned to multiples of the
source text size, and Claude began confusing the context for the
text to translate.  Concrete failures observed on
@nebuleto/2026/swe-with-coding-agent-in-2026: the en-US row came
back as the linked Anthropic blog post in English (~3.5x the
original length), and the ja row came back entirely in Korean,
paraphrasing the linked content.

Switch to the passive fetchWebPage source instead.  Vertana exposes
passive sources as tools rather than pre-fetching them into the
prompt, so the model only invokes a fetch when it actually decides
to, which sidesteps the up-front pollution while still leaving a
fetch capability on the table.  The other context sources
(author info, article tags) and Vertana options (refinement,
dynamic glossary, technical tone) are unchanged.

Existing translation rows that were generated under the old
behavior remain in the database; they should be deleted so the
auto-request path on /lang regenerates them with the new code.

Assisted-by: Claude Code:claude-opus-4-7
Pure refactor of startArticleContentTranslation in models/article.ts.
The on-conflict / staleness gate at the front (insert-or-find the
placeholder row, decide whether work is needed) stays in the public
function.  The post-row-creation work — fetching author/tag context,
running translate(), writing the result back, sending the federation
Update activity, and kicking off post-translation summarization — moves
into a new private async helper runArticleContentTranslation(fedCtx,
queued).

No behavior change.  The synchronous setup phase (article-source
fetch) is still awaited before the translate() chain is scheduled, so
callers see exactly the same return timing as before, and the
translate-then/catch chain still runs in the background.

A defensive log-and-return guard is added when the helper is called
with a row that has originalLanguage IS NULL (the original-language
row itself); the existing public entry point never produces such a
row, but a forthcoming caller will pass already-existing translation
rows through this path.

Assisted-by: Claude Code:claude-opus-4-7
Inside the new runArticleContentTranslation helper, take a JS-side
claim Date as ownership stamp on the placeholder row before kicking
off the translate() chain.  The success UPDATE and the failure
DELETE both gate on the row's `updated` still equalling that claim
— a concurrent re-translation that resets the row out from under us
bumps `updated` past the claim, and our writes turn into no-ops
instead of clobbering the fresher placeholder with stale text or
deleting it on a stale failure.

A JS Date is used (rather than PG `CURRENT_TIMESTAMP`) because the
postgres driver hands back JS `Date` values truncated to ms while
PG `timestamptz` keeps µs precision; a CAS against a round-tripped
`CURRENT_TIMESTAMP` write would never match its own claim.  This
mirrors the claim pattern already used by startArticleContentSummary
in this file.

The race this prevents only becomes reachable in practice with the
forthcoming retranslation-on-edit code path, but the guard is
self-contained and the existing failure-cleanup test still passes,
so it lands ahead of that change as a pure correctness fix.

Assisted-by: Claude Code:claude-opus-4-7
Edits to an article's body now invalidate every existing translation
row for that article and re-runs the translator against the fresh
original.  Previously, translations stayed pinned to whatever text
they were generated from, so /ja, /en, etc. silently drifted out of
sync with the edited original.

The trigger lives in two pieces:

- updateArticleSource now returns an extra `originalContentChanged`
  flag set when the original-language `article_content` row's body
  actually changed.  The flag mirrors the existing summary
  invalidation gate: title-only edits do not set it, so a typo fix
  on the title doesn't fan out N+1 LLM calls; the body-changed gate
  also catches the case the summary path already cared about.  A
  successful `languageChanged` cannot reach this branch when
  translations exist (the self-FK on `article_content` re-throws
  23503 as `LanguageChangeWithTranslationsError`), so the flag is
  effectively `contentChanged`.

- A new restartArticleContentTranslations(fedCtx, articleSource)
  resets each translation row to placeholder state — copying the
  new original title/content into it, flipping `beingTranslated`
  back to true, and clearing summary state — then fires the
  extracted runArticleContentTranslation helper against each reset
  row in the background.

updateArticle reads the new flag and awaits the synchronous
claim-and-reset step so placeholders are visible by the time the
mutation responds; the actual translate() calls run in the
background, same as the initial translation flow.  Each
re-translation that completes sends its own federation Update so
peers see the original-language change first and then each
translation's refresh as it becomes available — the right
ActivityPub semantics.  The CAS guard added in the previous commit
keeps an in-flight stale translation from clobbering the new
placeholder with stale text on completion.

The summary-test callers of updateArticleSource only assert
truthiness on the return value, so the new return shape passes
through them unchanged.  The dedicated source-level test that
previously asserted "preserves translations" splits into two: one
for the body-change case (now asserting the flag flips and the
existing row is preserved here for the outer caller to invalidate)
and one for the title-only carve-out.  New end-to-end tests in
article.lifecycle.test.ts cover updateArticle's full body-change
and title-only paths.

Fixes hackers-pub#95

Assisted-by: Claude Code:claude-opus-4-7
Three fixes uncovered by an external code review of commits
302c242..422ee87.

1. Translate the freshest body, not the queued snapshot
   (models/article.ts).  runArticleContentTranslation's claim
   step now reads the row back via RETURNING and uses
   `claimed.title` / `claimed.content` for the translate input.
   This closes a race where two restartArticleContentTranslations
   calls interleave: the older restart's runArticleContentTranslation
   would otherwise translate the older body even after the newer
   restart wrote the newer body into the placeholder.  The claim
   WHERE also now requires `beingTranslated=true`, so the helper
   bails silently if the row has been completed (or deleted)
   between the caller queueing this run and the claim landing,
   instead of stamping a claim onto a no-longer-relevant row.

2. Make translation Update activity ids unique
   (models/article.ts).  Previously every translation completion
   reused `#update/${article.updated.toISOString()}` as the
   activity id.  `article_source.updated` is not bumped by
   translation completions, so the original-language Update and
   every per-locale translation Update for the same edit all
   collided.  The id now incorporates the target language and
   the translated row's own (just-bumped) `updated` so each
   completion mints a distinct id, regardless of how many
   locales finish in close succession.

3. Catch background retranslation kickoff failures
   (models/article.ts).  restartArticleContentTranslations did
   `void runArticleContentTranslation(...)`, which loses any
   rejection thrown by the synchronous setup before the internal
   `.then().catch()` chain is installed (e.g., a transient DB
   error during the claim UPDATE or the article-source fetch).
   Such rejections would surface as unhandled promise rejections
   in the runtime; they now go through a `.catch()` that logs
   the failed (sourceId, language) and the error.

All 29 article-domain tests still pass.

Assisted-by: Codex:gpt-5.5
Assisted-by: Claude Code:claude-opus-4-7
Wrap the read-original-then-reset-translations sequence in
restartArticleContentTranslations in a transaction that takes
`SELECT … FOR UPDATE` on the article_source row.  Concurrent
restartArticleContentTranslations calls (driven by back-to-back
edits to the same article) would otherwise each read their own
snapshot of the original-language body and then overwrite each
other's placeholder writes — the older restart's iteration could
land after the newer restart's, leaving the placeholder pointing at
the older body and the eventual translation reflecting that older
body even though a newer edit had committed.

The article_source row lock is the same row-level write lock that
updateArticleSource takes on its UPDATE, so concurrent edits and
restarts queue up cleanly without any new lock graph.  The actual
translate() round-trip runs after the transaction commits — the
runArticleContentTranslation calls are scheduled outside the tx
block — so the LLM call doesn't extend the lock window.

Follow-up to the previous commit (Address Codex review of the
retranslation flow), which addressed three of four issues; this
one closes the remaining stale-snapshot race that survived the
freshest-body claim refactor in runArticleContentTranslation.

Assisted-by: Codex:gpt-5.5
Assisted-by: Claude Code:claude-opus-4-7
@coderabbitai

coderabbitai Bot commented May 4, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@dahlia has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 23 minutes and 28 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0cfccfaa-6132-486c-a0c1-c35c98fb366f

📥 Commits

Reviewing files that changed from the base of the PR and between ff61bc2 and 2dd4480.

📒 Files selected for processing (2)
  • models/article.translation-split.test.ts
  • models/article.ts
📝 Walkthrough

Walkthrough

updateArticleSource now reports when the original body actually changed; updateArticle uses that flag to call restartArticleContentTranslations which resets LLM-requested translation rows to placeholders and re-queues translations with CAS ownership. runArticleContentTranslation uses JS-side claim stamps. ai/translate.ts stops eagerly fetching linked pages and exposes fetchWebPage as a callable tool.

Changes

Article translation restart and CAS

Layer / File(s) Summary
Data Shape
models/article.ts
Adds exported UpdateArticleSourceResult with originalContentChanged: boolean.
Core: Source Update
models/article.ts
updateArticleSource returns { source, originalContentChanged }, sets originalContentChanged=true only when original-language body is updated.
Core: Update Flow
models/article.ts
updateArticle consumes originalContentChanged and, if true and allowLlmTranslation, awaits restartArticleContentTranslations(...) after federating the original-language update.
Core: Restart Logic
models/article.ts
New exported restartArticleContentTranslations(fedCtx, articleSource): locks article_source (FOR UPDATE), resets LLM-requested translation rows to placeholder state (copying current original title/content, clearing summary/ogImageKey, stamping updated), and fire-and-forgets runArticleContentTranslation per reset with error logging.
Core: Translation Queue / CAS
models/article.ts
startArticleContentTranslation stamps inserted placeholders with a JS queueStamp, re-stamps stale placeholders for reclaim. runArticleContentTranslation claims placeholders by updating updated to a JS claim and only persists success/failure actions when article_content.updated === claim. Activity Update id includes the translated row’s updated and targetLanguage.
Translation Context Tooling
ai/translate.ts
contextSources typed as ContextSource[]; replaces eager fetchLinkedPages web-context building with adding fetchWebPage as a passive callable tool in the context sources array.
Tests: Lifecycle
models/article.lifecycle.test.ts
Adds waitFor helper and tests: body edits trigger translation reset, title-only edits do not, gating behavior for allowLlmTranslation and human-curated translations preserved.
Tests: Background
models/article.background.test.ts
Adds tests for restartArticleContentTranslations() resetting stale translation rows to placeholders and being a no-op when no translations exist.
Tests: Source Update
models/article.source.test.ts
Adds tests asserting originalContentChanged true for body edits and false for title-only edits; verifies preservation or update of original/translation rows accordingly.
Tests: Split Helper
models/article.translation-split.test.ts
Adds unit tests for splitTranslationTitleAndContent() across Markdown scenarios.
Test Utilities
test/wait.ts
Adds exported waitFor(predicate, timeoutMs) async polling helper used by the new integration tests.

Sequence Diagram

sequenceDiagram
  participant User
  participant API
  participant DB
  participant Translator
  participant Summarizer

  User->>API: updateArticle(newTitle?, newContent?)
  API->>DB: updateArticleSource(new data)
  DB-->>API: { source, originalContentChanged }
  alt originalContentChanged == true and allowLlmTranslation
    API->>DB: restartArticleContentTranslations(source)
    DB->>DB: LOCK article_source FOR UPDATE
    DB->>DB: SELECT translation rows (LLM-requested)
    loop per translation row
      DB->>DB: UPDATE row -> placeholder (copy source title/content, clear summary, set beingTranslated, stamp updated)
      DB-->>API: queued
    end
    par async per queued row
      API->>Translator: runArticleContentTranslation(rowId)
      Translator->>DB: claim placeholder (updated := claim)
      Translator->>Translator: translate from claimed.title/claimed.content
      alt claim still valid
        Translator->>DB: CAS UPDATE success (only if updated === claim)
        DB-->>Translator: persisted
        Translator->>Summarizer: schedule post-translation summary
      else claim stale
        Translator-->>DB: log & skip persistence
      end
    end
  end
  API-->>User: updateArticle response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

bug

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the two main changes: fixing translation output pollution by swapping context sources and refreshing translations on body edits.
Description check ✅ Passed The description comprehensively documents both fixes, including architecture notes and verification, relating directly to the changeset in multiple file changes.
Linked Issues check ✅ Passed The PR fully addresses issue #95 by implementing body-edit detection, translation reset/refresh logic, filtering for LLM-generated rows only, respecting allowLlmTranslation gating, and preserving human-curated translations.
Out of Scope Changes check ✅ Passed All changes are in-scope: core translation and article logic updates, test coverage for new behavior, a test helper utility, and a new translation-parsing test suite directly supporting the stated fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 23 minutes and 28 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements automatic re-translation of articles when the original content is updated. It introduces a restartArticleContentTranslations function that resets existing translation rows to a placeholder state and triggers background re-translation using a new runArticleContentTranslation helper. The implementation includes a concurrency control mechanism using a Check-and-Set (CAS) pattern on the updated timestamp and ensures unique ActivityPub update IDs for translation completions. Additionally, the translation logic was optimized to fetch linked pages lazily as tools. Feedback was provided regarding the duplication of the waitFor test helper across multiple test files.

Comment thread models/article.lifecycle.test.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f649b60f05

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread models/article.ts Outdated
dahlia added 2 commits May 4, 2026 15:04
updateArticle() called restartArticleContentTranslations() on every
body change without checking the article-level allowLlmTranslation
flag.  When an edit toggles the flag off and changes the body in
the same update, the previous behavior would still enqueue
background translate() runs for every existing translation row,
overriding the author's just-expressed wish and incurring model
usage / cost / data-egress for translations they no longer want.

Gate the call on articleSource.allowLlmTranslation.  Existing
translation rows from before the flag was flipped off are left
alone (stale, not refreshed); flipping the flag back on and
editing the body again brings them back into sync.

Adds a regression test covering the gate.

hackers-pub#283 (comment)

Assisted-by: Codex:gpt-5.5
Assisted-by: Claude Code:claude-opus-4-7
Both models/article.background.test.ts and
models/article.lifecycle.test.ts had a private copy of the same
async-polling helper that watches for background-promise side
effects (translation completion, placeholder reset, summary
cleanup) without exposing the underlying promise to the test.
Move it to test/wait.ts so future tests don't grow a third copy
when they hit the same observation pattern.

Behavior is unchanged: same 50 ms poll cadence, same default
10 s deadline, same error message on timeout.

hackers-pub#283 (comment)

Assisted-by: Claude Code:claude-opus-4-7
@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to automatically re-translate articles when the original body content is updated. Key changes include the addition of restartArticleContentTranslations to reset translation rows and runArticleContentTranslation to handle LLM interactions using a Compare-And-Swap pattern for concurrency safety. The translation pipeline was also refined to provide web context as a passive tool, preventing model confusion. Feedback focuses on optimizing the database interaction in restartArticleContentTranslations by replacing the loop of individual updates with a single batch UPDATE statement to improve performance when multiple translations exist.

Comment thread models/article.ts Outdated
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

restartArticleContentTranslations() previously listed every
translation row for the article via a relational findMany and then
issued a per-row UPDATE inside the transaction.  Each row gets the
same placeholder values, so the loop is N+1 round trips against
the same row set.

Replace the findMany + per-row UPDATE loop with a single UPDATE
filtered on (sourceId = articleSource.id AND
originalLanguage IS NOT NULL).  The filter targets exactly the
same rows the previous implementation iterated; the schema check
article_content_being_translated_check requires
originalLanguage IS NOT NULL whenever beingTranslated=true so the
filter doubles as the schema-safety guard the per-iteration
comment used to explain.  RETURNING gives us the reset rows for
the post-commit fan-out into runArticleContentTranslation, the
same shape callers depended on.

Behavior is otherwise unchanged: still inside the same transaction
that holds the article_source FOR UPDATE row lock, still preserves
originalLanguage and translationRequesterId by omitting them from
set, still bumps updated to CURRENT_TIMESTAMP.  The "no rows to
restart" early return is now implicit in the UPDATE returning an
empty array; the surrounding logger calls are kept and reworded
slightly for the new shape.

All 30 article-domain tests still pass.

hackers-pub#283 (comment)

Assisted-by: Claude Code:claude-opus-4-7
@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements automatic retranslation of articles when the original body is updated by introducing a mechanism to reset translation rows and trigger background processing. It includes a Compare-And-Swap (CAS) logic to handle concurrent updates and ensures unique ActivityPub Update IDs for translations. Feedback suggests clearing ogImageKey during resets to prevent stale assets and refining the CAS claim logic to avoid redundant LLM calls during rapid successive edits.

Comment thread models/article.ts
Comment thread models/article.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 59503306de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread models/article.ts
dahlia added 2 commits May 4, 2026 15:32
article_content.ogImageKey caches the rendered OG image for a row.
The image is rendered from the row's title and content, so any
write that replaces those fields makes the cached image stale: the
crawler that follows the cached URL will see an image whose
heading or excerpt no longer matches the live body.

Add ogImageKey: null to the two write paths that replace a
translation row's title/content:

- restartArticleContentTranslations()'s reset UPDATE, which copies
  the new original title/content into every translation placeholder
  on a body edit.
- runArticleContentTranslation()'s success write, which lands the
  newly translated title/content for one row.

Both paths already clear the matching summary state (summary,
summaryStarted, summaryUnnecessary) for the same reason; the OG
image is the same kind of derived asset and gets cleared next to
them.  Lazy regeneration on the next OG-image request rebuilds
from the freshly written body.

The original-language UPDATE in updateArticleSource() is left
alone for now; that staleness predates this PR and is the same
question for the original row, not specifically a retranslation
concern.

hackers-pub#283 (comment)

Assisted-by: Claude Code:claude-opus-4-7
runArticleContentTranslation()'s claim UPDATE only required
beingTranslated=true on the row, so two background runs scheduled
for the same (sourceId, language) by back-to-back edits would both
pass the claim check and both call translate().  The success /
failure CAS at the end still ensured only one write landed, but
the second translate() round trip was wasted LLM cost.  A related
concern is that JS Date is millisecond-precision, so two claim
values created in the same millisecond compare equal; the
existing CAS happens to remain correct because the first success
bumps `updated` to a new value that the second can no longer
match, but the protocol felt easier to reason about with a
stronger uniqueness story up front.

Add `updated = queued.updated` to the claim WHERE so each run only
claims the placeholder it was queued for.  When a later restart
re-stamps the row, the older run's claim turns into a no-op
instead of going through to translate().

For this CAS to be lossless, every writer that produces a row
the helper later treats as `queued` now stamps `updated` with a
JS Date (which round-trips through the postgres driver at the
same ms precision the JS side stores) instead of letting it
default to PG's CURRENT_TIMESTAMP (which keeps µs and would lose
precision on the round trip):

- startArticleContentTranslation()'s INSERT: explicit
  updated: queueStamp.
- startArticleContentTranslation()'s stuck-row recovery: an
  explicit re-stamp UPDATE before handing the row off to the
  helper, so the helper's CAS has a value to match.  Returns
  early without scheduling translate() if a concurrent writer
  took the row out from under us between the SELECT and the
  re-stamp.
- restartArticleContentTranslations()'s reset UPDATE: explicit
  updated: restartStamp instead of sql`CURRENT_TIMESTAMP`.

The claim WHERE in runArticleContentTranslation() then matches
this stamp exactly.

All 30 article-domain tests still pass; the existing fixtures
already insert with JS Date stamps, so no test changes are
needed.

hackers-pub#283 (comment)
hackers-pub#283 (comment)

Assisted-by: Codex:gpt-5.5
Assisted-by: Gemini Code Assist:gemini-2.5
Assisted-by: Claude Code:claude-opus-4-7
@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements automatic retranslation of articles when the original content body is updated. It introduces a background retranslation pipeline using a Check-and-Set (CAS) mechanism with millisecond-precision timestamps to handle concurrent updates safely. The changes also include logic to skip retranslation for title-only edits and respect the allowLlmTranslation setting. Feedback was provided regarding the robustness of the Markdown title extraction logic, suggesting a more resilient parsing approach or stricter prompt instructions to handle potential variations in LLM output format.

Comment thread models/article.ts Outdated
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The success path of runArticleContentTranslation() split the LLM
output into title and body with two strict regexes anchored at
^\s*#\s+, so any model output that didn't begin exactly with a `# `
H1 left title="" and dumped the entire raw output into content.
The most likely cause in production was a model that prefaced the
translation with a brief commentary line (a translator's note,
hedging about preserved terms, etc.) before the H1.  The previous
commit's fetchWebPage switch made this much less common, but the
parser was still fragile.

Replace the inline regexes with a small splitTranslationTitleAndContent
helper that:

- Accepts the first H1 line anywhere in the output as the title,
  not just at offset 0.  Any text before that H1 is dropped on the
  assumption it is model commentary, since the prompt did not
  contain anything before `# Title`.
- Falls back to the first non-empty line as the title (with the
  rest as the body) if there is no H1 at all.
- Restricts H1 detection to a single `#` followed by whitespace, so
  later `## Section` headings inside the body are not mis-promoted
  to the title.
- Trims surrounding whitespace from both halves.

The helper is exported and covered by a dedicated unit test file
exercising clean H1 + body, H1 with preamble, no H1 at all,
H2-only input, H1-only input, empty / whitespace-only input, and
title-line whitespace trimming.

hackers-pub#283 (comment)

Assisted-by: Gemini Code Assist:gemini-2.5
Assisted-by: Claude Code:claude-opus-4-7
@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements automatic retranslation of articles when their original content is updated. Key changes include the introduction of a restartArticleContentTranslations function, a more robust LLM output parser in splitTranslationTitleAndContent, and a concurrency-safe claim mechanism for background tasks using JS Date timestamps to handle database precision differences. The feedback identifies a redundant null-coalescing fallback for the source language in the translation routine that can be simplified due to existing guards and schema constraints.

Comment thread models/article.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f59757a238

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread models/article.ts
dahlia added 2 commits May 4, 2026 16:04
restartArticleContentTranslations()'s reset filter selected every
row with `originalLanguage IS NOT NULL`, which catches both
LLM-requested translation rows (carrying `translationRequesterId`)
and human-curated translation rows (carrying `translatorId`).  On
a body edit, a human translation would therefore be reset to the
new source-language placeholder and re-run through the LLM,
silently destroying the translator's work and leaving incorrect
attribution metadata on machine-generated output.

Add `translatorId IS NULL` to the WHERE so the restart only
touches LLM-requested rows.  The schema check
`article_content_translator_translation_requester_id_check`
already makes `translatorId` and `translationRequesterId`
mutually exclusive, so the new predicate cleanly partitions human
vs LLM translations without needing to also test
`translationRequesterId IS NOT NULL`.

Adds a regression test that creates one human translation and
one LLM translation alongside it on the same article, edits the
body, and asserts the human row is untouched while the LLM row
is picked up by the restart.

hackers-pub#283 (comment)

Assisted-by: Codex:gpt-5.5
Assisted-by: Claude Code:claude-opus-4-7
The translate() input in runArticleContentTranslation() passed
sourceLanguage as `claimed.originalLanguage ?? originalLanguage`,
where the right-hand side was the destructured local from the
caller-provided `queued`.  The fallback was dead code: the claim
WHERE required `beingTranslated=true`, and the schema check
`article_content_being_translated_check` makes that imply
`originalLanguage IS NOT NULL`, so `claimed.originalLanguage` is
always set on a row that came back from the claim's RETURNING.

Drop the fallback and use a non-null assertion on
`claimed.originalLanguage` (drizzle types the column as nullable
because it is nullable in general, but in this position the
schema guarantees otherwise).  A short comment captures the chain
of guarantees so a future reader doesn't reach for the same
defensive `??` again.

hackers-pub#283 (comment)

Assisted-by: Gemini Code Assist:gemini-2.5
Assisted-by: Claude Code:claude-opus-4-7
@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@models/article.ts`:
- Around line 815-824: The reclaim update is not CAS-safe: the
db.update(articleContentTable).set({...}).where(...) only checks
articleContentTable.beingTranslated and can overwrite a fresh claim by another
worker; change the WHERE in the reclaim/update to require that
articleContentTable.updated (or the exact previous timestamp/value you read into
e.g. content.updated or currentUpdated) still equals the stale value you
observed, so the UPDATE only succeeds if no other worker changed the row, and
then check the returning() result (zero rows means another worker claimed it)
before proceeding with translate(); reference the reclaimed variable, the update
call on articleContentTable, and the articleContentTable.beingTranslated and
articleContentTable.updated fields.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 95f1839f-0218-44d4-9bbe-d147c553cec8

📥 Commits

Reviewing files that changed from the base of the PR and between f649b60 and ff61bc2.

📒 Files selected for processing (5)
  • models/article.background.test.ts
  • models/article.lifecycle.test.ts
  • models/article.translation-split.test.ts
  • models/article.ts
  • test/wait.ts
✅ Files skipped from review due to trivial changes (1)
  • models/article.translation-split.test.ts

Comment thread models/article.ts

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements automatic re-translation of articles when the original content is updated. It introduces a background processing pipeline that invalidates existing LLM-generated translations while preserving human-curated ones. Key changes include a new restartArticleContentTranslations function, a robust title/content splitter for LLM outputs, and a concurrency-safe claim system using timestamps. Feedback focuses on using idiomatic Drizzle methods for row locking and simplifying the logic for parsing translated content.

Comment thread models/article.ts Outdated
Comment thread models/article.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff61bc24db

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread models/article.ts Outdated
dahlia added 3 commits May 4, 2026 16:25
The stuck-row reclaim branch in startArticleContentTranslation()
SELECTed the row, decided it was stale (>30 min old and still
beingTranslated), then UPDATEd it with only beingTranslated=true
in the WHERE.  A concurrent worker that landed its own reclaim in
the gap between our SELECT and our UPDATE would have its fresh
claim overwritten by ours, and both workers would then go on to
call runArticleContentTranslation(), wasting one LLM round trip.

Repeat the same staleness check inside the UPDATE itself
(`updated < CURRENT_TIMESTAMP - INTERVAL '30 minutes'`).  Under
PG read-committed isolation, the second UPDATE waits for the
first's row lock, then re-evaluates its WHERE on the now-fresh
row state and drops the row from the candidate set, so the second
worker's reclaim returns 0 rows and bails cleanly.  Same
predicate as the SELECT, so it stays in lockstep with the
visibility logic above.

hackers-pub#283 (comment)

Assisted-by: CodeRabbit:cr-3
Assisted-by: Claude Code:claude-opus-4-7
restartArticleContentTranslations() acquired the article_source
row lock with raw SQL via tx.execute(sql\`SELECT 1 FROM ... FOR
UPDATE\`).  Drizzle exposes the same primitive idiomatically as
.for("update") on a select query, which keeps column references
and parameter binding consistent with the rest of this file's
query style.  Same lock, same semantics, no behavior change.

hackers-pub#283 (comment)

Assisted-by: Gemini Code Assist:gemini-2.5
Assisted-by: Claude Code:claude-opus-4-7
splitTranslationTitleAndContent() previously took the first H1
line anywhere in the output as the title, dropping any text
before it on the assumption that it was model commentary.  That
handled a "preamble before # Title" case nicely but introduced a
data-loss path: when the model omitted the article-title H1 and
the body happened to contain its own H1 section heading, the
parser silently truncated the article (mis-promoted body H1 to
title, dropped the intro paragraphs above it).  Silent truncation
is a worse failure mode than a visible-but-ugly title.

Anchor parsing to the first line of the trimmed output:

- The first line becomes the title.
- If it begins with `# `, the marker is stripped; otherwise the
  whole line is the title verbatim.
- Everything after the first line becomes the body, untouched.

The strict implementation also folds in the related cleanup of
the prior fallback branch.  Because `trimmed` is the result of
`String#trim()`, it is guaranteed to either be empty or begin
with a non-whitespace character, so the previous `findIndex` for
the first non-empty line was redundant and is now removed.  The
H1-marker regex is restricted to a single `#` followed by
whitespace so a first-line `## Section` is taken verbatim
instead of being mis-stripped.

Tests in models/article.translation-split.test.ts are updated to
match the new behavior: the preamble case now keeps the preamble
visible as the title with the body's `# Title` framing
preserved, and the rest of the cases (clean H1+body, no-H1
fallback, first-line H2, body H2 sections, empty input,
H1-only input, title trimming) continue to pass unchanged.

hackers-pub#283 (comment)
hackers-pub#283 (comment)

Assisted-by: Codex:gpt-5.5
Assisted-by: Gemini Code Assist:gemini-2.5
Assisted-by: Claude Code:claude-opus-4-7
@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a mechanism to automatically re-trigger article translations when the original content's body is updated. It introduces a restartArticleContentTranslations function that resets existing LLM-generated translation rows to a placeholder state and re-runs the translation pipeline. Key improvements include a concurrency-safe Check-and-Set (CAS) mechanism using JavaScript Date objects for database updates, a more robust title/body splitting logic for LLM outputs, and updated ActivityPub activity IDs to ensure uniqueness across multiple translation completions. The PR also refactors test utilities into a shared waitFor helper and adds comprehensive test coverage for the new lifecycle logic. I have no feedback to provide.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5fd961d6e9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread models/article.ts
The stuck-row reclaim branch in startArticleContentTranslation()
only bumped `updated` and left the row's `title` / `content`
alone.  That was fine when the helper translated from the
caller-provided original content, but
runArticleContentTranslation() now translates from
`claimed.title` / `claimed.content` (the row's own state read
back via RETURNING), so a placeholder that had been stuck since
before a body edit would be retranslated using the OLD source
text and publish a translation that no longer matches the
article.

Concrete failure path: an editor disables allowLlmTranslation,
edits the body (no restart fires because of the gate), waits
out the 30-min stale window, re-enables allowLlmTranslation,
and a later requestArticleTranslation hits the stuck-recovery
branch.  Without this fix, the resulting translation reflects
the body from before the edit.

Extend the reclaim UPDATE to copy the caller-provided original
title / content (and the matching language) into the
placeholder, and clear the `summary` / `summaryStarted` /
`summaryUnnecessary` / `ogImageKey` derived state for the same
reason as restartArticleContentTranslations() does.  The
caller of startArticleContentTranslation() always passes the
article's current original-language content row, which is the
right source of truth at request time.

hackers-pub#283 (comment)

Assisted-by: Codex:gpt-5.5
Assisted-by: Claude Code:claude-opus-4-7
@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented May 4, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a mechanism to automatically invalidate and re-run translations for articles when their original-language content is updated. It introduces a new restartArticleContentTranslations function, updates updateArticleSource to track content changes, and adds robust testing with a new waitFor utility to handle asynchronous background translation state. I have no feedback to provide as the changes are well-structured and properly tested.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Bravo.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@dahlia dahlia merged commit b4d8ad6 into hackers-pub:main May 4, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Translated articles are not updated when the original article is modified

1 participant