Skip to content

fix: link email-only maintainers on incremental runs (DE-980)#4277

Merged
joanagmaia merged 9 commits into
mainfrom
fix/DE-980-email-only-maintainer-link
Jul 2, 2026
Merged

fix: link email-only maintainers on incremental runs (DE-980)#4277
joanagmaia merged 9 commits into
mainfrom
fix/DE-980-email-only-maintainer-link

Conversation

@joanagmaia

@joanagmaia joanagmaia commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Email-only maintainers in large MAINTAINERS files — most visibly the Linux kernel repo (torvalds/linux) — were not being linked in maintainersInternal.identityId on incremental runs, even when the corresponding identity already existed in memberIdentities. Concrete example that motivated the ticket: Joshua Crofts <joshua.crofts1@gmail.com> (VEML3328 driver) — his identity was created 2026-05-30, but the kernel repo's incremental sync has been running daily since 2026-01-13 and never linked him.

Measured impact on the kernel repo: only 1980 of 2303 maintainers whose identities already exist in memberIdentities are actually linked in maintainersInternal — i.e. 323 resolvable identities (~14%) are silently dropped on every incremental run.

Root cause

Two independent bugs in the incremental code path. Both were invisible on first-run repos because insert_new_maintainers doesn't share any of this logic.

Bug 1 — compare_and_update_maintainers keyed by github_username

compare_and_update_maintainers builds dicts of current vs. new maintainers using github_username as the key:

current_maintainers_dict = {m[\"github_username\"]: m for m in current_maintainers}
new_maintainers_dict     = {m.github_username:     m for m in maintainers}

The AI extractor frequently emits github_username=\"unknown\" when the source only carries an email — for the kernel MAINTAINERS file that's 3959 of 4216 entries. All 3959 collapse into a single \"unknown\" slot in each dict, so at most one of them survives per run; the other 3958 are silently dropped. (The remaining 257 entries with real github usernames are processed normally.)

Bug 2 — get_maintainers_for_repo filtered out email-linked identities

The current-state query restricted the join with:

mem.platform = 'github' AND mem.type = 'username' AND mem.verified = TRUE

But the write path uses find_maintainer_identity_by_email which matches platform IN ('github','git','gitlab'). The kernel email identities resolve to platform='git' rows, so they were excluded from the comparison set entirely — leading to spurious re-inserts (idempotent via ON CONFLICT, so harmless on its own) and a broken view of "what's currently linked".

The general rule: the read filter must not be stricter than the write filter, or the diff loses sight of rows that were legitimately persisted.

Why incremental-only

insert_new_maintainers was already iterating maintainers directly without keying by username, and used find_github_identity → email fallback per maintainer. So on first run every email-only maintainer whose identity existed at that moment was linked. The bug only manifested on subsequent runs — any identity created after a repo's first run (Joshua's case: identity created 2026-05-30, repo's first run was 2026-01-13) could never be picked up, and any incremental run on a repo with many "unknown" extractions silently collapsed them down to one.

Fix

  • Centralised identity resolution_resolve_identity (github → email fallback) and _resolve_maintainers (concurrent batch resolver capped at MAX_CONCURRENT_CHUNKS). Both first-run (insert_new_maintainers) and incremental (compare_and_update_maintainers) paths share identical lookup semantics.
  • Identity-based diff keycompare_and_update_maintainers keys current and new maintainer sets by (identityId, role) instead of github_username. Matches the unique index on maintainersInternal (repoId, identityId, role) and stops \"unknown\" entries from collapsing into one slot.
  • get_maintainers_for_repo aligned with the write path — replaced platform='github' AND type='username' AND verified=TRUE with endDate IS NULL. The write path (find_github_identity / find_maintainer_identity_by_email) is the real gatekeeper for what gets persisted. endDate IS NULL restricts the comparison set to active rows so a reappearing maintainer falls into the "new" branch and gets reactivated via upsert_maintainer's ON CONFLICT … SET \"endDate\" = NULL clause (this restores the reactivation behaviour the old platform='github' filter incidentally provided for email-linked maintainers).
  • End-date safety guard — under the new identity-based key, a transient resolution failure could wrongly end-date a maintainer still listed in the source. Added a guard that skips end-dating when the row's identifying value still appears in the extracted file. Restores the invariant the old github_username-keyed new_maintainers_dict provided for free (every extracted entry was unconditionally in the "new" set regardless of lookup outcome).
  • Concurrent upserts preserved in insert_new_maintainers — restored asyncio.gather + Semaphore(MAX_CONCURRENT_CHUNKS) so first-run processing of large MAINTAINERS files doesn't serialise the thousands of writes.

Expected outcome

After the next scheduled run of the kernel repo, the gap between resolvable identities (memberIdentities rows that match a maintainer entry) and linked rows in maintainersInternal should close — currently 1980/2303, expected to approach 2303/2303 (modulo any new identities created between this PR landing and the next sync).

Test plan

  • Local extraction sanity-checked end-to-end against the full kernel MAINTAINERS file (4216 maintainers extracted; Joshua Crofts <joshua.crofts1@gmail.com> correctly returned as github_username=\"unknown\", email=\"joshua.crofts1@gmail.com\").
  • Ruff lint + format clean on both modified files.
  • After deploy, verify on the next scheduled run of the kernel repo that the 1980/2303 ratio moves toward parity and that Joshua Crofts specifically appears in maintainersInternal.identityId.
  • Verify no regressions on a typical github-username-driven repo (the most common case): existing rows remain linked, no spurious end-dates.

JIRA

DE-980

🤖 Generated with Claude Code


Note

Medium Risk
Changes maintainer persistence and end-dating logic on recurring syncs; incorrect guards could leave stale rows or skip legitimate end-dates, though the PR adds explicit safety around unresolved identities.

Overview
Fixes incremental maintainer sync so email-only entries (e.g. Linux MAINTAINERS with github_username="unknown") are linked instead of collapsed or ignored.

Shared resolution_resolve_identity / _resolve_maintainers (GitHub username then email, concurrent) are used by both first-run insert_new_maintainers and incremental compare_and_update_maintainers. find_github_identity now requires verified = TRUE.

Incremental diff — Comparison keys (identityId, role) instead of github_username, so thousands of "unknown" rows no longer overwrite each other. get_maintainers_for_repo returns active rows (endDate IS NULL) with identity value / platform / type (not GitHub-username-only), aligned with how rows are written.

End-dating — Rows missing from the resolved set are end-dated unless a kind-aware guard sees the same GitHub username or email still present in the file but identity lookup failed this run (avoids false removals on transient lookup failure).

Fork filtering — Parent-repo dedup uses new get_github_maintainer_usernames_for_repo because the broadened maintainer query includes email-linked identities.

Reviewed by Cursor Bugbot for commit 9c35de2. Bugbot is set up for automated code reviews on this repo. Configure here.

joanagmaia and others added 2 commits June 29, 2026 17:37
Two related bugs caused email-only maintainers (e.g. Joshua Crofts in the
linux MAINTAINERS file) to never get linked to memberIdentities after the
first run for a repo:

- compare_and_update_maintainers keyed both current and new maintainers by
  github_username. Extractions where the AI returns "unknown" (~3959 entries
  in the kernel file) collapse into a single dict slot, silently dropping
  the rest. Now keyed by (identityId, role), matching the unique index on
  maintainersInternal.
- get_maintainers_for_repo filtered current maintainers to
  platform='github' AND type='username', excluding maintainers linked via
  email. The comparison set was therefore incomplete. Filter relaxed to
  only deletedAt IS NULL.

Identity resolution is also unified: a single _resolve_identity helper
falls back from github username to email when the username is missing or
"unknown", so the email path is exercised on both first and incremental
runs.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
…E-980)

Adds inline comments to the fix from the previous commit so future readers
understand:
- why get_maintainers_for_repo intentionally has no platform/type/verified
  filters (the read must mirror what the write path persists),
- why _resolve_identity falls back from github username to email,
- why _resolve_maintainers is a shared helper, and
- why compare_and_update_maintainers keys by (identityId, role) instead of
  github_username.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
Copilot AI review requested due to automatic review settings June 29, 2026 16:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes incremental maintainer syncing in services/apps/git_integration so email-only maintainers (common in large MAINTAINERS files like the Linux kernel’s) can be linked to existing identities on subsequent runs, instead of being silently skipped due to username-based key collisions and overly strict read-side filtering.

Changes:

  • Centralizes identity lookup with a GitHub-username-first, email-fallback resolver and reuses it across first-run and incremental paths.
  • Reworks incremental diffing to compare maintainers by (identityId, role) rather than github_username to avoid collapsing "unknown" entries.
  • Relaxes the current-state query to include email-linked identities in the read set for incremental comparisons.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py Adds centralized identity resolution + switches incremental comparison to identity-based keys; refactors first-run insertion to share the same resolution semantics.
services/apps/git_integration/src/crowdgit/database/crud.py Relaxes get_maintainers_for_repo filtering so incremental reads can “see” maintainers stored via email-linked identities.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
joanagmaia and others added 3 commits June 29, 2026 18:17
Four follow-ups from PR #4277 review, all aimed at preserving prior
behaviour rather than changing the feature:

- compare_and_update_maintainers: guard the end-date pass with a "still
  mentioned in source" check. The old github_username-keyed dict always
  included extracted maintainers regardless of identity-lookup outcome,
  so a transient resolution failure never wrongly end-dated a maintainer
  still present in the file. Restore that invariant under the new
  identity-based keying.
- get_maintainers_for_repo: add mi."endDate" IS NULL. Reinstates the
  reactivation path the old platform='github' filter incidentally
  provided for email-linked maintainers (end-dated rows fell out of the
  comparison set, hit the "new maintainer" branch, and got reactivated
  via upsert's ON CONFLICT ... SET "endDate" = NULL).
- insert_new_maintainers: restore concurrent upserts via
  asyncio.gather + Semaphore(MAX_CONCURRENT_CHUNKS). The earlier refactor
  accidentally serialised them, regressing first-run performance on
  large MAINTAINERS files.
- _resolve_maintainers: use self.MAX_CONCURRENT_CHUNKS instead of the
  hardcoded literal so concurrency tuning stays in one place.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
Trims the explanatory comments added in the earlier docs commit down to
one-line whys, dropping the restatement of code behaviour. No behaviour
change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
Comment thread services/apps/git_integration/src/crowdgit/database/crud.py Outdated
- Scope the safety guard to extractor entries whose identity resolution
  failed this run, and match by identifier kind (github-username vs email)
  so role changes correctly end-date the stale (identityId, role) row and
  a same-named handle on another platform cannot shield an unrelated
  identity from end-dating.
- Restore verified=TRUE on the read path and add it to find_github_identity
  so reads and writes are symmetric; return mem.platform/mem.type so the
  guard can disambiguate identifier kinds.
- Rename the misleading github_username alias to identity_value and add a
  dedicated github-username query for the fork/parent-repo filter path.

Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
Copilot AI review requested due to automatic review settings June 30, 2026 11:37

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4299076. Configure here.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings June 30, 2026 11:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@joanagmaia joanagmaia requested a review from mbani01 June 30, 2026 13:19
@joanagmaia joanagmaia merged commit d4833ae into main Jul 2, 2026
15 checks passed
@joanagmaia joanagmaia deleted the fix/DE-980-email-only-maintainer-link branch July 2, 2026 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants