fix: link email-only maintainers on incremental runs (DE-980)#4277
Conversation
Two related bugs caused email-only maintainers (e.g. Joshua Crofts in the linux MAINTAINERS file) to never get linked to memberIdentities after the first run for a repo: - compare_and_update_maintainers keyed both current and new maintainers by github_username. Extractions where the AI returns "unknown" (~3959 entries in the kernel file) collapse into a single dict slot, silently dropping the rest. Now keyed by (identityId, role), matching the unique index on maintainersInternal. - get_maintainers_for_repo filtered current maintainers to platform='github' AND type='username', excluding maintainers linked via email. The comparison set was therefore incomplete. Filter relaxed to only deletedAt IS NULL. Identity resolution is also unified: a single _resolve_identity helper falls back from github username to email when the username is missing or "unknown", so the email path is exercised on both first and incremental runs. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
…E-980) Adds inline comments to the fix from the previous commit so future readers understand: - why get_maintainers_for_repo intentionally has no platform/type/verified filters (the read must mirror what the write path persists), - why _resolve_identity falls back from github username to email, - why _resolve_maintainers is a shared helper, and - why compare_and_update_maintainers keys by (identityId, role) instead of github_username. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
There was a problem hiding this comment.
Pull request overview
This PR fixes incremental maintainer syncing in services/apps/git_integration so email-only maintainers (common in large MAINTAINERS files like the Linux kernel’s) can be linked to existing identities on subsequent runs, instead of being silently skipped due to username-based key collisions and overly strict read-side filtering.
Changes:
- Centralizes identity lookup with a GitHub-username-first, email-fallback resolver and reuses it across first-run and incremental paths.
- Reworks incremental diffing to compare maintainers by
(identityId, role)rather thangithub_usernameto avoid collapsing"unknown"entries. - Relaxes the current-state query to include email-linked identities in the read set for incremental comparisons.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py | Adds centralized identity resolution + switches incremental comparison to identity-based keys; refactors first-run insertion to share the same resolution semantics. |
| services/apps/git_integration/src/crowdgit/database/crud.py | Relaxes get_maintainers_for_repo filtering so incremental reads can “see” maintainers stored via email-linked identities. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Four follow-ups from PR #4277 review, all aimed at preserving prior behaviour rather than changing the feature: - compare_and_update_maintainers: guard the end-date pass with a "still mentioned in source" check. The old github_username-keyed dict always included extracted maintainers regardless of identity-lookup outcome, so a transient resolution failure never wrongly end-dated a maintainer still present in the file. Restore that invariant under the new identity-based keying. - get_maintainers_for_repo: add mi."endDate" IS NULL. Reinstates the reactivation path the old platform='github' filter incidentally provided for email-linked maintainers (end-dated rows fell out of the comparison set, hit the "new maintainer" branch, and got reactivated via upsert's ON CONFLICT ... SET "endDate" = NULL). - insert_new_maintainers: restore concurrent upserts via asyncio.gather + Semaphore(MAX_CONCURRENT_CHUNKS). The earlier refactor accidentally serialised them, regressing first-run performance on large MAINTAINERS files. - _resolve_maintainers: use self.MAX_CONCURRENT_CHUNKS instead of the hardcoded literal so concurrency tuning stays in one place. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
Trims the explanatory comments added in the earlier docs commit down to one-line whys, dropping the restatement of code behaviour. No behaviour change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
- Scope the safety guard to extractor entries whose identity resolution failed this run, and match by identifier kind (github-username vs email) so role changes correctly end-date the stale (identityId, role) row and a same-named handle on another platform cannot shield an unrelated identity from end-dating. - Restore verified=TRUE on the read path and add it to find_github_identity so reads and writes are symmetric; return mem.platform/mem.type so the guard can disambiguate identifier kinds. - Rename the misleading github_username alias to identity_value and add a dedicated github-username query for the fork/parent-repo filter path. Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
Signed-off-by: Joana Maia <jmaia@contractor.linuxfoundation.org>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4299076. Configure here.

Summary
Email-only maintainers in large
MAINTAINERSfiles — most visibly the Linux kernel repo (torvalds/linux) — were not being linked inmaintainersInternal.identityIdon incremental runs, even when the corresponding identity already existed inmemberIdentities. Concrete example that motivated the ticket:Joshua Crofts <joshua.crofts1@gmail.com>(VEML3328 driver) — his identity was created2026-05-30, but the kernel repo's incremental sync has been running daily since 2026-01-13 and never linked him.Measured impact on the kernel repo: only 1980 of 2303 maintainers whose identities already exist in
memberIdentitiesare actually linked inmaintainersInternal— i.e. 323 resolvable identities (~14%) are silently dropped on every incremental run.Root cause
Two independent bugs in the incremental code path. Both were invisible on first-run repos because
insert_new_maintainersdoesn't share any of this logic.Bug 1 —
compare_and_update_maintainerskeyed bygithub_usernamecompare_and_update_maintainersbuilds dicts of current vs. new maintainers usinggithub_usernameas the key:The AI extractor frequently emits
github_username=\"unknown\"when the source only carries an email — for the kernelMAINTAINERSfile that's 3959 of 4216 entries. All 3959 collapse into a single\"unknown\"slot in each dict, so at most one of them survives per run; the other 3958 are silently dropped. (The remaining 257 entries with real github usernames are processed normally.)Bug 2 —
get_maintainers_for_repofiltered out email-linked identitiesThe current-state query restricted the join with:
But the write path uses
find_maintainer_identity_by_emailwhich matchesplatform IN ('github','git','gitlab'). The kernel email identities resolve toplatform='git'rows, so they were excluded from the comparison set entirely — leading to spurious re-inserts (idempotent viaON CONFLICT, so harmless on its own) and a broken view of "what's currently linked".The general rule: the read filter must not be stricter than the write filter, or the diff loses sight of rows that were legitimately persisted.
Why incremental-only
insert_new_maintainerswas already iterating maintainers directly without keying by username, and usedfind_github_identity→ email fallback per maintainer. So on first run every email-only maintainer whose identity existed at that moment was linked. The bug only manifested on subsequent runs — any identity created after a repo's first run (Joshua's case: identity created 2026-05-30, repo's first run was 2026-01-13) could never be picked up, and any incremental run on a repo with many "unknown" extractions silently collapsed them down to one.Fix
_resolve_identity(github → email fallback) and_resolve_maintainers(concurrent batch resolver capped atMAX_CONCURRENT_CHUNKS). Both first-run (insert_new_maintainers) and incremental (compare_and_update_maintainers) paths share identical lookup semantics.compare_and_update_maintainerskeys current and new maintainer sets by(identityId, role)instead ofgithub_username. Matches the unique index onmaintainersInternal (repoId, identityId, role)and stops\"unknown\"entries from collapsing into one slot.get_maintainers_for_repoaligned with the write path — replacedplatform='github' AND type='username' AND verified=TRUEwithendDate IS NULL. The write path (find_github_identity/find_maintainer_identity_by_email) is the real gatekeeper for what gets persisted.endDate IS NULLrestricts the comparison set to active rows so a reappearing maintainer falls into the "new" branch and gets reactivated viaupsert_maintainer'sON CONFLICT … SET \"endDate\" = NULLclause (this restores the reactivation behaviour the oldplatform='github'filter incidentally provided for email-linked maintainers).github_username-keyednew_maintainers_dictprovided for free (every extracted entry was unconditionally in the "new" set regardless of lookup outcome).insert_new_maintainers— restoredasyncio.gather+Semaphore(MAX_CONCURRENT_CHUNKS)so first-run processing of large MAINTAINERS files doesn't serialise the thousands of writes.Expected outcome
After the next scheduled run of the kernel repo, the gap between resolvable identities (
memberIdentitiesrows that match a maintainer entry) and linked rows inmaintainersInternalshould close — currently 1980/2303, expected to approach 2303/2303 (modulo any new identities created between this PR landing and the next sync).Test plan
MAINTAINERSfile (4216 maintainers extracted;Joshua Crofts <joshua.crofts1@gmail.com>correctly returned asgithub_username=\"unknown\",email=\"joshua.crofts1@gmail.com\").Joshua Croftsspecifically appears inmaintainersInternal.identityId.JIRA
DE-980
🤖 Generated with Claude Code
Note
Medium Risk
Changes maintainer persistence and end-dating logic on recurring syncs; incorrect guards could leave stale rows or skip legitimate end-dates, though the PR adds explicit safety around unresolved identities.
Overview
Fixes incremental maintainer sync so email-only entries (e.g. Linux
MAINTAINERSwithgithub_username="unknown") are linked instead of collapsed or ignored.Shared resolution —
_resolve_identity/_resolve_maintainers(GitHub username then email, concurrent) are used by both first-runinsert_new_maintainersand incrementalcompare_and_update_maintainers.find_github_identitynow requiresverified = TRUE.Incremental diff — Comparison keys
(identityId, role)instead ofgithub_username, so thousands of"unknown"rows no longer overwrite each other.get_maintainers_for_reporeturns active rows (endDate IS NULL) with identityvalue/platform/type(not GitHub-username-only), aligned with how rows are written.End-dating — Rows missing from the resolved set are end-dated unless a kind-aware guard sees the same GitHub username or email still present in the file but identity lookup failed this run (avoids false removals on transient lookup failure).
Fork filtering — Parent-repo dedup uses new
get_github_maintainer_usernames_for_repobecause the broadened maintainer query includes email-linked identities.Reviewed by Cursor Bugbot for commit 9c35de2. Bugbot is set up for automated code reviews on this repo. Configure here.