Skip to content

events collection could be more efficient #422

Description

@MoralCode

We might be able to squeeze a lot more performance from our api calls in github events collection

Sonnet 4.6 has identified that we are not being efficient for large repos:

events.py already has a good bulk path (BulkGithubEventCollection), but it falls back to ThoroughGithubEventCollection when a repo has >300 pages of events. In that fallback:

Lines 312–333: loops every issue in the DB → individual GET .../issues/{issue_number}/events
Lines 375–396: same per PR
The trigger condition (line 49–60, the 300-page cutoff) means your largest, most active repos hit the worst path. GraphQL timelineItems would solve this.

https://github.com/chaoss/CollectOSS/blob/96adf3a4d68725db21622673ee6613693c0f5ace/collectoss/tasks/github/events.py#L312-L321

This feeds the same extract_issue_event_data function as the Bulk collection path, so it seems unlikely that we are getting different/more thorough data from the call-by-call method (if we were, then why would we skipping it for repos with fewer events?)

(note: this issue is very similar to several others. be careful to make sure you are talking about the same issue in the comments)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions