We might be able to squeeze a lot more performance from our api calls in github events collection
Sonnet 4.6 has identified that we are not being efficient for large repos:
events.py already has a good bulk path (BulkGithubEventCollection), but it falls back to ThoroughGithubEventCollection when a repo has >300 pages of events. In that fallback:
Lines 312–333: loops every issue in the DB → individual GET .../issues/{issue_number}/events
Lines 375–396: same per PR
The trigger condition (line 49–60, the 300-page cutoff) means your largest, most active repos hit the worst path. GraphQL timelineItems would solve this.
https://github.com/chaoss/CollectOSS/blob/96adf3a4d68725db21622673ee6613693c0f5ace/collectoss/tasks/github/events.py#L312-L321
This feeds the same extract_issue_event_data function as the Bulk collection path, so it seems unlikely that we are getting different/more thorough data from the call-by-call method (if we were, then why would we skipping it for repos with fewer events?)
(note: this issue is very similar to several others. be careful to make sure you are talking about the same issue in the comments)
We might be able to squeeze a lot more performance from our api calls in github events collection
Sonnet 4.6 has identified that we are not being efficient for large repos:
https://github.com/chaoss/CollectOSS/blob/96adf3a4d68725db21622673ee6613693c0f5ace/collectoss/tasks/github/events.py#L312-L321
This feeds the same
extract_issue_event_datafunction as the Bulk collection path, so it seems unlikely that we are getting different/more thorough data from the call-by-call method (if we were, then why would we skipping it for repos with fewer events?)(note: this issue is very similar to several others. be careful to make sure you are talking about the same issue in the comments)