Skip to content

Replace TrinoResult generator with class-based iterator for error recovery#612

Open
srchilukoori wants to merge 1 commit into
trinodb:masterfrom
srchilukoori:fix/trino-result-iterator-recovery
Open

Replace TrinoResult generator with class-based iterator for error recovery#612
srchilukoori wants to merge 1 commit into
trinodb:masterfrom
srchilukoori:fix/trino-result-iterator-recovery

Conversation

@srchilukoori

Copy link
Copy Markdown
Contributor

Problem

Fixes #598.

TrinoResult.__iter__ uses a generator (yield). Python generators are permanently finalized after an unhandled exception propagates through the yield point — all subsequent next() calls raise StopIteration. Since #597 introduced lazy segment downloads during iteration, a transient I/O error (e.g. S3 timeout) mid-iteration kills the generator. fetchone() then returns None as if the query completed, silently dropping remaining rows.

Root cause

Generator frame finalization is a Python language constraint, not a bug in the iteration logic itself. Once an exception propagates through yield, the generator's frame is deallocated and cannot be resumed.

Fix

Replace the generator with a class-based iterator (__iter__ returning self, __next__ holding state in instance fields). The iteration logic is unchanged — batch prefetch via fetch() before exposing rows, same _rownumber tracking. The difference is that instance fields (_row_iter, _next_rows) survive exceptions, so callers can resume iteration after catching a transient error.

Testing

Added test_trino_result_iterator_survives_transient_error — injects a FailOnceIterator that raises IOError on the 3rd element, verifies the iterator resumes and returns remaining rows after the caller catches the exception.

Full unit test suite passes (123 tests, 0 failures).

Python generators are permanently finalized after an unhandled exception
propagates through yield. With lazy spooled segment iteration (PR trinodb#597),
a transient I/O error (e.g. S3 timeout) during iteration kills the
generator. Subsequent fetchone() calls return None as if the query
completed normally, silently dropping remaining rows.

Replace the generator-based __iter__ with __iter__/__next__ on the class
itself. Instance fields (_row_iter, _next_rows) survive exceptions, so
the iterator can resume on the next next() call.

Fixes trinodb#598
@cla-bot cla-bot Bot added the cla-signed label Jun 27, 2026
@wendigo wendigo requested a review from azawlocki-sbdt June 29, 2026 17:37
Comment thread trino/client.py
Comment on lines +857 to +871
if self._row_iter is None:
self._row_iter = iter(self._rows)
self._next_rows = self._query.fetch() if not self._query.finished else None

self._rows = next_rows
while True:
try:
row = next(self._row_iter)
self._rownumber += 1
return row
except StopIteration:
if self._next_rows is None:
raise
self._rows = self._next_rows
self._row_iter = iter(self._rows)
self._next_rows = self._query.fetch() if not self._query.finished else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work correctly for the spooling case, when I/O errors originate from next(self._row_iter).

But what if the error is raised by self._query.fetch() instead?

As I understand, in the spooling case it is not an issue, since self._query.fetch() does not occur once the list of segments is fetched right at the beginning. Except maybe when it does not fit in a single response page? But it can occur if the direct protocol is used and self._query.fetch() is called to get the next result page.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, what contract do we want the iterator to have?

Do we want to guarantee that it can be reused after an I/O error, and return more data without duplication, possibly skipping over some rows? It works that way for spooling but I'm not sure if it works that way for the direct protocol. For example, an error in self._query.fetch() in line 871 will leave self._next_rows unchanged and I think the iterator would go over them second time after the error.

@azawlocki-sbdt azawlocki-sbdt Jun 30, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: if we assume s3 communication is more unreliable than coordinator communication, as #598 seems to assume, maybe we should just have a separate retry/backoff policy for s3 downloading?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hashhar, you're the reporter of #598, WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

TrinoResult generator cannot recover from errors during lazy segment iteration

2 participants