Replace TrinoResult generator with class-based iterator for error recovery#612
Replace TrinoResult generator with class-based iterator for error recovery#612srchilukoori wants to merge 1 commit into
Conversation
Python generators are permanently finalized after an unhandled exception propagates through yield. With lazy spooled segment iteration (PR trinodb#597), a transient I/O error (e.g. S3 timeout) during iteration kills the generator. Subsequent fetchone() calls return None as if the query completed normally, silently dropping remaining rows. Replace the generator-based __iter__ with __iter__/__next__ on the class itself. Instance fields (_row_iter, _next_rows) survive exceptions, so the iterator can resume on the next next() call. Fixes trinodb#598
| if self._row_iter is None: | ||
| self._row_iter = iter(self._rows) | ||
| self._next_rows = self._query.fetch() if not self._query.finished else None | ||
|
|
||
| self._rows = next_rows | ||
| while True: | ||
| try: | ||
| row = next(self._row_iter) | ||
| self._rownumber += 1 | ||
| return row | ||
| except StopIteration: | ||
| if self._next_rows is None: | ||
| raise | ||
| self._rows = self._next_rows | ||
| self._row_iter = iter(self._rows) | ||
| self._next_rows = self._query.fetch() if not self._query.finished else None |
There was a problem hiding this comment.
Seems to work correctly for the spooling case, when I/O errors originate from next(self._row_iter).
But what if the error is raised by self._query.fetch() instead?
As I understand, in the spooling case it is not an issue, since self._query.fetch() does not occur once the list of segments is fetched right at the beginning. Except maybe when it does not fit in a single response page? But it can occur if the direct protocol is used and self._query.fetch() is called to get the next result page.
There was a problem hiding this comment.
Ultimately, what contract do we want the iterator to have?
Do we want to guarantee that it can be reused after an I/O error, and return more data without duplication, possibly skipping over some rows? It works that way for spooling but I'm not sure if it works that way for the direct protocol. For example, an error in self._query.fetch() in line 871 will leave self._next_rows unchanged and I think the iterator would go over them second time after the error.
There was a problem hiding this comment.
Another thought: if we assume s3 communication is more unreliable than coordinator communication, as #598 seems to assume, maybe we should just have a separate retry/backoff policy for s3 downloading?
Problem
Fixes #598.
TrinoResult.__iter__uses a generator (yield). Python generators are permanently finalized after an unhandled exception propagates through the yield point — all subsequentnext()calls raiseStopIteration. Since #597 introduced lazy segment downloads during iteration, a transient I/O error (e.g. S3 timeout) mid-iteration kills the generator.fetchone()then returnsNoneas if the query completed, silently dropping remaining rows.Root cause
Generator frame finalization is a Python language constraint, not a bug in the iteration logic itself. Once an exception propagates through
yield, the generator's frame is deallocated and cannot be resumed.Fix
Replace the generator with a class-based iterator (
__iter__returningself,__next__holding state in instance fields). The iteration logic is unchanged — batch prefetch viafetch()before exposing rows, same_rownumbertracking. The difference is that instance fields (_row_iter,_next_rows) survive exceptions, so callers can resume iteration after catching a transient error.Testing
Added
test_trino_result_iterator_survives_transient_error— injects aFailOnceIteratorthat raisesIOErroron the 3rd element, verifies the iterator resumes and returns remaining rows after the caller catches the exception.Full unit test suite passes (123 tests, 0 failures).