Add Config.HardStopTimeout to perform a "hard stop" setting jobs errored#1289
Add Config.HardStopTimeout to perform a "hard stop" setting jobs errored#1289brandur wants to merge 1 commit into
Config.HardStopTimeout to perform a "hard stop" setting jobs errored#1289Conversation
1234376 to
c5151af
Compare
|
|
||
| var setStateParams *riverdriver.JobSetStateIfRunningParams | ||
| if job.Attempt >= job.MaxAttempts { | ||
| setStateParams = riverdriver.JobSetStateDiscarded(job.ID, now, errData, nil) |
There was a problem hiding this comment.
This mirrors existing behavior where a soft stop will set an error and potentially send the job to discarded, but looking at this again, this existing behavior does seem potentially wrong.
c5151af to
7342765
Compare
While working on #1289, I realized that jobs which are "soft stopped" via context cancellation are still prone to the same side effects as if they errored in any other way: * Their number of attempts is incremented. * They may be discarded if reaching max attempts. * They'll have to wait to be retried according to retry policy. This doesn't really seem right because these jobs didn't actually misbehave in any way, but were rather just slow-to-run jobs that couldn't finish cleanly inside the default stop allowance while a client was restarting or being deployed. The proper behavior should probably be more like a snooze. i.e. The soft timeout cancellation doesn't count and the jobs get a chance to be retried immediately. Here, make that change.
While working on #1289, I realized that jobs which are "soft stopped" via context cancellation are still prone to the same side effects as if they errored in any other way: * Their number of attempts is incremented. * They may be discarded if reaching max attempts. * They'll have to wait to be retried according to retry policy. This doesn't really seem right because these jobs didn't actually misbehave in any way, but were rather just slow-to-run jobs that couldn't finish cleanly inside the default stop allowance while a client was restarting or being deployed. The proper behavior should probably be more like a snooze. i.e. The soft timeout cancellation doesn't count and the jobs get a chance to be retried immediately. Here, make that change.
bgentry
left a comment
There was a problem hiding this comment.
Looks good aside form ordering nit and missing changelog ![]()
| // HardStopTimeout is the maximum amount of time that the client will wait | ||
| // after job contexts are cancelled during shutdown before forcing jobs still | ||
| // running to an errored state. This hard stop phase lets jobs be retried | ||
| // immediately on the next client start instead of waiting for rescue. | ||
| // | ||
| // The timer starts only after a soft stop has begun by cancelling job | ||
| // contexts, like after SoftStopTimeout elapses, StopAndCancel is called, or | ||
| // the Start context is cancelled without SoftStopTimeout configured. | ||
| // | ||
| // Defaults to no timeout (hard stop disabled). | ||
| HardStopTimeout time.Duration |
There was a problem hiding this comment.
Missed the alphabetical sort order here
There was a problem hiding this comment.
Oops, quite right. I renamed this a couple times which is probably what happened.
7342765 to
7530cde
Compare
…rored Here, add a new `Config.HardStopTimeout` on top of the existing `SoftStopTimeout` whose job it is to recover badly behaving job as much as possible before coming to a full stop. Currently, if a client is stopping and is running jobs that don't respond to context cancellation, those jobs end up getting left in a `running` state, which means that they won't be recoverable again until they're rescued an hour later. `HardStopTimeout` engages after soft stop, and has each producer perform a "hard stop", which means to have it set any jobs still running to an error state. Because they're errored, they'll get to run immediately the next time a client starts up. Ideally, users don't need to depend on this functionality since the "correct" behavior would be to make sure that all jobs are able to respond to context cancellation, so we make this new feature optional.
7530cde to
8a7f786
Compare
|
@brandur did you want to get this one into the release? |
While working on #1289, I realized that jobs which are "soft stopped" via context cancellation are still prone to the same side effects as if they errored in any other way: * Their number of attempts is incremented. * They may be discarded if reaching max attempts. * They'll have to wait to be retried according to retry policy. This doesn't really seem right because these jobs didn't actually misbehave in any way, but were rather just slow-to-run jobs that couldn't finish cleanly inside the default stop allowance while a client was restarting or being deployed. The proper behavior should probably be more like a snooze. i.e. The soft timeout cancellation doesn't count and the jobs get a chance to be retried immediately. Here, make that change.
|
I would, but I guess it's prone to the same considerations you brought up in #1290. Probably easier to do a fast follow up with a couple stopping improvements after 0.40.0 is out. |
Here, add a new
Config.HardStopTimeouton top of the existingSoftStopTimeoutwhose job it is to recover badly behaving job as muchas possible before coming to a full stop. Currently, if a client is
stopping and is running jobs that don't respond to context cancellation,
those jobs end up getting left in a
runningstate, which means thatthey won't be recoverable again until they're rescued an hour later.
HardStopTimeoutengages after soft stop, and has each producer performa "hard stop", which means to have it set any jobs still running to an
error state. Because they're errored, they'll get to run immediately the
next time a client starts up.
Ideally, users don't need to depend on this functionality since the
"correct" behavior would be to make sure that all jobs are able to
respond to context cancellation, so we make this new feature optional.