Add `Config.HardStopTimeout` to perform a "hard stop" setting jobs errored by brandur · Pull Request #1289 · riverqueue/river

brandur · 2026-06-19T15:07:07Z

Here, add a new Config.HardStopTimeout on top of the existing
SoftStopTimeout whose job it is to recover badly behaving job as much
as possible before coming to a full stop. Currently, if a client is
stopping and is running jobs that don't respond to context cancellation,
those jobs end up getting left in a running state, which means that
they won't be recoverable again until they're rescued an hour later.

HardStopTimeout engages after soft stop, and has each producer perform
a "hard stop", which means to have it set any jobs still running to an
error state. Because they're errored, they'll get to run immediately the
next time a client starts up.

Ideally, users don't need to depend on this functionality since the
"correct" behavior would be to make sure that all jobs are able to
respond to context cancellation, so we make this new feature optional.

brandur · 2026-06-19T17:43:33Z

+
+		var setStateParams *riverdriver.JobSetStateIfRunningParams
+		if job.Attempt >= job.MaxAttempts {
+			setStateParams = riverdriver.JobSetStateDiscarded(job.ID, now, errData, nil)


This mirrors existing behavior where a soft stop will set an error and potentially send the job to discarded, but looking at this again, this existing behavior does seem potentially wrong.

While working on #1289, I realized that jobs which are "soft stopped" via context cancellation are still prone to the same side effects as if they errored in any other way: * Their number of attempts is incremented. * They may be discarded if reaching max attempts. * They'll have to wait to be retried according to retry policy. This doesn't really seem right because these jobs didn't actually misbehave in any way, but were rather just slow-to-run jobs that couldn't finish cleanly inside the default stop allowance while a client was restarting or being deployed. The proper behavior should probably be more like a snooze. i.e. The soft timeout cancellation doesn't count and the jobs get a chance to be retried immediately. Here, make that change.

bgentry

Looks good aside form ordering nit and missing changelog

bgentry · 2026-06-28T22:52:16Z

+	// HardStopTimeout is the maximum amount of time that the client will wait
+	// after job contexts are cancelled during shutdown before forcing jobs still
+	// running to an errored state. This hard stop phase lets jobs be retried
+	// immediately on the next client start instead of waiting for rescue.
+	//
+	// The timer starts only after a soft stop has begun by cancelling job
+	// contexts, like after SoftStopTimeout elapses, StopAndCancel is called, or
+	// the Start context is cancelled without SoftStopTimeout configured.
+	//
+	// Defaults to no timeout (hard stop disabled).
+	HardStopTimeout time.Duration


Missed the alphabetical sort order here

Oops, quite right. I renamed this a couple times which is probably what happened.

…rored Here, add a new `Config.HardStopTimeout` on top of the existing `SoftStopTimeout` whose job it is to recover badly behaving job as much as possible before coming to a full stop. Currently, if a client is stopping and is running jobs that don't respond to context cancellation, those jobs end up getting left in a `running` state, which means that they won't be recoverable again until they're rescued an hour later. `HardStopTimeout` engages after soft stop, and has each producer perform a "hard stop", which means to have it set any jobs still running to an error state. Because they're errored, they'll get to run immediately the next time a client starts up. Ideally, users don't need to depend on this functionality since the "correct" behavior would be to make sure that all jobs are able to respond to context cancellation, so we make this new feature optional.

bgentry · 2026-07-02T14:21:47Z

@brandur did you want to get this one into the release?

While working on #1289, I realized that jobs which are "soft stopped" via context cancellation are still prone to the same side effects as if they errored in any other way: * Their number of attempts is incremented. * They may be discarded if reaching max attempts. * They'll have to wait to be retried according to retry policy. This doesn't really seem right because these jobs didn't actually misbehave in any way, but were rather just slow-to-run jobs that couldn't finish cleanly inside the default stop allowance while a client was restarting or being deployed. The proper behavior should probably be more like a snooze. i.e. The soft timeout cancellation doesn't count and the jobs get a chance to be retried immediately. Here, make that change.

brandur · 2026-07-02T20:12:45Z

I would, but I guess it's prone to the same considerations you brought up in #1290. Probably easier to do a fast follow up with a couple stopping improvements after 0.40.0 is out.

brandur marked this pull request as draft June 19, 2026 15:08

brandur force-pushed the brandur-hard-stop-timeout branch 2 times, most recently from 1234376 to c5151af Compare June 19, 2026 15:16

brandur commented Jun 19, 2026

View reviewed changes

brandur force-pushed the brandur-hard-stop-timeout branch from c5151af to 7342765 Compare June 19, 2026 17:44

brandur mentioned this pull request Jun 19, 2026

Make jobs cancelled due to a soft stop immediately available #1290

Open

brandur marked this pull request as ready for review June 20, 2026 03:46

brandur requested a review from bgentry June 20, 2026 03:46

brandur mentioned this pull request Jun 23, 2026

River job stuck at running #1258

Open

bgentry approved these changes Jun 29, 2026

View reviewed changes

brandur force-pushed the brandur-hard-stop-timeout branch from 7342765 to 7530cde Compare June 30, 2026 19:21

brandur force-pushed the brandur-hard-stop-timeout branch from 7530cde to 8a7f786 Compare June 30, 2026 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `Config.HardStopTimeout` to perform a "hard stop" setting jobs errored#1289

Add `Config.HardStopTimeout` to perform a "hard stop" setting jobs errored#1289
brandur wants to merge 1 commit into
masterfrom
brandur-hard-stop-timeout

brandur commented Jun 19, 2026

Uh oh!

brandur Jun 19, 2026

Uh oh!

bgentry left a comment

Uh oh!

bgentry Jun 28, 2026

Uh oh!

brandur Jun 30, 2026

Uh oh!

bgentry commented Jul 2, 2026

Uh oh!

brandur commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

brandur commented Jun 19, 2026

Uh oh!

brandur Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

bgentry left a comment

Choose a reason for hiding this comment

Uh oh!

bgentry Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

brandur Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

bgentry commented Jul 2, 2026

Uh oh!

brandur commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants