Blog

Troubleshooting2 min read

Error Handling Patterns for Long-Running Video Jobs

Transient failures, cold-starts, and timeouts need different responses. A decision tree for what to retry and what to bail on.


Not every failure is the same. A 503 from a warming worker should not be retried the same way as a prompt-level content refusal. A decision tree for what to do with each class keeps your pipeline from either burning credits on futile retries or bailing on jobs that would have worked.

The three classes

There are only three failure shapes worth naming. Transient failures are network blips and transient 5xx from the platform. Cold-starts show up as a long IN_QUEUE time when your model has not been used recently. Timeouts are jobs that stay IN_PROGRESS past your expected wall clock.

A decision tree with three error classes
A decision tree with three error classes

Transient, retry fast

Retry twice, with exponential backoff starting at 500ms. If it fails three times in a row, your problem is not transient. Stop retrying and log.

JAVASCRIPT
1async function withRetry(fn, attempts = 3) {
2 for (let i = 0; i < attempts; i++) {
3 try { return await fn(); }
4 catch (err) {
5 if (i === attempts - 1) throw err;
6 await new Promise(r => setTimeout(r, 500 * Math.pow(2, i)));
7 }
8 }
9}
10
11const result = await withRetry(() =>
12 fal.subscribe("fal-ai/wan/v2.7/text-to-video", { input: { prompt, duration: 5 } })
13);

Cold-start, wait, do not retry

If your status shows IN_QUEUE for longer than you expect, do not cancel. Canceling a warming job and resubmitting starts a new warm from scratch. Wait. Wan 2.7 and Veo 3.1 typically warm within 45 seconds of first submission. If you run no jobs for hours, plan for the first one to take longer.

Timeout, classify before retry

A job that has been IN_PROGRESS for more than 3x your normal wall clock is stuck. Cancel it. Then check your prompt. If your prompt has no obvious issue, resubmit. If your prompt has a long unusual phrase or an unsupported aspect, rewrite before you retry.

A stopwatch with a stuck progress bar
A stopwatch with a stuck progress bar

Content refusals are not transient

If the platform returns a content-moderation-style error, your prompt is the problem, not the runtime. Never put this in a retry loop. It will never work, and you will get rate-limited on top of it.

A decision table you can drop in

Status or error | Action --- | --- HTTP 5xx, connection reset | Retry with backoff, 3 attempts max HTTP 429 | Back off 10 seconds, reduce concurrency IN_QUEUE > 120s on first call | Wait, do not cancel IN_PROGRESS > 3x normal | Cancel, inspect prompt, resubmit once Content refusal | Log, surface to user, do not retry Invalid input | Log, fix schema, do not retry

The quiet failure

The worst failure is the one where the job returns a URL but the video is wrong, too short, or off-prompt. That is not an error class. It is a review problem. Do not try to catch it in your retry logic. Catch it in your review pass with a short human check.