Blog

Troubleshooting3 min read

Debugging IN_PROGRESS Forever Jobs

When a fal job never completes, five checks you run before you cancel and resubmit. Cancel is always last.


IN_PROGRESS does not mean broken

A job sitting at IN_PROGRESS longer than you expect is not automatically stuck. Before you cancel and resubmit, run five checks. Cancel is last.

Five step status ladder with magnifier
Five step status ladder with magnifier

Check one: logs at the endpoint

fal.queue.status with logs: true returns whatever the worker has printed. Ninety percent of "stuck" jobs tell you exactly what they are waiting on.

TYPESCRIPT
1const status = await fal.queue.status("fal-ai/veo3.1", { requestId, logs: true });
2
3console.log(status.status);
4status.logs?.forEach((l) => console.log(l.timestamp, l.message));

Recent timestamps mean it is alive. Last log 30 seconds old at 90 percent? Wait. Last log five minutes old at 10 percent? Continue.

Check two: the endpoint status

A single slow endpoint during a burst is common for Veo 3.1 4K, Kling v3 Pro, Wan 2.7 at 15 seconds. 45 second, one minute queue times on those during peak are not a bug.

If the community is quiet and the endpoint looks healthy, the issue is your payload. Continue.

Check three: the input payload

Two classes of silent-ish failure.

  • Oversized inputs: image-to-video with a 12 MB reference, or audio with a 20 MB WAV. The worker is downloading your asset. Slow storage = long IN_PROGRESS.
  • Ambiguous prompts: extremely long prompts or content that triggers Veo 3.1's auto_fix=true rewrites. Result finishes, but slower.

Look for log messages like "downloading image_url" or "prompt expanded".

Check four: region and account concurrency

If you just fired 30 jobs and this is the 31st, you are queued behind a hidden worker limit. Status says IN_QUEUE or IN_PROGRESS but nothing is actively computing.

Count your own in-flight jobs:

SQL
1SELECT count(*) FROM generations WHERE status IN ('IN_QUEUE','IN_PROGRESS');

Near your ceiling? Obvious answer. Or submit a tiny Pixverse v6 5 second draft (around $0.15 at 360p no audio). If that flies through, your infra is fine and the original is just waiting.

Check five: the webhook

If you submitted with webhookUrl and your DB still says IN_QUEUE, the job may have completed and you missed the notification. Two ways that breaks.

  • Handler returned non 2xx. Fal retries, then stops. Upstream COMPLETED, your DB never got the update.
  • Webhook URL changed during deploy. Delivery fires to a 404.

Poll directly.

TYPESCRIPT
1const final = await fal.queue.result("fal-ai/veo3.1", { requestId });

If this succeeds, the job is done. Fix the webhook, update your DB with the result, move on. No re-generation.

Now, and only now, cancel

Cancel as last resort, new request id stamped
Cancel as last resort, new request id stamped

If logs are stale, endpoint is healthy, payload is sane, concurrency is not it, no webhook pending, then cancel.

TYPESCRIPT
1await fal.queue.cancel("fal-ai/veo3.1", { requestId });
2await db.query("UPDATE generations SET status='CANCELLED' WHERE request_id=$1", [requestId]);

Resubmit with a different strategy:

  • Veo 3.1 at 4K: draft first on Veo 3.1 Lite at $0.05/sec.
  • Wan 2.7 at 15s: split into two 8 second clips.
  • Image-to-video with a large asset: upload to fal.storage first.

A debug helper you will reuse

TYPESCRIPT
1async function debugJob(endpoint: string, requestId: string) {
2 const s = await fal.queue.status(endpoint, { requestId, logs: true });
3 return {
4 status: s.status,
5 lastLogTime: s.logs?.at(-1)?.timestamp,
6 lastLogMsg: s.logs?.at(-1)?.message,
7 inFlight: await db.oneOrNone(
8 "SELECT count(*)::int AS c FROM generations WHERE status IN ('IN_QUEUE','IN_PROGRESS')",
9 ),
10 };
11}

Four data points. Most of the time, one obvious cause.