Grok Imagine: Prompting a Model With No Seed and No Negatives
Grok Imagine has no negative prompt, no user-settable seed, and no audio generation. At $0.05/sec the tradeoff is prompt discipline, your words are the entire control surface.
The stripped-down parameter set
What Grok Imagine Video does not have: negative_prompt, a user-settable seed, audio output, style presets, thinking mode. What it has: prompt, aspect_ratio, duration (integer 4-8), resolution (480p or 720p). That is the whole control surface.
Narrow levers, predictable prompt response. Every word has to carry weight, you cannot compensate with parameters after the fact.

Scenes, not tags
The Aurora engine reads scenes holistically. Keyword-list prompts underperform.
Bad: woman walking, city, rain, cinematic, 4k, bokeh, six floating attributes with no relationship between them. The model picks up about half and guesses the rest.
Good:
1A woman walks alone through a rain-soaked Tokyo street at night, soft neon reflections shimmering on wet pavement, shallow depth of field, slow tracking shot following at shoulder height
Same idea, written as a scene. Subject, context, lighting, camera, motion, one coherent sentence.
The five elements to cover
- Subject and action, active verb, not a noun phrase
- Camera framing and movement, wide, close-up, over-the-shoulder, aerial pull-back, slow push-in, handheld, static locked
- Lighting and time of day, golden hour, overcast diffuse, harsh midday, candlelit, neon-lit night
- Visual style, photorealistic, anime, claymation, film grain, cinematic anamorphic
- Motion character, secondary motion: leaves drifting, crowd rushing past, camera drifting imperceptibly left

Duration changes how you write
Durations are integers 4 to 8. Narrower than most models:
- 4s: one clean action, resolved in the clip. No two-beat sequences.
- 6s (default): two-beat works, establishing moment, then motion development.
A hawk perches on a fence post in sharp focus, then spreads its wings and launches into a blue sky, camera tilting upward to follow. - 7-8s: three-beat possible, but each transition must be explicit.
Stuffing three beats into 4 seconds produces rushed, incoherent motion.
No negatives means no exclusion field
Exclusions go inside the prompt, paired with positive clauses: clean frame, no text overlay, single subject, no crowd, sharp focus, no motion blur. no text overlay alone does nothing. The positive clause gives the model something to render; the exclusion narrows it.
No seed and no audio
No seed means every generation is a fresh roll. When a clip is almost right, you rewrite the prompt and accept a new roll. Tighten harder than you would for Wan 2.7 or Pixverse, framing, lighting, subject description all explicit.
Output is silent MP4. Audio is ignored. Do not spend prompt budget on sound.
Aspect ratio matches composition cues
Text-to-video supports seven ratios; image-to-video is always auto. Match composition to ratio, figure rising from the bottom of frame for 9:16, car driving left to right for 16:9, face centered in the frame for 1:1. A 9:16 prompt describing horizontal motion fights the frame.
Image-to-video: motion, not content
In I2V mode, the prompt is motion direction, not scene description. The image already has the scene. Re-describing the image tends to produce static output, the model reads description as confirmation rather than instruction.
Bad: A woman in a red coat stands in a forest clearing (the image already shows this).
Good: The figure in the foreground walks slowly toward the camera, coat fabric rippling in a gentle wind, background trees swaying, camera drifting imperceptibly forward, what moves, direction, speed, camera behavior.
A full call
1const result = await fal.subscribe("xai/grok-imagine-video/text-to-video", {2 input: {3 prompt: "A golden retriever leaps off a sun-drenched wooden dock into a mountain lake, droplets scattering in slow motion, slow tracking shot from the side at shoulder height, late-afternoon light with long low shadows, photorealistic, secondary motion of ripples expanding on the water",4 duration: 6,5 resolution: "720p",6 aspect_ratio: "16:9",7 },8});
What to strip
4K,8K,DSLR,RAW photo, not used as quality cues- Mood adjectives without grounding (
emotional,profound,epic), ground the feeling in an action - Conflicting style cues:
photorealistic claymation,anime documentary - Audio descriptions, ignored
- Long time scales at 6-8 seconds:
a day in the life,time-lapse over weeks