Prompting4 min readMar 11, 2026

Grok Imagine: Prompting a Model With No Seed and No Negatives

Grok Imagine has no negative prompt, no user-settable seed, and no audio generation. At $0.05/sec the tradeoff is prompt discipline, your words are the entire control surface.

The stripped-down parameter set

What Grok Imagine Video does not have: negative_prompt, a user-settable seed, audio output, style presets, thinking mode. What it has: prompt, aspect_ratio, duration (integer 4-8), resolution (480p or 720p). That is the whole control surface.

Narrow levers, predictable prompt response. Every word has to carry weight, you cannot compensate with parameters after the fact.

Bare API parameter list with crossed out fields

Scenes, not tags

The Aurora engine reads scenes holistically. Keyword-list prompts underperform.

Bad: woman walking, city, rain, cinematic, 4k, bokeh, six floating attributes with no relationship between them. The model picks up about half and guesses the rest.

Good:

CODE

1A woman walks alone through a rain-soaked Tokyo street at night, soft neon reflections shimmering on wet pavement, shallow depth of field, slow tracking shot following at shoulder height

Same idea, written as a scene. Subject, context, lighting, camera, motion, one coherent sentence.

The five elements to cover

Subject and action, active verb, not a noun phrase
Camera framing and movement, wide, close-up, over-the-shoulder, aerial pull-back, slow push-in, handheld, static locked
Lighting and time of day, golden hour, overcast diffuse, harsh midday, candlelit, neon-lit night
Visual style, photorealistic, anime, claymation, film grain, cinematic anamorphic
Motion character, secondary motion: leaves drifting, crowd rushing past, camera drifting imperceptibly left

Duration changes how you write

Durations are integers 4 to 8. Narrower than most models:

4s: one clean action, resolved in the clip. No two-beat sequences.
6s (default): two-beat works, establishing moment, then motion development. A hawk perches on a fence post in sharp focus, then spreads its wings and launches into a blue sky, camera tilting upward to follow.
7-8s: three-beat possible, but each transition must be explicit.

Stuffing three beats into 4 seconds produces rushed, incoherent motion.

No negatives means no exclusion field

Exclusions go inside the prompt, paired with positive clauses: clean frame, no text overlay, single subject, no crowd, sharp focus, no motion blur. no text overlay alone does nothing. The positive clause gives the model something to render; the exclusion narrows it.

No seed and no audio

No seed means every generation is a fresh roll. When a clip is almost right, you rewrite the prompt and accept a new roll. Tighten harder than you would for Wan 2.7 or Pixverse, framing, lighting, subject description all explicit.

Output is silent MP4. Audio is ignored. Do not spend prompt budget on sound.

Aspect ratio matches composition cues

Text-to-video supports seven ratios; image-to-video is always auto. Match composition to ratio, figure rising from the bottom of frame for 9:16, car driving left to right for 16:9, face centered in the frame for 1:1. A 9:16 prompt describing horizontal motion fights the frame.

Image-to-video: motion, not content

In I2V mode, the prompt is motion direction, not scene description. The image already has the scene. Re-describing the image tends to produce static output, the model reads description as confirmation rather than instruction.

Bad: A woman in a red coat stands in a forest clearing (the image already shows this).

Good: The figure in the foreground walks slowly toward the camera, coat fabric rippling in a gentle wind, background trees swaying, camera drifting imperceptibly forward, what moves, direction, speed, camera behavior.

A full call

TYPESCRIPT

1const result = await fal.subscribe("xai/grok-imagine-video/text-to-video", {
2 input: {
3 prompt: "A golden retriever leaps off a sun-drenched wooden dock into a mountain lake, droplets scattering in slow motion, slow tracking shot from the side at shoulder height, late-afternoon light with long low shadows, photorealistic, secondary motion of ripples expanding on the water",
4 duration: 6,
5 resolution: "720p",
6 aspect_ratio: "16:9",
7 },
8});

What to strip

4K, 8K, DSLR, RAW photo, not used as quality cues
Mood adjectives without grounding (emotional, profound, epic), ground the feeling in an action
Conflicting style cues: photorealistic claymation, anime documentary
Audio descriptions, ignored
Long time scales at 6-8 seconds: a day in the life, time-lapse over weeks

Back to all posts

Blog