Blog

Prompting4 min read

LTX 2.3: Prompts That Survive a 4K Render

LTX 2.3 is one of the few models that renders natively at 2160p and generates synchronized audio in the same pass. Prompts that get away with being loose at 720p fall apart when the detail ceiling rises.


The detail ceiling changes what matters

Most video models max out at 1080p. LTX 2.3 renders natively at 1080p, 1440p, and 2160p, not upscaled. Small errors you would never see at 720p become visible when the VAE has more room to work. Loose prompts expose themselves. A prompt that says "a craftsman working" at 720p becomes "a craftsman with slightly wrong fingers and an ambiguous tool" at 2160p.

The fix is not more adjectives. It is more concrete nouns.

The four-part structure

LTX 2.3 has a 4x larger text connector than previous versions. The model tracks more clauses across longer durations without losing thread. Write in four sections, in order, no labels:

  1. Subject and action, who, what they are doing, specific physical detail
  2. Environment and lighting, where, quality of light, atmospheric conditions
  3. Camera behavior, movement, lens, target state
  4. Audio and atmosphere, since audio is native, what should be heard

Specificity test

Loose (fine at 720p, ugly at 2160p):

CODE
1An old man fixing a watch in a workshop, cinematic

Fingers will warp. The tool will be generic. The watch internals will be noise. "Cinematic" adds nothing.

Specific (holds at 2160p):

CODE
1Close-up of an elderly craftsman's hands repairing a pocket watch, extreme detail on the gear teeth and his calloused fingertips, shallow depth of field, warm workshop lamp as key light from frame left, static camera with minimal lens breathe, soft tick of mechanisms

Specific hands, object, light direction, camera behavior, audio. The higher resolution now has detail to serve.

Loose versus specific prompt comparison
Loose versus specific prompt comparison

Camera language

Use the industry verbs, LTX 2.3 treats them as instructions, not flavor. push in / pull back, pan across, tracks, circles around, handheld, locked off, overhead, over-the-shoulder all resolve reliably. Add a target state after the movement: camera pushes in until the subject fills the frame. End condition plus direction.

Audio is part of the prompt

LTX 2.3 generates audio in the same forward pass as video. Cross-attention keeps them aligned frame by frame. That only works if audio intent is in your prompt.

  • Environmental sound: rain on glass, distant traffic, wind through grass, crackling fire
  • Absence of sound: silence broken only by footsteps on hardwood
  • Space characterization: reverberant stone chamber, muffled indoor dampness

At $0.08/sec this is free quality. Ignoring it is strictly a loss.

End-frame interpolation prompts

When using fal-ai/ltx-2.3/image-to-video with both image_url and end_image_url, the prompt should describe the transition, not the content. The images already describe the content:

CODE
1The subject turns from looking left to looking right, lighting shifts from warm afternoon to cool blue evening, smooth continuous motion between the two states

Re-describing either end frame tends to produce lower-motion output, the model reads the description as confirmation rather than instruction to change.

FPS is a creative decision

24, 25, 48, 50. 24/25 reads as film. 48/50 reads as broadcast or slow-motion source material. If your sequence cuts between LTX clips at different FPS, the transitions have a perceptual speed mismatch that is very hard to fix in post. Pick one per project.

FPS dial pointing at 24 with broadcast labels
FPS dial pointing at 24 with broadcast labels

A full call

TYPESCRIPT
1const result = await fal.subscribe("fal-ai/ltx-2.3/text-to-video", {
2 input: {
3 prompt: "A Tokyo convenience store at 3am seen through the rain-streaked window from outside, a single employee restocking shelves under fluorescent light, camera holds completely still, rain on glass in the foreground, distant sound of passing cars and rain on the awning",
4 resolution: "1080p",
5 aspect_ratio: "16:9",
6 duration: 8,
7 fps: 24,
8 generate_audio: true,
9 },
10});

What to strip

  • 4K or 8K inside the prompt text, the resolution parameter handles output size
  • cinematic as a solo adjective, empty without a specific lens or lighting choice
  • Contradictory camera instructions (static handheld locked tracking), pick one
  • Temporal vagueness like sunset, prefer the last ten minutes of golden hour

LTX 2.3 text-to-video is $0.08/sec, image-to-video $0.06/sec. At 2160p generation time rises, not per-second price. Draft at 1080p and commit to 2160p only when the prompt is stable.