LTX 2.3: Prompts That Survive a 4K Render
LTX 2.3 is one of the few models that renders natively at 2160p and generates synchronized audio in the same pass. Prompts that get away with being loose at 720p fall apart when the detail ceiling rises.
The detail ceiling changes what matters
Most video models max out at 1080p. LTX 2.3 renders natively at 1080p, 1440p, and 2160p, not upscaled. Small errors you would never see at 720p become visible when the VAE has more room to work. Loose prompts expose themselves. A prompt that says "a craftsman working" at 720p becomes "a craftsman with slightly wrong fingers and an ambiguous tool" at 2160p.
The fix is not more adjectives. It is more concrete nouns.
The four-part structure
LTX 2.3 has a 4x larger text connector than previous versions. The model tracks more clauses across longer durations without losing thread. Write in four sections, in order, no labels:
- Subject and action, who, what they are doing, specific physical detail
- Environment and lighting, where, quality of light, atmospheric conditions
- Camera behavior, movement, lens, target state
- Audio and atmosphere, since audio is native, what should be heard
Specificity test
Loose (fine at 720p, ugly at 2160p):
1An old man fixing a watch in a workshop, cinematic
Fingers will warp. The tool will be generic. The watch internals will be noise. "Cinematic" adds nothing.
Specific (holds at 2160p):
1Close-up of an elderly craftsman's hands repairing a pocket watch, extreme detail on the gear teeth and his calloused fingertips, shallow depth of field, warm workshop lamp as key light from frame left, static camera with minimal lens breathe, soft tick of mechanisms
Specific hands, object, light direction, camera behavior, audio. The higher resolution now has detail to serve.

Camera language
Use the industry verbs, LTX 2.3 treats them as instructions, not flavor. push in / pull back, pan across, tracks, circles around, handheld, locked off, overhead, over-the-shoulder all resolve reliably. Add a target state after the movement: camera pushes in until the subject fills the frame. End condition plus direction.
Audio is part of the prompt
LTX 2.3 generates audio in the same forward pass as video. Cross-attention keeps them aligned frame by frame. That only works if audio intent is in your prompt.
- Environmental sound:
rain on glass, distant traffic,wind through grass,crackling fire - Absence of sound:
silence broken only by footsteps on hardwood - Space characterization:
reverberant stone chamber,muffled indoor dampness
At $0.08/sec this is free quality. Ignoring it is strictly a loss.
End-frame interpolation prompts
When using fal-ai/ltx-2.3/image-to-video with both image_url and end_image_url, the prompt should describe the transition, not the content. The images already describe the content:
1The subject turns from looking left to looking right, lighting shifts from warm afternoon to cool blue evening, smooth continuous motion between the two states
Re-describing either end frame tends to produce lower-motion output, the model reads the description as confirmation rather than instruction to change.
FPS is a creative decision
24, 25, 48, 50. 24/25 reads as film. 48/50 reads as broadcast or slow-motion source material. If your sequence cuts between LTX clips at different FPS, the transitions have a perceptual speed mismatch that is very hard to fix in post. Pick one per project.

A full call
1const result = await fal.subscribe("fal-ai/ltx-2.3/text-to-video", {2 input: {3 prompt: "A Tokyo convenience store at 3am seen through the rain-streaked window from outside, a single employee restocking shelves under fluorescent light, camera holds completely still, rain on glass in the foreground, distant sound of passing cars and rain on the awning",4 resolution: "1080p",5 aspect_ratio: "16:9",6 duration: 8,7 fps: 24,8 generate_audio: true,9 },10});
What to strip
4Kor8Kinside the prompt text, the resolution parameter handles output sizecinematicas a solo adjective, empty without a specific lens or lighting choice- Contradictory camera instructions (
static handheld locked tracking), pick one - Temporal vagueness like
sunset, preferthe last ten minutes of golden hour
LTX 2.3 text-to-video is $0.08/sec, image-to-video $0.06/sec. At 2160p generation time rises, not per-second price. Draft at 1080p and commit to 2160p only when the prompt is stable.