FlowLong: Inference-time Long Video Generation
via Manifold-constrained Tweedie Matching


1KAIST   2Amazon
*Equal contribution   Co-corresponding authors

Videos are muted by default — click any video to unmute it (others auto-mute). Press Enter to pause or resume all videos.

30-Second Long-form Audio-Video Generation

30-second videos with synchronized audio, from LTX2.
Extending LTX2's base 5-second generation budget by ×6.

16-Second Audio-Video Samples

Additional 16-second samples with synchronized audio, from LTX2.
Extending LTX2's base 5-second generation budget by ×3.

30-Second Long Video Samples

30-second videos from HunyuanVideo-1.5.
Extending HunyuanVideo-1.5's base 5-second generation budget by ×6.

Comparison with Bidirectional Models

30-second generation against bidirectional long-video baselines (RIFLEx, UltraVico) on the same prompts.
RIFLEx is built on CogVideoX (5B); UltraVico and FlowLong (Ours) both use Wan2.1 (1.3B).

RIFLExCogVideoX (5B)
UltraVicoWan2.1 (1.3B)
FlowLong (Ours)Wan2.1 (1.3B)
Comparison with Autoregressive Models

30-second generation against autoregressive long-video baselines on 10 diverse prompts.
All baselines and FlowLong (Ours) are built on Wan2.1 (1.3B).

CausVid
Self-Forcing
Deep-Forcing
LongLive
Infinity-RoPE
FlowLong (Ours)
Prompts 1–4 of 10
Application: Long Text-to-3D Gaussian Splatting

FlowLong's long-generation idea applied to text-to-3DGS: VIST3A runs out of frames mid-trajectory while FlowLong (Ours) continues seamlessly.

VIST3A
FlowLong (Ours)
VIST3A
FlowLong (Ours)
Out of frames
Out of frames
Out of frames
Out of frames