Home Abstract Video Method Results Citation Acknowledgement

Point-to-Point (P2P) Video Generation. Given a pair of (orange) start- and (red) end-frames in the video and 3D skeleton domains, our method generates videos with smooth transitional frames of various lengths. The superb controllability of p2p generation naturally facilitates the modern video editing process.


While image manipulation achieves tremendous breakthroughs (e.g., generating realistic faces) in recent years, video generation is much less explored and harder to control, which limits its applications in the real world. For instance, video editing requires temporal coherence across multiple clips and thus poses both start and end constraints within a video sequence. We introduce point-to-point video generation that controls the generation process with two control points: the targeted start- and end-frames. The task is challenging since the model not only generates a smooth transition of frames, but also plans ahead to ensure that the generated end-frame conforms to the targeted end-frame for videos of various length. We propose to maximize the modified variational lower bound of conditional data likelihood under a skip-frame training strategy. Our model can generate sequences such that their end-frame is consistent with the targeted end-frame without loss of quality and diversity. Extensive experiments are conducted on Stochastic Moving MNIST, Weizmann Human Action, and Human3.6M to evaluate the effectiveness of the proposed method. We demonstrate our method under a series of scenarios (e.g., dynamic length generation) and the qualitative results showcase the potential and merits of point-to-point generation.

Video Overview


Overview. We describe the novel components in our model to achieve reliable p2p generation. In Panel (a), our model is a VAE consisting of posterior qφ, prior pψ, and generator pθ. We use KL-divergence to encourage the posterior to be similar to the prior. In this way, the generated frame will preserve smooth transition. To control the generation process, we encode the targeted end-frame xT into a global descriptor. Both posterior and prior are computed by an LSTM considering not only the input frame (xt or xt-1), but also the global descriptor and time counter. We further use the alignment loss to align the encoder and decoder latent space to reinforce the end-frame consistency. In Panel (b), our skip-frame training has a probability to skip the current frame for each timestamp where the inputs will be ignored completely and the hidden state will not be propagated at all (indicated by the dashed line). In Panel (c), the control point consistency is achieved by posing CPC loss on pψ without deteriorating the reconstruction objective of posterior (highlighted in bold).


Generation with Various Lengths

Given a pair of (orange) start- and (red) end-frames, we show generation results with various lengths on SM-MNIST, Weizmann and Human3.6M. The number beneath each frame indicates the timestamp. Our model can achieve high-intermediate-diversity and targeted end-frame consistency while aware of various-length generation at the same time. More results in .gif are shown in the bottom.

Multiple Control Points Generation.

Given multiple pairs of (orange) start- and (red) end-frames, we can merge multiple generated clips into a longer video, which is similar to the modern video editing process. The number beneath each frame indicates the timestamp. More results in .gif are shown in the bottom.

Loop Generation.

We set the (orange) start- and (red) end-frame with the same frame to achieve loop generation. We show that our model can generate videos that form infinite loops while still preserving content diversity. More results in .gif are shown in the bottom.


Point-to-Point Video Generation.

Tsun-Hsuan Wang*, Yen-Chi Cheng*, Chieh Hubert Lin, Hwann-Tzong Chen, Min Sun
Paper (arXiv)  Source Code
    title={Point-to-Point Video Generation},
    author={Wang, Tsun-Hsuan and Cheng, Yen-Chi and Lin, Chieh Hubert and Chen, Hwann-Tzong and Sun, Min},
    journal={arXiv preprint arXiv:1904.02912},