Unified Video Editing with Temporal Reasoner

👁️ See → 🧠 Reason → ✏️ Edit

🚀 A Chain of Frames editing method enbale temporal reasoning and 4x video length generalization with just 50k training pairs!

Xiangpeng Yang¹, Ji Xie², Yiyuan Yang¹, Yan Huang¹, Min Xu¹, Qiang Wu¹
¹University of Technology Sydney, ²Zhejiang University

VideoCoF: Unified Video Editing with Temporal Reasoner

VideoCoF is a unified video editing model that bridges the gap between expert models (precise but restricted) and unified in-context models (flexible but spatially inaccurate). By introducing a "See → Reason → Edit", a Chain-of-Frames paradigm, VideoCoF predicts reasoning tokens before generating the target video tokens, thereby removing the need for user-provided masks while achieving precise instruction to-region alignment.

Click the image above to watch the full video on YouTube 🎬

🌟 Key Capabilities

Temporal Reasoning: Adopts a unique approach where the model first identifies where and how to edit (Reasoning) before predicting the target video tokens.
Data Efficiency: Achieves SOTA performance with only 50k training pairs (33 frames each).
Length Extrapolation: Demonstrates robust multi-shot editing and can generalize to videos 4× longer than training samples.
Versatile Editing: Supports:
- Object Removal
- Object Addition
- Object Swap
- Local Style Transfer

🔧 Quick Start

To use these weights, please refer to the official GitHub Repository for inference code and environment setup.

Installation

git clone https://github.com/knightyxp/VideoCoF
cd VideoCoF

# 1. Create and activate a conda environment
conda create -n videocof python=3.10
conda activate videocof

# 2. Install PyTorch (Choose version compatible with your CUDA)
# For standard GPUs (CUDA 12.1):
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# For Hopper GPUs (e.g., H100/H800) requiring fast inference:
# pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

# 3. Install other dependencies
pip install -r requirements.txt

Note on Flash Attention: We recommend using FlashAttention-3 (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs. If you are using these GPUs, please follow the official FlashAttention-3 installation guide after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).

Download Models

Wan-2.1-T2V-14B Pretrained Weights:

git lfs install
git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B

# Or using huggingface-cli:
# hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B

VideoCoF Checkpoint:

git lfs install
git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight

# Or using huggingface-cli:
# hf download XiangpengYang/VideoCoF --local-dir videocof_weight

Inference

export CUDA_VISIBLE_DEVICES=0
torchrun --nproc_per_node=1 inference.py \
  --video_path assets/two_man.mp4 \
  --prompt "Remove the young man with short black hair wearing black shirt on the left." \
  --output_dir results/obj_rem \
  --model_name /scratch3/yan204/models/Wan2.1-T2V-14B \
  --seed 0 \
  --num_frames 33 \
  --source_frames 33 \
  --reasoning_frames 4 \
  --repeat_rope \
  --videocof_path videocof_weight/videocof.safetensors

For parallel inference:

sh scripts/parallel_infer.sh

🙏 Acknowledgments

We thank the authors of related works and the open-source community VideoX-Fun and Wan for their contributions.

📜 License

This project is licensed under the Apache License 2.0.

📮 Contact

For any questions, please feel free to reach out to the author Xiangpeng Yang @knightyxp, email: [email protected]/[email protected]

📄 Citation

If you find this work useful for your research, please consider citing:

@article{yang2025videocof,
  title={Unified Video Editing with Temporal Reasoner},
  author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
  journal={arXiv preprint arXiv:2512.07469},
  year={2025}
}

❤️ **If you find this project helpful, please consider giving it a like!** ❤️

Downloads last month: 13

Inference Providers NEW

Video-to-Video

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support