The system requires substantial VRAM due to its dual-model architecture.
Video generation utilizes the Wan2.2-I2V-A14B model with FP8 quantization, requiring approximately 36GB for model weights plus an additional 4-6GB for inference overhead, bringing the minimum requirement to 40GB VRAM for stable operation.
Background generation employs Stable Diffusion XL alongside OpenCLIP and segmentation models, consuming approximately 14-17GB total with inference overhead included, making 24GB VRAM theoretically sufficient though 28-32GB is recommended for reliability.
The dual-tab architecture ensures only one feature loads at a time, allowing configuration based on your primary use case.