Generating 360-degree panoramas from narrow field of view (NFoV) images is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP-based metrics, which tend to perceive the image quality and are not suitable for evaluating the distortion. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP, to accurately evaluate panorama distortion and discover the "visual cheating" phenomenon in previous works (i.e., tending to improve visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address this phenomenon, we propose PanoDecouple, a decoupled diffusion model framework, which decouples panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion priors and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. Extensive experiments validate that PanoDecouple surpasses existing methods in both distortion and visual metrics.
The image quality and distortion accuracy of existing methods and ours by FID and Distort-FID (ours) respectively. We project two regions in panorama (signed in corresponding color) into perspective image to show the distortion accuracy of existing methods (i.e., no distortion and natural layout in perspective image means good results). Recent methods improve the image quality while significantly ruining the distortion. We named it ``visual cheating'' phenomenon.
The training pipeline of our Distort-CLIP. The image features of three distortion types will perform cosine similarity with themselves, and with the text features of three distortion types respectively. ``-'' means that the corresponding elements will not participate in the computation because it is meaningless. The boxes in blue mean the similarity of corresponding elements is 1, otherwise 0.
The pipeline of the proposed PanoDecouple, a decoupled diffusion model. The DistortNet focuses on distortion guidance via the proposed distortion map. To make full use of the position-encoding-like distortion map, we modify the condition registration mechanism of ControlNet from the first block only to all the blocks. The ContentNet is devoted to content completion by imposing partial panorama image input and perspective information. The U-Net remains frozen, coordinating the information fusion between content completion and distortion guidance branches, while fully leveraging its powerful pre-trained knowledge. Note that we omit the text input of the DistortNet and U-Net for simplification, while the one for ContentNet is replaced by perspective image embedding.
Comparison with SOTA methods. † means re-implementing in our setting for fair comparison. Note that the bottom region of Laval is entirely black edges and we crop 20% of it when testing image quality and undo it when testing distortion as it requires the full image. (·) means the crop setting of PanoDiff (crop 20% of the top and bottom regions). The best and second-best results are in bold and underline, respectively.
Visual results with raw image input. Note that the images we use are for academic purposes only. If any copyright infringement occurs, we will promptly remove them.
If you find our work useful, please consider citing our paper:
@InProceedings{zheng2025panorama,
title={Panorama Generation From NFoV Image Done Right},
author={Zheng, Dian and Zhang, Cheng and Wu, Xiao-Ming and Li, Cao and Lv, Chengfei and Hu, Jian-Fang and Zheng, Wei-Shi},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}