Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.
We validate it by visualizing the attention from source and target text tokens to the visual tokens during initial and sampling stage. In RF-Solver, the attention from the source text token to the visual tokens remains higher than that from the target text token. However, after removing attention injection, the attention from ''black'' and ''orange'' to visual tokens returns to similar levels, but some subject attributes (e.g., pose) change accordingly.
Pipeline of our ProEdit. The mask extraction module identifies the edited region based on source and target prompts during the first inversion step. After obtaining the inverted noise, we apply Latents-Shift to perturb the initial distribution in the edited region, reducing source image information. In selected sampling steps, we fuse source and target attention features in the edited region while directly injecting source features in non-edited regions to achieve accurate attribute editing and background preservation simultaneously.
FLUX 🎨 Flow-based Image Generation Model
Background: Trees ➡ Mountain
− Mint Leaves
+ Reading Book
Umbrella ➡ Umbrella
Bench ➡ Sofa
Tiger ➡ Cat
Cat ➡ Fox
Shirt ➡ Sweater
Cat ➡ Dog
HunyuanVideo 🎥 Flow-based Video Generation Model
+ Roof Rack
+ Crown
Red Car ➡ Black Car
Deer ➡ Cow
Qualitative results of image editing based on editing instruction. To lower the barrier for using our method and make it more user-friendly, we introduce a large language model Qwen3-8B to enable editing based on editing instruction. The actual input editing instruction are shown above each source image and its corresponding edited image.
If you find our work useful, please consider citing our paper:
@misc{ouyang2025proedit,
title={ProEdit: Inversion-based Editing From Prompts Done Right},
author={Ouyang, Zhi and Zheng, Dian and Wu, Xiao-Ming and Jiang, Jian-Jian and Lin, Kun-Yu and Meng, Jingke and Zheng, Wei-Shi},
year={2025},
eprint={2512.22118}
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.22118}
}