ProEdit : Inversion-based Editing From Prompts Done Right

Dec 2025
(* equal contributions, † corresponding authors)
1 Sun Yat-sen University    2 CUHK MMLab   
3 College of Computing and Data Science, Nanyang Technological University   
4 The University of Hong Kong   
5 Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China
Teaser.

Overview of ProEdit. We propose a highly accurate, plug-and-play editing method for flow inversion that addresses the problem of excessive source image information injection, which prevents proper modification of attributes such as pose, number, and color. Our method has demonstrated impressive performance in both image editing and video editing tasks.

Video

Abstract

Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.

Excessive source image information injection phenomeno in RF-Solver

We validate it by visualizing the attention from source and target text tokens to the visual tokens during initial and sampling stage. In RF-Solver, the attention from the source text token to the visual tokens remains higher than that from the target text token. However, after removing attention injection, the attention from ''black'' and ''orange'' to visual tokens returns to similar levels, but some subject attributes (e.g., pose) change accordingly.

Pipeline of ProEdit

Pipeline of our ProEdit. The mask extraction module identifies the edited region based on source and target prompts during the first inversion step. After obtaining the inverted noise, we apply Latents-Shift to perturb the initial distribution in the edited region, reducing source image information. In selected sampling steps, we fuse source and target attention features in the edited region while directly injecting source features in non-edited regions to achieve accurate attribute editing and background preservation simultaneously.

Text-driven Image / Video Editing

FLUX 🎨 Flow-based Image Generation Model


Background: Trees ➡ Mountain

− Mint Leaves

+ Reading Book

UmbrellaUmbrella

Bench ➡ Sofa

Tiger ➡ Cat

Cat ➡ Fox

Shirt ➡ Sweater

Cat ➡ Dog

HunyuanVideo 🎥 Flow-based Video Generation Model


+ Roof Rack

+ Crown

Red Car ➡ Black Car

Deer ➡ Cow


Editing By Instruction

Qualitative results of image editing based on editing instruction. To lower the barrier for using our method and make it more user-friendly, we introduce a large language model Qwen3-8B to enable editing based on editing instruction. The actual input editing instruction are shown above each source image and its corresponding edited image.

BibTeX

If you find our work useful, please consider citing our paper:

@misc{ouyang2025proedit,
  title={ProEdit: Inversion-based Editing From Prompts Done Right},
  author={Ouyang, Zhi and Zheng, Dian and Wu, Xiao-Ming and Jiang, Jian-Jian and Lin, Kun-Yu and Meng, Jingke and Zheng, Wei-Shi},
  year={2025},
  eprint={2512.22118}
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.22118}
}