Addressing Text Embedding Leakage in Diffusion-based Image Editing

Sunung Mun*, Jinhwan Nam*, Sunghyun Cho, Jungseul Ok
Pohang University of Science and Technology (POSTECH)
ICCV 2025

Abstract

Text-based image editing, powered by generative diffusion models, lets users modify images through natural-language prompts and has dramatically simplified traditional workflows. Despite these advances, current methods still suffer from a critical problem: attribute leakage, where edits meant for specific objects unintentionally affect unrelated regions or other target objects. Our analysis reveals the root cause as the semantic entanglement inherent in End-of-Sequence (EOS) embeddings generated by autoregressive text encoders, which indiscriminately aggregate attributes across prompts. To address this issue, we introduce Attribute-Leakage-free Editing (ALE), a framework that tackles attribute leakage at its source. ALE combines Object-Restricted Embeddings (ORE) to disentangle text embeddings, Region-Guided Blending for Cross-Attention Masking (RGB-CAM) for spatially precise attention, and Background Blending (BB) to preserve non-edited content. To quantitatively evaluate attribute leakage across various editing methods, we propose the Attribute-Leakage Evaluation Benchmark (ALE-Bench), featuring comprehensive editing scenarios and new metrics. Extensive experiments show that ALE reduces attribute leakage by large margins, thereby enabling accurate, multi-object, text-driven image editing while faithfully preserving non-target content.

Comparison of ALE with Other Methods

floorcream-colored carpet
flowerpotblue-colored bucket

sunballoon
parasolmushroom

tableblack-colored table
chaircrimson-colored chair

Proposed Method: Attribute-Leakage-free Editing

The framework consists of two branches: the upper branch (source branch) processes the source latent \( z_{\tau}^{\text{src}} \), while the lower branch (target branch) processes the target latent \( z_{\tau}^{\text{tgt}} \) at each timestep \( \tau \). The method comprises three key components: Object-Restricted Embeddings (ORE), Region-Guided Blending for Cross-Attention Masking (RGB-CAM), and Background Blending (BB). ORE generates attribute-leakage-free embeddings to minimize interference between unrelated objects. RGB-CAM leverages segmentation masks from Grounded-SAM to refine cross-attention activations, ensuring attention is focused on target regions for each target object. BB composes the source latent for background regions and the target latent for edited regions.

Proposed Benchmark: Attribute Leakage Evaluation Benchmark

Attribute Leakage Evaluation Benchmark (ALE-Bench) is a new benchmark specifically designed to evaluate attribute leakage. It simulates real-world editing scenarios using succinct prompts and comprises 3,000 editing cases, covering five editing types and different numbers of editing objects.
To quantify attribute leakage, we introduce two key metrics:

  • Target-Internal Leakage Score (TILS) measures unintended cross-influence between edited objects. Lower TILS values indicate edits are well-isolated within the intended objects:
    \( \text{TILS} = \frac{1}{N(N-1)}\sum_{i \not= j}^{N} \text{CLIP} \left( x_{\text{tgt}} \odot m_j, y_i^{\text{tgt}} \right) \label{eq/ti}. \)
  • Target-External Leakage Score (TELS) quantifies unintended changes in non-edited areas, such as the background. Lower TELS values indicate that background regions remain unaffected by the edits:
    \( \text{TELS} = \frac{1}{N}\sum_{i=1}^{N} \text{CLIP}\left( x_{\text{tgt}} \odot \left( \textbf{1} - \bigcup_{j=1}^{N}{m}_j\right), y_i^{\text{tgt}} \right) \label{eq/te}, \)

where CLIP is the CLIP similarity score, \(N\) is the number of objects to be edited, \(x_{\text{tgt}} \) is the edited image, \({m}_j\) is the \(j\)-th object mask, and \(y_i^{\text{tgt}}\) is the target prompt for \(i\)-th object.

Example editing scenarios in ALE-Bench

Various Editing Results of ALE

hair → braided hair

glasses → black sunglass

flower → cotton candy

turtle → tank
green lettuce → fire

yellow paprika → red apple
red paprika → yellow cheese

cake → chocolate cake
plate → red plastic plate

blue globe → gold globe
orange globe → red moon
box → steel box

wolf → eagle
moon → blue moon
mountain → glacier

tree → Christmas tree
couch → couch made of snow
floor → ice rink

Results

Our proposed method, ALE, demonstrates superior capability in attribute leakage control and image editing, outperforming existing methods as summarized below.
Comparison of methods on ALE-Bench
Performance of ALE with varying numbers of editing objects
Performance of ALE across different editing types

BibTeX

@InProceedings{mun2025ale,
    author    = {Mun, Sunung and Nam, Jinhwan and Cho, Sunghyun and Ok, Jungseul},
    title     = {Addressing Text Embedding Leakage in Diffusion-based Image Editing},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year      = {2025},
    url       = {https://arxiv.org/abs/2412.04715},
  }