Addressing Attribute Leakages in Diffusion-based Image Editing without Training

Sunung Mun*, Jinhwan Nam*, Sunghyun Cho, Jungseul Ok
Pohang University of Science and Technology (POSTECH)
* Equal Contribuion

Abstract

Diffusion models have become a cornerstone in image editing, offering flexibility with language prompts and source images. However, a key challenge is attribute leakage, where unintended modifications occur in non-target regions or within target regions due to attribute interference. Existing methods often suffer from leakage due to naive text embeddings and inadequate handling of End-of-Sequence (EOS) token embeddings. To address this, we propose ALE-Edit (Attribute-leakage-free editing), a novel framework to minimize attribute leakage with three components: (1) Object-Restricted Embeddings (ORE) to localize object-specific attributes in text embeddings, (2) Region-Guided Blending for Cross-Attention Masking (RGB-CAM) to align attention with target regions, and (3) Background Blending (BB) to preserve non-edited regions. Additionally, we introduce ALE-Bench, a benchmark for evaluating attribute leakage with new metrics for target-external and target-internal leakage. Experiments demonstrate that our framework significantly reduces attribute leakage while maintaining high editing quality, providing an efficient and tuning-free solution for multi-object image editing. We will release the source code and benchmark.

Comparison of ALE-Edit with Other Methods

floorcream-colored carpet
flowerpotblue-colored bucket

sunballoon
parasolmushroom

tableblack-colored table
chaircrimson-colored chair

Proposed Method: Attribute-Leakage-free Editing

The framework consists of two branches: the upper part (source branch) processes the source latent \( z_{\tau}^S \), and the lower part (target branch) processes the target latent \( z_{\tau}^T \) at each timestep \( \tau \). The method comprises three key components: Object-Restricted Embeddings (ORE), Region-Guided Blending for Cross-Attention Masking (RGB-CAM), and Background Blending (BB). ORE generates attribute-leakage-free embeddings to minimize interference between unrelated objects. RGB-CAM leverages segmentation masks from Grounded-SAM to refine cross-attention activations, aligning attention to specific regions for each target object. BB integrates the source latent for background regions and the target latent for edited regions.

Proposed Benchmark: Attribute Leakage Evaluation Benchmark

ALE-Bench (Attribute Leakage Evaluation Benchmark) is a novel benchmark designed to evaluate attribute leakage. It simulates real-world editing scenarios using succinct prompts and includes a total of 3,000 editing cases with five editing types and varying numbers of objects.
To quantify attribute leakage, we introduce two key metrics.

  • Target-Internal (TI) Leakage measures unintended cross-influence between edited objects. Lower TI values indicate edits are well-isolated to the intended objects:
    \( \text{TI} = \frac{1}{N(N-1)}\sum_{i \not= j}^{N} \text{CLIP} \left( I_e \odot m_j, y_i^T \right) \label{eq/ti}. \)
  • Target-External (TE) Leakage quantifies unintended changes in non-edited areas, such as the background. Lower TE values suggest that external areas remain unaffected by the edits:
    \( \text{TE} = \frac{1}{N}\sum_{i=1}^{N} \text{CLIP}\left( I_e \odot \left( \textbf{1} - \bigcup_{j=1}^{N}{m}_j\right), y_i^T \right) \label{eq/te}, \)

where CLIP is the CLIP similarity score, \(N\) is the number of objects to be edited, \(I_e\) is the edited image, \({m}_j\) is the \(j\)-th object mask, and \(y_i^T\) is the target prompt for \(i\)-th object.

Examples of editing scenarios in ALE-Bench

Various Editing Results of ALE-Edit

hair → braided hair

glasses → black sunglass

flower → cotton candy

turtle → tank
green lettuce → fire

yellow paprika → red apple
red paprika → yellow cheese

cake → chocolate cake
plate → red plastic plate

blue globe → gold globe
orange globe → red moon
box → steel box

wolf → eagle
moon → blue moon
mountain → glacier

tree → Christmas tree
couch → couch made of snow
floor → ice rink

Results

Our proposed method, ALE-Edit, demonstrates superior performance in attribute leakage control and image editing quality compared to existing methods, as shown in the results below. Specifically, our method achieves the lowest scores in both Target-External (TE) and Target-Internal (TI) attribute leakage, ensuring precise edits limited to the specified target regions. CS represents CLIP Similarity, SD represents Structure Distance.
Performance comparison on ALE-Bench
Performance of ALE-Edit across the number of editing objects
Performance of ALE-Edit across editing types

BibTeX

@article{mun2024leakage,
        title={Addressing Attribute Leakages in Diffusion-based Image Editing without Training}, 
        author={Mun, Sunung and Nam, Jinhwan and Cho, Sunghyun and Ok, Jungseul},
        year={2024},
        eprint={2412.04715},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2412.04715}, 
  }