Addressing Attribute Leakages in Diffusion-based Image Editing without Training

Sunung Mun^*, Jinhwan Nam^*, Sunghyun Cho, Jungseul Ok

Pohang University of Science and Technology (POSTECH)
^* Equal Contribuion

arXiv

Abstract

Diffusion models have become a cornerstone in image editing, offering flexibility with language prompts and source images. However, a key challenge is attribute leakage, where unintended modifications occur in non-target regions or within target regions due to attribute interference. Existing methods often suffer from leakage due to naive text embeddings and inadequate handling of End-of-Sequence (EOS) token embeddings. To address this, we propose ALE-Edit (Attribute-leakage-free editing), a novel framework to minimize attribute leakage with three components: (1) Object-Restricted Embeddings (ORE) to localize object-specific attributes in text embeddings, (2) Region-Guided Blending for Cross-Attention Masking (RGB-CAM) to align attention with target regions, and (3) Background Blending (BB) to preserve non-edited regions. Additionally, we introduce ALE-Bench, a benchmark for evaluating attribute leakage with new metrics for target-external and target-internal leakage. Experiments demonstrate that our framework significantly reduces attribute leakage while maintaining high editing quality, providing an efficient and tuning-free solution for multi-object image editing. We will release the source code and benchmark.

Comparison of ALE-Edit with Other Methods

floor → cream-colored carpet
flowerpot → blue-colored bucket

sun → balloon
parasol → mushroom

table → black-colored table
chair → crimson-colored chair

Proposed Method: Attribute-Leakage-free Editing

The framework consists of two branches: the upper part (source branch) processes the source latent \( z_{\tau}^S \), and the lower part (target branch) processes the target latent \( z_{\tau}^T \) at each timestep \( \tau \). The method comprises three key components: Object-Restricted Embeddings (ORE), Region-Guided Blending for Cross-Attention Masking (RGB-CAM), and Background Blending (BB). ORE generates attribute-leakage-free embeddings to minimize interference between unrelated objects. RGB-CAM leverages segmentation masks from Grounded-SAM to refine cross-attention activations, aligning attention to specific regions for each target object. BB integrates the source latent for background regions and the target latent for edited regions.

Proposed Benchmark: Attribute Leakage Evaluation Benchmark

ALE-Bench (Attribute Leakage Evaluation Benchmark) is a novel benchmark designed to evaluate attribute leakage. It simulates real-world editing scenarios using succinct prompts and includes a total of 3,000 editing cases with five editing types and varying numbers of objects.
To quantify attribute leakage, we introduce two key metrics.

Target-Internal (TI) Leakage measures unintended cross-influence between edited objects. Lower TI values indicate edits are well-isolated to the intended objects:
\( \text{TI} = \frac{1}{N(N-1)}\sum_{i \not= j}^{N} \text{CLIP} \left( I_e \odot m_j, y_i^T \right) \label{eq/ti}. \)
Target-External (TE) Leakage quantifies unintended changes in non-edited areas, such as the background. Lower TE values suggest that external areas remain unaffected by the edits:
\( \text{TE} = \frac{1}{N}\sum_{i=1}^{N} \text{CLIP}\left( I_e \odot \left( \textbf{1} - \bigcup_{j=1}^{N}{m}_j\right), y_i^T \right) \label{eq/te}, \)

where CLIP is the CLIP similarity score, \(N\) is the number of objects to be edited, \(I_e\) is the edited image, \({m}_j\) is the \(j\)-th object mask, and \(y_i^T\) is the target prompt for \(i\)-th object.