Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style. While conditional diffusion models have proven effective for text-guided inpainting, we introduce the novel task of multi-mask inpainting, where multiple regions are simultaneously inpainted using distinct prompts. Furthermore, we design a fine-tuning procedure for multimodal LLMs, such as LLaVA, to generate multi-mask prompts automatically using corrupted images as inputs. These models can generate helpful and detailed prompt suggestions for filling the masked regions. The generated prompts are then fed to Stable Diffusion, which is fine-tuned for the multi-mask inpainting problem using rectified cross-attention, enforcing prompts onto their designated regions for filling. Experiments on digitized paintings from WikiArt and the Densely Captioned Images dataset demonstrate that our pipeline delivers creative and accurate inpainting results.
We propose a pipeline for text-guided multi-mask inpainting, comprising two key stages: prompt generation and inpainting. During the prompt generation stage, we fine-tune LLaVA with QLoRA to automatically generate multi-mask prompts. In the inpainting stage, Stable Diffusion is fine-tuned for multi-mask inpainting using an adaptation of the rectified cross-attention (RCA) algorithm, where LoRA adapters target all cross-attention layers. The rectified cross-attention module ensures that prompts are accurately applied to their designated regions for inpainting. The pipeline undergoes a two-stage training process and is evaluated on an automatically annotated dataset of art images as well as the Densely Captioned Images dataset.
Qualitative examples.
Qualitative examples from the Densely Captioned Images dataset.
Qualitative examples for 1-mask inpainting.
Qualitative examples for 2-mask inpainting.
Qualitative examples for 3-mask inpainting.
Qualitative examples for 4-mask inpainting.
Qualitative examples for 5-mask inpainting.
The following is a non-comprehensive list of related works:
Cohen et al. enhance inpainting diversity by sampling rare yet plausible concepts from the posterior distribution of generated objects.
Similar to our approach, Brush2Prompt introduces the task of prompt generation for text-guided inpainting but focuses on single-mask inpainting.
FreestyleNet introduces the rectified cross-attention mechanism for freestyle layout-to-image synthesis, which inspired our rectified cross-attention module for multi-mask inpainting.
@inproceedings{fanelli2025idream,
title = {I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting},
author = {Nicola, Fanelli and Gennaro, Vessio and Giovanna, Castellano},
year = {2025},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}
}