I Dream My Painting

Abstract

Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style. While conditional diffusion models have proven effective for text-guided inpainting, we introduce the novel task of multi-mask inpainting, where multiple regions are simultaneously inpainted using distinct prompts. Furthermore, we design a fine-tuning procedure for multimodal LLMs, such as LLaVA, to generate multi-mask prompts automatically using corrupted images as inputs. These models can generate helpful and detailed prompt suggestions for filling the masked regions. The generated prompts are then fed to Stable Diffusion, which is fine-tuned for the multi-mask inpainting problem using rectified cross-attention, enforcing prompts onto their designated regions for filling. Experiments on digitized paintings from WikiArt and the Densely Captioned Images dataset demonstrate that our pipeline delivers creative and accurate inpainting results.

Method

We propose a pipeline for text-guided multi-mask inpainting, comprising two key stages: prompt generation and inpainting. During the prompt generation stage, we fine-tune LLaVA with QLoRA to automatically generate multi-mask prompts. In the inpainting stage, Stable Diffusion is fine-tuned for multi-mask inpainting using an adaptation of the rectified cross-attention (RCA) algorithm, where LoRA adapters target all cross-attention layers. The rectified cross-attention module ensures that prompts are accurately applied to their designated regions for inpainting. The pipeline undergoes a two-stage training process and is evaluated on an automatically annotated dataset of art images as well as the Densely Captioned Images dataset.

Gallery