Consistent123: Improve Consistency for One Image to 3D Object Synthesis

1South China University of Technology, 2International Digital Economy Academy (IDEA)

Abstract

Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field.

Method

(a) At the training stage, multiple noisy views concatenated (denoted as ⊕) with the input view are fed into the denoising U-Net simultaneously, conditioned on the CLIP embedding of the input view and the corresponding poses. For sampling, views are denoised iteratively from the normal distribution through the U-Net. (b) In the shared self-attention layer, all views query the same key and value from the input view, which provides detailed spatial layout information for novel view synthesis. The input view and related poses are injected into the model by the cross-attention layer, and synthesized views are further aligned via the cross-view attention layer.

Novel View Synthesis

Objaverse Testset

GSO

Downstream Tasks

3D Reconstruction with Neus

Dreamfusion

One-2-3-45

BibTeX


      @misc{weng2023consistent123,
            title={Consistent123: Improve Consistency for One Image to 3D Object Synthesis}, 
            author={Haohan Weng and Tianyu Yang and Jianan Wang and Yu Li and Tong Zhang and C. L. Philip Chen and Lei Zhang},
            year={2023},
            eprint={2310.08092},
            archivePrefix={arXiv},
            primaryClass={cs.CV}
      }