LaRa: Efficient Large-Baseline Radiance Fields (2024)

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: 1University of Tübingen, Tübingen AI Center 2ETH Zürich
%\quad%{\textsuperscript{3}Meta}https://apchenstu.github.io/LaRa/

Anpei Chen1,2  Haofei Xu1,2  Stefano Esposito1
Siyu Tang2  Andreas Geiger1

Abstract

Radiance field methods have achieved photorealistic novel view synthesis and geometry reconstruction.But they are mostly applied in per-scene optimization or small-baseline settings.While several recent works investigate feed-forward reconstruction with large baselines by utilizing transformers, they all operate with a standard global attention mechanism and hence ignore the local nature of 3D reconstruction.We propose a method that unifies local and global reasoning in transformer layers, resulting in improved quality and faster convergence. Our model represents scenes as Gaussian Volumes and combines this with an image encoder and Group Attention Layers for efficient feed-forward reconstruction. Experimental results demonstrate that our model, trained for two days on four GPUs, demonstrates high fidelity in reconstructing 360 radiance fields, and robustness to zero-shot and out-of-domain testing.

Keywords:

3D Reconstruction 3D Transformer Radiance Fields

1 Introduction

The ability to reconstruct the shape and appearance of objects from multi-view images has long been one of the core challenges for computer vision and graphics.Modern 3D reconstruction techniques achieve impressive results with various applications in visual effects, e-commerce, virtual and augmented reality, and robotics.However, they are limited to small camera baselines or dense image captures[66, 8, 42, 32].In recent years, the computer vision community has made great strides towards high-quality scene reconstruction.In particular, Structure-from-Motion[51, 55] and multi-view stereo [72, 23] emerged as powerful 3D reconstruction methods.They identify surface points by aggregating similarities between point features queried from source images, and are able to reconstruct highly accurate surface and texture maps.

Despite these successes, geometry with view-consistent textures is not the only aspect required in applications of 3D reconstruction.The reconstruction process should also be able to recover view-dependent appearance.To this end, neural radiance fields [42] and neural implicit surfaces [47, 74] investigate volumetric representations that can be learned from multi-view captures without explicit feature matching.Their follow-ups [64, 76, 44, 21, 5, 32, 57, 2, 77] improve efficiency and quality, but mostly require per-scene optimization and dense multi-view supervision.

Several recent works thus investigate feed-forward models for radiance field reconstruction while relaxing the dense input view requirement.While feed-forward designs vary, they commonly utilize local feature matching [8, 13, 30, 37, 70], which however limits them tosmall-baseline reconstruction, since feature matching generally relies on substantial image overlap and reasonably similar viewpoints.Geometry-aware transformers [49, 34, 62, 43] have also been adapted to address large-baseline problems, but they often suffer from blurry reconstructions due to the lack of 3D inductive biases.Recent large reconstruction models [26, 36] learn the internal perspective relationships through context attention, enabling large-baseline reconstruction.However, the transformers are unaware of epipolar constraints, and instead are tasked to implicitly learn spatial relationships, which requires substantial data and GPU resources.

In this work, we present LaRa, a feed-forward reconstruction model without the requirement of heavy training resources for the task of 360 bounded radiance fields reconstruction from unstructured few-views.The core idea of our work is to progressively and implicitly perform feature matching through a novel volume transformer.We propose a Gaussian volume as 3D representation, in which each voxel comprises a set of learnable Gaussian primitives.To obtain the Gaussian volume from image conditions, we progressively update a learnable embedding volume by querying features in 3D.Specifically, we utilize a DINO image feature encoder to obtain image tokens and lift 2D tokens to 3D by unprojecting them to a shared canonical space.Next, we propose a novel Group Attention Layer architecture to enable local and global feature aggregation.Specifically, we divide dense volumes into local groups and only apply attention within each group, inspired by standard feature point matching.The grouped features and embeddings are fed to a cross-attention sub-layer to implicitly match features between feature groups of the feature volume and embedding volume,which is followed by a 3D CNN layer to efficiently share information across neighboring groups.After passing through all attention layers, the volume transformer outputs a Gaussian volume, and is then decoded as 2D Gaussian[27] parameters using a coarse-to-fine decoding process.By incorporating efficient rasterization, our method achieves high-resolution renderings.

We demonstrate our method’s efficiency and robustness for providing photorealistic, 360 novel view synthesis results using only four input images.We find that our model achieves zero-shot generalization to significantly out-of-distribution inputs.Moreover, our reconstructed radiance fields allow high-quality mesh reconstruction using off-the-shelf depth-map fusion algorithms. Finally, our model achieves high-quality reconstruction results using only 4 A100-40G GPUs within a span of 2 days.

2 Related Work

Multi-view stereo.Multi-view stereo reconstruction aims to generate detailed 3D models by reasoning from images captured from multiple viewpoints, which has been studied for decades [14, 35, 33, 25, 52, 22, 50].In recent years, multi-view stereo networks [72, 28] have been proposed to address MVS problems.MVSNet [72] utilizes a 3D Convolutional Neural Network for processing a cost volume.This cost volume is created by aggregating features from a set of adjacent views, employing the plane-sweeping technique from a reference viewpoint, facilitating depth estimation and enabling superior 3D reconstructions.Subsequent research has built on top of this foundation, incorporating strategies such as iterative plane sweeping [73], point cloud enhancement [9], confidence-driven fusion [41], and the usage of multiple cost volumes [12, 24] to further refine reconstruction accuracy.However, all of these works require large image overlap for faithful feature matching.

Few-shot Radiance fields.The Radiance field representation [42] has revolutionized the reconstruction field, emerging as a promising replacement for traditional reconstruction methods.Despite the promising achievement in per-scene sparse view reconstruction [46, 63, 16, 56, 60, 6, 7, 53, 68], training a feed-forward radiance field predictor [76, 66, 8, 13] has gained popularity.MVSNeRF [8] proposed to combine a cost volume with volume rendering, allowing appearance and geometry reconstruction only using a photometric loss.The following works [30, 37, 10, 70] are proposed to advance reconstruction quality and efficiency.Similarly to standard MVS methods, they are limited to small camera baselines.

Recently, several works have explored feed-forward models for few-shot [1, 36, 19, 43, 4, 11, 67] input by capitalizing on large-scale training datasets and model sizes.They leverage cross-view attention to globally reason about 3D scenes and output 3D representation (e.g., tri-plane, IB-planes) for radiance field reconstruction.Concurrent work by LGM [59] and GRM [75] introduces few-shot 3D reconstruction models that produce high-resolution 3D Gaussians using a transformer framework.While these methods achieve impressive visual results, training becomes expensive and less practical for the academic community.Unlike some recent single view reconstruction methods[58, 26], our work focuses on few-shot (>1absent1>1> 1) reconstruction since single-view input can be efficiently lifted to multi-view by multi-view generative models [38, 71, 54, 39].

3 LaRa: Large-baseline Radiance Fields

Our goal is to reconstruct the geometry and view-dependent appearance of bounded scenes from sparse input views using limited training resources.Given M𝑀Mitalic_M images 𝐈=(I1,,IM)𝐈subscript𝐼1subscript𝐼𝑀\mathbf{I}\!=\!(I_{1},\ldots,I_{M})bold_I = ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) with camera parameters 𝝅=(π1,,πM)𝝅subscript𝜋1subscript𝜋𝑀\boldsymbol{\pi}\!=\!(\pi_{1},\ldots,\pi_{M})bold_italic_π = ( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ),our method reconstructs radiance fields as a collection of 2D Gaussians, which is used to synthesize novel views and extract meshes.Our model is a function 𝐟𝐟\mathbf{f}bold_f of a discrete radiance field of voxel positions 𝐯𝐯\mathbf{v}bold_v and outputs a Gaussian volume 𝐕𝒢subscript𝐕𝒢\mathbf{V}_{\mathcal{G}}bold_V start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT

𝐕𝒢={𝒢ik}k=1K=𝐟(𝐯;𝐈,𝝅),subscript𝐕𝒢superscriptsubscriptsubscriptsuperscript𝒢𝑘𝑖𝑘1𝐾𝐟𝐯𝐈𝝅,\displaystyle\mathbf{V}_{\mathcal{G}}=\{\mathcal{G}^{k}_{i}\}_{k=1}^{K}=%\mathbf{f}(\mathbf{v};\mathbf{I},\boldsymbol{\pi})\text{,}bold_V start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT = { caligraphic_G start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT = bold_f ( bold_v ; bold_I , bold_italic_π ) ,(1)

where 𝒢iksubscriptsuperscript𝒢𝑘𝑖\mathcal{G}^{k}_{i}caligraphic_G start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the primitives within i𝑖iitalic_ith voxel, and k𝑘kitalic_k is the index of K𝐾Kitalic_K primitives.The output Gaussian volume 𝐕𝒢subscript𝐕𝒢\mathbf{V}_{\mathcal{G}}bold_V start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT can be utilized for decoding into radiance fields.Our work considers sparse input views, in which the camera rotates around a bounded region within a hemisphere.Our approach is designed to handle unstructured views and is flexible to accommodate various numbers of views (see supplementary material).Figure1 shows an overview of our method.

In the following, we first describe how we model objects using Gaussian Volumes,in which each voxel stores multiple Gaussian primitives (Section3.1).Next, we introduce how to infer the primitive parameters from multi-view inputs (Section3.2).For rendering, we explore a coarse-fine decoding process to enable efficient rendering with rich texture details (Section3.3).Finally, we discuss how we train our model from large-scale image collections (Section3.4).

LaRa: Efficient Large-Baseline Radiance Fields (1)

3.1 3D Representation

We utilize a 3D voxel grid as our 3D representation, consisting of 3 volumes: an image feature volume 𝐕fsubscript𝐕f\mathbf{V}_{\text{f}}bold_V start_POSTSUBSCRIPT f end_POSTSUBSCRIPT to model image conditions, an embedding volume 𝐕esubscript𝐕e\mathbf{V}_{\text{e}}bold_V start_POSTSUBSCRIPT e end_POSTSUBSCRIPT describes 3D prior learned from data, and a Gaussian volume 𝐕𝒢subscript𝐕𝒢\mathbf{V}_{\!\mathcal{G}}bold_V start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT represents the radiance field.

Image feature volume.We construct a feature volume for each input view by lifting the 2D image features to a canonical volume defined in the center of the scene.We use the DINO [3] image encoder to extract per-view image features, and inject Plücker ray directions into the features via adaptive layer norm [48].Unlike previous works that modulate camera poses to image features using extrinsic and intrinsic matrices[38, 26], Plücker rays are defined by the cross product between the camera location and ray direction, offering a unique ray parameterization independent of object scale, camera position and focal length.After modulation, we obtain M𝑀Mitalic_M per-view image feature maps.We further lift the 2D maps to 3D by back-projecting the feature maps to a canonical volume, therefore resulting in M𝑀Mitalic_M feature volumes 𝐕fW×W×W×Osubscript𝐕fsuperscript𝑊𝑊𝑊𝑂\mathbf{V}_{\text{f}}\in\mathbb{R}^{W\times W\times W\times O}bold_V start_POSTSUBSCRIPT f end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_W × italic_W × italic_W × italic_O end_POSTSUPERSCRIPT with O𝑂Oitalic_O channels.

Embedding volume.Inspired by prior works [31, 45, 26], we construct a learnable embedding volume 𝐕eW×W×W×Csubscript𝐕esuperscript𝑊𝑊𝑊𝐶\mathbf{V}_{\text{e}}\in\mathbb{R}^{W\times W\times W\times C}bold_V start_POSTSUBSCRIPT e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_W × italic_W × italic_W × italic_C end_POSTSUPERSCRIPT for modeling prior knowledge.3D reconstruction is generally under-constrained in sparse view settings, hence prior knowledge is critical for faithful reconstructions.We propose to leverage a 3D embedding volume to model and learn prior information across objects, which acts as a 3D object template that greatly reduces the solution space.The embedding volume is aligned with the image feature volume, allowing for efficient cross attention (see Section3.2).

Gaussian volume.To achieve efficient rendering, we propose to use dense primitives as an object representation and output a set of 2D Gaussians from the image feature volume and embedding volume.However, predicting a set of dense unordered point sets without 3D supervision is always a challenge for neural networks.To this end, we introduce a dense Gaussian volume representation that can effectively model points densely near the object’s surface, while being suitable for modern network architectures by facilitating prediction and generation.

Specifically, our Gaussian volume comprises K𝐾Kitalic_K learnable Gaussian primitives per voxel, where each primitive can move freely within a constrained spherical region centered at the voxels’ center.For primitive modeling, we borrow the shape and appearance parametrization from 2D Gaussian splatting [27] for better surface modeling. Each Gaussian has an opacity α𝛼\alphaitalic_α, tangent vectors 𝐭=[𝐭u,𝐭v]𝐭subscript𝐭usubscript𝐭v\mathbf{t}\!=\![\mathbf{t}_{\text{u}},\mathbf{t}_{\text{v}}]bold_t = [ bold_t start_POSTSUBSCRIPT u end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT v end_POSTSUBSCRIPT ], a scaling vector 𝐒=(su,sv)𝐒subscript𝑠usubscript𝑠v\mathbf{S}\!=\!(s_{\text{u}},s_{\text{v}})bold_S = ( italic_s start_POSTSUBSCRIPT u end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT v end_POSTSUBSCRIPT ) controlling the shape of the 2D Gaussian, and spherical harmonics coefficients for view-dependent appearance.Furthermore, we substitute the primitive’s position with an offset vector Δ[1,1]3Δsuperscript113\Delta\!\in\![-1,1]^{3}roman_Δ ∈ [ - 1 , 1 ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, incorporating a scaled sigmoid activation function.Consequently, the position of Gaussian primitive k𝑘kitalic_k in voxel 𝐯isubscript𝐯𝑖\mathbf{v}_{i}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is expressed as 𝐩ik=𝐯i+rΔiksubscriptsuperscript𝐩𝑘𝑖subscript𝐯𝑖𝑟subscriptsuperscriptΔ𝑘𝑖\mathbf{p}^{k}_{i}=\mathbf{v}_{i}+r\cdot\Delta^{k}_{i}bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_r ⋅ roman_Δ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where r𝑟ritalic_r signifies the maximum displacement range of the primitive.In this way, primitives are restricted to neighborhoods of uniformly distributed local centers.The inclusion of offset modeling allows each voxel to effectively represent adjacent regions that require it.This reduces unnecessary capacity in empty space and enhances the representational capacity compared to the standard dense volume.We refer the reader to the supplementary material for more details on 2D Gaussian splatting.

3.2 Volume Transformer

To predict the Gaussian volume, we propose a volume transformer architecture to perform attention between volumes.Self-attention and cross-attention modules, as commonly used in transformers [17], are inefficient for volumes, since the number of tokens grows cubically with the resolution of the 3D representation.Naïve applications thus result in long training times and large GPU memory requirements.In addition, geometry constraints and regional matching play crucial roles in the context of 3D reconstruction, which should be considered in the attention design.

LaRa: Efficient Large-Baseline Radiance Fields (2)

We now present our novel volume transformer containing a set of group attention layers that progressively update the embedding volume.Our group attention layers contain three sublayers (see Figure2): group cross-attention, a multi-layer perceptron (MLP), and 3D convolution.Given the image feature volume and embedding volume, we first unfold these 3D volumes (i.e., 𝐕fsubscript𝐕f\mathbf{V}_{\text{f}}bold_V start_POSTSUBSCRIPT f end_POSTSUBSCRIPT and 𝐕esubscript𝐕e\mathbf{V}_{\text{e}}bold_V start_POSTSUBSCRIPT e end_POSTSUBSCRIPT) into G𝐺Gitalic_G local token groups along each axis.We then apply a cross-attention layer between the corresponding groups of embedding tokens 𝐕eg,jsuperscriptsubscript𝐕e𝑔𝑗\mathbf{V}_{\text{e}}^{g,j}bold_V start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g , italic_j end_POSTSUPERSCRIPT and image feature tokens 𝐕fgsubscriptsuperscript𝐕𝑔f\mathbf{V}^{g}_{\text{f}}bold_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT f end_POSTSUBSCRIPT, where j𝑗jitalic_j denotes the index of the layer starting from 1, and {𝐕eg,1}g=𝐕esubscriptsuperscriptsubscript𝐕e𝑔1𝑔subscript𝐕e\{\mathbf{V}_{\text{e}}^{g,1}\}_{g}\!=\!\mathbf{V}_{\text{e}}{ bold_V start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g , 1 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT e end_POSTSUBSCRIPT.Figure1 illustrates the unfolding for G=4𝐺4G=4italic_G = 4 and highlights the corresponding groups.

The next sublayer isan MLP, similar to the original transformer [61, 48, 29].The updated embedding groups {𝐕¨eg,j}g=1Gsubscriptsuperscriptsuperscriptsubscript¨𝐕e𝑔𝑗𝐺𝑔1\{\ddot{\mathbf{V}}_{\text{e}}^{g,j}\}^{G}_{g=1}{ over¨ start_ARG bold_V end_ARG start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g , italic_j end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT are reassembled into the original volume shape, resulting in 𝐕¨ejsuperscriptsubscript¨𝐕e𝑗\ddot{\mathbf{V}}_{\text{e}}^{j}over¨ start_ARG bold_V end_ARG start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, which are subsequently processed by a 3D convolutional layer toshare information between groups and enable the intra-model connections within the spatially organized voxels.In summary,

𝐕˙eg,jsuperscriptsubscript˙𝐕e𝑔𝑗\displaystyle\dot{\mathbf{V}}_{\text{e}}^{g,j}over˙ start_ARG bold_V end_ARG start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g , italic_j end_POSTSUPERSCRIPT=GroupCrossAttn(LN(𝐕eg,j),𝐕fg)+𝐕eg,j,absentGroupCrossAttnLNsuperscriptsubscript𝐕e𝑔𝑗subscriptsuperscript𝐕𝑔fsuperscriptsubscript𝐕e𝑔𝑗,\displaystyle=\text{GroupCrossAttn}\left(\text{LN}\left(\mathbf{V}_{\text{e}}^%{g,j}\right),\mathbf{V}^{g}_{\text{f}}\right)+\mathbf{V}_{\text{e}}^{g,j}\text%{,}= GroupCrossAttn ( LN ( bold_V start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g , italic_j end_POSTSUPERSCRIPT ) , bold_V start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT f end_POSTSUBSCRIPT ) + bold_V start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g , italic_j end_POSTSUPERSCRIPT ,(2)
𝐕¨eg,jsuperscriptsubscript¨𝐕e𝑔𝑗\displaystyle\ddot{\mathbf{V}}_{\text{e}}^{g,j}over¨ start_ARG bold_V end_ARG start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g , italic_j end_POSTSUPERSCRIPT=MLP(LN(𝐕˙eg,j))+𝐕˙eg,j,absentMLPLNsuperscriptsubscript˙𝐕e𝑔𝑗superscriptsubscript˙𝐕e𝑔𝑗,\displaystyle=\text{MLP}\left(\text{LN}\left(\dot{\mathbf{V}}_{\text{e}}^{g,j}%\right)\right)+\dot{\mathbf{V}}_{\text{e}}^{g,j}\text{,}= MLP ( LN ( over˙ start_ARG bold_V end_ARG start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g , italic_j end_POSTSUPERSCRIPT ) ) + over˙ start_ARG bold_V end_ARG start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g , italic_j end_POSTSUPERSCRIPT ,(3)
𝐕ej+1superscriptsubscript𝐕e𝑗1\displaystyle\mathbf{V}_{\text{e}}^{j+1}bold_V start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j + 1 end_POSTSUPERSCRIPT=3DCNN(LN(𝐕¨ej))+𝐕¨ej.absent3DCNNLNsuperscriptsubscript¨𝐕e𝑗superscriptsubscript¨𝐕e𝑗.\displaystyle=\text{3DCNN}\left(\text{LN}\left(\ddot{\mathbf{V}}_{\text{e}}^{j%}\right)\right)+\ddot{\mathbf{V}}_{\text{e}}^{j}\text{.}= 3DCNN ( LN ( over¨ start_ARG bold_V end_ARG start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) + over¨ start_ARG bold_V end_ARG start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT .(4)

To incorporate information from multiple views, we flatten and concatenate the image feature tokens from multi-view feature volumes.It is important to note that different groups are processed simultaneously by the group attention layer across the batch dimension.This parallel processing allows for a larger training batch size within the attention sublayer, reducing the number of training steps required.In addition, using a 3D convolution layer increases inference efficiency compared to the popular self-attention layer. Also, we also apply layer norms LM()LM\text{LM}(\cdot)LM ( ⋅ ) between the sub-layers.Finally, the output embedding volume 𝐕ejsubscriptsuperscript𝐕𝑗e\mathbf{V}^{j}_{\text{e}}bold_V start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT e end_POSTSUBSCRIPT serves as input for the subsequent (j+1)𝑗1(j\!+\!1)( italic_j + 1 )th group attention layer.

After passing through all (12 in our experiments) group attention layers, we use a 3D transposed CNN to scale up the updated embedding volume 𝐕˙esubscript˙𝐕e\dot{\mathbf{V}}_{\text{e}}over˙ start_ARG bold_V end_ARG start_POSTSUBSCRIPT e end_POSTSUBSCRIPT,𝐕𝒢=Transpose-3DCNN(𝐕˙e).subscript𝐕𝒢Transpose-3DCNNsubscript˙𝐕e.\mathbf{V}_{\!\mathcal{G}}=\text{Transpose-3DCNN}\left(\dot{\mathbf{V}}_{\text%{e}}\right)\!\text{.}bold_V start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT = Transpose-3DCNN ( over˙ start_ARG bold_V end_ARG start_POSTSUBSCRIPT e end_POSTSUBSCRIPT ) .Now we have a Gaussian volume 𝐕𝒢subscript𝐕𝒢\mathbf{V}_{\mathcal{G}}bold_V start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT, each Gaussian voxel is a 1-D feature vector 𝐕𝒢i1×Bsubscriptsuperscript𝐕𝑖𝒢superscript1𝐵\mathbf{V}^{i}_{\mathcal{G}}\in\mathbb{R}^{1\times B}bold_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_B end_POSTSUPERSCRIPT, representing the primitives associated with the voxel.

3.3 Coarse-Fine Decoding

We obtain 2D Gaussian primitive shape and appearance parameters from the Gaussian volume, so we introduce a coarse-fine decoding process to better recover texture details.Instead of using a single network and sampling scheme to reason about the scene, we simultaneously optimize two decoding modules: one “coarse” and one “fine”.

LaRa: Efficient Large-Baseline Radiance Fields (3)

For the “coarse” decoding module, we feed Gaussian volume features to a lightweight MLP and output a set of K𝐾Kitalic_K Gaussian parameters per voxel.We employ the efficient 2D splatting technique [27] to form high-resolution renderings, including RGB, depth, opacity, and normal maps.During training, we render M𝑀Mitalic_M input views and M𝑀Mitalic_M novel views for supervision.

Despite the fact that the coarse renderings can already provide faithful depths/geometries, the appearance tends to be blurred, as shown in (e) of Figure6.This is because the image texture can easily be lost after the DINO encoder and the Group Attention layers.To address this problem, we propose a “fine” decoding module to guide fine texture prediction.

Specifically, we project the primitive centers 𝐩iksubscriptsuperscript𝐩𝑘𝑖\mathbf{p}^{k}_{i}bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT onto the coarse renderings (i.e., RGB image 𝐈^^𝐈\hat{\mathbf{I}}over^ start_ARG bold_I end_ARG, depth image 𝐃^^𝐃\hat{\mathbf{D}}over^ start_ARG bold_D end_ARG, and accumulation alpha map 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG) to contain the coarse renderings for each primitive using the camera poses 𝝅𝝅\boldsymbol{\pi}bold_italic_π,

𝒳𝐩ik=(𝐈𝐩ik,𝐈^𝐩ik,𝐃^𝐩ik,𝐀^𝐩ik)=𝚽(𝒫(𝐩ik,𝝅),[𝐈,𝐈^,𝐃^,𝐀^]),subscript𝒳subscriptsuperscript𝐩𝑘𝑖subscript𝐈subscriptsuperscript𝐩𝑘𝑖subscript^𝐈subscriptsuperscript𝐩𝑘𝑖subscript^𝐃subscriptsuperscript𝐩𝑘𝑖subscript^𝐀subscriptsuperscript𝐩𝑘𝑖𝚽𝒫subscriptsuperscript𝐩𝑘𝑖𝝅direct-sum𝐈^𝐈^𝐃^𝐀,\displaystyle\mathcal{X}_{\mathbf{p}^{k}_{i}}=\left(\mathbf{I}_{\mathbf{p}^{k}%_{i}},\hat{\mathbf{I}}_{\mathbf{p}^{k}_{i}},\hat{\mathbf{D}}_{\mathbf{p}^{k}_{%i}},\hat{\mathrm{\mathbf{A}}}_{\mathbf{p}^{k}_{i}}\right)=\boldsymbol{\Phi}%\left(\mathcal{P}\left(\mathbf{p}^{k}_{i},\boldsymbol{\pi}\right),\oplus\left[%\mathbf{I},\hat{\mathbf{I}},\hat{\mathbf{D}},\hat{\mathrm{\mathbf{A}}}\right]%\right)\text{,}caligraphic_X start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( bold_I start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_A end_ARG start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = bold_Φ ( caligraphic_P ( bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_π ) , ⊕ [ bold_I , over^ start_ARG bold_I end_ARG , over^ start_ARG bold_D end_ARG , over^ start_ARG bold_A end_ARG ] ) ,(5)

where 𝒫𝒫\mathcal{P}caligraphic_P denotes the point projection, direct-sum\oplus is a concatenation operation along the channel dimension, and 𝚽𝚽\boldsymbol{\Phi}bold_Φ is a bilinear interpolation in 2D space.

In practice, the depth features can change significantly across different scenes.To mitigate scaling discrepancies, we replace the rendering depth 𝐃^𝐩iksubscript^𝐃subscriptsuperscript𝐩𝑘𝑖\hat{\mathbf{D}}_{\mathbf{p}^{k}_{i}}over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT with a displacement feature |𝐃^𝐩ikz𝐩ik|subscript^𝐃subscriptsuperscript𝐩𝑘𝑖subscript𝑧subscriptsuperscript𝐩𝑘𝑖\left\lvert\hat{\mathbf{D}}_{\mathbf{p}^{k}_{i}}-z_{\mathbf{p}^{k}_{i}}\right\rvert| over^ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | that compares the rendered depth for input views and the depth z𝐩iksubscript𝑧subscriptsuperscript𝐩𝑘𝑖z_{\mathbf{p}^{k}_{i}}italic_z start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT of a primitive, allowing for occlusion-aware reasoning.

We then apply a point-based cross-attention layer to establish relationships between the features of a point 𝒳𝐩iksubscript𝒳subscriptsuperscript𝐩𝑘𝑖\mathcal{X}_{\mathbf{p}^{k}_{i}}caligraphic_X start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the primitive voxel.The results of this cross-attention process are then fed into an MLP, which is tasked with predicting the residual spherical harmonics

SHi,kresidualssuperscriptsubscriptSH𝑖𝑘residuals\displaystyle\text{SH}_{i,k}^{\textit{residuals}}SH start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT residuals end_POSTSUPERSCRIPT=MLP(CrossAttn(𝒳𝐩ik,𝐕ei)),absentMLPCrossAttnsubscript𝒳subscriptsuperscript𝐩𝑘𝑖subscriptsuperscript𝐕𝑖e,\displaystyle=\text{MLP}\left(\text{CrossAttn}\left(\mathcal{X}_{\mathbf{p}^{k%}_{i}},\mathbf{V}^{i}_{\text{e}}\right)\right)\text{,}= MLP ( CrossAttn ( caligraphic_X start_POSTSUBSCRIPT bold_p start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT e end_POSTSUBSCRIPT ) ) ,(6)
SHi,kfinesuperscriptsubscriptSH𝑖𝑘fine\displaystyle\text{SH}_{i,k}^{\textit{fine}}SH start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fine end_POSTSUPERSCRIPT=SHi,kcoarse+SHi,kresiduals.absentsuperscriptsubscriptSH𝑖𝑘coarsesuperscriptsubscriptSH𝑖𝑘residuals.\displaystyle=\text{SH}_{i,k}^{\textit{coarse}}+\text{SH}_{i,k}^{\textit{%residuals}}\text{.}= SH start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT coarse end_POSTSUPERSCRIPT + SH start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT residuals end_POSTSUPERSCRIPT .(7)

Intuitively, the “fine” decoding module attempts to learn a geometry-aware texture blending process based on multi-view images, primitive features, and rendering buffers from the coarse module.Furthermore, both coarse and fine modules are differentiable and updated simultaneously.Thus, the fine renderings can further regularize the coarse predictions.

Splatting.Our work takes advantage of Gaussian splatting [32, 27] to facilitate efficient high-resolution image rendering.We follow the original rasterization process and further output depth and normal maps by integrating the z𝑧zitalic_z value and the normal of the primitives.

3.4 Training

Our LaRa is optimized across scenes via gradient descent, minimizing simple image reconstruction objectives between the coarse and fine renderings (i.e., ^^\hat{\mathcal{I}}over^ start_ARG caligraphic_I end_ARG) and the ground-truth images (i.e., \mathcal{I}caligraphic_I),

=MSE(,^)+SSIM(,^)+Reg,subscriptMSE^subscriptSSIM^subscriptReg,\displaystyle\mathcal{L}=\mathcal{L}_{\text{MSE}}(\mathcal{I},\hat{\mathcal{I}%})+\mathcal{L}_{\text{SSIM}}(\mathcal{I},\hat{\mathcal{I}})+\mathcal{L}_{\text%{Reg}}\text{,}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT ( caligraphic_I , over^ start_ARG caligraphic_I end_ARG ) + caligraphic_L start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT ( caligraphic_I , over^ start_ARG caligraphic_I end_ARG ) + caligraphic_L start_POSTSUBSCRIPT Reg end_POSTSUBSCRIPT ,(8)

where MSEsubscriptMSE\mathcal{L}_{\text{MSE}}caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT is the pixel-wise L2 loss, SSIMsubscriptSSIM\mathcal{L}_{\text{SSIM}}caligraphic_L start_POSTSUBSCRIPT SSIM end_POSTSUBSCRIPT is the structural similarity loss, which are applied on both coarse and fine RGB outputs.

Regularization terms.We find that only applying the photometric reconstruction losses is adequate for rendering.However, the consistency across views is low because of the strong flexibility of the discrete Gaussian primitives.To encourage the primitives to be constructed on the surface, we follow 2D Gaussian splatting[27] that utilize a self-supervised distortion loss dsubscriptd\mathcal{L}_{\text{d}}caligraphic_L start_POSTSUBSCRIPT d end_POSTSUBSCRIPT and a normal consistency loss nsubscriptn\mathcal{L}_{\text{n}}caligraphic_L start_POSTSUBSCRIPT n end_POSTSUBSCRIPT to regularize the training.

Specifically, we concentrate the weight distribution along the rays by minimizing the distance between the ray-primitive intersections, inspired by Mip-NeRF[2].Given a ray 𝐮(𝐱)𝐮𝐱\mathbf{u}(\mathbf{x})bold_u ( bold_x ) of pixel 𝐱𝐱\mathbf{x}bold_x, we obtain its distortion loss by,

d=i,jωiωj|zizj|,subscriptdsubscript𝑖𝑗subscript𝜔𝑖subscript𝜔𝑗subscript𝑧𝑖subscript𝑧𝑗,\mathcal{L}_{\text{d}}=\sum_{i,j}\omega_{i}\omega_{j}|z_{i}-z_{j}|\text{,}caligraphic_L start_POSTSUBSCRIPT d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ,(9)

where ωi=αi𝒢i(𝐮(𝐱))j=1i1(1αj𝒢j(𝐮(𝐱)))subscript𝜔𝑖subscript𝛼𝑖subscript𝒢𝑖𝐮𝐱superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗subscript𝒢𝑗𝐮𝐱\omega_{i}=\alpha_{i}\,\mathcal{G}_{i}(\mathbf{u}(\mathbf{x}))\prod_{j=1}^{i-1%}(1-\alpha_{j}\,\mathcal{G}_{j}(\mathbf{u}(\mathbf{x})))italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_u ( bold_x ) ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_u ( bold_x ) ) ) is the blending weight of the ilimit-from𝑖i-italic_i -th intersection and zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the depth of the intersection point.

As 2D Gaussians explicitly model the primitive normals, we can align their normals 𝐧isubscript𝐧𝑖\mathbf{n}_{i}bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the normals 𝐍𝐍\mathbf{N}bold_N derived from the depth maps via the loss

n=iωi(1𝐧i𝐍).subscriptnsubscript𝑖subscript𝜔𝑖1superscriptsubscript𝐧𝑖top𝐍.\displaystyle\mathcal{L}_{\text{n}}=\sum_{i}\omega_{i}(1-\mathbf{n}_{i}^{\top}%\mathbf{N})\text{.}caligraphic_L start_POSTSUBSCRIPT n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_N ) .(10)

Therefore, our regularization term for the ray 𝐮(𝐱)𝐮𝐱\mathbf{u}(\mathbf{x})bold_u ( bold_x ) is given by Reg=γdd+γnnsubscriptRegsubscript𝛾dsubscriptdsubscript𝛾nsubscriptn\mathcal{L}_{\text{Reg}}=\gamma_{\text{d}}\mathcal{L}_{\text{d}}+\gamma_{\text%{n}}\mathcal{L}_{\text{n}}caligraphic_L start_POSTSUBSCRIPT Reg end_POSTSUBSCRIPT = italic_γ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT d end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT n end_POSTSUBSCRIPT.We set γd=1000subscript𝛾d1000\gamma_{\text{d}}\!=\!1000italic_γ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT = 1000 and γn=0.2subscript𝛾n0.2\gamma_{\text{n}}\!=\!0.2italic_γ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT = 0.2 in our experiments.

4 Implementation Details

We briefly discuss our implementation, including the training and evaluation dataset, network design, optimizer, and mesh extraction.

Datasets.We train our model on multi-view synthetic renderings of objects [69], based on the Objaverse dataset [15], which includes 264,775 scenes with a train/test split of 10:1.Each scene contains 38 circular views with an image resolution of 512×512512512512\times 512512 × 512.To ensure sufficient angular coverage of the input views, we employ the classical K-means algorithm to cluster the cameras into 4 clusters.During training, we randomly choose two views from each cluster for every iteration, in which the first 4 images share the same camera poses as the input views, while the remaining 4 images are novel view outputs. We employ the eight output images for supervision and leverage the loss objectives outlined in Eq.8 to update the network.

We present our in-domain evaluation using the Objaverse dataset’s test set, consisting of 26,478 scenes.To assess our model’s cross-domain applicability, we conducted tests on the Google Scanned Objects dataset [18], which contains 1,030 scans of real objects,and on the 46 hydrants and 90 teddy bears from the Co3D test set [49], totaling 136 objects.To examine our model’s performance on zero-shot reconstruction task, we use the generative multi-view dataset from Instant3D[36], which comprises 122 scenes generated from text prompts.

Network.We developed LaRa using PyTorch Lightning [20] and conduct our training on 4 NVIDIA A100-40G GPUs over a period of 2 days for the fast model and 3.5 days for the base model, with a batch size of 2 per GPU.We use DINO-base for encoding M=4𝑀4M\!=\!4italic_M = 4 multi-view images at a resolution of 512×512512512512\times 512512 × 512.We use a volume resolution of W=16𝑊16W\!=\!16italic_W = 16 with C=768𝐶768C\!=\!768italic_C = 768 channels for the image feature volume,and a resolution of W=32𝑊32W\!=\!32italic_W = 32 with C=256𝐶256C\!=\!256italic_C = 256 channels for the embedding volume, dividing both into G=16𝐺16G\!=\!16italic_G = 16 groups for the group attention layers.Our group attention network consists of 12 layers, producing a Gaussian volume of size 64×64×64×806464648064\!\times\!64\!\times\!64\!\times\!8064 × 64 × 64 × 80.We choose K=2𝐾2K\!=\!2italic_K = 2 primitives for each voxel, and constrain the offset radius to r=1/32𝑟132r=1/32italic_r = 1 / 32 in our experiments. The total number of trainable parameters is 125 million.

Training.The optimization is carried out using the AdamW optimizer [40], starting with a learning rate of 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and following a cosine annealing schedule with a period of 10 epochs.Our final model is trained for 50 epochs, comprising 50,000 iterations for each epoch.We observe that applying the regularization loss from the start can slow down the convergence regarding the shape.This is because regularization objectives tend to encourage thinner surfaces, which may result in premature convergence to local minima if the shapes are noisy.In our experiments, we thus enable regularization after the first 15 epochs.

Mesh extraction.To obtain a mesh from reconstructed 2D primitives, we generate RGBD maps by rendering along three circular video trajectories at elevations of 30°, 0°, and 3030-30- 30°.Inside the scene bounding box, we construct a signed distance function volume and apply truncated SDF (TSDF) fusion to integrate the reconstructed rgb and depth maps, allowing for efficient textured mesh extraction.In our experiments, we use a resolution of 2563superscript2563256^{3}256 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and set a truncation threshold of 0.02 for TSDF fusion.

5 Experiments

We now present an extensive evaluation of LaRa, our large-baseline radiance field.We first compare with previous and concurrent works on in-domain and zero-shot generalization settings.We then analyze the effect of local attention, regularization term, and renderer.

5.1 Comparison

We compare our method against MVSNeRF [8], MuRF [70], and the concurrent work LGM [59].The first two methods are key representatives of feature matching-based methods, and the latter shares a conceptually similar approach of using Gaussian primitives for large-baseline settings.It is worth noting that while existing feed-forward radiance field reconstruction methods are capable of being evaluated in large-baseline settings, retraining these methods to establish a new large-baseline benchmark on the Objaverse dataset is both time and GPU intensive.Here, we retrain MVSNeRF and the current state-of-the-art feed-forward radiance field reconstruction method MuRF [70].

LaRa: Efficient Large-Baseline Radiance Fields (4)
Gobjaverse[69]GSO[18]Co3D[49]
MethodPSNR\uparrowSSIM\uparrowLPIPS\downarrowPSNR\uparrowSSIM\uparrowLPIPS\downarrowPSNR\uparrowSSIM\uparrowLPIPS\downarrow
MVSNeRF[8]14.480.8960.185615.210.9120.154412.940.8410.2412
MuRF[70]14.050.8770.301812.890.8850.279711.600.8150.3933
LGM[59]19.670.8670.157623.670.9170.063713.810.7390.4142
Ours-fast25.300.9250.102726.790.9460.068321.560.8700.2079
Ours26.140.9310.093227.650.9510.061621.640.8710.2026

Appearance.Table1 shows quantitative results (PSNR, SSIM, and LPIPS) comparisons.Our method achieves clearly improved rendering quality for both in-domain generation (Gobjaverse testing set) and zero-shot generalization (GSO and Co3D datasets).As shown in Figure4, MVSNeRF fails to provide faithful reconstructions on the large-baseline setting and tends to produce floaters within the reconstruction regions since the cost volume is extremely noisy in the sparse view scenarios, resulting in a challenge for its convolution matching network to distinguish the surface.MuRF [70] quickly overfits the white background and produces empty predictions for all inputs.Instead of predefining and constructing the feature similarity as network input, our method injects volume features to the inter-middle attention layer and implicitly and progressively matches them through the attention mechanism between the volume feature and updated embeddings, achieving clearer and overall better reconstructions.

Our approach is robust to scene scale and can generalize to real captured images, such as those in the Co3D dataset, thanks to our canonical modeling and projection-based feature lifting.In contrast, LGM [59] leverages a monocular prediction and fusion technique that requires a reference scene scale and a constant camera-object distance to avoid focal length and distance ambiguity.This requirement significantly limits its generalizability to real data.As shown in Table1 and Figure4, LGM provides faithful reconstructions in datasets with a strict constant camera-object distance, such as GSO, but fails to generalize to unconstraint multi-view data such as in Objaverse and Co3D datasets, and exhibits serious distortions.Our model trained on 4 A100-40G GPUs for 2 days demonstrates superior results compared to the LGM model trained on 32 A100-80G GPUs (8×\times× GPUs, 16×\times× RAM, 32×\times× GPU hours) and on the same synthetic Objaverse dataset [15].

Furthermore, our approach also performs well for generative multi-view images where textures are not consistent across views.In this comparison, we only present a qualitative analysis due to the absence of ground truths, as illustrated in the bottom rows ofFigure4. Our method offers detailed texture and smooth surface reconstruction. We invite the reader to our Appendix for more results.

LaRa: Efficient Large-Baseline Radiance Fields (5)
MethodAbs err\downarrowAcc (0.005)\uparrowAcc (0.01)\uparrowAcc (0.02)\uparrow
MVSNeRF[8]0.09936.212.424.0
LGM[59]0.112113.426.249.6
Ours-fast0.069532.752.270.7
Ours0.065436.657.475.4

Geometry.We evaluate the quality of our geometry reconstruction by comparing the depth reconstructions on novel views, generated by a weighted sum of the z𝑧zitalic_z values of the primitives.As shown in Table2, our approach achieves significantly lower L1 errors and higher geometry accuracy other baselines.In Figure5, we also visualize geometry reconstruction by extracting meshes using TSDF.In addition, our trajectory video rendering (48 views at a resolution of 512512512512) together with mesh extraction is highly efficient, as it does not require fine-tuning and can be performed in just 2 seconds.

5.2 Ablation Study

We now analyze the contributions of individual elements of our model design.To reduce the training cost,we reduce the training from 50 to 30 epochs for ablations.

LaRa: Efficient Large-Baseline Radiance Fields (6)
Gobjaverse[69]GSO[18]
DesignPSNR\uparrowSSIM\uparrowLPIPS\downarrowPSNR\uparrowSSIM\uparrowLPIPS\downarrowGeo (%) \uparrow
a) G=4𝐺4G=4italic_G = 422.270.9000.155823.060.9200.111317.3/31.0/48.1
b) G=8𝐺8G=8italic_G = 823.800.9140.125625.300.9360.084925.1/42.8/61.1
c) w/o RegsubscriptReg\mathcal{L}_{\textit{Reg}}caligraphic_L start_POSTSUBSCRIPT Reg end_POSTSUBSCRIPT26.160.9300.100627.710.9500.066822.8/45.6/71.2
d) 3DGS26.040.9290.102127.450.9500.066623.3/45.0/69.2
e) coarse25.060.9220.123926.280.9340.101732.7/52.2/70.7
f) SH order-024.930.9230.109726.710.9450.074332.0/51.7/70.5
g) full model25.300.9250.102726.790.9460.068332.7/52.2/70.7
h) 2 views19.910.8910.180619.110.9010.156615.1/26.8/41.3
i) 3 views23.050.9130.127224.190.9330.088025.9/43.2/61.4

Effect of local attention.We first evaluate the contribution of our group partition using different group numbers.Here, G=1𝐺1G\!=\!1italic_G = 1 is equivalent to the standard cross-attention layer; however, using such group size can lead to much higher compute time for the same number of iterations, i.e., 22 days on 4 A100s for 30 epochs.Therefore, our ablation starts with 4 groups for acceptable training time.As shown in ablations (a), (b) and (g) inTable3 and Figure6, the image synthesis and geometry quality are consistently improved with a larger group number, thanks to the local attention mechanism.

Effect of regularization term.We further evaluate the regularization term introduced in Eq.9 and Eq.10. We observe a marked improvement in the average rendering score when disabling the regularization. Although this provides a stronger model capability for modeling details, this may cause floaters near the surfaces, as shown in (c) and (d) of Figure6, which leads to inconsistent free-viewpoint video rendering (see Appendix video). In contrast, our approach is able to reconstruct hard surfaces.

Effect of renderer.We also compare 2D Gaussian splatting with 3D Gaussian splatting in our framework, as shown in (c) and (d). They achieve similar rendering quality and we choose 2DGS to facilitate surface regularization and mesh extraction. Furthermore, to evaluate the effectiveness of the coarse-fine decoding, we conduct an evaluation of the coarse outputs, shown in row (e). Our fine decoding is able to provide richer texture details.

Effect of renderer.We also compare 2D Gaussian splatting with 3D Gaussian splatting in our framework, as shown in (c) and (d). They achieve similar rendering quality and we choose 2DGS to facilitate surface regularization and mesh extraction. Furthermore, to evaluate the effectiveness of the coarse-fine decoding, we conduct an evaluation of the coarse outputs, shown in row (e). Our fine decoding is able to provide richer texture details.

Effect of input views.Our approach is highly efficient and compatible with different numbers of input views. In prior experiments, we utilize 4 views for both training and inference as our standard configuration.We evaluate our method in 24242-42 - 4 testing views (as shown in rows (h),(j), and (g)) using the full model in row (g).

6 Conclusion

We have presented LaRa, a novel method for 360 bounded radiance fields reconstruction from large-baseline inputs. Our central idea is to match image features and embedding volume through unified local and global attention layers. By integrating this with a coarse-fine decoding and splatting process, we achieve high efficiency for both training and inference. In future work, we plan to explore how to enlarge the batch size per-GPU and volume resolution without increasing GPU usage. In addition, we hope to investigate how to extend it to handle unbounded 360 scenes and decompose the radiance field into its constituent physical material and lighting components.

Limitations and Discussions.Our LaRa demonstrates a remarkable efficiency feed-forward model that achieved high-fidelity all-around novel-view synthesis and surface reconstruction from sparse large-baseline images.However, our approach struggles to recover high-frequency geometry and texture details, mainly due to the low volume resolution.Enhancing our approach with techniques such as gradient checkpointing or mixed-precision training can potentially increase training batch size as well as volume resolution.We have also noticed that our method can yield inconsistent rendering results when the geometry is incorrectly estimated or when reconstructing multi-view inconsistent inputs, as demonstrated in the comparison video. This occurs because our method utilizes second-order Spherical Harmonic appearance modeling. While such modeling can capture view-dependent effects, it also introduces a stronger ambiguity between geometry and appearance. We believe that incorporating our method with a physically-based rendering process can potentially address this issue.In addition, our work assumes posed inputs, but estimating precise camera poses for sparse views is a challenge in practice.Incorporating a pose estimation module [65] into the feed-forward setting is an orthogonal direction to our work.

7 Acknowledgements

We thank Bozidar Antic for pointing out a bug that resulted in an improvement of about 1dB.Special thanks to BinBin Huang and Zehao Yu for their helpful discussion and suggestions.We would like to thank Bi Sai, Jiahao Li, Zexiang Xu for providing us with the testing examples of Instant3D, and Jiaxiang Tang for helping us to construct a comparison with LGM.

References

  • [1]Anciukevičius, T., Manhardt, F., Tombari, F., Henderson, P.: Denoising diffusion via image-based rendering. In: ICLR (2024)
  • [2]Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-NeRF: A multiscale representation for anti-aliasing neural radiance fields. In: ICCV (2021)
  • [3]Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
  • [4]Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction (2024)
  • [5]Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: TensoRF: Tensorial radiance fields. In: ECCV (2022)
  • [6]Chen, A., Xu, Z., Wei, X., Tang, S., Su, H., Geiger, A.: Dictionary fields: Learning a neural basis decomposition. ACM Trans. Graph. (2023)
  • [7]Chen, A., Xu, Z., Wei, X., Tang, S., Su, H., Geiger, A.: Factor fields: A unified framework for neural fields and beyond. arXiv.org (2023)
  • [8]Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., Su, H.: MVSNeRF: Fast generalizable radiance field reconstruction from multi-view stereo. In: ICCV (2021)
  • [9]Chen, R., Han, S., Xu, J., Su, H.: Point-based multi-view stereo network. In: ICCV (2019)
  • [10]Chen, Y., Xu, H., Wu, Q., Zheng, C., Cham, T.J., Cai, J.: Explicit correspondence matching for generalizable neural radiance fields. arXiv.org (2023)
  • [11]Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv.org (2024)
  • [12]Cheng, S., Xu, Z., Zhu, S., Li, Z., Li, L.E., Ramamoorthi, R., Su, H.: Deep stereo using adaptive thin volume representation with uncertainty awareness. In: CVPR (2020)
  • [13]Chibane, J., Bansal, A., Lazova, V., Pons-Moll, G.: Stereo radiance fields (SRF): Learning view synthesis for sparse views of novel scenes. In: CVPR (2021)
  • [14]DeBonet, J.S., Viola, P.: Poxels: Probabilistic voxelized volume reconstruction. In: ICCV (1999)
  • [15]Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3D objects. In: CVPR (2023)
  • [16]Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised NeRF: Fewer views and faster training for free. In: CVPR (2022)
  • [17]Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16×\times×16 words: Transformers for image recognition at scale. In: ICLR (2021)
  • [18]Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T.B., Vanhoucke, V.: Google scanned objects: A high-quality dataset of 3D scanned household items. In: ICRA (2022)
  • [19]Du, Y., Smith, C., Tewari, A., Sitzmann, V.: Learning to render novel views from wide-baseline stereo pairs. In: CVPR (2023)
  • [20]Falcon, W., The PyTorch Lightning team: PyTorch Lightning (2019), https://github.com/Lightning-AI/lightning
  • [21]Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: Radiance fields without neural networks. In: CVPR (2022)
  • [22]Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. PAMI (2010)
  • [23]Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M.: Multi-view stereo for community photo collections. In: ICCV (2007)
  • [24]Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for high-resolution multi-view stereo and stereo matching. In: CVPR (2020)
  • [25]HernándezEsteban, C., Schmitt, F.: Silhouette and stereo fusion for 3D object modeling. Computer Vision and Image Understanding (2004)
  • [26]Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: LRM: Large reconstruction model for single image to 3D. In: ICLR (2024)
  • [27]Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S.: 2d gaussian splatting for geometrically accurate radiance fields. ACM SIGGRAPH (2024)
  • [28]Im, S., Jeon, H.G., Lin, S., Kweon, I.S.: DPSNet: End-to-end deep plane sweep stereo. In: ICLR (2019)
  • [29]Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: General perception with iterative attention. In: Meila, M., Zhang, T. (eds.) ICML (2021)
  • [30]Johari, M.M., Lepoittevin, Y., Fleuret, F.: GeoNeRF: Generalizing NeRF with geometry priors. In: CVPR (2022)
  • [31]Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
  • [32]Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. on Graphics (2023)
  • [33]Kolmogorov, V., Zabih, R.: Multi-camera scene reconstruction via graph cuts. In: ECCV (2002)
  • [34]Kulhánek, J., Derner, E., Sattler, T., Babuška, R.: ViewFormer: NeRF-free neural rendering from few images using transformers. In: ECCV (2022)
  • [35]Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. International journal of computer vision (2000)
  • [36]Li, J., Tan, H., Zhang, K., Xu, Z., Luan, F., Xu, Y., Hong, Y., Sunkavalli, K., Shakhnarovich, G., Bi, S.: Instant3D: Fast text-to-3D with sparse-view generation and large reconstruction model. In: ICLR (2024)
  • [37]Lin, H., Peng, S., Xu, Z., Yan, Y., Shuai, Q., Bao, H., Zhou, X.: Efficient neural radiance fields for interactive free-viewpoint video. In: SIGGRAPH Asia (2022)
  • [38]Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3D object. In: ICCV (2023)
  • [39]Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., etal.: Wonder3D: Single image to 3D using cross-domain diffusion. arXiv.org (2023)
  • [40]Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
  • [41]Luo, K., Guan, T., Ju, L., Huang, H., Luo, Y.: P-MVSNet: Learning patch-wise matching confidence aggregation for multi-view stereo. In: ICCV (2019)
  • [42]Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
  • [43]Miyato, T., Jaeger, B., Welling, M., Geiger, A.: GTA: A geometry-aware attention mechanism for multi-view transformers. In: ICLR (2024)
  • [44]Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. on Graphics (2022)
  • [45]Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.: HoloGAN: Unsupervised learning of 3D representations from natural images. In: ICCV (2019)
  • [46]Niemeyer, M., Barron, J., Mildenhall, B., Sajjadi, M.S.M., Geiger, A., Radwan, N.: RegNeRF: Regularizing neural radiance fields for view synthesis from sparse inputs. In: CVPR (2022)
  • [47]Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric rendering: Learning implicit 3D representations without 3D supervision. In: CVPR (2020)
  • [48]Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV (2023)
  • [49]Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. In: ICCV (2021)
  • [50]Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV (2016)
  • [51]Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
  • [52]Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: CVPR (2006)
  • [53]Shi, R., Wei, X., Wang, C., Su, H.: ZeroRF: Fast sparse view 360° reconstruction with zero pretraining (2023), arXiv:2312.09249
  • [54]Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., Yang, X.: MVDream: Multi-view diffusion for 3D generation. In: ICLR (2024)
  • [55]Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3D. ACM Trans. on Graphics (2006)
  • [56]Somraj, N., Karanayil, A., Soundararajan, R.: SimpleNeRF: Regularizing sparse input neural radiance fields with simpler solutions. In: SIGGRAPH Asia (2023)
  • [57]Sun, C., Sun, M., Chen, H.: Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In: CVPR (2022)
  • [58]Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view 3d reconstruction. CVPR (2024)
  • [59]Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: LGM: Large multi-view Gaussian model for high-resolution 3D content creation. arXiv.org (2024)
  • [60]Truong, P., Rakotosaona, M.J., Manhardt, F., Tombari, F.: SPARF: Neural radiance fields from sparse and noisy poses. In: CVPR (2023)
  • [61]Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017)
  • [62]Venkat, N., Agarwal, M., Singh, M., Tulsiani, S.: Geometry-biased transformers for novel view synthesis. arXiv.org (2023)
  • [63]Wang, G., Chen, Z., Loy, C.C., Liu, Z.: SparseNeRF: Distilling depth ranking for few-shot novel view synthesis. In: ICCV (2023)
  • [64]Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: NeuS: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In: NeurIPS (2021)
  • [65]Wang, P., Tan, H., Bi, S., Xu, Y., Luan, F., Sunkavalli, K., Wang, W., Xu, Z., Zhang, K.: PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction. In: ICLR (2024)
  • [66]Wang, Q., Wang, Z., Genova, K., Srinivasan, P., Zhou, H., Barron, J.T., Martin-Brualla, R., Snavely, N., Funkhouser, T.: IBRNet: Learning multi-view image-based rendering. In: CVPR (2021)
  • [67]Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Jerome, R.: Dust3r: Geometric 3d vision made easy. CVPR (2024)
  • [68]Wu, R., Mildenhall, B., Henzler, P., Park, K., Gao, R., Watson, D., Srinivasan, P.P., Verbin, D., Barron, J.T., Poole, B., Holynski, A.: ReconFusion: 3D reconstruction with diffusion priors (2023), arXiv:2312.02981
  • [69]Xu, C., Dong, Y., Zuo, Q., Zhang, J., Ye, X., Geng, W., Zhang, Y., Gu, X., Qiu, L., Zhao, Z., Qing, R., Jiayi, J., Dong, Z., Bo, L.: G-buffer Objaverse: High-quality rendering dataset of Objaverse, https://aigc3d.github.io/gobjaverse/
  • [70]Xu, H., Chen, A., Chen, Y., Sakaridis, C., Zhang, Y., Pollefeys, M., Geiger, A., Yu, F.: MuRF: Multi-baseline radiance fields. In: CVPR (2024)
  • [71]Xu, Y., Tan, H., Luan, F., Bi, S., Wang, P., Li, J., Shi, Z., Sunkavalli, K., Wetzstein, G., Xu, Z., Zhang, K.: DMV3D: Denoising multi-view diffusion using 3D large reconstruction model. In: ICLR (2024)
  • [72]Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: MVSNet: Depth inference for unstructured multi-view stereo. In: ECCV (2018)
  • [73]Yao, Y., Luo, Z., Li, S., Shen, T., Fang, T., Quan, L.: Recurrent MVSNet for high-resolution multi-view stereo depth inference. In: CVPR (2019)
  • [74]Yariv, L., Kasten, Y., Moran, D., Galun, M., Atzmon, M., Ronen, B., Lipman, Y.: Multiview neural surface reconstruction by disentangling geometry and appearance. In: NIPS (2020)
  • [75]Yinghao, X., Zifan, S., Wang, Y., Hansheng, C., Ceyuan, Y., Sida, P., Yujun, S., Gordon, W.: Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation (2024)
  • [76]Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: Neural radiance fields from one or few images. In: CVPR (2021)
  • [77]Yu, Z., Chen, A., Huang, B., Sattler, T., Geiger, A.: Mip-splatting: Alias-free 3d gaussian splatting. CVPR (2024)
LaRa: Efficient Large-Baseline Radiance Fields (2024)

References

Top Articles
Latest Posts
Article information

Author: Rev. Porsche Oberbrunner

Last Updated:

Views: 6840

Rating: 4.2 / 5 (53 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Rev. Porsche Oberbrunner

Birthday: 1994-06-25

Address: Suite 153 582 Lubowitz Walks, Port Alfredoborough, IN 72879-2838

Phone: +128413562823324

Job: IT Strategist

Hobby: Video gaming, Basketball, Web surfing, Book restoration, Jogging, Shooting, Fishing

Introduction: My name is Rev. Porsche Oberbrunner, I am a zany, graceful, talented, witty, determined, shiny, enchanting person who loves writing and wants to share my knowledge and understanding with you.