Learning Unified Representation of 3D Gaussian Splatting
Abstract
A well-designed vectorized representation is crucial for the learning systems natively based on 3D Gaussian Splatting. While 3DGS enables efficient and explicit 3D reconstruction, its parameter-based representation remains hard to learn as features, especially for neural-network-based models. Directly feeding raw Gaussian parameters into learning frameworks fails to address the non-unique and heterogeneous nature of the Gaussian parameterization, yielding highly data-dependent models. This challenge motivates us to explore a more principled approach to represent 3D Gaussian Splatting in neural networks that preserves the underlying color and geometric structure while enforcing unique mapping and channel homogeneity. In this paper, we propose an embedding representation of 3DGS based on continuous submanifold fields that encapsulate the intrinsic information of Gaussian primitives, thereby benefiting the learning of 3DGS.
Motivation
3D Gaussian Splatting (3DGS) enables high-fidelity and real-time view synthesis by parameterizing each primitive with \( \boldsymbol{\theta} = \{\boldsymbol{\mu}, \mathbf{q}, \mathbf{s}, \mathbf{c}, o\}\), where \(\boldsymbol{\mu}\) is position, \(\mathbf{q}\) rotation, \(\mathbf{s}\) scale, \(\mathbf{c}\) spherical harmonics coefficients, and \(o\) opacity. However, this parametric space is ill-suited for learning:
- Non-uniqueness: distinct parameter sets (e.g., \(\mathbf{q}\) vs. \(-\mathbf{q}\)) produce identical renderings.
- Numerical heterogeneity: parameters span different ranges and reside on distinct manifolds (\(\mathbb{R}^3\), \(SO(3)\), SH space).
- Instability: autoencoders trained on parameters collapse under domain shifts and noise.

Submanifold Field Representation of 3DGS
Each Gaussian primitive is mapped to a submanifold field defined on its iso-probability ellipsoid: \( \mathcal{M} = \{\mathbf{x} \in \mathbb{R}^3 \mid (\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}) = r^2 \}\), with color field \(F(\mathbf{x}) = \sigma(o)\cdot \text{Color}(\mathbf{d}_\mathbf{x})\), where \(\mathbf{d}_\mathbf{x} = (\mathbf{x}-\boldsymbol{\mu})/\|\mathbf{x}-\boldsymbol{\mu}\|\). This mapping is provably unique and injective, unlike the parametric case.
We discretize \((\mathcal{M},F)\) as a colored point cloud and encode it using a Submanifold Field VAE (SF-VAE):
- Encoder: PointNet maps sampled points to a 32-D latent code.
- Decoder: reconstructs the field and recovers Gaussian parameters via PCA + SH fitting.
- Loss: Wasserstein-2 based Manifold Distance (M-Dist), aligning spatial and color similarity: \( d^2((\mathbf{x},c_x),(\mathbf{y},c_y)) = \|\mathbf{x}-\mathbf{y}\|^2 + \lambda \|c_x - c_y\|^2 \).

Evaluation
We evaluate on ShapeSplat and Mip-NeRF 360, comparing submanifold fields with parametric baselines under matched capacity. Key results:
- Higher fidelity: SF-VAE achieves PSNR↑63.4 vs 44.7/37.5 for parametric baselines (ShapeSplat).
- Generalization: superior cross-domain transfer (object → scene PSNR↑19.2 vs 9.8 for parametric).
- Robustness: embeddings resilient to noise; interpolations smooth and semantically consistent.
- Better latent structure: unsupervised clustering on SF embeddings shows clearer semantic separation.


