SAILOR: Synergizing Radiance and Occupancy Fields for Live Human Performance Capture

SIGGRAPH Asia 2023 (Journal Track)

Zheng Dong¹, Ke Xu², Yaoan Gao¹, Qilin Sun^3,4, Hujun Bao¹, Weiwei Xu^*1, Rynson W.H. Lau²

¹State Key Laboratory of CAD&CG, Zhejiang University ²City University of Hong Kong
³The Chinese University of Hong Kong, Shenzhen ⁴Point Spread

Paper

Abstract

TL;DR: We propose SAILOR, a novel method for human free-view rendering and reconstruction from **very sparse (e.g., 4) RGBD streams**. It learns a hybrid representation of radiance and occupancy fields, which can handle unseen performers without additional fine-tuning.

Our free-view rendering results and bullet-time effects given input sparse RGBD streams on our real-captured dataset ( Unseen performers ).

Immersive user experiences in live VR/AR performances require a fast and accurate free-view rendering of the performers. Existing methods are mainly based on Pixel-aligned Implicit Functions (PIFu) or Neural Radiance Fields (NeRF). However, while PIFu-based methods usually fail to produce photorealistic view-dependent textures, NeRF-based methods typically lack local geometry accuracy and are computationally heavy (e.g., dense sampling of 3D points, additional fine-tuning, or pose estimation). In this work, we propose a novel generalizable method, named SAILOR, to create high-quality human free-view videos from very sparse RGBD live streams. To produce view-dependent textures while preserving locally accurate geometry, we integrate PIFu and NeRF such that they work synergistically by conditioning the PIFu on depth and then rendering view-dependent textures through NeRF. Specifically, we propose a novel network, named SRONet, for this hybrid representation. SRONet can handle unseen performers without fine-tuning. Besides, a neural blending-based ray interpolation approach, a tree-based voxel-denoising scheme, and a parallel computing pipeline are incorporated to reconstruct and render live free-view videos at 10 fps on average. To evaluate the rendering performance, we construct a real-captured RGBD benchmark from 40 performers. Experimental results show that SAILOR outperforms existing human reconstruction and performance capture methods.

Method

Fig 1. Overview of SAILOR. Given RGBD streams captured by 4 Azure Kinect sensors as inputs, (1) a Depth Denoising Module first removes noise and fills the holes of raw depths conditioned on the RGB images. (2) With the denoised depths, a Two-Layer Tree Structure is constructed to store the global geometry in discretization. (3) Efficient Ray-Voxel Intersection and Points Sampling are performed for rendering. (4) A novel SRONet network is proposed to synergize the radiance and occupancy fields, for 3D reconstruction and free-view rendering. (5) The outputs of SRONet are then upsampled via Ray Upsampling and Neural Blending to produce the final results in 1𝑘 resolution.

Results

Fig 2. (1) Depth Denoising Module. Visualization of our depth denoising results on our real captured RGBD images.

Fig 3. Static 3D human reconstruction and free-view rendering results using four RGBD images.

Fig 4. Dynamic human free-view rendering results and bullet-time effects from four-view RGBD videos.

Fig 5. Visualization of rendering comparisons on our real-captured dataset (row 1-5) and the THuman2.0 dataset (row 6).