Reconstruction of real-world scenes from a set of multiple images is a topic in Computer Vision and 3D
Computer Graphics with many interesting applications. There exists a powerful algorithm for shape reconstruction
from arbitrary viewpoints, called Space Carving. However, it is computationally expensive and
hence can not be used with applications in the field of 3D video or CSCW as well as interactive 3D model
creation. Attempts have been made to achieve real-time framerates using PC cluster systems. While these
provide enough performance they are also expensive and less flexible. Approaches that use GPU hardware
acceleration on single workstations achieve interactive framerates for novel-view synthesis, but do
not provide an explicit volumetric representation of the whole scene. The proposed approach shows the
efforts in developing a GPU hardware-accelerated framework for obtaining the volumetric photo hull of
a dynamic 3D scene as seen from multiple calibrated cameras. High performance is achieved by employing
a shape from silhouette technique in advance to obtain a tight initial volume for Space Carving. Also
several speed-up techniques are presented to increase efficiency. Since the entire processing is done on a
single PC the framework can be applied to mobile setups, enabling a wide range of further applications.
The approach is explained using programmable vertex and fragment processors with current hardware and
compared to highly optimized CPU implementations. It is shown that the new approach can outperform
the latter by more than one magnitude. The downloadable introduction has been written specifically for this offer. Its contents are only a subset of the real introductory chapter of the thesis.
Contents
1 Introduction
1.1 Application
1.2 Classification
1.3 Performance
1.4 Contribution
1.5 Overview
2 Related Work
2.1 Shape from Silhouette
2.1.1 Image Segmentation
2.1.2 Foundations
2.1.3 Performance of View-Independent Reconstruction
2.1.3.1 CPU
2.1.3.2 GPU Acceleration
2.1.4 Performance of View-Dependent Reconstruction
2.1.4.1 CPU
2.1.4.2 GPU Acceleration
2.1.5 Conclusion
2.2 Shape from Photo-Consistency
2.2.1 Foundations
2.2.2 Performance of View-Independent Reconstruction
2.2.2.1 CPU
2.2.2.2 GPU Acceleration
2.2.3 Performance of View-Dependent Reconstruction
2.2.3.1 CPU
2.2.3.2 GPU Acceleration
2.2.4 Conclusion
3 Fundamentals
3.1 Camera Geometry
3.1.1 Pinhole Camera Model
3.1.2 Camera Parameters
3.1.2.1 Intrinsic Parameters
3.1.2.2 Extrinsic Parameters
3.1.2.3 Radial Lens Distortion
3.1.3 Camera Calibration
3.2 Light and Color
3.2.1 Light in Space
3.2.1.1 Radiance
3.2.2 Light at a Surface
3.2.2.1 Irradiance
3.2.2.2 Radiosity
3.2.2.3 Lambertian and Specular Surfaces
3.2.3 Occlusion and Shadows
3.2.4 Light at a Camera
3.2.5 Color
3.2.6 Color Representation
3.2.6.1 Linear Color Spaces
3.2.6.2 Non-linear Color Spaces
3.2.6.3 Color Metric
3.2.7 CCD Camera Color Imaging
3.3 3D Reconstruction from Multiple Views
3.3.1 Visual Hull Reconstruction by Shape from Silhouette
3.3.1.1 Shape from Silhouette
3.3.1.2 Discussion
3.3.1.3 The Visual Hull
3.3.1.4 Silhouette-Equivalency
3.3.1.5 Number of Viewpoints
3.3.1.6 Conclusion
3.3.2 Photo Hull Reconstruction by Shape from Photo-Consistency
3.3.2.1 Shape from Photo-Consistency
3.3.2.2 Discussion
3.3.2.3 Photo-Consistency
3.3.2.4 The Photo Hull
3.3.2.5 The Space Carving Algorithm
3.3.2.6 Voxel Visibility
3.3.2.7 Conclusion
4 Basic Algorithm
4.1 Data
4.1.1 Camera Parameters
4.1.2 Image Data
4.2 Reconstruction
4.2.1 3D Data Representation
4.2.2 Volumetric Bounding Box
4.2.3 Maximal Volume Intersection
4.2.4 Visual Hull Approximation
4.2.5 Photo-Consistent Surface
4.2.5.1 Active Source Camera Test
4.2.5.2 Photo Consistency Test
5 Advanced Algorithm
5.1 Overview
5.1.1 Deployment
5.1.2 Process Flow
5.2 Texture Processing
5.2.1 Lookup Table for Projection Coordinates
5.2.2 Mapping Image Data into Textures
5.2.3 Texture Upload and Processing Performance
5.2.4 GPU Image Processing
5.3 Destination Cameras
5.3.1 Discussion
5.3.1.1 Ray Casting vs. Multi Plane Sweeping
5.3.1.2 Virtual vs. Natural Views
5.3.2 Interleaved Depth Sampling
5.3.3 Active Destination Cameras
5.3.3.1 Source Camera Viewing Ray
5.3.3.2 Intersection of Volume and Source Camera Viewing Ray
5.3.3.3 Activity Decision
5.4 Reconstruction
5.4.1 Vertex Data
5.4.2 Vertex Shader Visual Hull Approximation
5.4.2.1 Decreasing the Sampling Error for Interleaved Sampling
5.4.2.2 Early Ray Carving
5.4.3 Fragment Shader Photo-Consistent Surface
5.4.3.1 Filling Holes
5.4.3.2 Modified Active Source Camera Decision
5.4.4 Fragment Shader Color Blending
5.4.5 Fragment Shader Render to Texture
5.5 Postprocessing
5.5.1 Extracting Texture Data
5.5.2 Filling Interior Volume Data
5.5.2.1 Ambiguities
5.5.2.2 Performance
6 Experiments
6.1 System Setup
6.2 Implementation
6.3 Datasets
6.4 Performance
6.4.1 Abstract Data Performance Experiments
6.4.1.1 CPU-GPU Texture Upload
6.4.1.2 Interleaved Sampling
6.4.1.3 Early Ray Carving
6.4.1.4 Fragment Shader CIELab-RGB Conversion
6.4.1.5 Porting all Load to the Fragment Processor
6.4.1.6 GPU-CPU Texture Read-back
6.4.1.7 FBO Texture Size
6.4.1.8 Impact of CPU-GPU Texture Upload on overall Performance
6.4.1.9 Number of Source Cameras
6.4.1.10 Number of Destination Cameras
6.4.2 Concrete Data Performance Experiments
6.4.2.1 Algorithmic Features
6.4.2.2 Destination Cameras
6.4.2.3 Volumetric Resolution
6.4.2.4 Volumetric Bounding Box
6.4.2.5 PCS Increments
6.4.3 Conclusion
6.4.3.1 Algorithmic Features
6.4.3.2 Parameters
6.4.3.3 GPU/CPU Comparison
6.5 Quality
6.5.1 Concrete Data Quality Experiments
6.5.1.1 Volumetric Resolution
6.5.1.2 Volumetric Bounding Box
6.5.1.3 PCS Increments
6.5.2 Visual Experiments
6.5.2.1 Image Segmentation
6.5.2.2 Interleaved Sampling and MV I
6.5.2.3 Camera Viewing Cone Intersection
6.5.2.4 Reconstruction of V HA and PCS
6.5.2.5 Volumetric Resolution
6.5.2.6 PCS Increments
6.5.2.7 Geometrical Score for Active Source Camera Computation
6.5.2.8 Range of Color Distances for Active Source Camera Computation
6.5.2.9 Labeling of Interior Space
6.5.3 Conclusion
6.5.3.1 Image Segmentation
6.5.3.2 BB, MV I and Viewing Cone Intersection
6.5.3.3 V HA and PCS
6.5.3.4 PCS Parameters
6.5.3.5 Labeling of Interior Space
7 Discussion and Enhancements
7.1 Summary
7.2 Limitations
7.3 Future Work
7.3.1 Online System
7.3.2 Performance
7.3.3 Quality
7.4 Annotation
Research Objectives and Focus Areas
The research focuses on developing a high-performance framework for real-time 3D reconstruction of dynamic scenes using graphics hardware on a single PC. The primary objective is to create an explicit volumetric representation of a scene captured from multiple calibrated cameras, overcoming the computational bottlenecks that typically restrict such reconstruction to cluster systems.
- GPU-accelerated Space Carving techniques for real-time performance.
- Effective use of the Visual Hull as an initial volume for the Photo Hull.
- Innovative heuristic methods for Active Source Camera selection to implicitly handle visibility.
- Implementation strategies for interleaved depth sampling and efficient data mapping.
- Integration of a comprehensive post-processing pipeline for 3D data refinement.
Excerpt from the Book
1.1 Application
The ability to reconstruct dynamic realworld scenes from multiple video streams enables a wide range of applications. 3D video extends the common 2D video in the way, that it is view-independent. Hence, the decision about a fixed camera from where the scene is viewed is shifted from the time and place of recording to the time and place of consumption. 3D video can be used in the context of personal and social human activities like entertainment (e.g. 3D games, 3D events [24], 3D TV [37], 3D video recorder [70]) and education (e.g. 3D books, 3D-based training) or for the preservation of common knowledge (e.g. ethnic dancing performances [39], special skills).
3D video is obtained by analyzing properties like scene geometry and color from multi-viewpoint video streams. This generally involves a high computational effort. Nevertheless, if the reconstruction can be computed in real-time, a broad field of further applications is facilitated. In the field of CSCW, more realistic and immersive telepresence and conferencing systems are created by using 3D cues [44]. In this context, PRINCE et al. [52] propose a framework, where a user can interact with a dynamic model of a remotely captured collaborator which is overlaid with the real world in a video-see-through HMD. Instead of inserting a dynamic scene into a real environment (AR), its also possible to combine it into a virtual world (MR). This enables for seamless integration of real and virtual content with appropriate visual cues like reflections, shading, shadowing, occlusion and collision detection. Interactive systems for reconstruction and insertion of actors into virtual worlds are proposed by HASENFRATZ et al. [21] and more recently POMI and SLUSALLEK [49], contributing to the field of virtual television studio techniques.
Dynamic 3D data may be also used as an intermediate result for further processing and information extraction. Scene analysis allows for object recognition or motion tracking. Latter is often applied as a non-intrusive approach to human motion capturing [43].
Summary of Chapters
1 Introduction: Provides an overview of the motivation for 3D reconstruction and outlines the scope, contributions, and structure of the thesis.
2 Related Work: Reviews existing literature on shape-from-silhouette and shape-from-photo-consistency techniques, specifically focusing on GPU-accelerated methods.
3 Fundamentals: Establishes the theoretical framework for camera geometry, radiometry, color representation, and the mathematical principles behind Visual Hull and Photo Hull reconstruction.
4 Basic Algorithm: Introduces the abstract, hardware-independent algorithms for data handling and space occupancy testing required for 3D reconstruction.
5 Advanced Algorithm: Details the GPU-mapped framework, utilizing multi-pass rendering and specialized heuristics to achieve real-time performance on a single workstation.
6 Experiments: Presents the practical system setup and evaluates the proposed method through abstract and concrete data experiments, comparing performance and quality.
7 Discussion and Enhancements: Summarizes the thesis findings, discusses limitations, and suggests future improvements, such as implementing an online system.
Keywords
3D Reconstruction, Real-time Rendering, Space Carving, GPU Acceleration, Shape from Silhouette, Photo-Consistency, Visual Hull, Photo Hull, Computer Vision, Volumetric Representation, 3D Video, Implicit Visibility, Graphics Hardware, Voxel Reconstruction, Multi-viewpoint
Frequently Asked Questions
What is the core focus of this research?
The work focuses on creating a real-time, hardware-accelerated framework for 3D scene reconstruction by performing Space Carving on a single PC using Graphics Processing Units (GPUs).
What are the primary technical contributions?
The main contributions include a GPU-based implementation of the Space Carving algorithm, a novel implicit visibility computation method, and a robust hybrid reconstruction pipeline that uses the Visual Hull as an initial volume to refine the Photo Hull.
What primary reconstruction method is utilized?
The framework employs shape-from-photo-consistency as the primary reconstruction technique, while utilizing shape-from-silhouette to efficiently generate an initial, conservative volume.
How is real-time performance achieved?
Performance is achieved by mapping the reconstruction algorithm to the GPU using multi-pass rendering, implementing efficient texture data processing, and avoiding the need for a PC cluster by using a single, high-performance workstation.
What is the main challenge addressed regarding visibility?
The work addresses the computational expense of explicit visibility handling by using an implicit visibility heuristic based on per-camera scores, which avoids the need for complex, memory-intensive ray-tracing structures.
Which datasets were used to validate the approach?
The framework was validated using two datasets featuring a dancing person (Dancer1 and Dancer2) and a static Paper Houses dataset, which allowed for testing in different scene complexities and concavities.
How does this approach differ from previous GPU-accelerated work?
Unlike previous methods that either compute implicit view-dependent models or require massive PC clusters for explicit volume models, this approach achieves an explicit, view-independent volumetric reconstruction on a single GPU-accelerated server.
What is the significance of the "Active Source Camera" decision?
This decision heuristic is critical for the real-time performance because it filters which source cameras contribute to the photo-consistency test for a given voxel, thereby implicitly handling visibility and specularities without a heavy computational overhead.
- Quote paper
- Christian Nitschke (Author), 2006, A Framework for Real-time 3D Reconstruction by Space Carving using Graphics Hardware, Munich, GRIN Verlag, https://www.grin.com/document/186283