A Framework for Real-time 3D Reconstruction by Space Carving using Graphics Hardware

Diploma Thesis, 2006

146 Pages, Grade: 1.0



1 Introduction
1.1 Application
1.2 Classification
1.3 Performance
1.4 Contribution
1.5 Overview

2 Related Work
2.1 Shape from Silhouette
2.1.1 Image Segmentation
2.1.2 Foundations
2.1.3 Performance of View-Independent Reconstruction CPU GPU Acceleration
2.1.4 Performance of View-Dependent Reconstruction CPU GPU Acceleration
2.1.5 Conclusion
2.2 Shape from Photo-Consistency
2.2.1 Foundations
2.2.2 Performance of View-Independent Reconstruction CPU GPU Acceleration
2.2.3 Performance of View-Dependent Reconstruction CPU GPU Acceleration
2.2.4 Conclusion

3 Fundamentals
3.1 Camera Geometry
3.1.1 Pinhole Camera Model
3.1.2 Camera Parameters Intrinsic Parameters Extrinsic Parameters Radial Lens Distortion
3.1.3 Camera Calibration
3.2 Light and Color
3.2.1 Light in Space Radiance
3.2.2 Light at a Surface Irradiance Radiosity Lambertian and Specular Surfaces
3.2.3 Occlusion and Shadows
3.2.4 Light at a Camera
3.2.5 Color
3.2.6 Color Representation Linear Color Spaces Non-linear Color Spaces Color Metric
3.2.7 CCD Camera Color Imaging
3.3 3D Reconstruction from Multiple Views
3.3.1 Visual Hull Reconstruction by Shape from Silhouette Shape from Silhouette Discussion The Visual Hull Silhouette-Equivalency Number of Viewpoints Conclusion
3.3.2 Photo Hull Reconstruction by Shape from Photo-Consistency Shape from Photo-Consistency Discussion Photo-Consistency The Photo Hull The Space Carving Algorithm Voxel Visibility Conclusion

4 Basic Algorithm
4.1 Data
4.1.1 Camera Parameters
4.1.2 Image Data
4.2 Reconstruction
4.2.1 3D Data Representation
4.2.2 Volumetric Bounding Box
4.2.3 Maximal Volume Intersection
4.2.4 Visual Hull Approximation
4.2.5 Photo-Consistent Surface Active Source Camera Test Photo Consistency Test

5 Advanced Algorithm
5.1 Overview
5.1.1 Deployment
5.1.2 Process Flow
5.2 Texture Processing
5.2.1 Lookup Table for Projection Coordinates
5.2.2 Mapping Image Data into Textures
5.2.3 Texture Upload and Processing Performance
5.2.4 GPU Image Processing
5.3 Destination Cameras
5.3.1 Discussion Ray Casting vs. Multi Plane Sweeping Virtual vs. Natural Views
5.3.2 Interleaved Depth Sampling
5.3.3 Active Destination Cameras Source Camera Viewing Ray Intersection of Volume and Source Camera Viewing Ray Activity Decision
5.4 Reconstruction
5.4.1 Vertex Data
5.4.2 Vertex Shader Visual Hull Approximation Decreasing the Sampling Error for Interleaved Sampling Early Ray Carving
5.4.3 Fragment Shader Photo-Consistent Surface Filling Holes Modified Active Source Camera Decision
5.4.4 Fragment Shader Color Blending
5.4.5 Fragment Shader Render to Texture
5.5 Postprocessing
5.5.1 Extracting Texture Data
5.5.2 Filling Interior Volume Data Ambiguities Performance

6 Experiments
6.1 System Setup
6.2 Implementation
6.3 Datasets
6.4 Performance
6.4.1 Abstract Data Performance Experiments CPU-GPU Texture Upload Interleaved Sampling Early Ray Carving Fragment Shader CIELab-RGB Conversion Porting all Load to the Fragment Processor GPU-CPU Texture Read-back FBO Texture Size Impact of CPU-GPU Texture Upload on overall Performance Number of Source Cameras Number of Destination Cameras
6.4.2 Concrete Data Performance Experiments Algorithmic Features Destination Cameras Volumetric Resolution Volumetric Bounding Box PCS Increments
6.4.3 Conclusion Algorithmic Features Parameters GPU/CPU Comparison
6.5 Quality
6.5.1 Concrete Data Quality Experiments Volumetric Resolution Volumetric Bounding Box PCS Increments
6.5.2 Visual Experiments Image Segmentation Interleaved Sampling andMVI Camera Viewing Cone Intersection Reconstruction of VHA and PCS Volumetric Resolution PCS Increments Geometrical Score for Active Source Camera Computation Range of Color Distances for Active Source Camera Computation ... Labeling of Interior Space
6.5.3 Conclusion Image Segmentation BB, MVI and Viewing Cone Intersection VHA and PCS PCS Parameters Labeling of Interior Space

7 Discussion and Enhancements
7.1 Summary
7.2 Limitations
7.3 Future Work
7.3.1 Online System
7.3.2 Performance
7.3.3 Quality
7.4 Annotation

A Abstract Data Performance Experiments

B Concrete Data Performance Experiments

C Concrete Data Quality Experiments

D Visual Experiments


I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the award of any other degree or diploma of the university or other institute of higher learning, except where due acknowledgment has been made in the text.

Christian Nitschke, Weimar 12/06/2006


Reconstruction of real-world scenes from a set of multiple images is a topic in Computer Vision and 3D Computer Graphics with many interesting applications. There exists a powerful algorithm for shape recon­struction from arbitrary viewpoints, called Space Carving. However, it is computationally expensive and hence can not be used with applications in the field of 3D video or CSCW as well as interactive 3D model creation. Attempts have been made to achieve real-time framerates using PC cluster systems. While these provide enough performance they are also expensive and less flexible. Approaches that use GPU hard­ware acceleration on single workstations achieve interactive framerates for novel-view synthesis, but do not provide an explicit volumetric representation of the whole scene. The proposed approach shows the efforts in developing a GPU hardware-accelerated framework for obtaining the volumetric photo hull of a dynamic 3D scene as seen from multiple calibrated cameras. High performance is achieved by employ­ing a shape from silhouette technique in advance to obtain a tight initial volume for Space Carving. Also several speed-up techniques are presented to increase efficiency. Since the entire processing is done on a single PC the framework can be applied to mobile setups, enabling a wide range of further applications. The approach is explained using programmable vertex and fragment processors with current hardware and compared to highly optimized CPU implementations. It is shown that the new approach can outperform the latter by more than one magnitude.


At first, I want to thank my supervisor, Oliver Bimber for providing me with the opportunity to spend a one-year research program at Takemura Lab in Osaka University, Japan. This work would not exist in the current form without his great support. I also want to thank my second supervisor, Kiyoshi Kiyokawa for coming to Germany just to attend my defense, for all the good discussions and comments when I got stuck in the daily details, for the wonderful time in Japan and the efforts in extending my stay in the last minute. It is a good place to thank both for reading the thesis in one week. I know myself, a lot of pages came together. I would also like to thank Atsushi Nakazawa for directly supervising my work with interesting discussions and background knowledge.

My parents and my family always supported me and and gave me a backing on my way. I want to thank you for all the support and that you never said ”no” to my ideas.

”The stone unhewn and cold becomes a living mould, The more the marble wastes the more the statue grows.” Michelangelo Buonarroti

List of Figures

1.1 A Taxonomy of 3D Shape Acquisition Techniques

2.1 Image-based Ray Sampling of the Visual Hull (IBVH)

2.2 Multi Plane-Sweep Implementation of Space Carving

3.1 Perspective Projection in Pinhole Camera

3.2 Relation of Coordinate Systems for Image Formation

3.3 Point Correspondences for Linear Camera Calibration

3.4 Light at a Camera

3.5 Volume Intersection

3.6 Limitations of Volume Intersection

3.7 Silhouette-Equivalency

3.8 Photo-Consistency

3.9 Visibility and Non-Photo-consistency

4.1 Nested Volumes

5.1 Hardware Deployment of Reconstruction Framework

5.2 Process Flow in Reconstruction Framework

5.3 Destination Camera Setup

5.4 Interleaved Sampling in 2D

5.5 Active Destination Camera Test

5.6 Early Ray Carving

5.7 Ray-based Sampling

5.8 Geometrical Score for Active Source Camera Decision

5.9 FBO Rendertexture

5.10 Labeling Interior Volume

6.1 Multiple Viewpoint Video Capturing Setup

6.2 Concrete Hardware Deployment of Reconstruction Framework

1.1 Performance CPU-GPU Texture Upload

1.2 Performance Interleaved Sampling

1.3 Performance Early Ray Carving

1.4 Performance Raw CIELab-RGB Conversion

1.5 Performance Vertex/Fragment Shader Balancing

1.6 Performance GPU-CPU Texture Read-back

1.7 FBO Texture Size Performance Impact 1

1.8 FBO Texture Size Performance Impact 2

1.9 Impact of CPU-GPU Texture Upload on System Performance

1.10 Impact of Source Cameras on Rendering Performance

1.11 Impact of Destination Cameras on Rendering Performance

2.1 Performance of Algorithmic Features (Dancer1)

2.2 Performance of Algorithmic Features (Dancer2)

2.3 Performance of Algorithmic Features (Paper Houses)

2.4 Destination Camera Rendering Performance (Dancerl)

2.5 Destination Camera Rendering Performance (Dancer2)

2.6 Destination Camera Rendering Performance (Paper Houses)

2.7 Performance Impact of Volumetric Resolution (Dancer1)

2.8 Performance Impact of Volumetric Resolution (Dancer2)

2.9 Performance Impact of Volumetric Resolution (Paper Houses)

2.10 Performance Impact of Volumetric Bounding Box (Dancer1)

2.11 Performance Impact of Volumetric Bounding Box (Dancer2)

2.12 Performance Impact of Volumetric Bounding Box (Paper Houses)

2.13 Performance Impact of PCS Increments (Dancer1)

2.14 Performance Impact of PCS Increments (Dancer2)

2.15 Performance Impact of PCS Increments (Paper Houses)

3.1 Quality Impact of Volumetric Resolution (Dancer1)

3.2 Quality Impact of Volumetric Resolution (Dancer2)

3.3 Quality Impact of Volumetric Resolution (Paper Houses)

3.4 Quality Impact of Volumetric Bounding Box (Dancer1)

3.5 Quality Impact of Volumetric Bounding Box (Dancer2)

3.6 Quality Impact of Volumetric Bounding Box (Paper Houses)

3.7 Quality Impact of PCS Increments (Dancer1)

3.8 Quality Impact of PCS Increments (Dancer2)

3.9 Quality Impact of PCS Increments (Paper Houses)

4.1 Image Segmentation Experiment (Dancer2)

4.2 Image Segmentation Experiment (Dancer1)

4.3 Conventional/Interleaved Sampling Visualization (Dancer1) 1

4.4 Conventional/Interleaved Sampling Visualization (Dancer1) 2

4.5 Camera Viewing Cone Intersection Visualization (Dancer1) 1

4.6 Camera Viewing Cone Intersection Visualization (Dancer1) 2

4.7 PCS Reconstruction Experiment (Dancer1) 1

4.8 PCS Reconstruction Experiment (Dancer1) 2

4.9 PCS Reconstruction Experiment (Dancer1) 3

4.10 PCS Reconstruction Experiment (Dancer1) 4

4.11 PCS Reconstruction Experiment (Dancer2) 1

4.12 PCS Reconstruction Experiment (Dancer2) 2

4.13 PCS Increments Experiment (Paper Houses) 1

4.14 PCS Increments Experiment (Paper Houses) 2

4.15 PCS Increments Experiment (Dancer1)

4.16 Impact ofGeometrical Score s4 Experiment (Dancer1)

4.17 Impact of Range r (k) on Color Similarity Score scck Experiment (Dancer1)

4.18 Interior Volume Labeling Experiment (Dancer1)

List of Tables

1.1 Feature Comparison with Closely Related Works

1.1 Parameter Setup for Abstract Data Performance Experiments

2.1 Parameter Setup for Concrete Data Performance Experiments

3.1 Parameter Setup for Concrete Data Quality Experiments

4.1 Parameter Setup for Visual Experiments

1 Introduction

As computational machinery and methods are subject to an ever increasing progress, traditional applica­tions change and new ones emerge. Years ago, computer graphics and computer vision were two distinct fields having their respective area of application. However, recently a convergence between both can be observed. Photorealistic models of real-world scenes are not only obtained by artificially creating shape and appearance, but also by reconstructing these models from photographic or video data of the real world. This results in a powerful contribution to our multimedial influenced environment.

1.1 Application

The ability to reconstruct dynamic realworld scenes from multiple video streams enables a wide range of applications. 3D video extends the common 2D video in the way, that it is view-independent. Hence, the decision about a fixed camera from where the scene is viewed is shifted from the time and place of recording to the time and place of consumption. 3D video can be used in the context of personal and so­cial human activities like entertainment (e.g. 3D games, 3D events [24], 3D TV [37], 3D video recorder [70]) and education (e.g. 3D books, 3D-based training) or for the preservation of common knowledge (e.g. ethnic dancing performances [39], special skills).

3D video is obtained by analyzing properties like scene geometry and color from multi-viewpoint video streams. This generally involves a high computational effort. Nevertheless, if the reconstruction can be computed in real-time, a broad field of further applications is facilitated. In the field of CSCW1 2 3, more realistic and immersive telepresence and conferencing systems are created by using 3D cues [44]. In this context, Prince et al. [52] propose a framework, where a user can interact with a dynamic model of a remotely captured collaborator which is overlaid with the real world in a video-see-through HMD. Instead of inserting a dynamic scene into a real environment (AR2), its also possible to combine it into a virtual world (MR3). This enables for seamless integration of real and virtual content with appropriate visual cues like reflections, shading, shadowing, occlusion and collision detection. Interactive systems for reconstruction and insertion of actors into virtual worlds are proposed by Hasenfratz et al. [21] and more recently Pomi and Slusallek [49], contributing to the field of virtual television studio techniques.

Dynamic 3D data may be also used as an intermediate result for further processing and information ex­traction. Scene analysis allows for object recognition or motion tracking. Latter is often applied as a non-intrusive approach to human motion capturing [43].

Supporting tools for complex 3D model creation are powerful, but also expensive and difficult to use.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.1: A Taxonomy of 3D Shape Acquisition Techniques (CURLESS [9]).

Obtaining realistic models is still a time consuming process. As mentioned earlier, 3D models can be created Xrom photographic data of the real world. This is generally possible by performing an automatic reconstruction. However, better performance, quality and handling of scenes with difficult properties is achieved using techniques that require user interaction [76]. Here, real-time computation allows for immediate feedback on actions and parameter adjustment.

1.2 Classification

This proposed approach performs 3D shape reconstruction using images or videos from multiple view­points around the scene. This is one of many different techniques that can be used to accomplish the task of 3D shape acquisition (Figure 1.1). It belongs to the class of non-contact, reflective, optical, passive techniques. Passive refers to the fact, that only light is captured which is already existent in the scene. Instead, active-light methods perform a controlled illumination, e.g. by projecting a coded pattern to accomplish a more robust extraction of the 3D features. Passive-light techniques are also referred to as Shape from X, where X relates to the special image cue that is used to infer the shape. Common cues are e.g. stereo, shading, focus/defocus, motion, silhouettes and scene photo-consistency. While there also exist methods using uncalibrated cameras for shape recovery, the proposed approach assumes that the position and orientation of each camera to the world coordinate frame is known. General surveys about techniques in the field of passive-light are given by Dyer [12] and Slabaugh et al. [60].

As several shape from X approaches rely on special image cues to be present, three general classes for 3D shape reconstruction and texturing from multiple photographs have been developed. Namely (1) image- based stereo (IBS), (2) volumetric intersection (VI) and (3) photo-consistency (PC). Each of them having advantages and drawbacks. There is a general difference between IBS and VI/PC. IBS performs an image- based search to find corresponding pixels to generate 3D points by triangulation. The result is a depth map or a partial surface. There are some disadvantages inherent with this. (1) For the correspondence search to be efficient, the views must be close together (small baseline). (2) If more then two views are used, correspondences must be searched in each pair of views resulting in a high complexity (multi-baseline). (3) For reconstructing a whole 360 degree scene model, many partial surfaces have to be computed for a set of reference viewpoints. Integration of these distinct surfaces into a global consistent model can be complex, since the alignment is often carried out in a least-square sense as proposed by Curless and Levoy [10]. (4) Only a sparse point cloud is obtained when pixel correspondences or image features are hard to find. In general, there is no neighborship relation between the 3D points, as they are not located on a regular grid. Hence, generating a mesh-based surface representation implies further computation. (5) There is no handling of occlusion between the different views. This might lead to errors throughout the correspondence search.

Instead of the image-based search in IBS, VI/PC methods perform a volumetric scene modelling and thus compensate for the disadvantages. The methods operate directly in the world coordinate frame and test for a particular voxel, whether it belongs to the scene or not. The images act as set of constraints that the unknown scene has to fulfill. To limit the search space, some initial volume is defined where the scene is known to be contained. The theoretical foundation of shape from silhouette (VI) and shape from photo-consistency (PC) is laid by Laurentini [29] and Kutulakos and Seitz [28] respectively. Both approaches have advantages and drawbacks. shape from silhouette relies on a high quality image segmentation and can generally not recover several smooth convex and concave surface patches. The reconstructed shape is known as the visual hull of the scene. Nevertheless, silhouette-based methods are often applied where performance matters, as they are simple, robust and fast to compute. shape from photo-consistency does not have these disadvantages. It enables a high quality shape reconstruction and gives the possibility to approximate a view-independent voxel color in a straightforward way. The result­ing shape is known as the photo hull of the scene. However, the method suffers from a high computational cost. An interesting fact is, that both techniques are related. Shape from photo-consistency can be seen as a generalization of shape from silhouette where further constraints are added. Due to this, the visual hull is proven to contain the photo hull. The proposed approach exploits this relation by computing the visual hull as initial volume for the reconstruction of the photo hull.

1.3 Performance

The extraction of the 3D geometry as a set of voxels is only the first part of a possible whole processing pipeline. Further tasks may include isosurface extraction, photorealistic texturing, insertion into virtual and real environments for rendering as well as higher-level feature extraction for scene recognition. To perform the entire processing pipeline in real-time, attempts have been made to develop scalable parallel algorithms that are executed on PC cluster systems [39][21]. While these commonly provide the scalabil­ity and processing power to handle a high number of input views with good quality, they also introduce several drawbacks. These include a high hardware cost and a complex bulky setup which restricts their usage to professional and static environments.

Focusing towards real-time processing on a single PC, algorithmical complexity and computational hard­ware have to be improved. Regarding the latter, beside a rapid progress in CPu clock speed, there exist assembly level optimizations for the CPU, such as the MMX and SSE extensions from Intel. However, this may speed up 3D shape reconstruction but also entirely occupies the CPU, so that no other pipeline tasks may be performed.

A way to increase the performance of 3D shape recovery on a single PC while saving CPU cycles for other tasks is to leverage the power of current off-the-shelf Graphics Processing Units (GPU). The GPU is a unique processor beside the CPU, originally introduced to free the CPU from an increasing amount of 3D graphics processing. GPUs are optimized to achieve a high throughput (streaming) on a special arithmetic instruction set (SIMD). For comparision, an Intel Pentium4 CPU at 3.6GHz achieves a peak of 14.4 GFLOPs for its SSE3 unit. For the Nvidia GPUs GeForce 7800GTX/512MB (G70), GeForce 7900GTX (G71) and GeForce 8800GTX (G80) it is 165, 250 and 518 GFLOPs respectively. GPUs are not only very fast processors, they are also developing with a higher rate of growth then CPUs. With an average yearly rate of 1.7x/2.3x for pixels/vertices per second compared to a rate of 1.4x for CPU performance, GPU growth outpaces Moore's Law for traditional microprocessors [47]. This divergence of the performance curves is related to the fundamental architectural differences. CPUs aim at achieving a high performance in sequential computations. As these have a very generic nature, CPUs use several of their processing power for non-computational tasks like branch prediction and caching. Instead, GPUs are specialized to achieve high arithmetic intensity with performing parallelized graphics computations. On the other side, its specialized nature makes it difficult to employ a GPU as processor for generic computations. Thus, algorithms have to be mapped to emulate rendering tasks. The proposed approach takes this way and employs the GPU for the task of real-time 3D reconstruction by shape from photo­consistency on a single PC. A recent overview of the GPU architecture and its related programming paradigms is found from Owens et al. [47].

1.4 Contribution

Most closely related to the proposed approach are the works on GPU-accelerated Space Carving from Sainz et al. [55] and Zach et al. [73]. These restore a full, explicit and view-independent volumetric scene model. However, since they explicitly compute visibility and process interior voxels they are not real-time capable.

GPU-accelerated image-based methods for computing the photo hull such as Li et al. [35] allow for real­time execution but lack many other features. As these algorithms are denoted to the generation of novel viewpoints, they only recover a partial, implicit and view-dependent scene model. If GPU acceleration is not used, it is obvious that the model is explicitly available on the CPU as in Slabaugh et al. [61]. Nevertheless, the performance decreases about seven times compared to [35] and real-time processing is only achieved by employing a coarse space sampling which influences quality.

Multi-baseline image-based stereo (IBS) can also be computed in real-time by employing GPU acceler­ation as in Yang et al. [72] and Yang and Pollefeys [71]. These approaches provide an explicit representation as disparity or depth map, but also suffer from the general drawbacks of IBS and hence do not generate an entire, view-independent scene model.

Abbildung in dieser Leseprobe nicht enthalten

Table 1.1: Feature Comparision with Closely Related Works.

The proposed approach explains a framework which employs GPU hardware-acceleration for shape from photo-consistency to generate an explicit volumetric representation of a scene that is captured from mul­tiple calibrated viewpoints. Several works are dealing in this topic, all performing 3D reconstruction on a single PC. However, none fulfills all of the features proposed with this approach (Table 1.1). Es­pecially combining real-time processing with the ability to generate an entire volumetric model can not be achieved. While realizing this, the proposed approach defines a novel photo-consistency measure im­plicitly accounting for visibility and specular highligths. Moreover, the entire image processing is also executed GPU-accelerated on the single reconstruction PC and not on a PC cluster as generally done.

1.5 Overview

The remainder of this thesis is organized as follows. Chapter 2 surveys related work in the field of dy­namic scene reconstruction by shape from silhouette and shape from photo-consistency. The focus lies on high performance reconstruction and hardware-acceleration. Chapter 3 introduces the theoretical basis for the proposed approach within three main parts. (1) Camera geometry is important to relate images captured from multiple viewpoints to an unknown 3D scene. (2) When taking a photograph of a scene, the light emitted towards the camera is captured. Therefore light and color is discussed as necessary for this work. (3) At last, the theory behind shape from silhouette and shape from photo-consistency needs to be explained. Chapter 4 continues with defining the Basic Algorithm as a hybrid approach to scene reconstruction with introducing features like (1) a robust and fast image-segmentation to account for shadows, (2) a set of nested volumes that are sequentially computed to speed-up computation and (3) an implicit visibility computation. The Basic Algorithm is extended and mapped onto the GPU in chapter 5. The corresponding Advanced Algorithm enables efficient real-time computation by employ­ing a multipass-rendering strategy. Chapter 6 explains and discusses several experiments to analyze the proposed system in terms of performance and reconstruction quality. Chapter 7 concludes with giving a summary and discussing limitations and future works. Due to their number, diagrams and figures relating to experiments are grouped into additional appendix chapters A, B, C and D.

2 Related Work

The following chapter discusses related research in the field of passive 3D reconstruction using images from multiple calibrated views. The aim is not only to present the different works, but also to connect the particular areas and set up the foundation for the proposed approach.

The objective of the proposed approach is to explicitly reconstruct the entire 3D shape of a dynamic scene in real-time on a single PC. Explicit relates to the fact, that a 3D scene description is obtained as an output. Implicit methods only perform an internal 3D reconstruction as an intermediate processing step, e.g. in combination with generating novel views of the scene. Making data explicit is generally no problem if the CPU is used for reconstruction. However, the proposed approach employs the GPU which involves an expensive read-back operation.

An important property is view-dependency, which relates to the fact that distinct views of the same scene part result also in distinct 3D shapes. It seems obvious that view-dependency may occur for techniques, that do not compute the entire scene model. These include techniques that generate of novel-views or depth maps from existing views. The proposed approach aims in restoring an entire view-independent scene model. Here, further performance and quality issues come into play. First, reconstructing the entire scene needs more computation. Second, the resulting model should be global consistent. This is gener­ally not easily achieved by merging multiple partial surfaces from view-dependent approaches.

Using GPU acceleration[4], real-time reconstruction in the mentioned way is possible for the fast shape from silhouette technique. However, this has several limitations on the quality of the result. To compen­sate for this, several hybrid approaches have been suggested to generate a refined shape using shape from photo-consistency. Though, these methods do not achieve real-time or interactive framerates, which is only obtained with GPU accelerated view-dependent reconstruction.

The following review is separated into two parts. At first, works using the fast and efficient shape from silhouette technique are discussed in terms of their performance. This is related to either algorithmical improvement or to the mapping on a CPU cluster or the GPU. The techniques are chosen, as they are directly related to this work. The second part reviews approaches that generate a better quality shape approximation using photo-consistency or image-based stereo.

2.1 Shape from Silhouette

If there exists a set of images from different viewpoints, where the scene is entirely included, volume intersection can be used to compute a shape approximation which is itself a subset of the convex hull and known to contain the real object. The approach is rather simple and based on the assumption, that any ray from the camera center through a pixel from where the original scene is visible, must intersect the (unknown) scene. If all such rays are constructed for the whole set of images, their intersection in 3D is referred to as volume intersection.

2.1.1 Image Segmentation

To determine the pixels from where an object is visible, image segmentation is performed5. The result­ing set of foreground pixels in an image is called silhouette of the 3D scene and is represented binary. Therefore, another common synonym for volume intersection methods is shape-from-silhouette. For the segmentation, knowlegde about background is obtained in advance by taking images from all viewpoints, this time without any scene present. In the simplest case, the color of a background pixel is compared with the corresponding pixel in the real image. If this exceeds some threshold, the pixel is assumed to belong to foreground. More robust but complex approaches as Horprasert et al. [22] generate a color distribution function for a background pixel. Moreover, background and foreground can be expressed by a non-parametric model which is adapted at runtime to compensate for various effects as in Toyama et al. [66] and more recently in Elgammal et al. [13]. While these approaches work on the CPU in real-time, image segmentation on the GPU achieves an even higher performance. Griesser [20] presents a robust adaptive implementation achieving 4ms for a VGA image. A theoretical analysis for image segmentation related to the color of the background is done by Smith and Blinn [62]. At last, a complete and recent overview of image segmentation techniques is given by Zhang [75].

2.1.2 Foundations

When a set of silhouette images is available, volume intersection can be performed. This is first described by Martin and Aggarwal [38] and often employed in subsequent works. The theoretical foundation is first established by Laurentini [29] proposing a problem definition and introducing the visual hull as the best possible shape approximation which can be obtained from volume intersection. Laurentini [30] extends the theory and discusses practical estimations for the necessary number of silhouettes de­pending on scene complexity in [31]. A more recent survey of visual hull related works is found in [32]. An introduction and discussion about the corresponding theory is found in (ref section).

The volume intersection is either represented as an exact geometrical object or as a discrete volume. A geometric representation is obtained by intersecting the polyhedra of the backprojected silhouette cones in 3D or by performing image-based polygone intersection in 2D. In case of a discrete representation, there exist different strategies for volumetric and image-based approaches. Volumetric approaches project and intersect the silhouettes on slices of voxels. Image-based methods perform a ray-casting to find the 3D intersections with the visual hull.

2.1.3 Performance of View-Independent Reconstruction

This section discusses attempts for increasing the performance for view-independent reconstruction, where a global consistent 3D model is generated. CPU

Performance can be improved in several ways. Many discrete approaches realize a LOD6 reconstruction using an octree representation. The space is sampled in a coarse-to-fine hierarchy depending on scene complexity. An important work in this area is presented by Potmesil [50] and significantly improved by Szeliski [64].

Fast and scalable systems propose to use a PC cluster for reconstruction. It is obvious that these are able to achieve real-time. The drawback lies in a static system setup involving a lot of expensive hard­ware. The experimental results for the following works are all achieved for capturing a dynamic dancer scene. Borovikov and Davis [3] utilize 14 cameras with 16 PentiumII 450 MHz CPUs, connected via 100Mbit ethernet. Reconstruction is done with an octree approach in a volume of 2 x 2 x 2m. At voxelsizes of 31, 16 and 8mm, a performance of 10, 8 and 4fps is obtained respectively. While this system depends mainly on octree depth and corresponding transmission time, Wu and Matsuyama [69] use a 10Gbit myrinet network to increase bandwidth and scalability. Their framework handles a varying number of VGA cameras, each connected with a dual PentiumIII 1GHz CPU. When using six cameras in a volume of 2x2x2m and a voxelsize of 20mm, 26fps are obtained. The framerate can be further increased with the number of PCs. Using 10 PCs, the system runs at 43fps. The work from Ueda et al. [67] introduces a whole PC cluster based reconstruction and rendering architecture. It uses nine VGA cameras and 21 Pentium4 3GHz CPUs, connected also via 10Gbit myrinet network. For a volume of 2.56x2.56x2.56m and a voxelsize of 20mm they report an overall performance of 10fps. This framerate is measured for the entire processing pipeline where the single processes are executed in parallel. GPU Acceleration

The work of Hasenfratz et al. [21] is important, as it is the only known system making the GPU- reconstructed view-independent volume intersection explicit. It proposes a real-time framework for re­construction and insertion of actors into virtual worlds. The reconstruction is done by sweeping a plane through the volume. At each position, the silhouette textures are projected onto the plane and combined by multitexturing. If all silhouettes project onto a particular voxel, it is contained in the volume intersection. The data is made explicit by reading back the slices from the framebuffer to the CPU. The system uses four VGA cameras with respective PCs performing the 2D processing. The images are then transfered via 1Gbit network to an Onyx 3000-IR3 server with eight R12000 400MHz CPUs. For the reconstruction, a volume of 2x2x2m with a rather big voxelsize of 30mm is used. With a texture upload of 2ms and reconstruction of 13ms, a framerate of 67fps is achieved. The read-back implies a further performance impact, as also interior voxels have to be transfered.

2.1.4 Performance of View-Dependent Reconstruction

The fundamental difference to the following works is, that they aim at rendering novel views. That means, the 3D data is usually only view-dependent and partially reconstructed. This is related to IBR7 (Buehler et al. [6]) in the way, that additional 3D information is leveraged. CPU

Matusik et al. [42] describe an efficient image-based sampling approach for computing and texturing visual hulls, called Image-Based Visual Hulls (IBVH). A ray-based reconstruction is done for the pixels from where the scene is visible in the destination image (Figure 2.1). Considering [42], the performance is mostly depending on the number of pixels and lies at about 8fps for an average VGA image using a quad CPU 550MHz server. For the test, four cameras are used at a resolution of 256x256 with corresponding PCs for image processing. Matusik et al. present an improved adaptive version in [40]. Depending on the scene, a speed-up of 4-9x is achieved. Matusik et al. continue their work on visual hull rendering and present a different approach in [41]. For the novel view, an exact geometrical polyhedral represen­tation is computed by employing image-based polygone intersections in 2D. View-dependent texturing is done on the GPU when rendering by using the approach of Buehler et al. [6]. Four cameras are employed at QVGA resolution, each with respective PC. Using a dual PentiumlII 933MHz CPU, fram­erates of 15fps are reported for pure geometry construction and 30fps for the rendering. Prince et al. [52] propose a framework for real-time reconstruction and novel-view rendering in an AR telepresence environment. Herefore, a remotely captured human is rendered on an ARToolkit [25] marker for the view­point of a person wearing video-see-through glasses. Interaction with the 3D video is done by adjusting the marker’s position and orientation. The image-based reconstruction is done similar to Matusik et al. [42]. Image acquisition is performend using 14 VGA cameras, connected to five PCs doing the image processing. Images are distributed to a rendering server over 1Gbit network. With a rendered resolution of 384x288, the system achieves a framerate of 25fps. GPU Acceleration

The proposed works of Matusik et al. perform image-based volume intersection and can thus decrease algorithmical complexity for novel-view rendering on the CPU. In contrast to this, Li et al. [?] propose an approach for novel-view rendering entirely on the GPU. For a particular silhouette, all faces of the corresponding 3D silhouette cone are rendered sequentially and projective textured from all other views.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.1: Image-based Ray Sampling of the Visual Hull (IBVH) (MATUSIK et al. [412]).

A desired nove1 view of fte visual hull is rendered by ray-based sampling. (1) A 3D) ray is projected into the reference views to (2) determine the intersections with the visual hull. (3) Backprojection of the inteesections from all reference views leads to the 3D intersection with the visual hull. Sampling for the novel view is debe in scanlines to detcemine which rays have to Ire sampled.

For the se textures, the alpha-fhannel encodes the silhouette occupancy information. Parts of the face wherenot al l silhouettes aire; projecting obtain an alpha-value of zero and stenciled out;. This is done for all faces os rll silhouette cones respectively. The framework if tested boy using four QVGA input cameras with cprresponding image processing PCs. The rendering server is a dual core Pentium4 1.7GHz PC with a GeForce3 GP U. Together with view-dependent Sexturing , a frsmerate o f 84fp s is achieved.

Lok [36] proposes a completely different approach to novel-view genbration. It uses a plane-eweep technique like in Haienfratz et; af. [21]. This time, for view-dependent surface generation instead of voiumarric sampling. The slices are generatedwith non-uniform distance end increasing sizf according to the frustum of the viewing camerp. Unlike [21 ], a particular ilice is renfered N times, where N is the number ofsilhouettes. Tine; volume intersection is computed by accumulating the projected silhouettes using the rtencil buffer. Since stencil and color buffer are not cleared after each slice, the depth buffer finally contains the view-dependent surface of the visual hull. The data resides on tire GPU and is noi made explicit. Ftom the dtpth buffer a coarse mesh is generated and view-dependent textured. The system uses five NTbC cameras with respective PCs for· image procesring. The reconstruction is obtained at 12- 15fps for a volume of 2.4 x 1.8 x 1. 8m and a voxelsize of 1 Omm.

2.1.5 Conclusion

The works introduced so far are computing and using knowledge about a capturedscene by employ­ing shape from silhouette. Henee, a robuwt and fast imsge segmentation technique is essential. For the proposed approach, a simple varisnt of color distance thresholding is used. Several modifications are introduced to increese robustness.

ing pipeline can be performed in real-time. However, this results in a complex distributed algorithm and an expensive static hardware setup. Using a single PC instead, the performance can be increased by using a LOD representation. If GPU acceleration is leveraged, the reconstruction result has to be read back to the CPU. However, this operation becomes a hard impact on performance as interior voxels are also reconstructed. Even if a low volumetric resolution of 64 x 64 x 64 voxels is used. This problem does not occur for the proposed approach as only surface voxels are read back. A labeling algorithm is proposed to restore interior voxels later on the CPU if necessary.

If the goal is not an explicit 3D reconstruction but rendering of novel views, interactive framerates are achieved using a single PC. The corresponding techniques perform either exact geometric intersection or discrete space sampling. Both approaches are done either image- or scene-based. The main issue is, that only a partial view-dependent surface is reconstructed. The discrete image-based methods are performing a ray-based space sampling. This strategy is similar to the proposed approach. Both works are related to sampling a CSG8 scene by ray-casting [54], as the volume intersection of the generalized silhouette cones can be modeled using CSG.

The proposed approach is actually related to both, discrete view-independent and view-dependent tech­niques. View-independent is the fact, that an entire volumetric 3D model is generated. On the other hand, this is achieved by ray-casting from multiple views around the scene. However, the difference is that the proposed approach uses parallel rays to sample a regular grid and applies an interleaved sampling technique to exploit coherence between all virtual views.

Similar to the proposed approach, all works handle a scene space of around 2x2x2m. However, the volumetric resolution reaches from 8 to 30mm. The input images are usually captured with QVGA reso­lution. This depends either on the bandwidth of the network or on the bandwidth of texture upload for the GPU approaches. Moreover, a maximum number of five input cameras is used. This relates to either the mentioned bandwidth issue or to limitations from the GPU architecture. The proposed approach employs eight cameras and experiments with image resolutions up to XVGA. Considering image processing, all systems use a PC cluster for this task. Usually, each camera is connected to a corresponding PC. Instead, the proposed approach uses the cluster only for image capturing, while image processing is entirely per­formed on the single server. This is a big difference, since the algorithm can also be applied in a setup, where all cameras are connected directly with the server.

2.2 Shape from Photo-Consistency

Techniques, that reconstruct a 3D scene by shape from silhouette employ information about foreground and background pixels. The 3D shape is known as volume intersection, since it represents the 3D in­ tersection of the backprojected 2D silhouettes. Shape from photo-consistency is a generalization of this approach with using additional greyscale or color information. Applying further constraints, these meth­ods can achieve a tighter approximation of the true shape. The property ofphoto-consistency relates to the fact, that the radiance from a particular (unknown) scene point should be consistent with the irradiance measured in the corresponding image pixels from where the point is visible. For each scene point can thus be determined, if it is consistent with the set of images. This is the inverse approach of the one taken by image-based stereo algorithms. Instead of evaluating the pixels that are corresponding to a particular scene point, they start from the images and try to find pixels that may correspond to the same scene point. The coordinates of the point are then obtained by triangulation. Due to this, shape from photo-consistency is also referred to as scene-based stereo. Similar to the term of photo-consistency, Leclerc et al. [34] define self-consistency as a pixel correspondence measure for evaluating image-based stereo algorithms. When comparing both approaches, image-based stereo shows several drawbacks. In particular related to the reconstruction of a global consistent scene. Also, the recovered points are arbitray located within point clouds and hence, do not correspond to a voxel representation on a regular grid. However, since image-based stereo algorithms (1) are related to the ray-based sampling employed for the proposed ap­proach and (2) achieve real-time framerates for generation of novel views, they are considered for this review. A survey of two-view stereo algorithms is provided by Scharstein and Szeliski in [57].

2.2.1 Foundations

The property of photo-consistency for restoring an unknown scene from a set of photographs is first intro­duced by Seitz and Dyer [58]. This work defines an algorithm for reconstruction of a photo-consistent shape from a sufficiently textured scene, called Voxel Coloring. Starting from an initial opaque volume, incremental carving of inconsistent voxels yields a shape that converges towards the real scene. However, due to voxel visibility considerations there exist restrictions on the camera placement. KUTULAKOS and Seitz [28] extend this work by defining the Space Carving algorithm which generalizes Voxel Coloring towards arbitrary camera placement. Furthermore, they provide the theoretical foundation for the problem of scene reconstruction from multiple images and define the photo hull as the maximal photo-consistent shape that can be achieved using this technique. An introduction and discussion about the corresponding theory is found in (ref section).

Similar research is carried out by Culbertson et al. [8] proposing the Generalized Voxel Coloring (GVC) algorithm which differs to Space Carving (SC), as another approach is taken to maintain visibility9. Two versions are proposed using different volumetric data structures and reducing either (1) the number of consistency checks or (2) the necessary memory footprint. While SC is realized by sweeping a plane through the volume, GVC operates a on a surface voxel representation. In comparision with SC, GVC is reported to achieve better reconstruction results.

2.2.2 Performance of View-Independent Reconstruction

This section discusses approaches to increase the performance for view-independent reconstruction tech­niques generating a full discrete 3D model. CPU

To compensate for an inaccurate camera calibration, Kutulakos [27] extends the research of Kutu- lakos and Seitz [28] by proposing the Approximate Space Carving (ASC) algorithm. SC evaluates photo-consistency for a voxel by testing the consistency of the image pixels, from where the voxel is visible. ASC extends this by incorporating a disc of pixels with radius r around the projection pixel. The so called r-consistency is achieved, if there exists at least one pixel in each disc, making the entire set of pixels photo-consistent. This defines a nested set of shapes, converging to the photo hull for r ^ 0. [27] suggests to implement a multi-resolution coarse-to-fine reconstruction using r-consistency to increase performance. In this case, r corresponds to the size of the voxels footprint at a particular LOD. A coarse- to-fine approach for the similar Voxel Coloring algorithm is already suggested earlier by Prock and Dyer [53]. This is also performed by analyzing the entire set of pixels where a voxel projects in each image. Achieving equal quality, the speed-up of this method increases with volumetric resolution from 1.23 x for 323 to 41.1 x for 5123 respectively.

Prock and Dyer [53] also propose to use a multi-resolution approach to exploit time-coherency for dynamic scenes. This is done by employing a coarse level of the shape in frame t as the initial volume for frame t + 1. Its obvious that this strategy does not compensate for discontinuities between frames. However, this can be ignored when capturing a human performance as movement is always continuous10. By using this approach, Prock and Dyer achieve a speed-up of 2x. GPU Acceleration

Sainz et al. [55] present an approach to GPU-accelerated reconstruction using a simplified version of the multi plane-sweep implementation of the Space Carving algorithm from Kutulakos and Seitz [28]. Here, a plane moves through a volumetric bounding box in six stages along each direction of the three coordinate axes ±x, ±y and ±z. At each iteration, only the voxels that intersect with the current plane are considered. As the goal lies in volumetric sampling of a regular grid, the destination camera is setup for orthographic projection. When rendering a particular plane there are two main issues, namely (1) increas­ing performance by computing the visual hull and (2) accounting for visibility. (1) When projecting the silhouettes from all cameras on a particular plane, voxels outside the visual hull can be determined and ignored. (2) To account for visibility when sweeping along the viewing direction, already reconstructed voxels are rendered and illuminated from the direction of the input cameras in front of the plane. The corresponding shadows are projected onto the current plane. For a distinct voxel, all visible light sources

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.2: Multi Plane-Sweep Implementation of Space Carving (SAINZ et al. [55]).

A plane sweeps in each direction along the three coordinate axes ±v, ±v and ±z to test voxels for photo-consistency. To account for visibility, the already reconstructed scene is illuminated from the reference views to project a shadow on the current sweep plane.

and hence cameras are used to test photo-consistency (Figure 2.2). Experiments are carried out using a Pentium4 2.2GHz server with a Nvidia Quadro Pro GPU. Using images from five cameras, reconstruction takes 33s and 94s for volumetric resolutions of 643 and 1283 respectively.

A similar system is proposed by Zach et al. [73]. While Sainz et al. [55] maintain a volumetric data structure on the CPU and massively perform expensive upload and read-back operations, Zach et al. keep all necessary data on the GPU. When sweeping the plane along a particular viewing direction, visibility is obtained by updating depthmaps for the already reconstructed volume in the reference views. The work differs to Sainz et al. [55] as the sweeps are carried out independent of each other. This means, even if a voxel is visible from different subsets of cameras, it is not evaluated again. Performance is tested using an AMD Athlon XP2000 server with ATI Radeon 9700 Pro GPU. A synthetic scene is captured from 36 cameras, and images are transfered to the GPU in advance. Four sweeps are applied, each time considering a subset of nine cameras. For a volumetric resolution of 643 and 1283 voxels, a framerate of 1.4fps and 0.7fps is obtained respectively.

2.2.3 Performance of View-Dependent Reconstruction CPU

Like for shape from silhouette, real-time performance can be achieved for shape from photo-consistency when generating novel-views. Similar to the former, a partial view-dependent representation of the scene is obtained. In this spirit, Slabaugh et al. [61] propose an image-based algorithm for reconstructing the photo hull, called Image-Based Photo Hulls (IBPH). This is a direct extension of the Image-Based Visual Hulls from Matusik et al. [42]. IBPH uses a hybrid approach to increase performance. First, IBVH is applied to fastly sample along the viewing rays until the visual hull is reached. A second step continues ray-sampling with an adaption of Generalized Voxel Coloring to determine the photo-hull. To further increase performance, a lower-resolution image sampling is applied. The color value for each ray is determined by blending the pixels from where the corresonding 3D point is visible. Experiments are performed with five QVGA cameras, each connected to a PC for image segmentation. The data is sent over a 100Mbit network to a dual processor 2GHz HP x4000 server where a multithreaded reconstruction is done. IBPH outperforms IBVH in quality but also leads to a high decrease in performance. For rendering a QVGA image with sampling each 16th ray, IBPH achieves a framerate of 6fps while IBVH alone runs at 25fps. GPU Acceleration

Slabaugh et al. [61] perform the entire shape computation on the CPU. Li et al. [35] propose a system for GPU accelerated rendering of novel views. Their approach can be seen as a view-dependent map­ping of Voxel Coloring Seitz and Dyer [58] on the GPU. Reconstruction is done by sweeping a plane through the frustum of the viewing camera and performing photo-consistency testing in the fragment shader. Visibility is handled by backprojecting a particular plane into the input views to mark occluded pixels. To speed-up computation, only a bounding box around the visual hull is processed. To accomplish for this, the fast visual hull rendering from Li et al. [?] is invoked at a very low resolution for the novel view. Eight QVGA cameras are used for experiments. After image processing on the corresponding PCs, the data is sent via 1Gbit network to a Pentium4 1.7GHz server with Nvidia Geforce FX 5800 Ultra GPU. Novel views are rendered as QVGA, where the bounding rectangle occupies about one third of the image. 2fps are achieved for reconstructing a dancer by using 60 depth planes. A speed-up of 7 x is obtained in comparision to an implementation of IBPH on a dual processor 2GHz server.

GPU-accelerated image-based stereo (IBS) approaches differ in their goal, which is usually either (1) the generation of a depthmap or disparity maps for an existing view or (2) the construction of a novel-view from at least two nearby cameras. (2) is similar to the explained photo-consistency (PC) approaches. This is because reconstruction is also done by view-dependent sampling and considering of color distances be­tween pixels from multiple views. The difference lies in the fact, that IBS does not explicitly handle scene-space occlusion. Hence, the sampling approaches differ in the way, space occupancy decisions are performed. For the comparision, it can be assumed that PC and IBS both step along a viewing ray for a particular destination pixel. In case of PC, non-surface points are carved at each depth step. This immediately updates the visibility of points at increased depth. Instead, IBS first steps along the whole ray and just determines color distances between the corresponding pixels for each respective 3D point. Finally, the 3D point with minimal color distance for its corresponding pixels is chosen.

In Yang et al. [72] a novel view is generated from multiple nearby cameras (multi-baseline). This is done similar to Li et al. [?] by sweeping a plane along the viewing direction. For each viewing ray, the best match is chosen from all intersections with the depth planes. Yang et al. [72] use five QVGA cameras. The images are sent from the respective image processing PCs via 100Mbit network to a server having a Nvidia Geforce3 GPU. The rendering performance varies from 62.5fps to 2.5fps for 20 x 1282 to 100x5122 planes. While this system aims at novel-view generation, Yang and Pollefeys [71] propose a plane-sweep implementation for computing disparity maps between pairs of existing views with read-back to the CPU. For more than two cameras, the resulting pairs are used to improve each other by cross-checking. Furthermore, several features to enhance quality and robustness are explained. Experiments are performed on a 2.8GHz server with ATI Radeon 9800 XT GPU. Using 32x5122 depth planes, 21.5fps and 10fps are achieved in case of two and eight input cameras respectively.

2.2.4 Conclusion

Reconstructing by shape from photo-consistency generally leads to a better approximation of shape and color. However, (1) increased space sampling, (2) visibility handling and (3) photo-consistency testing result in a low performance compared to shape from silhouette. Like the proposed approach, many works exploit the fact that the photo hull is a subset of the visual hull and thus perform shape from silhouette in advance.

View-independent approaches reconstruct the entire 3D scene. With either performing a multi-resolution reconstruction or using time-coherency in dynamic scenes, the performance of CPU techniques can be increased. Nevertheless, these systems are far away from real-time. Using GPU acceleration increases reconstruction performance but makes expensive data transfers necessary. Especially the approach of Sainz et al. [55] suffers from this issue. To reduce data transfers throughout the reconstruction, a 3D tex­ture is maintained which has several performance impact. At last, all view-independent GPU-accelerated techniques perform a read-back of interior voxels to the CPU. This is not necessary for the proposed approach, as only surface voxels are restored.

Real-time performance is obtained for view-dependent methods. When reconstructing a view at QVGA resolution with sampling each 16th ray, the CPU algorithm IBPH achieves a framerate of 6fps. Instead, IBVH alone runs at 25fps. However, this good performance results from the undersampling. Li et al. [35] propose a GPU-accelerated version of Voxel Coloring and achieve 2fps. Moreover, an equal configuration using IBPH on the CPU performs about 7 x slower.

Like IBHV, also IBPH is similar to the proposed work as it performs a ray-casting to find the intersection with the shape to be reconstructed. It is equal in the way, that a two-step approach is taken to find the two nested shapes along each ray, visual hull and photo hull. The difference is, that the proposed approach performs ray-casting to sample a regular volumetric space. This makes it similar to both, view-dependent and view-independent techniques. But in fact it is view-independent. Compared to plane-sweeping, ray­casting has the advantage that interior voxels are not evaluated and thus not read-back to the CPU. Further, all systems suffer from an expensive explicit visibility handling. This is different to the proposed work, where visibility is computed implicitly using camera geometry and color distribution for the entire set of pixels.

3 Fundamentals

This chapter lays the theoretical foundation for the proposed approach. It is subdivided in three parts. (1) A 3D scene is reconstructed from a set of images from known viewpoints. Hence, camera geometry introduces the camera parameters and the corresponding process of calibration. (2) An image represents a measurement of the light emitted from the scene. Thus, properties of light, its interaction with surfaces and its measurement by the camera is discussed. As color is necessary to describe the proposed approach, the notion of light is extended by spectral properties leading to color and color distances. (3) The third part introduces the theoretical foundation of the two reconstruction techniques applied for the proposed approach, namely shape from silhouette and shape from photo-consistency. Both are related to each other in order to build the foundation for the proposed approach.

3.1 Camera Geometry

3.1.1 Pinhole Camera Model

The principle of common cameras can be simply described by the pinhole model. On the frontside of a closed box, a small hole is pricked with a pin. Light rays radiating from an illuminated scene pass through the hole and arrive at the interior backside of the box. There, the rays form a flipped image of the scene. As the pinhole is ideally very small, just a single light ray hits each image point. To capture the image, a photoactive chemical or electronical material is attached to the backside which acts as the im­age plane. In reality, cameras are equipped with lenses and have a finite sized hole. However, the simple model of pinhole or central perspective projection provides a good approximation of the imaging process.

The camera model is decribed with a coordinate frame (O,i, j,k) (Figure 3.1), where the origin O cor­responds to the pinhole. The basis vectors i and j form a plane which is located parallel to the image plane Π' at a positive distance f from the origin O along the basis vector k. The line perpendicular to Π' with direction k passing through O is called the optical axis. The point C where the line intersects Π7 is refered to as image center or principal point. Each scene point P = (x, y, z)T from where a radiating light ray is passing through the pinhole has a corresponding image point P' = (x',y!,Z)T. Since P' is located in Π', Z = f holds. The three points P, O and P1 are connected by the light ray and thus are collinear with the relation OP1 = λ OP for some scalar λ. This leads to

3.1.2 Camera Parameters

The relation between scene points P to its images P1 in Eq. (3.1) is only valid, when all distances relate to the camera coordinate frame. Also, image coordinates have their origin at the principal point where the optical axis intersects the physical retina of the camera. Generally, the camera coordinate frame (C) is different from a common world coordinate frame (W). The relation between both is defined by a set of physical parameters, namely the focal length of the lens, the size of the pixels, the position of the principle point on the retina as well as the position and orientation of the camera. These are distinguished into intrinsic and extrinsic parameters. Extrinsic parameters define a rigid body transformation WT relating the camera to the world coordinate frame. Intrinsic Parameters relate the physical camera coordinate frame to an idealized one (Figure 3.2). Moreover, a number of non-linear aberrations are added due to the the utilization of real lenses. Typically, the one causing the highest impact is radial distortion. Intrinsic Parameters

As an intermediate step, the camera can be associated with a normalized image plane Π located at a unit distance f = 1 parallel to its physical retina Π. This plane has its own coordinate frame attached. Its origin C is located centered at the intersection with the optical axis. The corresponding homogenous perspective projection relating a scene point P = (x,y,z)T in the camera coordinate frame to its projection p = (u, v, 1)T in Π by using the identity matrix I[3,3] is defined as

Abbildung in dieser Leseprobe nicht enthalten

However, this is just an intermediate step to obtain the projection p = (x,y, 1)T on the physical retina Π. Thus, by introducing the intrinsic parameters the entire perspective projection is expressed using the camera matrix K[3,3].

Abbildung in dieser Leseprobe nicht enthalten

3.2 Relation of Coordinate Systems for Image Formation

u° and ν° are the coordinates of the principle point C° in pixel units, a and β are horizontal and vertical scaling parameters for a non-rectangular pixel size and θ is the skew-angle. Similar to Eq. (3.2) one can now define the homogenous perspective projection relating a scene point P = (x,y,z)T in the camera coordinate frame to its projection p = (u, v, 1)T using the matrix M[3,4].

While most of the intrinsic parameters remain fixed for a given camera, some may vary over time. Ad­justing the zoom factor and focus has an effect on focal length f and also principal point C°, if the optical axis is not located perpendicular to the physical retina. Extrinsic Parameters

So far, the intrinsic parameters of a camera are explained. The scene point P is defined in the camera coordinate frame (C). Generally, scene points like P are defined in terms of a common world coordinate frame (W) which is distinct from (C). Particularily, (C) = (W) holds for configurations where multiple cameras at different viewpoints are used. Hence, it is necessary to introduce a transformation from world coordinates WP to camera coordinates CPz1. Such transformation consists of a rotation R[3,3] = WR and a[11]

translation = COW.

Using this relation, one can now update the perspective projection equation from (3.4) to relate a scene point WP = (Wx, Wy, Wzf to its image p = (u, v, 1)T.

Abbildung in dieser Leseprobe nicht enthalten Radial Lens Distortion

The introduced intrinsic and extrinsic parameters are necessary to describe the pinhole camera model. In reality, cameras are equipped with a (non-perfect) lens suffering from a number of aberrations. Generally, the one having the highest impact on image formation is radial distortion producing either (a) a barrel (k1 < 0) or (b) a pin-cushion (k1 > 0) distortion effect. For a distinct pixel p in the physical retina Π the non-linear distortion depends on the distance d between p and the principal point C0. If λ is a polynomial function of the squared distance d2 the perspective projection equation from (3.6) is updated to account for radial distortion as follows.

Abbildung in dieser Leseprobe nicht enthalten

For most situations its sufficient to use a lower degree polynomial with the distortion coefficients kp |p = 1,..., 3.

3.1.3 Camera Calibration

So far, the necessary parameters to model a camera system in terms of its perspective projection are known. This section copes with the estimation of the parameters for an unknown camera system. This process is known as geometric camera calibration.

There are different approaches for camera calibration. A review of methods is given by Salvi et al. [56]. In this context, the perspective projection matrix M is computed using a set of n known correspondence pairs between scene points Pi and their images pi. The correspondences are established either manually or automatically utilizing a calibration rig which is placed into the scene and visible in the camera image

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.3: Point Correspondences for Linear Camera Calibration.

A calibration rig is placed into the scene. It defines the common world coordinate frame with known points on its surface. The camera calibration is performed by solving a linear equation system to determine the perspective projection matrix M. This can be done after determining correspondences for n points from three perpendicular faces of the rig in the camera image.

(Figure 3.3). The rig defines the common world coordinate frame with known coordinates for distinct points on its surface. There exists a linear and a non-linear technique, latter accounting for radial distor­tion. Due to the limited scope, only the linear technique which is actually used for the proposed approach is explained here. It should be assumed, that radial distortion is known and compensated when establish­ing the correspondence pairs.

Eq. (3.6) defines the perspective projection relating a scene point P = (Wx, Wy, Wz, 1)T to its image p = (u, v, 1)T. Applying some algebraic transformations leads to the following equations with m1, m2 and m3 denoting the rows of M.

Abbildung in dieser Leseprobe nicht enthalten

Utilizing the correspondence pairs (Pi, pi) |i = 1... n, n > 6, the associated constraints yield a system of 2n homogenous linear equations with the matrix of coefficients A[2n,12], the vector of variables x[12)1] and the vector of solutions b[2n>1]. Latter are representing the elements of M and the points pi respectively.

Abbildung in dieser Leseprobe nicht enthalten

If necessary, M can be decomposited into the set of intrinsic and extrinsic parameters. An explanation is found in [17].

3.2 Light and Color

Light is electromagnetic radiation having a wavelength that can be percieved by the human eye. The visible spectrum corresponds to the interval from around 400 to 700nm. Also some non-visible spectrum below the lower and beyond the upper boundary, refered to as ultraviolet and infrared waves is considered as light. The elementary particle that defines light is the photon. The three basic dimensions of light are (1) intensity/amplitude related to the perception of brightness, (2) frequency/wavelength related to the perception of color and (3) polarization/angle of vibration which is only weakly perceptible.

Several aspects concerning light and light interaction will be treated. It is explained, (1) how to measure light emitted from a surface patch and travelling in space, (2) how light interacts with surface patches, (3) how shadows occur and (4) how light emitted from a surface patch relates to the light at the camera image plane. After this, the spectral properties of light are taken into account to explain (5) color, (6) color representation and (7) color image formation in a CCD camera. However, the generation of light is beyond the scope of this work. This relates to the topic of light sources and can be further studied in [17]. As follows, the term ”light” bears on electromagnetic radiation without any consideration on particular wavelengths. It refers to the sum of all radiation in the interval of visible wavelengths.

3.2.1 Light in Space

Taking an image of a scene can be thought of as gathering the set of light rays that arrive at a single point, the lens of a camera. It is necessary to know important properties of light, how it moves through space and interacts with surfaces and what effect the arriving light has on the final camera image. Therefore, a short introduction to the field of measuring light, known as radiometry is given. Radiance

Its should be assumed, that light radiates from a light emitting surface patch. Such a patch can belong to either an active light source generating light itself or a surface where incoming light interacts and leaves again. Radiance is the quantity that describes the distribution of light in space radiating from a surface patch. In [17] it is defined as the amount of energy, travelling at some point in a specified direction, per unit time, per unit area perpendicular to the direction oftravel, per unit solid angle.

Its unit is watts per square meter per steradian (W/m2sr). The radiance at a measuring point x which can be located on either a surface or in free space is denoted by

Abbildung in dieser Leseprobe nicht enthalten

where (θ, φ) corresponds to the viewing direction in terms of spherical coordinates. The radiance L is constant along a particular ray in space, if one assumes that light does not interact with the medium through which it travels and hence no energy is lost.

3.2.2 Light at a Surface

When light arrives at a surface, there might occur a complex interaction by combination of optical effects like absorption, transmission and scattering. The remaining light is reflected and radiates from the surface. Irradiance

Incoming light is measured in terms of Irradiance that is defined by [17] as

the incident power at a point x on a surface per unit area not foreshortened.

Its unit is watts per square meter (W m2). The irradiance at x, resulting from the incoming radiance Li (x, θi, φί) intersecting the unit sphere around x at (θ·, φi) with the solid angle dω is given by

Abbildung in dieser Leseprobe nicht enthalten

The entire power incident at x is obtained by accumulating the irradiances over the whole hemisphere. Radiosity

For some surfaces, the outgoing light may not be depending on the exit angle. As radiance describes the emitted light in terms of direction the appropriate quantity for the current case is Radiosity. It is defined by [17] as

the total amount of energy leaving a point on a surface per unit area on the surface.

It is measured in watts per square meter (W m2sr) and expressed by

Abbildung in dieser Leseprobe nicht enthalten

where Ω denotes the hemisphere and cos θ compensates the foreshortening, introduced by the radiance L. Lambertian and Specular Surfaces

For some surface material like cotton and matte paper, the emitted radiance is independent of the direction of incoming illumination. Such a surface is called an ideal diffuse or Lambertian surface12. Due to this property, radiosity is utilized instead of radiance to describe the leaving energy. As human perception of brightness corresponds roughly to radiance, a Lambertian surface appears equally bright from any view­ing direction.

Glossy and shiny materials lead to mirror-like reflectance properties and form the class of specular sur­faces. An ideal specular surface radiates light only along the specular direction, depending on the reflec­tion of incident light coming from a particluar direction. A surface is called ideal specular if it can be used as a mirrow. Generally, specular surfaces reflect incoming light into a lobe of directions around the specular direction. When a high radiance from a light source arrives at the surface and leaves into a lobe, it is percieved as a white distorted blob called specularity.

Typical surfaces are neither ideal diffuse nor perfectly specular. Depending on the angle of incoming illumination, the reflectance property varies between the two extremes.

3.2.3 Occlusion and Shadows

Shadows occur when objects are located within the path from a light source to a surface patch and thus the light source is not visible from the patch. A single point light source produces shadows with crisp boundaries. The pattern may be affected by light reflected from nearby surfaces. Moreover, shadows will be less dark when more light sources are added. This results in less crisp boundaries and areas of different darkness depending on how much light sources are visible from a particular part of the patch. The effect can be observed for sport events where illumination comes from many bright point light sources around the stadium. For a moving player, the shadow at his feet changes its darkness together with position and size.

Also area light sources might produce less crisp boundaries for a shadow. It is usually distinguished between points in the shadow area on the patch lying either in the umbra13 or in the penumbra14. While the former relates to points from where the light source can not be seen at all, the latter relates to points where it is partially visible.[12] [13] [14]

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.4: Light at a Camera.

The light that is emitted by a surface patch δΑ into the direction of the pinhole Oc is projected and measured onto a corresponding patch δΑ' in the physical retina.

3.2.4 Light at a Camera

Using a camera, scene illumination is captured from a particluar viewpoint, i.e. the center О of a thin lens. Light, radiating from a surface patch δΑ centered at a point P is concentrated by the lens as irradiance δΡ at its center О. The power δΡ is then radiated onto the corresponding area δΑ' on the image plane centered at a point P' (Figure 3.4). It is assumed, that light can only reach the image plane by passing through the lens and thus only the power δΡ is concentrated on the patch δ A' leading to the image irradi­ance E in δΑ'. For a detailed description refer to [17].

Some conclusions can be infered for the relationship between the object radiance L emitting from surface patch δΑ and the irradiance E arriving at its image δΑ'. The irradiance E is proportional to the radiance L stating a direct relation between camera image and captured scene. Further, the irradiance E is propor­tional to the square of the relative apperture denoting the ratio between lens diameter d and focal length f. Concerning the image plane, the irradiance decreases when the center of projection P' moves away from the optical axis, since irradiance is proportional to cos4 a. The effect from this fall-off in brightness is known as vignetting.

3.2.5 Color

Color relates to spectral properties of light and can be seen as the perceptual quantity that changes for light at different wavelengths. In the last sections, light is treated as some particular amount of energy radiating from light sources, interacting with sufaces and beeing captured by a camera. The definition of energy is now expanded to energy at different wavelengths. Moreover, this also refines the way how light interacts with surfaces. Depending on the color of the surface, only light at equal wavelength is reflected while light at different wavelengths is absorbed.

All physical quantities and units defined in the context of light can be extended by ”per unit wavelength”, yielding spectral quantities and corresponding spectral units to model the behavior of light at particu- lar wavelengths. Spectral radiance is denoted as LX (x, θ, φ) and radiance at a particular wavelength [X, λ + dX ] as LX (x, θ, φ ) dX respectively. The unit changes to watts per cubic meter per steradian (W/m3sr) where cubic meter relates to the addition of wavelength..

3.2.6 Color Representation

Human perception of color can be studied by color matching. Therefore, a test light T is presented to the test person. To match T, the person adjusts the weigths wi for a set of primaries Pi. A matching is expressed as

Abbildung in dieser Leseprobe nicht enthalten

For most people without a disfunction in color perception k = 3 primaries are sufficient to mix the colors. The primaries must be independent, meaning that no weighted sum of two primaries will match the third. This phenomenon is known as the principle of trichromacy, relating to the fact that in the eye there are three types of receptors for color perception at short (blue), middle (green) and long (red) wavelength.

Colors are represented as vectors in a color space[15]. There exist different types of color spaces, mainly distinguished into (1) linear and (2) non-linear color spaces. Non-linear color spaces are either used for (1) emphasizing particular properties or (2) to make color distances perceptually uniform. In a uniform color space, color distances with equal value are percieved with equal amount, independent of the posi­tion in space.

Linear and non-linear color spaces are discussed, matching the scope of this work. For a general introduc­tion on color representation the reader may refer to [63]. Linear Color Spaces

The spectral radiance of light representing a particular color can be described as a weighted sum of single wavelength sources. Given a set of primaries Pi, single wavelength sources U (X) can be obtained by knowing the respective set of weights or color matching functions f (X ).

Abbildung in dieser Leseprobe nicht enthalten

U (X ) is a single wavelength source of unit radiance and X the particular wavelength. An interesting property is the linearity of color matching. As mentioned, a spectral radiance S (X ) is generated by a weighted sum of single wavelength sources. U (X) can be applied as a single wavelength source.

RGB color spaces are mostly used to describe a color for reproduction by some physical light-emitting device. Colors in a RGB color space are constructed by superimposing weighted primaries in form of a red, green and blue color component. Each primary has a distinct wavelength, formally 645.16nm for R, 526.32nm for G and 444.44nm for B. As the primaries are generated by a particular physical device, RGB is device dependent and can not be used to describe an absolute color. The first RGB color space was defined 1931 by the CIE16 17 18 19 20 as CIE RGB to develop an RGB model of human vision. CIE RGB, sRGB and Adobe RGB are some examples of absolute color spaces based on the RGB model.

Even if constant primaries are used with RGB, there occurs another problem for describing colors. For some single wavelengths, the color matching functions have negative values. The CIE XYZ color space, also developed in 1931 overcomes this issue. Its color matching functions are everywhere positive. How­ever, the artificial primaries X, Y and Z can not be obtained as they are sums of different wavelengths with some having negative spectral radiance17. CIE XYZ is an absolute color space as its components are device-independent. The Y component describes luminance which is radiant power weighted by a spectral sensitivity function, related to the perception of brightness. Chromaticity information is obtained when transforming CIE XYZ into the space CIE xy, where the luminance information is eliminated and xy corresponds to pure chromaticity18. CIE xy represents a projection of CIE XYZ to an area of uniform luminance on the plane X + Y + Z = 1. Non-linear Color Spaces

Non-linear color spaces are designed to emphasize special properties. While RGB is most useful for technical reproduction of colors, its less applicable to the task of generating a color in computer graphics. Color spaces that model human experience more closely are HSV19 and HSL20. They are obtained by a non-linear transformation of an RGB color space.

Linear and several non-linear color spaces like HSV/HSL are not perceptually linear. This relates to the definition of color distances. If a space is not perceptually linear, the same color distance is percieved with different amount. Corresponding to the logarithmic response of the eye, perception dependens on the position in color space. The behavior was first observed by MacAdam. The related MacAdam ellipses correspond to areas surrounding a particular test light, where human observers do not percieve a differ­ence to the test light itself. In a non-uniform color space the ellipses have varying size.

The CIE L*a*b* color space is defined as a non-linear transformation of CIE XYZ to linearize perception and skew the MacAdam ellipses. The parameters a* and b* correspond to pure chromaticity components. L* denotes lightness and is an attempt to perceptually linearize changes of brightness, similar to adjusting brightness for a TV set. L* is defined by CIE as a modified cube root of luminance. More recent works on that topic proposes the use the square root instead [45, 19].

RGB, CIE XYZ and CIE L*a*b* are introduced here, since they are employed by the proposed approach. There exist much more color spaces. One can be transformed into each other by applying a set of trans­formations. For further information about color spaces and their conversion refer to [16] and [51]. Color Metric

With the introduction of color spaces the term color distance has been mentioned several times. It has a somehow ambiguous meaning [51]. In this scope, color distance relates to numerical differences between distinct color specifications. These are especially important for the proposed approach and lead to a definition of color metric. A simple metric over a three dimensional color space is given by the Euclidean distance between two colors ||c1 — c21| with dimensions x, y and z as To apply this metric, it is important that the basis vectors of color space are orthogonal. This generally holds, since the color components are independent. Here, the issue lies not in distance calculation, but in validity of the obtained values related to the perception of color differences. To compensate for this, perceptually uniform color spaces like CIE L*a*b* are applied.

At this point, one might think that only uniform color spaces can be used to compare two colors. In fact, any color space can be used as each is a transformation of another. The issue is, that also the distance metric has to be adjusted and can thus become much more complex than an Euclidean distance. Depending on the actual application, it should be determined whether it is better to transform color spaces and employ a simple metric or to take the opposite way. Since distances are applied within several tasks, the proposed approach uses a simple metric. Moreover, as monotony is preserved for squared Euclidean distances, an expensive square-root computation gets obsolete when distances are only compared.

Abbildung in dieser Leseprobe nicht enthalten

Assuming the Euclidean distance is used as metric over a perceptually uniform color space, equal color distances are percieved with equal amount. However, this applies only if color distances are small. For example, it is very difficult to decide without any further knownledge whether blue/yellow or red/green are more different.

3.2.7 CCD Camera Color Imaging

Section 3.2.4 describes how the light on the camera image plane relates to the light radiating from a corresponding surface patch. The spectral radiance ρ (λ ) E (λ ) emitted by a particular surface patch depends on (1) the spectral irradiance E (λ ) of arriving light and (2) the spectral reflectance ρ (λ ) of the surface patch. To record this light using a CCD21 camera, a matrix of sensory elements is placed into the image plane. Each element corresponds to a single pixel and consists of three sensors for the distinct RGB primaries respectively. If Λ denotes the range of relevant wavelengths for a sensor of type k having a spectral sensitivity of σk (λ), its response pk is given by

Abbildung in dieser Leseprobe nicht enthalten

An eligible property of a camera system would be that the measured spectral intensities relate linearly to the CIE XYZ color matching functions. This means that the camera shows the same response as a human observer. Interesting is that CCD sensors have this property, but due to various reasons related to percieved image quality the output is usually non-linearly transformed. For example, Gamma correction is applied to account for non-linearities at the reproduction of intensities in CRT monitors.

3.3 3D Reconstruction from Multiple Views

This section introduces the theoretical foundation for the techniques shape from silhouette and shape from photo-consistency.

3.3.1 Visual Hull Reconstruction by Shape from Silhouette

Based on the work of Laurentini, who first formalized a geometrical model for shape from silhouette in [29] this section describes the technique, discusses shapes that can not be reconstructed and defines the visual hull as the best possible approximation. Shape from Silhouette

As the name implies, shape from silhouette uses 2D silhouettes of a 3D shape S22 as the cue to infer its respective geometry. A silhouette of a shape corresponds to its perspective projection into the image plane of a particular view C23. Silhouette images encode the decision whether S is visible from a distinct pixel or not and are obtained by image segmentation.


[1] Computer Supported Cooperative/Collaborative Work

[2] Augmented Reality

[3] Mixed Reality

[4] In this scope, GPU acceleration relates to the acceleration of any shape modifying operation. Thus, CPU-based reconstruc­tion with GPU-accelerated rendering (and clipping) does not fall under this category.

[5] The technique is also referred to as silhouette extraction or background subtraction.

[6] Level Of Detail

[7] Image-Based Rendering

[8] Constructive Solid Geometry

[9] Actually this work uses the term ’’Space Carving” somehow ambiguously. In [28] it is defined as a generic concept without accounting for visibility. Hence, ’Multi-Sweep Space Carving” is meant here instead.

[10] Of course, this depends on the speed of movement and the framerate for image capturing.

[11] Since distinct coordinate frames are involved, the respective frame is attached to a particluar entity. For a point, it is written in superscript. For a transformation matrix relating two frames, source and destination frames are written in subscript and superscript respectively.

[12] After George Lambert

[13] Latin word for ’’shadow”

[14] Compound of latin words meaning ’almost shadow”

[15] Typically there are three or four color components.

[16] Commission International d’Eclairage

[17] This does not lead to any practical problems as only the weights are necessary to describe a color.

[18] Usually, chromaticity components in color spaces are denoted by lower case letters.

[19] Hue, Saturation, Value

[20] Hue, Saturation, Lightness

[21] Charge-coupled device

[22] Denotes a 3D scene.

[23] Since perspective projection is applied, it is assumed that intrinsic and extrinsic parameters of the camera at C are known.

Excerpt out of 146 pages


A Framework for Real-time 3D Reconstruction by Space Carving using Graphics Hardware
University of Weimar  (Faculty of Media)
Catalog Number
ISBN (eBook)
File size
17080 KB
Framework, Real-time, Reconstruction, Space, Carving, Graphics, Hardware
Quote paper
Christian Nitschke (Author), 2006, A Framework for Real-time 3D Reconstruction by Space Carving using Graphics Hardware, Munich, GRIN Verlag, https://www.grin.com/document/69735


  • No comments yet.
Read the ebook
Title: A Framework for Real-time 3D Reconstruction by Space Carving using Graphics Hardware

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free