A Framework for Real-time 3D Reconstruction by Space Carving using Graphics Hardware

E-Book Cover: ()
Flash Player and JavaScript is needed to view the text. Please install the Flash Player and enable JavaScript in your browser.

Install Flash Player

Details

Title: A Framework for Real-time 3D Reconstruction by Space Carving using Graphics Hardware
Author: Christian Nitschke
Subject: Computer Science - Applied
Institution/College: University of Weimar (Faculty of Media)
Category: Diploma Thesis
Year: 2006
Pages: 146
Grade: 1.0
Bibliography: ~ 76  Entries
Language: English
File size: 15731 KB
Archive No.: V69735
ISBN (E-book): 978-3-638-60755-1

Excerpt (computer-generated)

Bauhaus Universität at Weimar
Faculty of Media
Department of Media System Science

Osaka University, Japan

A Framework for Real-time 3D Reconstruction
by Space Carving using Graphics Hardware

Diploma Thesis
in partial fulfillment of the requirements for the degree of
Diplom-Mediensystemwissenschaftler

by Christian Nitschke

Date of Submission 12/06/2006
Date of Defence 12/15/2006

 

Abstract

Reconstruction of real-world scenes from a set of multiple images is a topic in Computer Vision and 3D Computer Graphics with many interesting applications. There exists a powerful algorithm for shape reconstruction from arbitrary viewpoints, called Space Carving. However, it is computationally expensive and hence can not be used with applications in the field of 3D video or CSCW as well as interactive 3D model creation. Attempts have been made to achieve real-time framerates using PC cluster systems. While these provide enough performance they are also expensive and less flexible. Approaches that use GPU hardware acceleration on single workstations achieve interactive framerates for novel-view synthesis, but do not provide an explicit volumetric representation of the whole scene. The proposed approach shows the efforts in developing a GPU hardware-accelerated framework for obtaining the volumetric photo hull of a dynamic 3D scene as seen from multiple calibrated cameras. High performance is achieved by employing a shape from silhouette technique in advance to obtain a tight initial volume for Space Carving. Also several speed-up techniques are presented to increase efficiency. Since the entire processing is done on a single PC the framework can be applied to mobile setups, enabling a wide range of further applications. The approach is explained using programmable vertex and fragment processors with current hardware and compared to highly optimized CPU implementations. It is shown that the new approach can outperform the latter by more than one magnitude.
 


”The stone unhewn and cold becomes a living mould,
The more the marble wastes the more the statue grows.”
Michelangelo Buonarroti
 

Contents

1 Introduction 1

1.1 Application ... 1
1.2 Classification ... 2
1.3 Performance ... 3
1.4 Contribution ... 4
1.5 Overview ... 5

2 Related Work 7

2.1 Shape from Silhouette ... 7
2.1.1 Image Segmentation ... 8
2.1.2 Foundations ... 8
2.1.3 Performance of View-Independent Reconstruction ... 9
2.1.3.1 CPU ... 9
2.1.3.2 GPU Acceleration ... 9
2.1.4 Performance of View-Dependent Reconstruction ... 10
2.1.4.1 CPU ... 10
2.1.4.2 GPU Acceleration ... 10
2.1.5 Conclusion ... 11

2.2 Shape from Photo-Consistency ... 12
2.2.1 Foundations ... 13
2.2.2 Performance of View-Independent Reconstruction ... 14
2.2.2.1 CPU ... 14
2.2.2.2 GPU Acceleration ... 14
2.2.3 Performance of View-Dependent Reconstruction ... 15
2.2.3.1 CPU ... 15
2.2.3.2 GPU Acceleration ... 16
2.2.4 Conclusion ... 17

3 Fundamentals 18

3.1 Camera Geometry ... 18
3.1.1 Pinhole Camera Model ... 18
3.1.2 Camera Parameters ... 19
3.1.2.1 Intrinsic Parameters ... 19
3.1.2.2 Extrinsic Parameters ... 20
3.1.2.3 Radial Lens Distortion ... 21
3.1.3 Camera Calibration ... 21

3.2 Light and Color ... 23
3.2.1 Light in Space ... 23
3.2.1.1 Radiance ... 23
3.2.2 Light at a Surface ... 24
3.2.2.1 Irradiance ... 24
3.2.2.2 Radiosity ... 24
3.2.2.3 Lambertian and Specular Surfaces ... 25
3.2.3 Occlusion and Shadows ... 25
3.2.4 Light at a Camera ... 26
3.2.5 Color ... 26
3.2.6 Color Representation ... 27
3.2.6.1 Linear Color Spaces ... 27
3.2.6.2 Non-linear Color Spaces ... 28
3.2.6.3 Color Metric ... 29
3.2.7 CCD Camera Color Imaging ... 30

3.3 3D Reconstruction from Multiple Views ... 30
3.3.1 Visual Hull Reconstruction by Shape from Silhouette ... 30
3.3.1.1 Shape from Silhouette ... 30
3.3.1.2 Discussion ... 31
3.3.1.3 The Visual Hull ... 32
3.3.1.4 Silhouette-Equivalency ... 33
3.3.1.5 Number of Viewpoints ... 34
3.3.1.6 Conclusion ... 34
3.3.2 Photo Hull Reconstruction by Shape from Photo-Consistency ... 35
3.3.2.1 Shape from Photo-Consistency ... 36
3.3.2.2 Discussion ... 36
3.3.2.3 Photo-Consistency ... 37
3.3.2.4 The Photo Hull ... 38
3.3.2.5 The Space Carving Algorithm ... 39
3.3.2.6 Voxel Visibility ... 40
3.3.2.7 Conclusion ... 41

4 Basic Algorithm 42

4.1 Data ... 42
4.1.1 Camera Parameters ... 42
4.1.2 Image Data ... 43

4.2 Reconstruction ... 44
4.2.1 3D Data Representation ... 44
4.2.2 Volumetric Bounding Box ... 45
4.2.3 Maximal Volume Intersection ... 46
4.2.4 Visual Hull Approximation ... 46
4.2.5 Photo-Consistent Surface ... 46
4.2.5.1 Active Source Camera Test ... 47
4.2.5.2 Photo Consistency Test ... 48

5 Advanced Algorithm 49

5.1 Overview ... 49
5.1.1 Deployment ... 49
5.1.2 Process Flow ... 49

5.2 Texture Processing ... 50
5.2.1 Lookup Table for Projection Coordinates ... 51
5.2.2 Mapping Image Data into Textures ... 52
5.2.3 Texture Upload and Processing Performance ... 52
5.2.4 GPU Image Processing ... 54

5.3 Destination Cameras ... 54
5.3.1 Discussion ... 55
5.3.1.1 Ray Casting vs. Multi Plane Sweeping ... 56
5.3.1.2 Virtual vs. Natural Views ... 57
5.3.2 Interleaved Depth Sampling ... 58
5.3.3 Active Destination Cameras ... 60
5.3.3.1 Source Camera Viewing Ray ... 60
5.3.3.2 Intersection of Volume and Source Camera Viewing Ray ... 61
5.3.3.3 Activity Decision ... 62

5.4 Reconstruction ... 62
5.4.1 Vertex Data ... 63
5.4.2 Vertex Shader Visual Hull Approximation ... 64
5.4.2.1 Decreasing the Sampling Error for Interleaved Sampling ... 64
5.4.2.2 Early Ray Carving ... 64
5.4.3 Fragment Shader Photo-Consistent Surface ... 65
5.4.3.1 Filling Holes ... 65
5.4.3.2 Modified Active Source Camera Decision ... 66
5.4.4 Fragment Shader Color Blending ... 67
5.4.5 Fragment Shader Render to Texture ... 68

5.5 Postprocessing ... 70
5.5.1 Extracting Texture Data ... 70
5.5.2 Filling Interior Volume Data ... 71
5.5.2.1 Ambiguities ... 71
5.5.2.2 Performance ... 72

6 Experiments ... 73

6.1 System Setup ... 73

6.2 Implementation ... 73

6.3 Datasets ... 74

6.4 Performance ... 75
6.4.1 Abstract Data Performance Experiments ... 76
6.4.1.1 CPU-GPU Texture Upload ... 76
6.4.1.2 Interleaved Sampling ... 76
6.4.1.3 Early Ray Carving ... 76
6.4.1.4 Fragment Shader CIELab-RGB Conversion ... 77
6.4.1.5 Porting all Load to the Fragment Processor ... 77
6.4.1.6 GPU-CPU Texture Read-back ... 77
6.4.1.7 FBO Texture Size ... 78
6.4.1.8 Impact of CPU-GPU Texture Upload on overall Performance ... 78
6.4.1.9 Number of Source Cameras ... 78
6.4.1.10 Number of Destination Cameras ... 79
6.4.2 Concrete Data Performance Experiments ... 79
6.4.2.1 Algorithmic Features ... 80
6.4.2.2 Destination Cameras ... 80
6.4.2.3 Volumetric Resolution ... 81
6.4.2.4 Volumetric Bounding Box ... 81
6.4.2.5 PCS Increments ... 81
6.4.3 Conclusion ... 82
6.4.3.1 Algorithmic Features ... 82
6.4.3.2 Parameters ... 82
6.4.3.3 GPU/CPU Comparison ... 82

6.5 Quality ... 83
6.5.1 Concrete Data Quality Experiments ... 83
6.5.1.1 Volumetric Resolution ... 83
6.5.1.2 Volumetric Bounding Box ... 83
6.5.1.3 PCS Increments ... 84
6.5.2 Visual Experiments ... 84
6.5.2.1 Image Segmentation ... 84
6.5.2.2 Interleaved Sampling and MVI ... 84
6.5.2.3 Camera Viewing Cone Intersection ... 84
6.5.2.4 Reconstruction of VHA and PCS ... 85
6.5.2.5 Volumetric Resolution ... 85
6.5.2.6 PCS Increments ... 85
6.5.2.7 Geometrical Score for Active Source Camera Computation ... 86
6.5.2.8 Range of Color Distances for Active Source Camera Computation ... 86
6.5.2.9 Labeling of Interior Space ... 86
6.5.3 Conclusion ... 86
6.5.3.1 Image Segmentation ... 87
6.5.3.2 BB, MVI and Viewing Cone Intersection ... 87
6.5.3.3 VHA and PCS ... 87
6.5.3.4 PCS Parameters ... 87
6.5.3.5 Labeling of Interior Space ... 88

7 Discussion and Enhancements ... 89

7.1 Summary ... 89

7.2 Limitations ... 90

7.3 Future Work ... 90
7.3.1 Online System ... 90
7.3.2 Performance ... 91
7.3.3 Quality ... 91

7.4 Annotation ... 92

A Abstract Data Performance Experiments 93

B Concrete Data Performance Experiments 98

C Concrete Data Quality Experiments 104

D Visual Experiments 108

References 127

 

1 Introduction

As computational machinery and methods are subject to an ever increasing progress, traditional applications change and new ones emerge. Years ago, computer graphics and computer vision were two distinct fields having their respective area of application. However, recently a convergence between both can be observed. Photorealistic models of real-world scenes are not only obtained by artificially creating shape and appearance, but also by reconstructing these models from photographic or video data of the real world. This results in a powerful contribution to our multimedial influenced environment.
 

1.1 Application

The ability to reconstruct dynamic realworld scenes from multiple video streams enables a wide range of applications. 3D video extends the common 2D video in the way, that it is view-independent. Hence, the decision about a fixed camera from where the scene is viewed is shifted from the time and place of recording to the time and place of consumption. 3D video can be used in the context of personal and social human activities like entertainment (e.g. 3D games, 3D events [24], 3D TV [37], 3D video recorder [70]) and education (e.g. 3D books, 3D-based training) or for the preservation of common knowledge (e.g. ethnic dancing performances [39], special skills).

3D video is obtained by analyzing properties like scene geometry and color from multi-viewpoint video streams. This generally involves a high computational effort. Nevertheless, if the reconstruction can be computed in real-time, a broad field of further applications is facilitated. In the field of CSCW1, more realistic and immersive telepresence and conferencing systems are created by using 3D cues [44]. In this context, PRINCE et al. [52] propose a framework, where a user can interact with a dynamic model of a remotely captured collaborator which is overlaid with the real world in a video-see-through HMD. Instead of inserting a dynamic scene into a real environment (AR2), its also possible to combine it into a virtual world (MR3). This enables for seamless integration of real and virtual content with appropriate visual cues like reflections, shading, shadowing, occlusion and collision detection. Interactive systems for reconstruction and insertion of actors into virtual worlds are proposed by HASENFRATZ et al. [21] and more recently POMI and SLUSALLEK [49], contributing to the field of virtual television studio techniques.

Dynamic 3D data may be also used as an intermediate result for further processing and information extraction. Scene analysis allows for object recognition or motion tracking. Latter is often applied as a non-intrusive approach to human motion capturing [43].

Supporting tools for complex 3D model creation are powerful, but also expensive and difficult to use.

(Figure 1.1: A Taxonomy of 3D Shape Acquisition Techniques (CURLESS [9]). - contained in the downloadversion)

Obtaining realistic models is still a time consuming process. As mentioned earlier, 3D models can be created from photographic data of the real world. This is generally possible by performing an automatic reconstruction. However, better performance, quality and handling of scenes with difficult properties is achieved using techniques that require user interaction [76]. Here, real-time computation allows for immediate feedback on actions and parameter adjustment.
 

1.2 Classification

This proposed approach performs 3D shape reconstruction using images or videos from multiple viewpoints around the scene. This is one of many different techniques that can be used to accomplish the task of 3D shape acquisition (Figure 1.1). It belongs to the class of non-contact, reflective, optical, passive techniques. Passive refers to the fact, that only light is captured which is already existent in the scene. Instead, active-light methods perform a controlled illumination, e.g. by projecting a coded pattern to accomplish a more robust extraction of the 3D features. Passive-light techniques are also referred to as Shape from X, where X relates to the special image cue that is used to infer the shape. Common cues are e.g. stereo, shading, focus/defocus, motion, silhouettes and scene photo-consistency. While there also exist methods using uncalibrated cameras for shape recovery, the proposed approach assumes that the position and orientation of each camera to the world coordinate frame is known. General surveys about techniques in the field of passive-light are given by DYER [12] and SLABAUGH et al. [60].

As several shape from X approaches rely on special image cues to be present, three general classes for 3D shape reconstruction and texturing from multiple photographs have been developed. Namely (1) imagebased stereo (IBS), (2) volumetric intersection (VI) and (3) photo-consistency (PC). Each of them having advantages and drawbacks. There is a general difference between IBS and VI/PC. IBS performs an imagebased search to find corresponding pixels to generate 3D points by triangulation. The result is a depth map or a partial surface. There are some disadvantages inherent with this. (1) For the correspondence search to be efficient, the views must be close together (small baseline). (2) If more then two views are used, correspondences must be searched in each pair of views resulting in a high complexity (multi-baseline). (3) For reconstructing a whole 360 degree scene model, many partial surfaces have to be computed for a set of reference viewpoints. Integration of these distinct surfaces into a global consistent model can be complex, since the alignment is often carried out in a least-square sense as proposed by CURLESS and LEVOY [10]. (4) Only a sparse point cloud is obtained when pixel correspondences or image features are hard to find. In general, there is no neighborship relation between the 3D points, as they are not located on a regular grid. Hence, generating a mesh-based surface representation implies further computation. (5) There is no handling of occlusion between the different views. This might lead to errors throughout the correspondence search.

[...]


1 Computer Supported Cooperative/Collaborative Work

2 Augmented Reality

3 Mixed Reality

Comments

Add Comment

This text can be quoted and accessed from this url:

http://www.grin.com/e-book/69735/