# A Framework for Real-time 3D Reconstruction by Space Carving using Graphics Hardware

## 146 Pages, Grade: 1

Excerpt

A Framework for Real-time 3D
Reconstruction by Space Carving using
Graphics Hardware
by
Christian Nitschke
Submitted to the Department of Media System Science
in partial fulfillment of the requirements for the degree of
Diplom-Mediensystemwissenschaftler
at the
B
AUHAUS
-U
NIVERSIT
¨
AT
W
EIMAR
Date of Submission
12/06/2006
Date of Defence
12/15/2006

To my Parents

I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it
contains no material previously published or written by another person nor material which to a
substantial extent has been accepted for the award of any other degree or diploma of the university or
other institute of higher learning, except where due acknowledgment has been made in the text.
Christian Nitschke, Weimar 12/06/2006

A
BSTRACT
Reconstruction of real-world scenes from a set of multiple images is a topic in Computer Vision and 3D
Computer Graphics with many interesting applications. There exists a powerful algorithm for shape recon-
struction from arbitrary viewpoints, called Space Carving. However, it is computationally expensive and
hence can not be used with applications in the field of 3D video or CSCW as well as interactive 3D model
creation. Attempts have been made to achieve real-time framerates using PC cluster systems. While these
provide enough performance they are also expensive and less flexible. Approaches that use GPU hard-
ware acceleration on single workstations achieve interactive framerates for novel-view synthesis, but do
not provide an explicit volumetric representation of the whole scene. The proposed approach shows the
efforts in developing a GPU hardware-accelerated framework for obtaining the volumetric photo hull of
a dynamic 3D scene as seen from multiple calibrated cameras. High performance is achieved by employ-
ing a shape from silhouette technique in advance to obtain a tight initial volume for Space Carving. Also
several speed-up techniques are presented to increase efficiency. Since the entire processing is done on a
single PC the framework can be applied to mobile setups, enabling a wide range of further applications.
The approach is explained using programmable vertex and fragment processors with current hardware and
compared to highly optimized CPU implementations. It is shown that the new approach can outperform
the latter by more than one magnitude.

A
CKNOWLEDGMENTS
At first, I want to thank my supervisor, Oliver Bimber for providing me with the opportunity to spend a
one-year research program at Takemura Lab in Osaka University, Japan. This work would not exist in
the current form without his great support. I also want to thank my second supervisor, Kiyoshi Kiyokawa
for coming to Germany just to attend my defense, for all the good discussions and comments when I
got stuck in the daily details, for the wonderful time in Japan and the efforts in extending my stay in the
last minute. It is a good place to thank both for reading the thesis in one week. I know myself, a lot of
pages came together. I would also like to thank Atsushi Nakazawa for directly supervising my work with
interesting discussions and background knowledge.
My parents and my family always supported me and and gave me a backing on my way. I want to thank
you for all the support and that you never said "no" to my ideas.

"The stone unhewn and cold becomes a living mould,
The more the marble wastes the more the statue grows."
Michelangelo Buonarroti

Contents
C
ONTENTS
1
Introduction
1
1.1
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.4
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.5
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2
Related Work
7
2.1
Shape from Silhouette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.1
Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.2
Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.3
Performance of View-Independent Reconstruction . . . . . . . . . . . . . . . .
9
2.1.3.1 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.3.2 GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.4
Performance of View-Dependent Reconstruction . . . . . . . . . . . . . . . . .
10
2.1.4.1 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.4.2 GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Shape from Photo-Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2.1
Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.2
Performance of View-Independent Reconstruction . . . . . . . . . . . . . . . .
14
2.2.2.1 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.2.2.2 GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.2.3
Performance of View-Dependent Reconstruction . . . . . . . . . . . . . . . . .
15
2.2.3.1 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.3.2 GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3
Fundamentals
18
3.1
Camera Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.1.1
Pinhole Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.1.2
Camera Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.1.2.1 Intrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.1.2.2 Extrinsic Parameters
. . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.1.2.3 Radial Lens Distortion . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.1.3
Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.2
Light and Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2.1
Light in Space
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2.1.1 Radiance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
- i -

Contents
3.2.2
Light at a Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.2.2.1 Irradiance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.2.2.3 Lambertian and Specular Surfaces . . . . . . . . . . . . . . . . . . . .
25
3.2.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2.4
Light at a Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.2.5
Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.2.6
Color Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2.6.1 Linear Color Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2.6.2 Non-linear Color Spaces . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2.6.3 Color Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.2.7
CCD Camera Color Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3
3D Reconstruction from Multiple Views . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3.1
Visual Hull Reconstruction by Shape from Silhouette . . . . . . . . . . . . . . .
30
3.3.1.1 Shape from Silhouette . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3.1.3 The Visual Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.1.4 Silhouette-Equivalency . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.3.1.5 Number of Viewpoints . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.3.1.6 Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.3.2
Photo Hull Reconstruction by Shape from Photo-Consistency
. . . . . . . . . .
35
3.3.2.1 Shape from Photo-Consistency . . . . . . . . . . . . . . . . . . . . . .
36
3.3.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.3.2.3 Photo-Consistency
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.3.2.4 The Photo Hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.3.2.5 The Space Carving Algorithm
. . . . . . . . . . . . . . . . . . . . . .
39
3.3.2.6 Voxel Visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.3.2.7 Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4
Basic Algorithm
42
4.1
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.1.1
Camera Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.1.2
Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.2
Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.2.1
3D Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.2.2
Volumetric Bounding Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.2.3
Maximal Volume Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.2.4
Visual Hull Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.2.5
Photo-Consistent Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.2.5.1 Active Source Camera Test . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2.5.2 Photo Consistency Test . . . . . . . . . . . . . . . . . . . . . . . . . .
48
- ii -

Contents
5
49
5.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.1.1
Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.1.2
Process Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.2
Texture Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.2.1
Lookup Table for Projection Coordinates . . . . . . . . . . . . . . . . . . . . .
51
5.2.2
Mapping Image Data into Textures . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.2.3
Texture Upload and Processing Performance . . . . . . . . . . . . . . . . . . .
52
5.2.4
GPU Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.3
Destination Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.3.1
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
5.3.1.1 Ray Casting vs. Multi Plane Sweeping . . . . . . . . . . . . . . . . . .
56
5.3.1.2 Virtual vs. Natural Views . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.3.2
Interleaved Depth Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.3.3
Active Destination Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5.3.3.1 Source Camera Viewing Ray . . . . . . . . . . . . . . . . . . . . . . .
60
5.3.3.2 Intersection of Volume and Source Camera Viewing Ray . . . . . . . .
61
5.3.3.3 Activity Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
5.4
Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
5.4.1
Vertex Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
5.4.2
Vertex Shader Visual Hull Approximation . . . . . . . . . . . . . . . . . . . . .
64
5.4.2.1 Decreasing the Sampling Error for Interleaved Sampling . . . . . . . .
64
5.4.2.2 Early Ray Carving . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.4.3
Fragment Shader Photo-Consistent Surface . . . . . . . . . . . . . . . . . . . .
65
5.4.3.1 Filling Holes
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.4.3.2 Modified Active Source Camera Decision . . . . . . . . . . . . . . . .
66
5.4.4
Fragment Shader Color Blending
. . . . . . . . . . . . . . . . . . . . . . . . .
67
5.4.5
Fragment Shader Render to Texture . . . . . . . . . . . . . . . . . . . . . . . .
68
5.5
Postprocessing
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
5.5.1
Extracting Texture Data
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
5.5.2
Filling Interior Volume Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.5.2.1 Ambiguities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.5.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
6
Experiments
73
6.1
System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
6.2
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
6.3
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
6.4
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
6.4.1
Abstract Data Performance Experiments . . . . . . . . . . . . . . . . . . . . . .
76
6.4.1.1 CPU-GPU Texture Upload . . . . . . . . . . . . . . . . . . . . . . . .
76
- iii -

Contents
6.4.1.2 Interleaved Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
6.4.1.3 Early Ray Carving . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
6.4.1.4 Fragment Shader CIELab-RGB Conversion . . . . . . . . . . . . . . .
77
6.4.1.5 Porting all Load to the Fragment Processor . . . . . . . . . . . . . . . .
77
6.4.1.6 GPU-CPU Texture Read-back
. . . . . . . . . . . . . . . . . . . . . .
77
6.4.1.7 FBO Texture Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
6.4.1.8 Impact of CPU-GPU Texture Upload on overall Performance . . . . . .
78
6.4.1.9 Number of Source Cameras . . . . . . . . . . . . . . . . . . . . . . . .
78
6.4.1.10 Number of Destination Cameras . . . . . . . . . . . . . . . . . . . . .
79
6.4.2
Concrete Data Performance Experiments . . . . . . . . . . . . . . . . . . . . .
79
6.4.2.1 Algorithmic Features . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
6.4.2.2 Destination Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
6.4.2.3 Volumetric Resolution
. . . . . . . . . . . . . . . . . . . . . . . . . .
81
6.4.2.4 Volumetric Bounding Box
. . . . . . . . . . . . . . . . . . . . . . . .
81
6.4.2.5 PCS Increments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
6.4.3
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
6.4.3.1 Algorithmic Features . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
6.4.3.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
6.4.3.3 GPU/CPU Comparison . . . . . . . . . . . . . . . . . . . . . . . . . .
82
6.5
Quality
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
6.5.1
Concrete Data Quality Experiments . . . . . . . . . . . . . . . . . . . . . . . .
83
6.5.1.1 Volumetric Resolution
. . . . . . . . . . . . . . . . . . . . . . . . . .
83
6.5.1.2 Volumetric Bounding Box
. . . . . . . . . . . . . . . . . . . . . . . .
83
6.5.1.3 PCS Increments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
6.5.2
Visual Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
6.5.2.1 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
6.5.2.2 Interleaved Sampling and MV I . . . . . . . . . . . . . . . . . . . . . .
84
6.5.2.3 Camera Viewing Cone Intersection . . . . . . . . . . . . . . . . . . . .
84
6.5.2.4 Reconstruction of V HA and PCS . . . . . . . . . . . . . . . . . . . . .
85
6.5.2.5 Volumetric Resolution
. . . . . . . . . . . . . . . . . . . . . . . . . .
85
6.5.2.6 PCS Increments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
6.5.2.7 Geometrical Score for Active Source Camera Computation . . . . . . .
86
6.5.2.8 Range of Color Distances for Active Source Camera Computation . . .
86
6.5.2.9 Labeling of Interior Space
. . . . . . . . . . . . . . . . . . . . . . . .
86
6.5.3
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
6.5.3.1 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
6.5.3.2 BB, MV I and Viewing Cone Intersection . . . . . . . . . . . . . . . . .
87
6.5.3.3 V HA and PCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
6.5.3.4 PCS Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
6.5.3.5 Labeling of Interior Space
. . . . . . . . . . . . . . . . . . . . . . . .
88
- iv -

Contents
7
Discussion and Enhancements
89
7.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
7.2
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
7.3
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
7.3.1
Online System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
7.3.2
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
7.3.3
Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
7.4
Annotation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
A Abstract Data Performance Experiments
93
B Concrete Data Performance Experiments
98
C Concrete Data Quality Experiments
104
D Visual Experiments
108
References
127
- v -

List of Figures
L
IST OF
F
IGURES
1.1
A Taxonomy of 3D Shape Acquisition Techniques . . . . . . . . . . . . . . . . . . . . .
2
2.1
Image-based Ray Sampling of the Visual Hull (IBVH)
. . . . . . . . . . . . . . . . . .
11
2.2
Multi Plane-Sweep Implementation of Space Carving . . . . . . . . . . . . . . . . . . .
15
3.1
Perspective Projection in Pinhole Camera . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2
Relation of Coordinate Systems for Image Formation . . . . . . . . . . . . . . . . . . .
20
3.3
Point Correspondences for Linear Camera Calibration . . . . . . . . . . . . . . . . . . .
22
3.4
Light at a Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.5
Volume Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.6
Limitations of Volume Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.7
Silhouette-Equivalency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.8
Photo-Consistency
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.9
Visibility and Non-Photo-consistency
. . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.1
Nested Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.1
Hardware Deployment of Reconstruction Framework . . . . . . . . . . . . . . . . . . .
50
5.2
Process Flow in Reconstruction Framework . . . . . . . . . . . . . . . . . . . . . . . .
51
5.3
Destination Camera Setup
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
5.4
Interleaved Sampling in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
5.5
Active Destination Camera Test
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
5.6
Early Ray Carving
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.7
Ray-based Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.8
Geometrical Score for Active Source Camera Decision . . . . . . . . . . . . . . . . . .
66
5.9
FBO Rendertexture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.10 Labeling Interior Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
6.1
Multiple Viewpoint Video Capturing Setup
. . . . . . . . . . . . . . . . . . . . . . . .
74
6.2
Concrete Hardware Deployment of Reconstruction Framework . . . . . . . . . . . . . .
75
1.1
Performance CPU-GPU Texture Upload . . . . . . . . . . . . . . . . . . . . . . . . . .
94
1.2
Performance Interleaved Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
1.3
Performance Early Ray Carving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
1.4
Performance Raw CIELab-RGB Conversion . . . . . . . . . . . . . . . . . . . . . . . .
95
1.5
Performance Vertex/Fragment Shader Balancing . . . . . . . . . . . . . . . . . . . . . .
95
1.6
Performance GPU-CPU Texture Read-back . . . . . . . . . . . . . . . . . . . . . . . .
95
1.7
FBO Texture Size Performance Impact 1 . . . . . . . . . . . . . . . . . . . . . . . . . .
96
1.8
FBO Texture Size Performance Impact 2 . . . . . . . . . . . . . . . . . . . . . . . . . .
96
1.9
Impact of CPU-GPU Texture Upload on System Performance . . . . . . . . . . . . . . .
97
1.10 Impact of Source Cameras on Rendering Performance . . . . . . . . . . . . . . . . . . .
97
1.11 Impact of Destination Cameras on Rendering Performance . . . . . . . . . . . . . . . .
97
2.1
Performance of Algorithmic Features (Dancer1) . . . . . . . . . . . . . . . . . . . . . .
99
2.2
Performance of Algorithmic Features (Dancer2) . . . . . . . . . . . . . . . . . . . . . .
99
2.3
Performance of Algorithmic Features (Paper Houses) . . . . . . . . . . . . . . . . . . .
99
- vi -

List of Figures
2.4
Destination Camera Rendering Performance (Dancer1) . . . . . . . . . . . . . . . . . . 100
2.5
Destination Camera Rendering Performance (Dancer2) . . . . . . . . . . . . . . . . . . 100
2.6
Destination Camera Rendering Performance (Paper Houses) . . . . . . . . . . . . . . . 100
2.7
Performance Impact of Volumetric Resolution (Dancer1) . . . . . . . . . . . . . . . . . 101
2.8
Performance Impact of Volumetric Resolution (Dancer2) . . . . . . . . . . . . . . . . . 101
2.9
Performance Impact of Volumetric Resolution (Paper Houses) . . . . . . . . . . . . . . 101
2.10 Performance Impact of Volumetric Bounding Box (Dancer1) . . . . . . . . . . . . . . . 102
2.11 Performance Impact of Volumetric Bounding Box (Dancer2) . . . . . . . . . . . . . . . 102
2.12 Performance Impact of Volumetric Bounding Box (Paper Houses) . . . . . . . . . . . . 102
2.13 Performance Impact of PCS Increments (Dancer1)
. . . . . . . . . . . . . . . . . . . . 103
2.14 Performance Impact of PCS Increments (Dancer2)
. . . . . . . . . . . . . . . . . . . . 103
2.15 Performance Impact of PCS Increments (Paper Houses) . . . . . . . . . . . . . . . . . . 103
3.1
Quality Impact of Volumetric Resolution (Dancer1) . . . . . . . . . . . . . . . . . . . . 105
3.2
Quality Impact of Volumetric Resolution (Dancer2) . . . . . . . . . . . . . . . . . . . . 105
3.3
Quality Impact of Volumetric Resolution (Paper Houses) . . . . . . . . . . . . . . . . . 105
3.4
Quality Impact of Volumetric Bounding Box (Dancer1) . . . . . . . . . . . . . . . . . . 106
3.5
Quality Impact of Volumetric Bounding Box (Dancer2) . . . . . . . . . . . . . . . . . . 106
3.6
Quality Impact of Volumetric Bounding Box (Paper Houses) . . . . . . . . . . . . . . . 106
3.7
Quality Impact of PCS Increments (Dancer1) . . . . . . . . . . . . . . . . . . . . . . . 107
3.8
Quality Impact of PCS Increments (Dancer2) . . . . . . . . . . . . . . . . . . . . . . . 107
3.9
Quality Impact of PCS Increments (Paper Houses) . . . . . . . . . . . . . . . . . . . . . 107
4.1
Image Segmentation Experiment (Dancer2) . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2
Image Segmentation Experiment (Dancer1) . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3
Conventional/Interleaved Sampling Visualization (Dancer1) 1
. . . . . . . . . . . . . . 111
4.4
Conventional/Interleaved Sampling Visualization (Dancer1) 2
. . . . . . . . . . . . . . 112
4.5
Camera Viewing Cone Intersection Visualization (Dancer1) 1 . . . . . . . . . . . . . . . 113
4.6
Camera Viewing Cone Intersection Visualization (Dancer1) 2 . . . . . . . . . . . . . . . 114
4.7
PCS Reconstruction Experiment (Dancer1) 1
. . . . . . . . . . . . . . . . . . . . . . . 115
4.8
PCS Reconstruction Experiment (Dancer1) 2
. . . . . . . . . . . . . . . . . . . . . . . 116
4.9
PCS Reconstruction Experiment (Dancer1) 3
. . . . . . . . . . . . . . . . . . . . . . . 117
4.10 PCS Reconstruction Experiment (Dancer1) 4
. . . . . . . . . . . . . . . . . . . . . . . 118
4.11 PCS Reconstruction Experiment (Dancer2) 1
. . . . . . . . . . . . . . . . . . . . . . . 119
4.12 PCS Reconstruction Experiment (Dancer2) 2
. . . . . . . . . . . . . . . . . . . . . . . 120
4.13 PCS Increments Experiment (Paper Houses) 1 . . . . . . . . . . . . . . . . . . . . . . . 121
4.14 PCS Increments Experiment (Paper Houses) 2 . . . . . . . . . . . . . . . . . . . . . . . 122
4.15 PCS Increments Experiment (Dancer1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.16 Impact of Geometrical Score sc
g
k
Experiment (Dancer1) . . . . . . . . . . . . . . . . . . 124
4.17 Impact of Range r (k) on Color Similarity Score sc
c
k
Experiment (Dancer1) . . . . . . . . 125
4.18 Interior Volume Labeling Experiment (Dancer1) . . . . . . . . . . . . . . . . . . . . . . 126
- vii -

List of Tables
L
IST OF
T
ABLES
1.1
Feature Comparison with Closely Related Works . . . . . . . . . . . . . . . . . . . . .
5
1.1
Parameter Setup for Abstract Data Performance Experiments . . . . . . . . . . . . . . .
93
2.1
Parameter Setup for Concrete Data Performance Experiments . . . . . . . . . . . . . . .
98
3.1
Parameter Setup for Concrete Data Quality Experiments
. . . . . . . . . . . . . . . . . 104
4.1
Parameter Setup for Visual Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 108
- viii -

Chapter 1 - Introduction
1
I
NTRODUCTION
As computational machinery and methods are subject to an ever increasing progress, traditional applica-
tions change and new ones emerge. Years ago, computer graphics and computer vision were two distinct
fields having their respective area of application. However, recently a convergence between both can be
observed. Photorealistic models of real-world scenes are not only obtained by artificially creating shape
and appearance, but also by reconstructing these models from photographic or video data of the real
world. This results in a powerful contribution to our multimedial influenced environment.
1.1
Application
The ability to reconstruct dynamic realworld scenes from multiple video streams enables a wide range
of applications. 3D video extends the common 2D video in the way, that it is view-independent. Hence,
the decision about a fixed camera from where the scene is viewed is shifted from the time and place of
recording to the time and place of consumption. 3D video can be used in the context of personal and so-
cial human activities like entertainment (e.g. 3D games, 3D events [24], 3D TV [37], 3D video recorder
[70]) and education (e.g. 3D books, 3D-based training) or for the preservation of common knowledge
(e.g. ethnic dancing performances [39], special skills).
3D video is obtained by analyzing properties like scene geometry and color from multi-viewpoint video
streams. This generally involves a high computational effort. Nevertheless, if the reconstruction can be
computed in real-time, a broad field of further applications is facilitated. In the field of CSCW
1
, more
realistic and immersive telepresence and conferencing systems are created by using 3D cues [44]. In
this context, P
RINCE
et al. [52] propose a framework, where a user can interact with a dynamic model
of a remotely captured collaborator which is overlaid with the real world in a video-see-through HMD.
Instead of inserting a dynamic scene into a real environment (AR
2
), its also possible to combine it into
a virtual world (MR
3
). This enables for seamless integration of real and virtual content with appropriate
visual cues like reflections, shading, shadowing, occlusion and collision detection. Interactive systems for
reconstruction and insertion of actors into virtual worlds are proposed by H
ASENFRATZ
et al. [21] and
more recently P
OMI
and S
LUSALLEK
[49], contributing to the field of virtual television studio techniques.
Dynamic 3D data may be also used as an intermediate result for further processing and information ex-
traction. Scene analysis allows for object recognition or motion tracking. Latter is often applied as a
non-intrusive approach to human motion capturing [43].
Supporting tools for complex 3D model creation are powerful, but also expensive and difficult to use.
1
Computer Supported Cooperative/Collaborative Work
2
Augmented Reality
3
Mixed Reality
- 1 -

Chapter 1 - Introduction
2
A taxonomy
Shape acquisition
Non-contact
Contact
Non-destructive
Destructive
CMM
Jointed arms
Slicing
Reflective
Transmissive
Non-optical
Industrial CT
Optical
A taxonomy
Optical
Passive
Stereo
Shape from
Shape from
silhouettes
Depth from
focus/defocus
Active
Triangulation
Interferometry
Moire
Holography
Active stereo
Active depth
from defocus
2
A taxonomy
Shape acquisition
Non-contact
Contact
Non-destructive
Destructive
CMM
Jointed arms
Slicing
Reflective
Transmissive
Non-optical
Industrial CT
Optical
A taxonomy
Optical
Passive
Stereo
Shape from
Shape from
silhouettes
Depth from
focus/defocus
Active
Triangulation
Interferometry
Moire
Holography
Active stereo
Active depth
from defocus
Figure 1.1: A Taxonomy of 3D Shape Acquisition Techniques (C
URLESS
[9]).
Obtaining realistic models is still a time consuming process. As mentioned earlier, 3D models can be
created from photographic data of the real world. This is generally possible by performing an automatic
reconstruction. However, better performance, quality and handling of scenes with difficult properties
is achieved using techniques that require user interaction [76]. Here, real-time computation allows for
immediate feedback on actions and parameter adjustment.
1.2
Classification
This proposed approach performs 3D shape reconstruction using images or videos from multiple view-
points around the scene. This is one of many different techniques that can be used to accomplish the task
of 3D shape acquisition (Figure 1.1). It belongs to the class of non-contact, reflective, optical, passive
techniques. Passive refers to the fact, that only light is captured which is already existent in the scene.
Instead, active-light methods perform a controlled illumination, e.g. by projecting a coded pattern to
accomplish a more robust extraction of the 3D features. Passive-light techniques are also referred to as
Shape from X, where X relates to the special image cue that is used to infer the shape. Common cues are
e.g. stereo, shading, focus/defocus, motion, silhouettes and scene photo-consistency. While there also
exist methods using uncalibrated cameras for shape recovery, the proposed approach assumes that the
position and orientation of each camera to the world coordinate frame is known. General surveys about
techniques in the field of passive-light are given by D
YER
[12] and S
LABAUGH
et al. [60].
As several shape from X approaches rely on special image cues to be present, three general classes for 3D
shape reconstruction and texturing from multiple photographs have been developed. Namely (1) image-
based stereo (IBS), (2) volumetric intersection (VI) and (3) photo-consistency (PC). Each of them having
advantages and drawbacks. There is a general difference between IBS and VI/PC. IBS performs an image-
based search to find corresponding pixels to generate 3D points by triangulation. The result is a depth map
or a partial surface. There are some disadvantages inherent with this. (1) For the correspondence search
to be efficient, the views must be close together (small baseline). (2) If more then two views are used,
correspondences must be searched in each pair of views resulting in a high complexity (multi-baseline).
(3) For reconstructing a whole 360 degree scene model, many partial surfaces have to be computed for a
set of reference viewpoints. Integration of these distinct surfaces into a global consistent model can be
- 2 -

Chapter 1 - Introduction
complex, since the alignment is often carried out in a least-square sense as proposed by C
URLESS
and
L
EVOY
[10]. (4) Only a sparse point cloud is obtained when pixel correspondences or image features are
hard to find. In general, there is no neighborship relation between the 3D points, as they are not located
on a regular grid. Hence, generating a mesh-based surface representation implies further computation.
(5) There is no handling of occlusion between the different views. This might lead to errors throughout
the correspondence search.
Instead of the image-based search in IBS, VI/PC methods perform a volumetric scene modelling and
thus compensate for the disadvantages. The methods operate directly in the world coordinate frame and
test for a particular voxel, whether it belongs to the scene or not. The images act as set of constraints
that the unknown scene has to fulfill. To limit the search space, some initial volume is defined where
the scene is known to be contained. The theoretical foundation of shape from silhouette (VI) and shape
from photo-consistency (PC) is laid by L
AURENTINI
[29] and K
UTULAKOS
and S
EITZ
[28] respectively.
Both approaches have advantages and drawbacks. Shape from Silhouette relies on a high quality image
segmentation and can generally not recover several smooth convex and concave surface patches. The
reconstructed shape is known as the visual hull of the scene. Nevertheless, silhouette-based methods are
often applied where performance matters, as they are simple, robust and fast to compute. Shape from
photo-consistency does not have these disadvantages. It enables a high quality shape reconstruction and
gives the possibility to approximate a view-independent voxel color in a straightforward way. The result-
ing shape is known as the photo hull of the scene. However, the method suffers from a high computational
cost. An interesting fact is, that both techniques are related. Shape from photo-consistency can be seen
as a generalization of shape from silhouette where further constraints are added. Due to this, the visual
hull is proven to contain the photo hull. The proposed approach exploits this relation by computing the
visual hull as initial volume for the reconstruction of the photo hull.
1.3
Performance
The extraction of the 3D geometry as a set of voxels is only the first part of a possible whole processing
pipeline. Further tasks may include isosurface extraction, photorealistic texturing, insertion into virtual
and real environments for rendering as well as higher-level feature extraction for scene recognition. To
perform the entire processing pipeline in real-time, attempts have been made to develop scalable parallel
algorithms that are executed on PC cluster systems [39][21]. While these commonly provide the scalabil-
ity and processing power to handle a high number of input views with good quality, they also introduce
several drawbacks. These include a high hardware cost and a complex bulky setup which restricts their
usage to professional and static environments.
Focusing towards real-time processing on a single PC, algorithmical complexity and computational hard-
ware have to be improved. Regarding the latter, beside a rapid progress in CPU clock speed, there exist
assembly level optimizations for the CPU, such as the MMX and SSE extensions from Intel. However,
this may speed up 3D shape reconstruction but also entirely occupies the CPU, so that no other pipeline
- 3 -

Chapter 1 - Introduction
tasks may be performed.
A way to increase the performance of 3D shape recovery on a single PC while saving CPU cycles for
other tasks is to leverage the power of current off-the-shelf Graphics Processing Units (GPU). The GPU
is a unique processor beside the CPU, originally introduced to free the CPU from an increasing amount
of 3D graphics processing. GPUs are optimized to achieve a high throughput (streaming) on a special
arithmetic instruction set (SIMD). For comparision, an Intel Pentium4 CPU at 3.6GHz achieves a peak
of 14.4 GFLOPs for its SSE3 unit. For the Nvidia GPUs GeForce 7800GTX/512MB (G70), GeForce
7900GTX (G71) and GeForce 8800GTX (G80) it is 165, 250 and 518 GFLOPs respectively. GPUs are
not only very fast processors, they are also developing with a higher rate of growth then CPUs. With
an average yearly rate of 1.7×/2.3× for pixels/vertices per second compared to a rate of 1.4× for CPU
performance, GPU growth outpaces Moore's Law for traditional microprocessors [47]. This divergence
of the performance curves is related to the fundamental architectural differences. CPUs aim at achieving
a high performance in sequential computations. As these have a very generic nature, CPUs use several
of their processing power for non-computational tasks like branch prediction and caching. Instead, GPUs
are specialized to achieve high arithmetic intensity with performing parallelized graphics computations.
On the other side, its specialized nature makes it difficult to employ a GPU as processor for generic
computations. Thus, algorithms have to be mapped to emulate rendering tasks. The proposed approach
takes this way and employs the GPU for the task of real-time 3D reconstruction by shape from photo-
consistency on a single PC. A recent overview of the GPU architecture and its related programming
paradigms is found from O
WENS
et al. [47].
1.4
Contribution
Most closely related to the proposed approach are the works on GPU-accelerated Space Carving from
S
AINZ
et al. [55] and Z
ACH
et al. [73]. These restore a full, explicit and view-independent volumetric
scene model. However, since they explicitly compute visibility and process interior voxels they are not
real-time capable.
GPU-accelerated image-based methods for computing the photo hull such as L
I
et al. [35] allow for real-
time execution but lack many other features. As these algorithms are denoted to the generation of novel
viewpoints, they only recover a partial, implicit and view-dependent scene model. If GPU acceleration
is not used, it is obvious that the model is explicitly available on the CPU as in S
LABAUGH
et al. [61].
Nevertheless, the performance decreases about seven times compared to [35] and real-time processing is
only achieved by employing a coarse space sampling which influences quality.
Multi-baseline image-based stereo (IBS) can also be computed in real-time by employing GPU acceler-
ation as in Y
ANG
et al. [72] and Y
ANG
and P
OLLEFEYS
[71]. These approaches provide an explicit
representation as disparity or depth map, but also suffer from the general drawbacks of IBS and hence do
- 4 -

Chapter 1 - Introduction
Feature
Proposed
Approach
Volumetric
Carving (GPU)
[56,74]
Novel-View
Rendering (GPU)
[35]
Novel-View
Rendering (CPU)
[62]
Image-based
Stereo (GPU)
[72,73]
Real-time Photo Hull
()
View-Independent
Entire 3D Model
Explicit Representation
Photo-consistency Measure
with Implicit Visibility
2D Processing on Single PC
3D Processing on Single PC
Table 1.1: Feature Comparision with Closely Related Works.
not generate an entire, view-independent scene model.
The proposed approach explains a framework which employs GPU hardware-acceleration for shape from
photo-consistency to generate an explicit volumetric representation of a scene that is captured from mul-
tiple calibrated viewpoints. Several works are dealing in this topic, all performing 3D reconstruction
on a single PC. However, none fulfills all of the features proposed with this approach (Table 1.1). Es-
pecially combining real-time processing with the ability to generate an entire volumetric model can not
be achieved. While realizing this, the proposed approach defines a novel photo-consistency measure im-
plicitly accounting for visibility and specular highligths. Moreover, the entire image processing is also
executed GPU-accelerated on the single reconstruction PC and not on a PC cluster as generally done.
1.5
Overview
The remainder of this thesis is organized as follows. Chapter 2 surveys related work in the field of dy-
namic scene reconstruction by shape from silhouette and shape from photo-consistency. The focus lies
on high performance reconstruction and hardware-acceleration. Chapter 3 introduces the theoretical basis
for the proposed approach within three main parts. (1) Camera geometry is important to relate images
captured from multiple viewpoints to an unknown 3D scene. (2) When taking a photograph of a scene,
the light emitted towards the camera is captured. Therefore light and color is discussed as necessary
for this work. (3) At last, the theory behind shape from silhouette and shape from photo-consistency
needs to be explained. Chapter 4 continues with defining the Basic Algorithm as a hybrid approach to
scene reconstruction with introducing features like (1) a robust and fast image-segmentation to account
for shadows, (2) a set of nested volumes that are sequentially computed to speed-up computation and
(3) an implicit visibility computation. The Basic Algorithm is extended and mapped onto the GPU in
chapter 5. The corresponding Advanced Algorithm enables efficient real-time computation by employ-
ing a multipass-rendering strategy. Chapter 6 explains and discusses several experiments to analyze the
proposed system in terms of performance and reconstruction quality. Chapter 7 concludes with giving a
summary and discussing limitations and future works. Due to their number, diagrams and figures relating
- 5 -

Chapter 1 - Introduction
to experiments are grouped into additional appendix chapters A, B, C and D.
- 6 -

Chapter 2 - Related Work
2
R
ELATED
W
ORK
The following chapter discusses related research in the field of passive 3D reconstruction using images
from multiple calibrated views. The aim is not only to present the different works, but also to connect the
particular areas and set up the foundation for the proposed approach.
The objective of the proposed approach is to explicitly reconstruct the entire 3D shape of a dynamic scene
in real-time on a single PC. Explicit relates to the fact, that a 3D scene description is obtained as an output.
Implicit methods only perform an internal 3D reconstruction as an intermediate processing step, e.g. in
combination with generating novel views of the scene. Making data explicit is generally no problem if
the CPU is used for reconstruction. However, the proposed approach employs the GPU which involves
an expensive read-back operation.
An important property is view-dependency, which relates to the fact that distinct views of the same scene
part result also in distinct 3D shapes. It seems obvious that view-dependency may occur for techniques,
that do not compute the entire scene model. These include techniques that generate of novel-views or
depth maps from existing views. The proposed approach aims in restoring an entire view-independent
scene model. Here, further performance and quality issues come into play. First, reconstructing the entire
scene needs more computation. Second, the resulting model should be global consistent. This is gener-
ally not easily achieved by merging multiple partial surfaces from view-dependent approaches.
Using GPU acceleration
4
, real-time reconstruction in the mentioned way is possible for the fast shape
from silhouette technique. However, this has several limitations on the quality of the result. To compen-
sate for this, several hybrid approaches have been suggested to generate a refined shape using shape from
photo-consistency. Though, these methods do not achieve real-time or interactive framerates, which is
only obtained with GPU accelerated view-dependent reconstruction.
The following review is separated into two parts. At first, works using the fast and efficient shape from
silhouette technique are discussed in terms of their performance. This is related to either algorithmical
improvement or to the mapping on a CPU cluster or the GPU. The techniques are chosen, as they are
directly related to this work. The second part reviews approaches that generate a better quality shape
approximation using photo-consistency or image-based stereo.
2.1
Shape from Silhouette
If there exists a set of images from different viewpoints, where the scene is entirely included, volume
intersection can be used to compute a shape approximation which is itself a subset of the convex hull and
4
In this scope, GPU acceleration relates to the acceleration of any shape modifying operation. Thus, CPU-based reconstruc-
tion with GPU-accelerated rendering (and clipping) does not fall under this category.
- 7 -

Chapter 2 - Related Work
known to contain the real object. The approach is rather simple and based on the assumption, that any
ray from the camera center through a pixel from where the original scene is visible, must intersect the
(unknown) scene. If all such rays are constructed for the whole set of images, their intersection in 3D is
referred to as volume intersection.
2.1.1
Image Segmentation
To determine the pixels from where an object is visible, image segmentation is performed
5
. The result-
ing set of foreground pixels in an image is called silhouette of the 3D scene and is represented binary.
Therefore, another common synonym for volume intersection methods is shape-from-silhouette. For the
segmentation, knowlegde about background is obtained in advance by taking images from all viewpoints,
this time without any scene present. In the simplest case, the color of a background pixel is compared
with the corresponding pixel in the real image. If this exceeds some threshold, the pixel is assumed to
belong to foreground. More robust but complex approaches as H
ORPRASERT
et al. [22] generate a color
distribution function for a background pixel. Moreover, background and foreground can be expressed by
a non-parametric model which is adapted at runtime to compensate for various effects as in T
OYAMA
et al.
[66] and more recently in E
LGAMMAL
et al. [13]. While these approaches work on the CPU in real-time,
image segmentation on the GPU achieves an even higher performance. G
RIESSER
[20] presents a robust
adaptive implementation achieving 4ms for a VGA image. A theoretical analysis for image segmentation
related to the color of the background is done by S
MITH
and B
LINN
[62]. At last, a complete and recent
overview of image segmentation techniques is given by Z
HANG
[75].
2.1.2
Foundations
When a set of silhouette images is available, volume intersection can be performed. This is first described
by M
ARTIN
and A
GGARWAL
[38] and often employed in subsequent works. The theoretical foundation
is first established by L
AURENTINI
[29] proposing a problem definition and introducing the visual hull
as the best possible shape approximation which can be obtained from volume intersection. L
AURENTINI
[30] extends the theory and discusses practical estimations for the necessary number of silhouettes de-
pending on scene complexity in [31]. A more recent survey of visual hull related works is found in [32].
An introduction and discussion about the corresponding theory is found in (ref section).
The volume intersection is either represented as an exact geometrical object or as a discrete volume. A
geometric representation is obtained by intersecting the polyhedra of the backprojected silhouette cones in
3D or by performing image-based polygone intersection in 2D. In case of a discrete representation, there
exist different strategies for volumetric and image-based approaches. Volumetric approaches project and
intersect the silhouettes on slices of voxels. Image-based methods perform a ray-casting to find the 3D
intersections with the visual hull.
5
The technique is also referred to as silhouette extraction or background subtraction.
- 8 -

Chapter 2 - Related Work
2.1.3
Performance of View-Independent Reconstruction
This section discusses attempts for increasing the performance for view-independent reconstruction, where
a global consistent 3D model is generated.
2.1.3.1
CPU
Performance can be improved in several ways. Many discrete approaches realize a LOD
6
reconstruction
using an octree representation. The space is sampled in a coarse-to-fine hierarchy depending on scene
complexity. An important work in this area is presented by P
OTMESIL
[50] and significantly improved
by S
ZELISKI
[64].
Fast and scalable systems propose to use a PC cluster for reconstruction. It is obvious that these are
able to achieve real-time. The drawback lies in a static system setup involving a lot of expensive hard-
ware. The experimental results for the following works are all achieved for capturing a dynamic dancer
scene. B
OROVIKOV
and D
AVIS
[3] utilize 14 cameras with 16 PentiumII 450 MHz CPUs, connected via
100Mbit ethernet. Reconstruction is done with an octree approach in a volume of 2×2×2m. At voxelsizes
of 31, 16 and 8mm, a performance of 10, 8 and 4fps is obtained respectively. While this system depends
mainly on octree depth and corresponding transmission time, W
U
and M
ATSUYAMA
[69] use a 10Gbit
myrinet network to increase bandwidth and scalability. Their framework handles a varying number of
VGA cameras, each connected with a dual PentiumIII 1GHz CPU. When using six cameras in a volume
of 2×2×2m and a voxelsize of 20mm, 26fps are obtained. The framerate can be further increased with
the number of PCs. Using 10 PCs, the system runs at 43fps. The work from U
EDA
et al. [67] introduces
a whole PC cluster based reconstruction and rendering architecture. It uses nine VGA cameras and 21
Pentium4 3GHz CPUs, connected also via 10Gbit myrinet network. For a volume of 2.56×2.56×2.56m
and a voxelsize of 20mm they report an overall performance of 10fps. This framerate is measured for the
entire processing pipeline where the single processes are executed in parallel.
2.1.3.2
GPU Acceleration
The work of H
ASENFRATZ
et al. [21] is important, as it is the only known system making the GPU-
reconstructed view-independent volume intersection explicit. It proposes a real-time framework for re-
construction and insertion of actors into virtual worlds. The reconstruction is done by sweeping a plane
through the volume. At each position, the silhouette textures are projected onto the plane and combined by
multitexturing. If all silhouettes project onto a particular voxel, it is contained in the volume intersection.
The data is made explicit by reading back the slices from the framebuffer to the CPU. The system uses
four VGA cameras with respective PCs performing the 2D processing. The images are then transfered
via 1Gbit network to an Onyx 3000-IR3 server with eight R12000 400MHz CPUs. For the reconstruction,
6
Level Of Detail
- 9 -

Chapter 2 - Related Work
a volume of 2×2×2m with a rather big voxelsize of 30mm is used. With a texture upload of 2ms and
reconstruction of 13ms, a framerate of 67fps is achieved. The read-back implies a further performance
impact, as also interior voxels have to be transfered.
2.1.4
Performance of View-Dependent Reconstruction
The fundamental difference to the following works is, that they aim at rendering novel views. That means,
the 3D data is usually only view-dependent and partially reconstructed. This is related to IBR
7
(B
UEHLER
et al. [6]) in the way, that additional 3D information is leveraged.
2.1.4.1
CPU
M
ATUSIK
et al. [42] describe an efficient image-based sampling approach for computing and texturing
visual hulls, called Image-Based Visual Hulls (IBVH). A ray-based reconstruction is done for the pixels
from where the scene is visible in the destination image (Figure 2.1). Considering [42], the performance is
mostly depending on the number of pixels and lies at about 8fps for an average VGA image using a quad
CPU 550MHz server. For the test, four cameras are used at a resolution of 256×256 with corresponding
PCs for image processing. M
ATUSIK
et al. present an improved adaptive version in [40]. Depending on
the scene, a speed-up of 4-9× is achieved. M
ATUSIK
et al. continue their work on visual hull rendering
and present a different approach in [41]. For the novel view, an exact geometrical polyhedral represen-
tation is computed by employing image-based polygone intersections in 2D. View-dependent texturing
is done on the GPU when rendering by using the approach of B
UEHLER
et al. [6]. Four cameras are
employed at QVGA resolution, each with respective PC. Using a dual PentiumIII 933MHz CPU, fram-
erates of 15fps are reported for pure geometry construction and 30fps for the rendering. P
RINCE
et al.
[52] propose a framework for real-time reconstruction and novel-view rendering in an AR telepresence
environment. Herefore, a remotely captured human is rendered on an ARToolkit [25] marker for the view-
point of a person wearing video-see-through glasses. Interaction with the 3D video is done by adjusting
the marker's position and orientation. The image-based reconstruction is done similar to M
ATUSIK
et al.
[42]. Image acquisition is performend using 14 VGA cameras, connected to five PCs doing the image
processing. Images are distributed to a rendering server over 1Gbit network. With a rendered resolution
of 384×288, the system achieves a framerate of 25fps.
2.1.4.2
GPU Acceleration
The proposed works of M
ATUSIK
et al. perform image-based volume intersection and can thus decrease
algorithmical complexity for novel-view rendering on the CPU. In contrast to this, L
I
et al. [?] propose
an approach for novel-view rendering entirely on the GPU. For a particular silhouette, all faces of the
corresponding 3D silhouette cone are rendered sequentially and projective textured from all other views.
7
Image-Based Rendering
- 10 -

Chapter 2 - Related Work
2
Background and Previous Work
Kanade's virtualized reality system [20] [23] [13] is perhaps clos-
est in spirit to the rendering system that we envision. Their initial
implementations have used a collection of cameras in conjunction
with multi-baseline stereo techniques to extract models of dy-
namic scenes. These methods require significant off-line
processing, but they are exploring special-purpose hardware for
this task. Recently, they have begun exploring volume-carving
methods, which are closer to the approach that we use [26] [30].
Pollard's and Hayes' [21] immersive video objects allow
rendering of real-time scenes by morphing live video streams to
simulate three-dimensional camera motion. Their representation
also uses silhouettes, but in a different manner. They match sil-
houette edges across pairs of views, and use these
correspondences to compute morphs to novel views. This ap-
proach has some limitations, since silhouette edges are generally
not consistent between views.
Visual Hull.
Many researchers have used silhouette infor-
mation to distinguish regions of 3D space where an object is and
is not present [22] [8] [19]. The ultimate result of this carving is a
shape called the object's visual hull [14]. A visual hull always
contains the object. Moreover, it is an equal or tighter fit than the
object's convex hull. Our algorithm computes a view-dependent,
sampled version of an object's visual hull each rendered frame.
Suppose that some original 3D object is viewed from a set of
reference views R. Each reference view r has the silhouette s
r
with
interior pixels covered by the object. For view r one creates the
cone-like volume vh
r
defined by all the rays starting at the image's
point of view p
r
and passing through these interior points on its
image plane. It is guaranteed that the actual object must be con-
tained in vh
r
. This statement is true for all r; thus, the object must
be contained in the volume vh
R
=
r
R
vh
r
. As the size of R goes to
infinity, and includes all possible views, vh
R
converges to a shape
known as the visual hull vh
of the original geometry. The visual
hull is not guaranteed to be the same as the original object since
concave surface regions can never be distinguished using silhou-
ette information alone.
In practice, one must construct approximate visual hulls us-
ing only a finite number of views. Given the set of views R, the
approximation vh
R
is the best conservative geometric description
that one can achieve based on silhouette information alone (see
Figure 1). If a conservative estimate is not required, then alterna-
tive representations are achievable by fitting higher order surface
approximations to the observed data [2].
Volume Carving.
Computing high-resolution visual hulls
can be tricky matter. The intersection of the volumes vh
r
requires
some form of CSG. If the silhouettes are described with a polygo-
nal mesh, then the CSG can be done using polyhedral CSG, but
this is very hard to do in a robust manner.
A more common method used to convert silhouette contours
into visual hulls is volume carving [22] [8] [29] [19] [5] [27].
This method removes unoccupied regions from an explicit volu-
metric representation. All voxels falling outside of the projected
silhouette cone of a given view are eliminated from the volume.
This process is repeated for each reference image. The resulting
volume is a quantized representation of the visual hull according
to the given volumetric grid. A major advantage of our view-
dependent method is that it minimizes artifacts resulting from this
quantization.
CSG Rendering.
A number of algorithms have been de-
veloped for the fast rendering of CSG models, but most are ill
suited for our task. The algorithm described by Rappoport [24],
1
2
3
Figure 2 ­ Computing the IBVH involves three steps. First, the
desired ray is projected onto a reference image. Next, the intervals
where the projected ray crosses the silhouette are determined.
Finally, these intervals are lifted back onto the desired ray where
they can be intersected with intervals from other reference images.
requires that each solid be first decomposed to a union of convex
primitives. This decomposition can prove expensive for compli-
cated silhouettes. Similarly, the algorithm described in [11]
requires a rendering pass for each layer of depth complexity. Our
method does not require preprocessing the silhouette cones. In
fact, there is no explicit data structure used to represent the sil-
houette volumes other than the reference images.
Using ray tracing, one can render an object defined by a tree
of CSG operations without explicitly computing the resulting
solid [25]. This is done by considering each ray independently
and computing the interval along the ray occupied by each object.
The CSG operations can then be applied in 1D over the sets of
intervals. This approach requires computing a 3D ray-solid inter-
section. In our system, the solids in question are a special class of
cone-like shapes with a constant cross section in projection. This
special form allows us to compute the equivalent of 3D ray inter-
sections in 2D using the reference images.
Image-Based Rendering.
Many different image-based
rendering techniques have been proposed in recent years
[3] [4] [15] [6] [12]. One advantage of image-based rendering
techniques is their stunning realism, which is largely derived from
the acquired images they use. However, a common limitation of
these methods is an inability to model dynamic scenes. This is
mainly due to data acquisition difficulties and preprocessing re-
quirements. Our system generates image-based models in real-
time, using the same images to construct the IBHV and to shade
the final rendering.
3 Visual-Hull
Computation
Our approach to computing the visual hull has two distinct char-
acteristics: it is computed in the image space of the reference
images and the resulting representation is viewpoint dependent.
The advantage of performing geometric computations in image
space is that it eliminates the resampling and quantization artifacts
that plague volumetric approaches. We limit our sampling to the
pixels of the desired image, resulting in a view-dependent visual-
hull representation. In fact, our IBVH representation is equivalent
to computing exact 3D silhouette cone intersections and rendering
the result with traditional rendering methods.
Our technique for computing the visual hull is analogous to
finding CSG intersections using a ray-casting approach [25].
Given a desired view, we compute each viewing ray's intersection
with the visual hull. Since computing a visual hull involves only
intersection operations, we can perform the CSG calculations in
any order. Furthermore, in the visual hull context, every CSG
primitive is a generalized cone (a projective extrusion of a 2D
image silhouette). Because the cone has a fixed (scaled) cross
section, the 3D ray intersections can be reduced to cheaper 2D ray
intersections. As shown in Figure 2 we perform the following
steps: 1) We project a 3D viewing ray into a reference image. 2)
We perform the intersection of the projected ray with the 2D sil-
houette. These intersections result in a list of intervals along the
ray that are interior to the cone's cross-section. 3) Each interval is
then lifted back into 3D using a simple projective mapping, and
then intersected with the results of the ray-cone intersections from
other reference images. A naïve algorithm for computing these
IBVH ray intersections follows:
IBVHisect (intervalImage &d, refImList R){
for each referenceImage r in R
computeSilhouetteEdges (r)
for each pixel p in desiredImage d do
p.intervals = {0..inf}
for each referenceImage r in R
for each scanline s in d
for each pixel p in s
ray3D ry3 = compute3Dray(p,d.camInfo)
lineSegment2D l2 = project3Dray(ry3,r.camInfo)
intervals int2D = calcIntervals(l2,r.silEdges)
intervals int3D = liftIntervals(int2D,r.camInfo,ry3)
p.intervals = p.intervals ISECT int3D
}
To analyze the efficiency of this algorithm, let n be the num-
ber of pixels in a scanline. The number of pixels in the image d is
O(n
2
). Let k be the number of reference images. Then, the above
algorithm has an asymptotic running time O(ikn
2
), where i is the
time complexity of the
calcIntervals
routine. If we test for the
intersection of each projected ray with each of the e edges of the
silhouette, the running time of
calcIntervals
is O(e). Given
that l is the average number of times that a projected ray intersects
the silhouette
1
, the number of silhouette edges will be O(ln).
Thus, the running time of
IBVHisect
to compute all of the 2D
intersections for a desired view is O(lkn
3
).
The performance of this naïve algorithm can be improved by
taking advantage of incremental computations that are enabled by
the epipolar geometry relating the reference and desired images.
These improvements will allow us to reduce the amortized cost of
1D ray intersections to O(l) per desired pixel, resulting in an im-
plementation of
IBVHisect
that takes O(lkn
2
).
Given two camera views, a reference view r and a desired
view d, we consider the set of planes that share the line connect-
ing the cameras' centers. These planes are called epipolar planes.
Each epipolar plane projects to a line in each of the two images,
called an epipolar line. In each image, all such lines intersect at a
common point, called the epipole, which is the projection of one
of the camera's center onto the other camera's view plane [9].
As a scanline of the desired view is traversed, each pixel
projects to an epipolar line segment in r. These line segments
emanate from the epipole e
dr
, the image of d's center of projection
onto r's image plane (see Figure 3), and trace out a "pencil" of
epipolar lines in r. The slopes of these epipolar line segments will
either increase or decrease monotonically depending on the direc-
tion of traversal (Green arc in Figure 3). We take advantage of this
monotonicity to compute silhouette intersections for the whole
scanline incrementally.
1
We assume reference images also have O(n
2
) pixels.
r
1
r
2
r
3
r
4
r
5
r
6
r
pr1
r
pr2
r
pr3
r
pr4
r
pr5
r
pr6
Desired Image
Reference Image
Figure 3 ­ The pixels of a scanline in the desired image trace out
a pencil of line segments in the reference image. An ordered tra-
versal of the scanline will sweep out these segments such that their
slope about the epipole varies monotonically.
The silhouette contour of each reference view is represented
as a list of edges enclosing the silhouette's boundary pixels. These
edges are generated using a 2D variant of the marching cubes
approach [16]. Next, we sort the O(nl) contour vertices in in-
creasing order by the slope of the line connecting each vertex to
the epipole. These sorted vertex slopes divide the reference image
domain into O(nl) bins. Bin B
i
has an extent spanning between the
slopes of the ith and i+1st vertex in the sorted list. In each bin B
i
we place all edges that are intersected by epipolar lines with a
slope falling within the bin's extent
2
. During
IBVHisect
as we
traverse the pixels along a scanline in the desired view, the pro-
jected corresponding view rays fan across the epipolar pencil in
the reference view with either increasing or decreasing slope.
Concurrently, we step through the list of bins. The appropriate bin
for each epipolar line is found and it is intersected with the edges
in that bin. This procedure is analogous to merging two sorted
lists, which can be done in a time proportional to the length of the
lists (O(nl) in our case).
For each scanline in the desired image we evaluate n viewing
rays. For each viewing ray we compute its intersection with edges
in a single bin. Each bin contains on average O(l) silhouette
edges. Thus, this step takes O(l) time per ray. Simultaneously we
traverse the sorted set of O(nl) bins as we traverse the scanline.
Therefore, one scanline is computed in O(nl) time. Over n scanli-
nes of the desired image, and over k reference images, this gives a
running time of O(lkn
2
). Pseudocode for the improved algorithm
follows.
IBVHisect (intervalImage &d, refImList R){
for each referenceImage r in R
computeSilhouetteEdges (r)
for each pixel p in desiredImage d do
p.intervals = {0..inf}
for each referenceImage r in R
bins b = constructBins(r.caminfo, r.silEdges, d.caminfo)
for each scanline s in d
incDec order = traversalOrder(r.caminfo,d.caminfo,s)
resetBinPositon(b)
for each pixel p in s according to order
ray3D ry3 = compute3Dray(p,d.camInfo)
lineSegment2D l2 = project3Dray(ry3,r.camInfo)
slope m = ComputeSlope(l2,r.caminfo,d.caminfo)
updateBinPosition(b,m)
intervals int2D = calcIntervals(l2,b.currentbin)
intervals int3D = liftIntervals(int2D,r.camInfo,ry3)
p.intervals = p.intervals ISECT int3D
}
2
Sorting the contour vertices takes O(nl log(nl)) and binning takes O(nl
2
).
Sorting and binning over k reference views takes O(knl log(nl)) and
O(knl
2
) correspondingly. In our setting, l << n so we view this preproc-
essing stage as negligible.
Figure 2.1: Image-based Ray Sampling of the Visual Hull (IBVH) (M
ATUSIK
et al. [42]).
A desired novel view of the visual hull is rendered by ray-based sampling. (1) A 3D ray is projected into the reference views
to (2) determine the intersections with the visual hull. (3) Backprojection of the intersections from all reference views leads to
the 3D intersection with the visual hull. Sampling for the novel view is done in scanlines to determine which rays have to be
sampled.
For these textures, the alpha-channel encodes the silhouette occupancy information. Parts of the face
where not all silhouettes are projecting obtain an alpha-value of zero and stenciled out. This is done for
all faces of all silhouette cones respectively. The framework is tested by using four QVGA input cameras
with corresponding image processing PCs. The rendering server is a dual core Pentium4 1.7GHz PC with
a GeForce3 GPU. Together with view-dependent texturing, a framerate of 84fps is achieved.
L
OK
[36] proposes a completely different approach to novel-view generation. It uses a plane-sweep
technique like in H
ASENFRATZ
et al. [21]. This time, for view-dependent surface generation instead of
volumetric sampling. The slices are generated with non-uniform distance and increasing size according
to the frustum of the viewing camera. Unlike [21], a particular slice is rendered N times, where N is the
number of silhouettes. The volume intersection is computed by accumulating the projected silhouettes
using the stencil buffer. Since stencil and color buffer are not cleared after each slice, the depth buffer
finally contains the view-dependent surface of the visual hull. The data resides on the GPU and is not
made explicit. From the depth buffer a coarse mesh is generated and view-dependent textured. The system
uses five NTSC cameras with respective PCs for image processing. The reconstruction is obtained at 12-
15fps for a volume of 2.4×1.8×1.8m and a voxelsize of 10mm.
2.1.5
Conclusion
The works introduced so far are computing and using knowledge about a captured scene by employ-
ing shape from silhouette. Hence, a robust and fast image segmentation technique is essential. For the
proposed approach, a simple variant of color distance thresholding is used. Several modifications are
introduced to increase robustness.
PC cluster systems allow highly parallelized scalable implementations, where a whole 3D video process-
- 11 -

Chapter 2 - Related Work
ing pipeline can be performed in real-time. However, this results in a complex distributed algorithm and
an expensive static hardware setup. Using a single PC instead, the performance can be increased by using
a LOD representation. If GPU acceleration is leveraged, the reconstruction result has to be read back
to the CPU. However, this operation becomes a hard impact on performance as interior voxels are also
reconstructed. Even if a low volumetric resolution of 64×64×64 voxels is used. This problem does not
occur for the proposed approach as only surface voxels are read back. A labeling algorithm is proposed
to restore interior voxels later on the CPU if necessary.
If the goal is not an explicit 3D reconstruction but rendering of novel views, interactive framerates are
achieved using a single PC. The corresponding techniques perform either exact geometric intersection or
discrete space sampling. Both approaches are done either image- or scene-based. The main issue is, that
only a partial view-dependent surface is reconstructed. The discrete image-based methods are performing
a ray-based space sampling. This strategy is similar to the proposed approach. Both works are related
to sampling a CSG
8
scene by ray-casting [54], as the volume intersection of the generalized silhouette
cones can be modeled using CSG.
The proposed approach is actually related to both, discrete view-independent and view-dependent tech-
niques. View-independent is the fact, that an entire volumetric 3D model is generated. On the other hand,
this is achieved by ray-casting from multiple views around the scene. However, the difference is that
the proposed approach uses parallel rays to sample a regular grid and applies an interleaved sampling
technique to exploit coherence between all virtual views.
Similar to the proposed approach, all works handle a scene space of around 2×2×2m. However, the
volumetric resolution reaches from 8 to 30mm. The input images are usually captured with QVGA reso-
lution. This depends either on the bandwidth of the network or on the bandwidth of texture upload for the
GPU approaches. Moreover, a maximum number of five input cameras is used. This relates to either the
mentioned bandwidth issue or to limitations from the GPU architecture. The proposed approach employs
eight cameras and experiments with image resolutions up to XVGA. Considering image processing, all
systems use a PC cluster for this task. Usually, each camera is connected to a corresponding PC. Instead,
the proposed approach uses the cluster only for image capturing, while image processing is entirely per-
formed on the single server. This is a big difference, since the algorithm can also be applied in a setup,
where all cameras are connected directly with the server.
2.2
Shape from Photo-Consistency
Techniques, that reconstruct a 3D scene by shape from silhouette employ information about foreground
and background pixels. The 3D shape is known as volume intersection, since it represents the 3D in-
8
Constructive Solid Geometry
- 12 -

Chapter 2 - Related Work
tersection of the backprojected 2D silhouettes. Shape from photo-consistency is a generalization of this
approach with using additional greyscale or color information. Applying further constraints, these meth-
ods can achieve a tighter approximation of the true shape. The property of photo-consistency relates to the
fact, that the radiance from a particular (unknown) scene point should be consistent with the irradiance
measured in the corresponding image pixels from where the point is visible. For each scene point can
thus be determined, if it is consistent with the set of images. This is the inverse approach of the one taken
by image-based stereo algorithms. Instead of evaluating the pixels that are corresponding to a particular
scene point, they start from the images and try to find pixels that may correspond to the same scene point.
The coordinates of the point are then obtained by triangulation. Due to this, shape from photo-consistency
is also referred to as scene-based stereo. Similar to the term of photo-consistency, L
ECLERC
et al. [34]
define self-consistency as a pixel correspondence measure for evaluating image-based stereo algorithms.
When comparing both approaches, image-based stereo shows several drawbacks. In particular related
to the reconstruction of a global consistent scene. Also, the recovered points are arbitray located within
point clouds and hence, do not correspond to a voxel representation on a regular grid. However, since
image-based stereo algorithms (1) are related to the ray-based sampling employed for the proposed ap-
proach and (2) achieve real-time framerates for generation of novel views, they are considered for this
review. A survey of two-view stereo algorithms is provided by S
CHARSTEIN
and S
ZELISKI
in [57].
2.2.1
Foundations
The property of photo-consistency for restoring an unknown scene from a set of photographs is first intro-
duced by S
EITZ
and D
YER
[58]. This work defines an algorithm for reconstruction of a photo-consistent
shape from a sufficiently textured scene, called Voxel Coloring. Starting from an initial opaque volume,
incremental carving of inconsistent voxels yields a shape that converges towards the real scene. However,
due to voxel visibility considerations there exist restrictions on the camera placement. K
UTULAKOS
and
S
EITZ
[28] extend this work by defining the Space Carving algorithm which generalizes Voxel Coloring
towards arbitrary camera placement. Furthermore, they provide the theoretical foundation for the problem
of scene reconstruction from multiple images and define the photo hull as the maximal photo-consistent
shape that can be achieved using this technique. An introduction and discussion about the corresponding
theory is found in (ref section).
Similar research is carried out by C
ULBERTSON
et al. [8] proposing the Generalized Voxel Coloring
(GVC) algorithm which differs to Space Carving (SC), as another approach is taken to maintain visibility
9
.
Two versions are proposed using different volumetric data structures and reducing either (1) the number
of consistency checks or (2) the necessary memory footprint. While SC is realized by sweeping a plane
through the volume, GVC operates a on a surface voxel representation. In comparision with SC, GVC is
reported to achieve better reconstruction results.
9
Actually this work uses the term "Space Carving" somehow ambiguously. In [28] it is defined as a generic concept without
accounting for visibility. Hence, "Multi-Sweep Space Carving" is meant here instead.
- 13 -

Chapter 2 - Related Work
2.2.2
Performance of View-Independent Reconstruction
This section discusses approaches to increase the performance for view-independent reconstruction tech-
niques generating a full discrete 3D model.
2.2.2.1
CPU
To compensate for an inaccurate camera calibration, K
UTULAKOS
[27] extends the research of K
UTU
-
LAKOS
and S
EITZ
[28] by proposing the Approximate Space Carving (ASC) algorithm. SC evaluates
photo-consistency for a voxel by testing the consistency of the image pixels, from where the voxel is
visible. ASC extends this by incorporating a disc of pixels with radius r around the projection pixel. The
so called r-consistency is achieved, if there exists at least one pixel in each disc, making the entire set of
pixels photo-consistent. This defines a nested set of shapes, converging to the photo hull for r 0. [27]
suggests to implement a multi-resolution coarse-to-fine reconstruction using r-consistency to increase
performance. In this case, r corresponds to the size of the voxels footprint at a particular LOD. A coarse-
to-fine approach for the similar Voxel Coloring algorithm is already suggested earlier by P
ROCK
and
D
YER
[53]. This is also performed by analyzing the entire set of pixels where a voxel projects in each
image. Achieving equal quality, the speed-up of this method increases with volumetric resolution from
1.23× for 32
3
to 41.1× for 512
3
respectively.
P
ROCK
and D
YER
[53] also propose to use a multi-resolution approach to exploit time-coherency for
dynamic scenes. This is done by employing a coarse level of the shape in frame t as the initial volume
for frame t + 1. Its obvious that this strategy does not compensate for discontinuities between frames.
However, this can be ignored when capturing a human performance as movement is always continuous
10
.
By using this approach, P
ROCK
and D
YER
achieve a speed-up of 2×.
2.2.2.2
GPU Acceleration
S
AINZ
et al. [55] present an approach to GPU-accelerated reconstruction using a simplified version of the
multi plane-sweep implementation of the Space Carving algorithm from K
UTULAKOS
and S
EITZ
[28].
Here, a plane moves through a volumetric bounding box in six stages along each direction of the three
coordinate axes ±x, ±y and ±z. At each iteration, only the voxels that intersect with the current plane are
considered. As the goal lies in volumetric sampling of a regular grid, the destination camera is setup for
orthographic projection. When rendering a particular plane there are two main issues, namely (1) increas-
ing performance by computing the visual hull and (2) accounting for visibility. (1) When projecting the
silhouettes from all cameras on a particular plane, voxels outside the visual hull can be determined and
ignored. (2) To account for visibility when sweeping along the viewing direction, already reconstructed
voxels are rendered and illuminated from the direction of the input cameras in front of the plane. The
corresponding shadows are projected onto the current plane. For a distinct voxel, all visible light sources
10
Of course, this depends on the speed of movement and the framerate for image capturing.
- 14 -

Chapter 2 - Related Work
The following subsections will provide more detail about
the primary components of the algorithm.
3.1 Image Projection
The consistency check criterion evaluates the similarity of
the color values of a voxel footprints for each of the
views. A common way to determine how a voxel projects
to an image is to rasterize the voxel in the image plane.
Since this is a basic operation supported by the OpenGL
API, it may seem straight forward to render the voxel
from the camera's point of view. However, when the angle
of the camera in respect to the normal of the carving plane
is large and the voxel resolution is high, some voxels
might be projected to the same pixel locations due to
rounding from 3D floating point to integer pixel coordi-
nates. This may result in some missing projected voxels,
or some voxels with smaller footprints than they should
have, contributing erroneously to the consistency check.
The presented approach overcomes this by projecting the
images onto the carving plane using texture mapping
techniques.
The front faces of the voxels on the carving plane are ren-
dered from a virtual viewpoint perpendicular to the plane.
To maximize the rendered area of the carving plane on the
framebuffer, the view frustum is set to 90 degrees and the
virtual view point is located such that the carving plane
rendering fills the entire image plane. The projection of
the image is then performed with projective texture map-
ping (Figure 3).
During the initialization process all images are stored as
textures in the video card's memory. For each image that
has to be evaluated, the voxels of the plane are then ren-
dered with texturing enabled. To assign the proper texture
coordinates to a voxel, first the texture matrix stack is set
to the full reference camera projection matrix, that is the
intrinsic*extrinsic matrices of the camera. This generates
texture coordinates that are a perspective projection of
image coordinates from the camera's location. By assign-
ing the same 3D coordinates of the voxel corners as tex-
ture coordinates, the projection of the texture onto the
voxel surface is achieved.
This approach has two advantages, (1) each voxel pre-
sents a constant footprint for each view, facilitating the
registration of the footprints and the statistical calcula-
tions and (2) since OpenGL texture mapping interpolates
the textures, subpixel accuracy can be reached at no extra
cost.
To simplify the computation and avoid evaluating voxels
that are occluded by consistent voxels of previous itera-
tions of the same sweep, Kutulakos and Seitz used bit-
masks for each image to mark which pixels were already
assigned to a voxel footprint.
In our approach textures are projected onto the voxels of
the carving plane. Textures can then be modified to reflect
which voxels were already assigned by coloring them
black or creating a transparency mask. Unfortunately,
modifying the textures in OpenGL can be costly as it
implies rasterizing the voxels on each image, rescaling the
images to the right texture size and transferring them back
to texture memory. Furthermore, if an assigned voxel is
deleted later on in another sweep direction, the original
colors in the texture have to be restored. We have found a
better approach that draws the assigned voxels as shadow
polygons onto the carving plane (view Figure 4) instead.
For a specific view, the camera location is considered as a
light source, and the set of accumulated voxels for that
specific carving sweep is rendered in black using the pla-
nar projected shadow method described in [Blinn88].
Given the ground plane equation (our carving plane) and
the homogenous position of the light (the camera location
in 3D) a 4x4 matrix, called planar projected shadow
matrix, can be constructed that projects any 3D polygon
onto the ground plane based on the light position. If this
matrix is concatenated with OpenGL's modelview matrix,
the shadows will be rasterized on top of the ground plane.
The complete projection of a reference view onto the
carving plane can then be achieved as follows:
1) Render the color coded carving plane and store the
2) Render the projective texture onto the plane.
Figure 3: Texture projection onto carving plane
Figure 4: Shadow voxels projection
The following subsections will provide more detail about
the primary components of the algorithm.
3.1 Image Projection
The consistency check criterion evaluates the similarity of
the color values of a voxel footprints for each of the
views. A common way to determine how a voxel projects
to an image is to rasterize the voxel in the image plane.
Since this is a basic operation supported by the OpenGL
API, it may seem straight forward to render the voxel
from the camera's point of view. However, when the angle
of the camera in respect to the normal of the carving plane
is large and the voxel resolution is high, some voxels
might be projected to the same pixel locations due to
rounding from 3D floating point to integer pixel coordi-
nates. This may result in some missing projected voxels,
or some voxels with smaller footprints than they should
have, contributing erroneously to the consistency check.
The presented approach overcomes this by projecting the
images onto the carving plane using texture mapping
techniques.
The front faces of the voxels on the carving plane are ren-
dered from a virtual viewpoint perpendicular to the plane.
To maximize the rendered area of the carving plane on the
framebuffer, the view frustum is set to 90 degrees and the
virtual view point is located such that the carving plane
rendering fills the entire image plane. The projection of
the image is then performed with projective texture map-
ping (Figure 3).
During the initialization process all images are stored as
textures in the video card's memory. For each image that
has to be evaluated, the voxels of the plane are then ren-
dered with texturing enabled. To assign the proper texture
coordinates to a voxel, first the texture matrix stack is set
to the full reference camera projection matrix, that is the
intrinsic*extrinsic matrices of the camera. This generates
texture coordinates that are a perspective projection of
image coordinates from the camera's location. By assign-
ing the same 3D coordinates of the voxel corners as tex-
ture coordinates, the projection of the texture onto the
voxel surface is achieved.
This approach has two advantages, (1) each voxel pre-
sents a constant footprint for each view, facilitating the
registration of the footprints and the statistical calcula-
tions and (2) since OpenGL texture mapping interpolates
the textures, subpixel accuracy can be reached at no extra
cost.
To simplify the computation and avoid evaluating voxels
that are occluded by consistent voxels of previous itera-
tions of the same sweep, Kutulakos and Seitz used bit-
masks for each image to mark which pixels were already
assigned to a voxel footprint.
In our approach textures are projected onto the voxels of
the carving plane. Textures can then be modified to reflect
which voxels were already assigned by coloring them
black or creating a transparency mask. Unfortunately,
modifying the textures in OpenGL can be costly as it
implies rasterizing the voxels on each image, rescaling the
images to the right texture size and transferring them back
to texture memory. Furthermore, if an assigned voxel is
deleted later on in another sweep direction, the original
colors in the texture have to be restored. We have found a
better approach that draws the assigned voxels as shadow
polygons onto the carving plane (view Figure 4) instead.
For a specific view, the camera location is considered as a
light source, and the set of accumulated voxels for that
specific carving sweep is rendered in black using the pla-
nar projected shadow method described in [Blinn88].
Given the ground plane equation (our carving plane) and
the homogenous position of the light (the camera location
in 3D) a 4x4 matrix, called planar projected shadow
matrix, can be constructed that projects any 3D polygon
onto the ground plane based on the light position. If this
matrix is concatenated with OpenGL's modelview matrix,
the shadows will be rasterized on top of the ground plane.
The complete projection of a reference view onto the
carving plane can then be achieved as follows:
1) Render the color coded carving plane and store the
2) Render the projective texture onto the plane.
Figure 3: Texture projection onto carving plane
Figure 4: Shadow voxels projection
Figure 2.2: Multi Plane-Sweep Implementation of Space Carving (S
AINZ
et al. [55]).
A plane sweeps in each direction along the three coordinate axes ±x, ±y and ±z to test voxels for photo-consistency. To account
for visibility, the already reconstructed scene is illuminated from the reference views to project a shadow on the current sweep
plane.
and hence cameras are used to test photo-consistency (Figure 2.2). Experiments are carried out using a
Pentium4 2.2GHz server with a Nvidia Quadro Pro GPU. Using images from five cameras, reconstruction
takes 33s and 94s for volumetric resolutions of 64
3
and 128
3
respectively.
A similar system is proposed by Z
ACH
et al. [73]. While S
AINZ
et al. [55] maintain a volumetric
data structure on the CPU and massively perform expensive upload and read-back operations, Z
ACH
et
al. keep all necessary data on the GPU. When sweeping the plane along a particular viewing direction,
visibility is obtained by updating depthmaps for the already reconstructed volume in the reference views.
The work differs to S
AINZ
et al. [55] as the sweeps are carried out independent of each other. This
means, even if a voxel is visible from different subsets of cameras, it is not evaluated again. Performance
is tested using an AMD Athlon XP2000 server with ATI Radeon 9700 Pro GPU. A synthetic scene is
captured from 36 cameras, and images are transfered to the GPU in advance. Four sweeps are applied,
each time considering a subset of nine cameras. For a volumetric resolution of 64
3
and 128
3
voxels, a
framerate of 1.4fps and 0.7fps is obtained respectively.
2.2.3
Performance of View-Dependent Reconstruction
2.2.3.1
CPU
Like for shape from silhouette, real-time performance can be achieved for shape from photo-consistency
when generating novel-views. Similar to the former, a partial view-dependent representation of the scene
is obtained. In this spirit, S
LABAUGH
et al. [61] propose an image-based algorithm for reconstructing
the photo hull, called Image-Based Photo Hulls (IBPH). This is a direct extension of the Image-Based
Visual Hulls from M
ATUSIK
et al. [42]. IBPH uses a hybrid approach to increase performance. First,
IBVH is applied to fastly sample along the viewing rays until the visual hull is reached. A second step
continues ray-sampling with an adaption of Generalized Voxel Coloring to determine the photo-hull. To
further increase performance, a lower-resolution image sampling is applied. The color value for each ray
is determined by blending the pixels from where the corresonding 3D point is visible. Experiments are
- 15 -
Excerpt out of 146 pages

Details

Title
A Framework for Real-time 3D Reconstruction by Space Carving using Graphics Hardware
College
University of Weimar
1
Author
Year
2006
Pages
146
Catalog Number
V186283
ISBN (eBook)
9783869438023
ISBN (Book)
9783869431000
File size
14385 KB
Language
English
Tags
framework, real-time, reconstruction, space, carving, graphics, hardware
Quote paper
Christian Nitschke (Author), 2006, A Framework for Real-time 3D Reconstruction by Space Carving using Graphics Hardware, Munich, GRIN Verlag, https://www.grin.com/document/186283

• No comments yet.