For new authors:
free, easy and fast
For registered authors
Diploma Thesis, 2006
146 Pages, Grade: 1
Page 1
Chapter 1 - Introduction
1 INTRODUCTION
As computational machinery and methods are subject to an ever increasing progress, traditional applications change and new ones emerge. Years ago, computer graphics and computer vision were two distinct ﬁelds having their respective area of application. However, recently a convergence between both can be observed. Photorealistic models of real-world scenes are not only obtained by artiﬁcially creating shape and appearance, but also by reconstructing these models from photographic or video data of the real world. This results in a powerful contribution to our multimedial inﬂuenced environment.
The ability to reconstruct dynamic realworld scenes from multiple video streams enables a wide range of applications. 3D video extends the common 2D video in the way, that it is view-independent. Hence, the decision about a ﬁxed camera from where the scene is viewed is shifted from the time and place of recording to the time and place of consumption. 3D video can be used in the context of personal and social human activities like entertainment (e.g. 3D games, 3D events [24], 3D TV [37], 3D video recorder [70]) and education (e.g. 3D books, 3D-based training) or for the preservation of common knowledge (e.g. ethnic dancing performances [39], special skills).
3D video is obtained by analyzing properties like scene geometry and color from multi-viewpoint video streams. This generally involves a high computational effort. Nevertheless, if the reconstruction can be computed in real-time, a broad ﬁeld of further applications is facilitated. In the ﬁeld of CSCW ^{1} , more realistic and immersive telepresence and conferencing systems are created by using 3D cues [44]. In this context, PRINCE et al. [52] propose a framework, where a user can interact with a dynamic model of a remotely captured collaborator which is overlaid with the real world in a video-see-through HMD. Instead of inserting a dynamic scene into a real environment (AR ^{2} ), its also possible to combine it into a virtual world (MR ^{3} ). This enables for seamless integration of real and virtual content with appropriate visual cues like reﬂections, shading, shadowing, occlusion and collision detection. Interactive systems for reconstruction and insertion of actors into virtual worlds are proposed by HASENFRATZ et al. [21] and more recently POMI and SLUSALLEK [49], contributing to the ﬁeld of virtual television studio techniques.
Dynamic 3D data may be also used as an intermediate result for further processing and information extraction. Scene analysis allows for object recognition or motion tracking. Latter is often applied as a non-intrusive approach to human motion capturing [43].
Supporting tools for complex 3D model creation are powerful, but also expensive and difﬁcult to use.
^{1} Computer Supported Cooperative/Collaborative Work
^{2} Augmented Reality
^{3} Mixed Reality
Page 2
Instead, active-light methods perform a controlled illumination, e.g. by projecting a coded pattern to accomplish a more robust extraction of the 3D features. Passive-light techniques are also referred to as Shape from X, where X relates to the special image cue that is used to infer the shape. Common cues are e.g. stereo, shading, focus/defocus, motion, silhouettes and scene photo-consistency. While there also exist methods using uncalibrated cameras for shape recovery, the proposed approach assumes that the position and orientation of each camera to the world coordinate frame is known. General surveys about techniques in the ﬁeld of passive-light are given by DYER [12] and SLABAUGH et al. [60].
As several shape from X approaches rely on special image cues to be present, three general classes for 3D shape reconstruction and texturing from multiple photographs have been developed. Namely (1) imagebased stereo (IBS), (2) volumetric intersection (VI) and (3) photo-consistency (PC). Each of them having advantages and drawbacks. There is a general difference between IBS and VI/PC. IBS performs an imagebased search to ﬁnd corresponding pixels to generate 3D points by triangulation. The result is a depth map or a partial surface. There are some disadvantages inherent with this. (1) For the correspondence search to be efﬁcient, the views must be close together (small baseline). (2) If more then two views are used, correspondences must be searched in each pair of views resulting in a high complexity (multi-baseline). (3) For reconstructing a whole 360 degree scene model, many partial surfaces have to be computed for a set of reference viewpoints. Integration of these distinct surfaces into a global consistent model can be
Page 3
Chapter 1 - Introduction
complex, since the alignment is often carried out in a least-square sense as proposed by CURLESS and LEVOY [10]. (4) Only a sparse point cloud is obtained when pixel correspondences or image features are hard to ﬁnd. In general, there is no neighborship relation between the 3D points, as they are not located on a regular grid. Hence, generating a mesh-based surface representation implies further computation. (5) There is no handling of occlusion between the different views. This might lead to errors throughout the correspondence search.
Instead of the image-based search in IBS, VI/PC methods perform a volumetric scene modelling and thus compensate for the disadvantages. The methods operate directly in the world coordinate frame and test for a particular voxel, whether it belongs to the scene or not. The images act as set of constraints that the unknown scene has to fulﬁll. To limit the search space, some initial volume is deﬁned where the scene is known to be contained. The theoretical foundation of shape from silhouette (VI) and shape from photo-consistency (PC) is laid by LAURENTINI [29] and KUTULAKOS and SEITZ [28] respectively. Both approaches have advantages and drawbacks. Shape from Silhouette relies on a high quality image segmentation and can generally not recover several smooth convex and concave surface patches. The reconstructed shape is known as the visual hull of the scene. Nevertheless, silhouette-based methods are often applied where performance matters, as they are simple, robust and fast to compute. Shape from photo-consistency does not have these disadvantages. It enables a high quality shape reconstruction and gives the possibility to approximate a view-independent voxel color in a straightforward way. The resulting shape is known as the photo hull of the scene. However, the method suffers from a high computational cost. An interesting fact is, that both techniques are related. Shape from photo-consistency can be seen as a generalization of shape from silhouette where further constraints are added. Due to this, the visual hull is proven to contain the photo hull. The proposed approach exploits this relation by computing the visual hull as initial volume for the reconstruction of the photo hull.
The extraction of the 3D geometry as a set of voxels is only the ﬁrst part of a possible whole processing pipeline. Further tasks may include isosurface extraction, photorealistic texturing, insertion into virtual and real environments for rendering as well as higher-level feature extraction for scene recognition. To perform the entire processing pipeline in real-time, attempts have been made to develop scalable parallel algorithms that are executed on PC cluster systems [39][21]. While these commonly provide the scalability and processing power to handle a high number of input views with good quality, they also introduce several drawbacks. These include a high hardware cost and a complex bulky setup which restricts their usage to professional and static environments.
Focusing towards real-time processing on a single PC, algorithmical complexity and computational hardware have to be improved. Regarding the latter, beside a rapid progress in CPU clock speed, there exist assembly level optimizations for the CPU, such as the MMX and SSE extensions from Intel. However, this may speed up 3D shape reconstruction but also entirely occupies the CPU, so that no other pipeline
Page 4
Chapter 1 - Introduction
tasks may be performed.
A way to increase the performance of 3D shape recovery on a single PC while saving CPU cycles for other tasks is to leverage the power of current off-the-shelf Graphics Processing Units (GPU). The GPU is a unique processor beside the CPU, originally introduced to free the CPU from an increasing amount of 3D graphics processing. GPUs are optimized to achieve a high throughput (streaming) on a special arithmetic instruction set (SIMD). For comparision, an Intel Pentium4 CPU at 3.6GHz achieves a peak of 14.4 GFLOPs for its SSE3 unit. For the Nvidia GPUs GeForce 7800GTX/512MB (G70), GeForce 7900GTX (G71) and GeForce 8800GTX (G80) it is 165, 250 and 518 GFLOPs respectively. GPUs are not only very fast processors, they are also developing with a higher rate of growth then CPUs. With an average yearly rate of 1.7×/2.3× for pixels/vertices per second compared to a rate of 1.4× for CPU performance, GPU growth outpaces Moore’s Law for traditional microprocessors [47]. This divergence of the performance curves is related to the fundamental architectural differences. CPUs aim at achieving a high performance in sequential computations. As these have a very generic nature, CPUs use several of their processing power for non-computational tasks like branch prediction and caching. Instead, GPUs are specialized to achieve high arithmetic intensity with performing parallelized graphics computations. On the other side, its specialized nature makes it difﬁcult to employ a GPU as processor for generic computations. Thus, algorithms have to be mapped to emulate rendering tasks. The proposed approach takes this way and employs the GPU for the task of real-time 3D reconstruction by shape from photoconsistency on a single PC. A recent overview of the GPU architecture and its related programming paradigms is found from OWENS et al. [47].
Most closely related to the proposed approach are the works on GPU-accelerated Space Carving from SAINZ et al. [55] and ZACH et al. [73]. These restore a full, explicit and view-independent volumetric scene model. However, since they explicitly compute visibility and process interior voxels they are not real-time capable.
GPU-accelerated image-based methods for computing the photo hull such as LI et al. [35] allow for realtime execution but lack many other features. As these algorithms are denoted to the generation of novel viewpoints, they only recover a partial, implicit and view-dependent scene model. If GPU acceleration is not used, it is obvious that the model is explicitly available on the CPU as in SLABAUGH et al. [61]. Nevertheless, the performance decreases about seven times compared to [35] and real-time processing is only achieved by employing a coarse space sampling which inﬂuences quality.
Multi-baseline image-based stereo (IBS) can also be computed in real-time by employing GPU acceleration as in YANG et al. [72] and YANG and POLLEFEYS [71]. These approaches provide an explicit representation as disparity or depth map, but also suffer from the general drawbacks of IBS and hence do
Page 5
not generate an entire, view-independent scene model.
The proposed approach explains a framework which employs GPU hardware-acceleration for shape from photo-consistency to generate an explicit volumetric representation of a scene that is captured from multiple calibrated viewpoints. Several works are dealing in this topic, all performing 3D reconstruction on a single PC. However, none fulﬁlls all of the features proposed with this approach (Table 1.1). Especially combining real-time processing with the ability to generate an entire volumetric model can not be achieved. While realizing this, the proposed approach deﬁnes a novel photo-consistency measure implicitly accounting for visibility and specular highligths. Moreover, the entire image processing is also executed GPU-accelerated on the single reconstruction PC and not on a PC cluster as generally done.
The remainder of this thesis is organized as follows. Chapter 2 surveys related work in the ﬁeld of dynamic scene reconstruction by shape from silhouette and shape from photo-consistency. The focus lies on high performance reconstruction and hardware-acceleration. Chapter 3 introduces the theoretical basis for the proposed approach within three main parts. (1) Camera geometry is important to relate images captured from multiple viewpoints to an unknown 3D scene. (2) When taking a photograph of a scene, the light emitted towards the camera is captured. Therefore light and color is discussed as necessary for this work. (3) At last, the theory behind shape from silhouette and shape from photo-consistency needs to be explained. Chapter 4 continues with deﬁning the Basic Algorithm as a hybrid approach to scene reconstruction with introducing features like (1) a robust and fast image-segmentation to account for shadows, (2) a set of nested volumes that are sequentially computed to speed-up computation and (3) an implicit visibility computation. The Basic Algorithm is extended and mapped onto the GPU in chapter 5. The corresponding Advanced Algorithm enables efﬁcient real-time computation by employing a multipass-rendering strategy. Chapter 6 explains and discusses several experiments to analyze the proposed system in terms of performance and reconstruction quality. Chapter 7 concludes with giving a summary and discussing limitations and future works. Due to their number, diagrams and ﬁgures relating
Page 7
Chapter 2 - Related Work
2 RELATED WORK
The following chapter discusses related research in the ﬁeld of passive 3D reconstruction using images from multiple calibrated views. The aim is not only to present the different works, but also to connect the particular areas and set up the foundation for the proposed approach.
The objective of the proposed approach is to explicitly reconstruct the entire 3D shape of a dynamic scene in real-time on a single PC. Explicit relates to the fact, that a 3D scene description is obtained as an output. Implicit methods only perform an internal 3D reconstruction as an intermediate processing step, e.g. in combination with generating novel views of the scene. Making data explicit is generally no problem if the CPU is used for reconstruction. However, the proposed approach employs the GPU which involves an expensive read-back operation.
An important property is view-dependency, which relates to the fact that distinct views of the same scene part result also in distinct 3D shapes. It seems obvious that view-dependency may occur for techniques, that do not compute the entire scene model. These include techniques that generate of novel-views or depth maps from existing views. The proposed approach aims in restoring an entire view-independent scene model. Here, further performance and quality issues come into play. First, reconstructing the entire scene needs more computation. Second, the resulting model should be global consistent. This is generally not easily achieved by merging multiple partial surfaces from view-dependent approaches.
Using GPU acceleration ^{4} , real-time reconstruction in the mentioned way is possible for the fast shape from silhouette technique. However, this has several limitations on the quality of the result. To compensate for this, several hybrid approaches have been suggested to generate a reﬁned shape using shape from photo-consistency. Though, these methods do not achieve real-time or interactive framerates, which is only obtained with GPU accelerated view-dependent reconstruction.
The following review is separated into two parts. At ﬁrst, works using the fast and efﬁcient shape from silhouette technique are discussed in terms of their performance. This is related to either algorithmical improvement or to the mapping on a CPU cluster or the GPU. The techniques are chosen, as they are directly related to this work. The second part reviews approaches that generate a better quality shape approximation using photo-consistency or image-based stereo.
If there exists a set of images from different viewpoints, where the scene is entirely included, volume intersection can be used to compute a shape approximation which is itself a subset of the convex hull and
^{4} In this scope, GPU acceleration relates to the acceleration of any shape modifying operation. Thus, CPU-based reconstruction with GPU-accelerated rendering (and clipping) does not fall under this category.
Page 8
Chapter 2 - Related Work
known to contain the real object. The approach is rather simple and based on the assumption, that any ray from the camera center through a pixel from where the original scene is visible, must intersect the (unknown) scene. If all such rays are constructed for the whole set of images, their intersection in 3D is referred to as volume intersection.
To determine the pixels from where an object is visible, image segmentation is performed ^{5} . The resulting set of foreground pixels in an image is called silhouette of the 3D scene and is represented binary. Therefore, another common synonym for volume intersection methods is shape-from-silhouette. For the segmentation, knowlegde about background is obtained in advance by taking images from all viewpoints, this time without any scene present. In the simplest case, the color of a background pixel is compared with the corresponding pixel in the real image. If this exceeds some threshold, the pixel is assumed to belong to foreground. More robust but complex approaches as HORPRASERT et al. [22] generate a color distribution function for a background pixel. Moreover, background and foreground can be expressed by a non-parametric model which is adapted at runtime to compensate for various effects as in TOYAMA et al. [66] and more recently in ELGAMMAL et al. [13]. While these approaches work on the CPU in real-time, image segmentation on the GPU achieves an even higher performance. GRIESSER [20] presents a robust adaptive implementation achieving 4ms for a VGA image. A theoretical analysis for image segmentation related to the color of the background is done by SMITH and BLINN [62]. At last, a complete and recent overview of image segmentation techniques is given by ZHANG [75].
When a set of silhouette images is available, volume intersection can be performed. This is ﬁrst described by MARTIN and AGGARWAL [38] and often employed in subsequent works. The theoretical foundation is ﬁrst established by LAURENTINI [29] proposing a problem deﬁnition and introducing the visual hull as the best possible shape approximation which can be obtained from volume intersection. LAURENTINI [30] extends the theory and discusses practical estimations for the necessary number of silhouettes depending on scene complexity in [31]. A more recent survey of visual hull related works is found in [32]. An introduction and discussion about the corresponding theory is found in (ref section).
The volume intersection is either represented as an exact geometrical object or as a discrete volume. A geometric representation is obtained by intersecting the polyhedra of the backprojected silhouette cones in 3D or by performing image-based polygone intersection in 2D. In case of a discrete representation, there exist different strategies for volumetric and image-based approaches. Volumetric approaches project and intersect the silhouettes on slices of voxels. Image-based methods perform a ray-casting to ﬁnd the 3D intersections with the visual hull.
^{5} The technique is also referred to as silhouette extraction or background subtraction.
Page 9
Chapter 2 - Related Work
This section discusses attempts for increasing the performance for view-independent reconstruction, where a global consistent 3D model is generated.
Performance can be improved in several ways. Many discrete approaches realize a LOD ^{6} reconstruction using an octree representation. The space is sampled in a coarse-to-ﬁne hierarchy depending on scene complexity. An important work in this area is presented by POTMESIL [50] and signiﬁcantly improved by SZELISKI [64].
Fast and scalable systems propose to use a PC cluster for reconstruction. It is obvious that these are able to achieve real-time. The drawback lies in a static system setup involving a lot of expensive hardware. The experimental results for the following works are all achieved for capturing a dynamic dancer scene. BOROVIKOV and DAVIS [3] utilize 14 cameras with 16 PentiumII 450 MHz CPUs, connected via 100Mbit ethernet. Reconstruction is done with an octree approach in a volume of 2×2×2m. At voxelsizes of 31, 16 and 8mm, a performance of 10, 8 and 4fps is obtained respectively. While this system depends mainly on octree depth and corresponding transmission time, WU and MATSUYAMA [69] use a 10Gbit myrinet network to increase bandwidth and scalability. Their framework handles a varying number of VGA cameras, each connected with a dual PentiumIII 1GHz CPU. When using six cameras in a volume of 2×2×2m and a voxelsize of 20mm, 26fps are obtained. The framerate can be further increased with the number of PCs. Using 10 PCs, the system runs at 43fps. The work from UEDA et al. [67] introduces a whole PC cluster based reconstruction and rendering architecture. It uses nine VGA cameras and 21 Pentium4 3GHz CPUs, connected also via 10Gbit myrinet network. For a volume of 2.56×2.56×2.56m and a voxelsize of 20mm they report an overall performance of 10fps. This framerate is measured for the entire processing pipeline where the single processes are executed in parallel.
The work of HASENFRATZ et al. [21] is important, as it is the only known system making the GPUreconstructed view-independent volume intersection explicit. It proposes a real-time framework for reconstruction and insertion of actors into virtual worlds. The reconstruction is done by sweeping a plane through the volume. At each position, the silhouette textures are projected onto the plane and combined by multitexturing. If all silhouettes project onto a particular voxel, it is contained in the volume intersection. The data is made explicit by reading back the slices from the framebuffer to the CPU. The system uses four VGA cameras with respective PCs performing the 2D processing. The images are then transfered via 1Gbit network to an Onyx 3000-IR3 server with eight R12000 400MHz CPUs. For the reconstruction,
^{6}
Level Of Detail
Page 10
Chapter 2 - Related Work
a volume of 2×2×2m with a rather big voxelsize of 30mm is used. With a texture upload of 2ms and reconstruction of 13ms, a framerate of 67fps is achieved. The read-back implies a further performance impact, as also interior voxels have to be transfered.
The fundamental difference to the following works is, that they aim at rendering novel views. That means, the 3D data is usually only view-dependent and partially reconstructed. This is related to IBR ^{7} (BUEHLER et al. [6]) in the way, that additional 3D information is leveraged.
MATUSIK et al. [42] describe an efﬁcient image-based sampling approach for computing and texturing visual hulls, called Image-Based Visual Hulls (IBVH). A ray-based reconstruction is done for the pixels from where the scene is visible in the destination image (Figure 2.1). Considering [42], the performance is mostly depending on the number of pixels and lies at about 8fps for an average VGA image using a quad CPU 550MHz server. For the test, four cameras are used at a resolution of 256×256 with corresponding PCs for image processing. MATUSIK et al. present an improved adaptive version in [40]. Depending on the scene, a speed-up of 4-9× is achieved. MATUSIK et al. continue their work on visual hull rendering and present a different approach in [41]. For the novel view, an exact geometrical polyhedral representation is computed by employing image-based polygone intersections in 2D. View-dependent texturing is done on the GPU when rendering by using the approach of BUEHLER et al. [6]. Four cameras are employed at QVGA resolution, each with respective PC. Using a dual PentiumIII 933MHz CPU, framerates of 15fps are reported for pure geometry construction and 30fps for the rendering. PRINCE et al. [52] propose a framework for real-time reconstruction and novel-view rendering in an AR telepresence environment. Herefore, a remotely captured human is rendered on an ARToolkit [25] marker for the viewpoint of a person wearing video-see-through glasses. Interaction with the 3D video is done by adjusting the marker’s position and orientation. The image-based reconstruction is done similar to MATUSIK et al. [42]. Image acquisition is performend using 14 VGA cameras, connected to ﬁve PCs doing the image processing. Images are distributed to a rendering server over 1Gbit network. With a rendered resolution of 384×288, the system achieves a framerate of 25fps.
The proposed works of MATUSIK et al. perform image-based volume intersection and can thus decrease algorithmical complexity for novel-view rendering on the CPU. In contrast to this, LI et al. [?] propose an approach for novel-view rendering entirely on the GPU. For a particular silhouette, all faces of the corresponding 3D silhouette cone are rendered sequentially and projective textured from all other views.
^{7}
Image-Based Rendering
Page 11
Chapter 2 - Related Work
ork
is perhaps clos- Their initial
^{1} Given a desired view, we compute each viewing ray’s intersection s in conjunction
^{r1} with the visual hull. Since computing a visual hull involves only models of dy- r2
^{r3} ^{ificant off-line} intersection operations, we can perform the CSG calculations in r4
se hardware for ^{r5} any order. Furthermore, in the visual hull context, every CSG 3
^{r6} ^{volume-carving} ^{2} primitive is a generalized cone (a projective extrusion of a 2D
use [26] [30].
image silhouette). Because the cone has a fixed (scaled) cross
o objects allow ^{rpr1} section, the 3D ray intersections can be reduced to cheaper 2D ray ^{rpr2} ideo streams to rpr3
^{rpr4} intersections. As shown in Figure 2 we perform the following r representation Desired Image rpr5
steps: 1) We project a 3D viewing ray into a reference image. 2) They match sil- rpr6
^{nd use these} We perform the intersection of the projected ray with the 2D sil- ^{iews.This ap-} houette.These intersections result in a list of intervals along the
^{es are generally} ray that are interior to the cone’s cross-section. 3) Each interval is Reference Image
then lifted back into 3D using a simple projective mapping, and
silhouette infor- Figure2 - ^{Computing the IBVH involves three steps. First, the} ^{Figure 3 - The pixels of a scanline in the desired image trace out} then intersected with the results of the ray-cone intersections from Figure 2.1: Image-based Ray Sampling of the Visual Hull (IBVH) (MATUSIK et al. [42]). ^{desired ray is projected onto a reference image. Next, the intervals} ^{an object is and} ^{a pencil of line segments in the reference image. An ordered tra-} otherreference images. A naïve algorithm for computing these ^{where the projected ray crosses the silhouette are determined.} ^{this carving is a} A desired novel view of the visual hull is rendered by ray-based sampling. (1) A 3D ray is projected into the reference views ^{versal of the scanline will sweep out these segments such that their} ^{Finally, these intervals are lifted back onto the desired ray where} IBVH ray intersections follows: ^{ual hull always} ^{slope about the epipole varies monotonically.} ^{they can be intersected with intervals from other reference images.} to (2) determine the intersections with the visual hull. (3) Backprojection of the intersections from all reference views leads to hter fit than the
^{IBVHisect (intervalImage &d, refImList R){} ^{iew-dependent, requires that each solid be first decomposed to a union of convex} the 3D intersection with the visual hull. Sampling for the novel view is done in scanlines to determine which rays have to be The silhouette contour of each reference view is represented for each referenceImage r in R
dered frame. primitives. This decomposition can prove expensive for compli- ^{computeSilhouetteEdges(r)} as a list of edges enclosing the silhouette’s boundary pixels. These sampled. ^{for each pixel p in desiredImage d do} ^{ed from a set of cated silhouettes. Similarly, the algorithm described in [11]} edges are generated using a 2D variant of the marching cubes ^{p.intervals = {0..inf}} ilhouette s r ^{with requires a rendering pass for each layer of depth complexity. Our} ^{for each referenceImage r in R} approach [16]. Next, we sort the O(nl) contour vertices in in- one the method does not require preprocessing the silhouette cones. In for each scanline s in d
For these textures, the alpha-channel encodes the silhouette occupancy information. Parts of the face creasing order by the slope of the line connecting each vertex to ^{for each pixel p in s} ^{g at the image's fact, there is no explicit data structure used to represent the sil-} ^{ray3D = compute3Dray(p,d.camInfo)} the epipole. These sorted vertex slopes divide the reference image ^{or points on its houette volumes other than the reference images.} where not all silhouettes are projecting obtain an alpha-value of zero and stenciled out. This is done for lineSegment2D l2 = project3Dray(ry3,r.camInfo)
domain into O(nl) bins. Bin B i has an extent spanning between the ^{ct must be con-Using ray tracing, one can render an object defined by a tree} intervals int2D = calcIntervals(l2,r.silEdges)
^{intervals int3D = liftIntervals(int2D,r.camInfo,ry3)} ^{the object must} slopes of the ith and i+1st vertex in the sorted list. In each bin B i ^{of CSG operations without explicitly computing the resulting} all faces of all silhouette cones respectively. The framework is tested by using four QVGA input cameras
^{p.intervals = p.intervals ISECT int3D} solid [25]. This is done by considering each ray independently ize of R ^{goes to} we place all edges that are intersected by epipolar lines with a }
^{and computing the interval along the ray occupied by each object.} with corresponding image processing PCs. The rendering server is a dual core Pentium4 1.7GHz PC with slope falling within the bin’s extent ^{2} . During IBVHisect as we erges to a shape
To analyze the efficiency of this algorithm, let n be the num- ^{TheCSG operations can then be applied in 1D over the sets of etry. The visual} traverse the pixels along a scanline in the desired view, the pro- aGeForce3 GPU. Together with view-dependent texturing, a framerate of 84fps is achieved. ^{intervals. This approach requires computing a 3D ray-solid inter-} berof pixels in a scanline. The number of pixels in the image d is ^{nal object since} jected corresponding view rays fan across the epipolar pencil in
^{section. In our system, the solids in question are a special class of} O(n 2 ). Let k be the number of reference images. Then, the above ^{ed using silhou-} the view with either increasing or decreasing slope. ^{cone-like shapes with a constant cross section in projection. This} algorithm has an asymptotic running time O(ikn 2 ), where i is the
Concurrently, we step through the list of bins. The appropriate bin ^{special form allows us to compute the equivalent of 3D ray intervisual hulls us-} timecomplexity of the calcIntervals routine. If we test for the
for each epipolar line is found and it is intersected with the edges sections in 2D using the reference images. of views R, ^{the} LOK [36] proposes a completely different approach to novel-view generation. It uses a plane-sweep intersection of each projected ray with each of the e edges of the Image-Based Rendering. ^{Many different image-based} in that bin. This procedure is analogous to merging two sorted etric description
silhouette, the running time of calcIntervals is O(e). Given ^{rendering techniques have been proposed in recent years} technique like in HASENFRATZ et al. [21]. This time, for view-dependent surface generation instead of ^{ation alone (see} lists, which can be done in a time proportional to the length of the
that l is the average number of times that a projected ray intersects ^{[3] [4] [15] [6] [12]. One advantage of image-based rendering d, then alterna-} lists(O(nl) in our case).
the silhouette ^{1} , the number of silhouette edges will be O(ln). volumetric sampling. The slices are generated with non-uniform distance and increasing size according ^{techniques is their stunning realism, which is largely derived from er order surface} For each scanline in the desired image we evaluate n viewing
^{the acquired images they use. However, a common limitation of} Thus, the running time of IBVHisect to compute all of the 2D
rays. For each viewing ray we compute its intersection with edges to the frustum of the viewing camera. Unlike [21], a particular slice is rendered N times, where N is the ^{these methods is an inability to model dynamic scenes. This is} intersections for a desired view is O(lkn 3 ). ion visual hulls
in a single bin. Each bin contains on average O(l) silhouette mainly due to data acquisition difficulties and preprocessing rees vh r ^{requires} The performance of this naïve algorithm can be improved by number of silhouettes. The volume intersection is computed by accumulating the projected silhouettes edges. Thus, this step takes O(l) time per ray. Simultaneously we ^{quirements. Our system generates image-based models in realwith a polygo-} taking of incremental computations that are enabled by
traverse the sorted set of O(nl) bins as we traverse the scanline. ^{time, using the same images to construct the IBHV and to shade edral CSG, but} using the stencil buffer. Since stencil and color buffer are not cleared after each slice, the depth buffer the epipolar geometry relating the reference and desired images. ^{the final rendering.} Therefore, one scanline is computed in O(nl) time. Over n scanli-These improvements will allow us to reduce the amortized cost of ^{ouette contours} ﬁnally contains the view-dependent surface of the visual hull. The data resides on the GPU and is not nes of the desired image, and over k reference images, this gives a 3 Visual-Hull Computation 1D ray intersections to O(l) per desired pixel, resulting in an im- ^{9][19] [5] [27].} running time of O(lkn 2 ). Pseudocode for the improved algorithm
plementation of IBVHisect that takes O(lkn 2 ). made explicit. From the depth buffer a coarse mesh is generated and view-dependent textured. The system ^{n explicit volu-Our approach to computing the visual hull has two distinct char-} follows.
^{of the projected acteristics: it is computed in the image space of the reference} Given two camera views, a reference view r and a desired
uses ﬁve NTSC cameras with respective PCs for image processing. The reconstruction is obtained at 12- omthe volume. images and the resulting representation is viewpoint dependent. ^{IBVHisect (intervalImage &d, refImList R){} view d, we consider the set of planes that share the line connect- ^{for referenceImage r in R} ^{e. The resulting The advantage of performing geometric computations in image} ing the cameras’ centers. These planes are called epipolar planes. 15fps for a volume of 2.4×1.8×1.8m and a voxelsize of 10mm. ^{computeSilhouetteEdges (r)} ^{l hull according space is that it eliminates the resampling and quantization artifacts} ^{for each pixel p in desiredImage d do} Each epipolar plane projects to a line in each of the two images,
^{e of our viewthat plague volumetric approaches. We limit our sampling to the} p.intervals = {0..inf}
called an epipolar line. In each image, all such lines intersect at a ^{for each referenceImage r in R} ^{ulting from this pixels of the desired image, resulting in a view-dependent visual-} ^{bins = constructBins(r.caminfo, r.silEdges, d.caminfo)} common point, called the epipole, which is the projection of one ^{hull representation. In fact, our IBVH representation is equivalent} ^{for each scanline s in d} 2.1.5 Conclusion of the camera's center onto the other camera's view plane [9]. ^{have been de- to computing exact 3D silhouette cone intersections and rendering} incDec order = traversalOrder(r.caminfo,d.caminfo,s)
^{resetBinPositon(b)} ^{but most are ill} As a scanline of the desired view is traversed, each pixel the result with traditional rendering methods.
^{for each pixel p in s according to order} ^{Rappoport [24], Our technique for computing the visual hull is analogous to} projects to an epipolar line segment in r. These line segments ray3D ry3 = compute3Dray(p,d.camInfo)
^{finding CSG intersections using a ray-casting approach [25].} The works introduced so far are computing and using knowledge about a captured scene by employ- ^{lineSegment2Dl2 = project3Dray(ry3,r.camInfo)} emanate from the epipole e dr , the image of d’s center of projection
slope m = ComputeSlope(l2,r.caminfo,d.caminfo)
onto r’s image plane (see Figure 3), and trace out a “pencil” of ^{updateBinPosition(b,m)} ing shape from silhouette. Hence, a robust and fast image segmentation technique is essential. For the
^{intervals int2D = calcIntervals(l2,b.currentbin)} epipolar lines in r. The slopes of these epipolar line segments will
intervals int3D = liftIntervals(int2D,r.camInfo,ry3)
either increase or decrease monotonically depending on the direc- proposedapproach, a simple variant of color distance thresholding is used. Several modiﬁcations are p.intervals = p.intervals ISECT int3D
^{}} tion of traversal (Green arc in Figure 3). We take advantage of this
introduced to increase robustness. monotonicity to compute silhouette intersections for the whole
scanline incrementally.
^{2} Sorting the contour vertices takes O(nl log(nl)) and binning takes O(nl 2 ).
Sorting and binning over k reference views takes O(knl ^{log(nl)) and} PC cluster systems allow highly parallelized scalable implementations, where a whole 3D video process-
O(knl
2
)
correspondingly.In our setting,
l << n
^{so we view this preproc-}
^{1}
We reference images also have
O(n
2
)
pixels. essing stage as negligible.
Page 12
Chapter 2 - Related Work
ing pipeline can be performed in real-time. However, this results in a complex distributed algorithm and an expensive static hardware setup. Using a single PC instead, the performance can be increased by using a LOD representation. If GPU acceleration is leveraged, the reconstruction result has to be read back to the CPU. However, this operation becomes a hard impact on performance as interior voxels are also reconstructed. Even if a low volumetric resolution of 64×64×64 voxels is used. This problem does not occur for the proposed approach as only surface voxels are read back. A labeling algorithm is proposed to restore interior voxels later on the CPU if necessary.
If the goal is not an explicit 3D reconstruction but rendering of novel views, interactive framerates are achieved using a single PC. The corresponding techniques perform either exact geometric intersection or discrete space sampling. Both approaches are done either image- or scene-based. The main issue is, that only a partial view-dependent surface is reconstructed. The discrete image-based methods are performing a ray-based space sampling. This strategy is similar to the proposed approach. Both works are related to sampling a CSG ^{8} scene by ray-casting [54], as the volume intersection of the generalized silhouette cones can be modeled using CSG.
The proposed approach is actually related to both, discrete view-independent and view-dependent techniques. View-independent is the fact, that an entire volumetric 3D model is generated. On the other hand, this is achieved by ray-casting from multiple views around the scene. However, the difference is that the proposed approach uses parallel rays to sample a regular grid and applies an interleaved sampling technique to exploit coherence between all virtual views.
Similar to the proposed approach, all works handle a scene space of around 2×2×2m. However, the volumetric resolution reaches from 8 to 30mm. The input images are usually captured with QVGA resolution. This depends either on the bandwidth of the network or on the bandwidth of texture upload for the GPU approaches. Moreover, a maximum number of ﬁve input cameras is used. This relates to either the mentioned bandwidth issue or to limitations from the GPU architecture. The proposed approach employs eight cameras and experiments with image resolutions up to XVGA. Considering image processing, all systems use a PC cluster for this task. Usually, each camera is connected to a corresponding PC. Instead, the proposed approach uses the cluster only for image capturing, while image processing is entirely performed on the single server. This is a big difference, since the algorithm can also be applied in a setup, where all cameras are connected directly with the server.
Techniques, that reconstruct a 3D scene by shape from silhouette employ information about foreground and background pixels. The 3D shape is known as volume intersection, since it represents the 3D in-
^{8}
ConstructiveSolid Geometry - 12 -
Page 13
Chapter 2 - Related Work
tersection of the backprojected 2D silhouettes. Shape from photo-consistency is a generalization of this approach with using additional greyscale or color information. Applying further constraints, these methods can achieve a tighter approximation of the true shape. The property of photo-consistency relates to the fact, that the radiance from a particular (unknown) scene point should be consistent with the irradiance measured in the corresponding image pixels from where the point is visible. For each scene point can thus be determined, if it is consistent with the set of images. This is the inverse approach of the one taken by image-based stereo algorithms. Instead of evaluating the pixels that are corresponding to a particular scene point, they start from the images and try to ﬁnd pixels that may correspond to the same scene point. The coordinates of the point are then obtained by triangulation. Due to this, shape from photo-consistency is also referred to as scene-based stereo. Similar to the term of photo-consistency, LECLERC et al. [34] deﬁne self-consistency as a pixel correspondence measure for evaluating image-based stereo algorithms. When comparing both approaches, image-based stereo shows several drawbacks. In particular related to the reconstruction of a global consistent scene. Also, the recovered points are arbitray located within point clouds and hence, do not correspond to a voxel representation on a regular grid. However, since image-based stereo algorithms (1) are related to the ray-based sampling employed for the proposed approach and (2) achieve real-time framerates for generation of novel views, they are considered for this review. A survey of two-view stereo algorithms is provided by SCHARSTEIN and SZELISKI in [57].
The property of photo-consistency for restoring an unknown scene from a set of photographs is ﬁrst introduced by SEITZ and DYER [58]. This work deﬁnes an algorithm for reconstruction of a photo-consistent shape from a sufﬁciently textured scene, called Voxel Coloring. Starting from an initial opaque volume, incremental carving of inconsistent voxels yields a shape that converges towards the real scene. However, due to voxel visibility considerations there exist restrictions on the camera placement. KUTULAKOS and SEITZ [28] extend this work by deﬁning the Space Carving algorithm which generalizes Voxel Coloring towards arbitrary camera placement. Furthermore, they provide the theoretical foundation for the problem of scene reconstruction from multiple images and deﬁne the photo hull as the maximal photo-consistent shape that can be achieved using this technique. An introduction and discussion about the corresponding theory is found in (ref section).
Similar research is carried out by CULBERTSON et al. [8] proposing the Generalized Voxel Coloring (GVC) algorithm which differs to Space Carving (SC), as another approach is taken to maintain visibility ^{9} . Two versions are proposed using different volumetric data structures and reducing either (1) the number of consistency checks or (2) the necessary memory footprint. While SC is realized by sweeping a plane through the volume, GVC operates a on a surface voxel representation. In comparision with SC, GVC is reported to achieve better reconstruction results.
^{9}
Actually this work uses the term ”Space Carving” somehow ambiguously. In [28] it is deﬁned as a generic concept without accounting for visibility. Hence, ”Multi-Sweep Space Carving” is meant here instead.
Page 14
Chapter 2 - Related Work
This section discusses approaches to increase the performance for view-independent reconstruction techniques generating a full discrete 3D model.
To compensate for an inaccurate camera calibration, KUTULAKOS [27] extends the research of KUTU- LAKOS andSEITZ [28] by proposing the Approximate Space Carving (ASC) algorithm. SC evaluates photo-consistency for a voxel by testing the consistency of the image pixels, from where the voxel is visible. ASC extends this by incorporating a disc of pixels with radius r around the projection pixel. The so called r-consistency is achieved, if there exists at least one pixel in each disc, making the entire set of pixels photo-consistent. This deﬁnes a nested set of shapes, converging to the photo hull for r → 0. [27] suggests to implement a multi-resolution coarse-to-ﬁne reconstruction using r-consistency to increase performance. In this case, r corresponds to the size of the voxels footprint at a particular LOD. A coarseto-ﬁne approach for the similar Voxel Coloring algorithm is already suggested earlier by PROCK and DYER [53]. This is also performed by analyzing the entire set of pixels where a voxel projects in each image. Achieving equal quality, the speed-up of this method increases with volumetric resolution from 1.23× for 32 ^{3} to 41.1× for 512 ^{3} respectively.
PROCK and DYER [53] also propose to use a multi-resolution approach to exploit time-coherency for dynamic scenes. This is done by employing a coarse level of the shape in frame t as the initial volume for frame t + 1. Its obvious that this strategy does not compensate for discontinuities between frames. However, this can be ignored when capturing a human performance as movement is always continuous ^{10} . By using this approach, PROCK and DYER achieve a speed-up of 2×.
SAINZ et al. [55] present an approach to GPU-accelerated reconstruction using a simpliﬁed version of the multi plane-sweep implementation of the Space Carving algorithm from KUTULAKOS and SEITZ [28]. Here, a plane moves through a volumetric bounding box in six stages along each direction of the three coordinate axes ±x, ±y and ±z. At each iteration, only the voxels that intersect with the current plane are considered. As the goal lies in volumetric sampling of a regular grid, the destination camera is setup for orthographic projection. When rendering a particular plane there are two main issues, namely (1) increasing performance by computing the visual hull and (2) accounting for visibility. (1) When projecting the silhouettes from all cameras on a particular plane, voxels outside the visual hull can be determined and ignored. (2) To account for visibility when sweeping along the viewing direction, already reconstructed voxels are rendered and illuminated from the direction of the input cameras in front of the plane. The corresponding shadows are projected onto the current plane. For a distinct voxel, all visible light sources
^{10} Of course, this depends on the speed of movement and the framerate for image capturing.
Page 15
^{colors in the texture have to be restored. We have found a The front faces of the voxels on the carving plane are ren-}
^{have,contributing erroneously to the consistency check. implies rasterizing the voxels on each image, rescaling the}
^{better approach that draws the assigned voxels as shadow dered from a virtual viewpoint perpendicular to the plane.}
^{The presented approach overcomes this by projecting the images to the right texture size and transferring them back}
^{polygons onto the carving plane (view Figure 4) instead. To maximize the rendered area of the carving plane on the}
images onto the carving plane using texture mapping to texture memory. Furthermore, if an assigned voxel is
techniques. The front faces of the voxels on the carving plane are rendered from a virtual viewpoint perpendicular to the plane. To maximize the rendered area of the carving plane on the
framebuffer, the view frustum is set to 90 degrees and the
virtual view point is located such that the carving plane
rendering fills the entire image plane. The projection of
the image is then performed with projective texture map-
For a specific view, the camera location is considered as a
light source, and the set of accumulated voxels for that
^{specific carving sweep is rendered in black using the pla-} Figure Multi Plane-Sweep Implementation of Space Carving (SAINZ et al. [55]). nar projected shadow method described in [Blinn88].
^{Given the ground plane equation (our carving plane) and} A plane sweeps in each direction along the three coordinate axes ±x, ±y and ±z to test voxels for photo-consistency. To account the homogenous position of the light (the camera location
for visibility, the already reconstructed scene is illuminated from the reference views to project a shadow on the current sweep ^{in 3D) a 4x4 matrix, called planar projected shadow} plane. matrix, can be constructed that projects any 3D polygon
Figure 3: Texture projection onto carving plane ^{nar projected shadow method described in [Blinn88].} dered with texturing enabled. To assign the proper texture onto the ground plane based on the light position. If this
^{Given the ground plane equation (our carving plane) and} ^{coordinates to a voxel, first the texture matrix stack is set matrix is concatenated with OpenGL’s modelview matrix,} ^{During the initialization process all images are stored as the homogenous position of the light (the camera location} and hence cameras are used to test photo-consistency (Figure 2.2). Experiments are carried out using a ^{to the full reference camera projection matrix, that is the the shadows will be rasterized on top of the ground plane.} ^{textures in the video card’s memory. For each image that in 3D) a 4x4 matrix, called planar projected shadow} intrinsic*extrinsic ^{matrices of the camera. This generates The complete projection of a reference view onto the} Pentium4 2.2GHz server with a Nvidia Quadro Pro GPU. Using images from ﬁve cameras, reconstruction ^{has to be evaluated, the voxels of the plane are then renmatrix, can be constructed that projects any 3D polygon} ^{texture coordinates that are a perspective projection of carving plane can then be achieved as follows:} ^{dered with texturing enabled. To assign the proper texture onto the ground plane based on the light position. If this} takes 33s and 94s for volumetric resolutions of 64 ^{3} and 128 ^{3} respectively. ^{image coordinates from the camera’s location. By assign-} ^{coordinates a voxel, first the texture matrix stack is set} ^{1) Render the color coded carving plane and store the} ^{matrix is concatenated with OpenGL’s modelview matrix,} ^{ing the same 3D coordinates of the voxel corners as tex-} ^{to full reference camera projection matrix, that is the} ^{voxel mask.} ^{the shadows will be rasterized on top of the ground plane.} ture coordinates, the projection of the texture onto the
intrinsic*extrinsic ^{matrices of the camera. This generates The complete projection of a reference view onto the} 2) Render the projective texture onto the plane. voxel surface is achieved.
^{texture coordinates that are a perspective projection of carving plane can then be achieved as follows:} A similar system is proposed by ZACH et al. [73]. While SAINZ et al. [55] maintain a volumetric ^{This approach has two advantages, (1) each voxel pre-} ^{imagecoordinates from the camera’s location. By assign-1) Render the color coded carving plane and store the ing the same 3D coordinates of the voxel corners as tex-} datastructure on the CPU and massively perform expensive upload and read-back operations, ZACH et
voxel mask. ture coordinates, the projection of the texture onto the
al. keep all necessary data on the GPU. When sweeping the plane along a particular viewing direction, 2) Render the projective texture onto the plane. voxel surface is achieved.
^{This approach has two advantages, (1) each voxel pre-} visibilityis obtained by updating depthmaps for the already reconstructed volume in the reference views. The work differs to SAINZ et al. [55] as the sweeps are carried out independent of each other. This means, even if a voxel is visible from different subsets of cameras, it is not evaluated again. Performance is tested using an AMD Athlon XP2000 server with ATI Radeon 9700 Pro GPU. A synthetic scene is captured from 36 cameras, and images are transfered to the GPU in advance. Four sweeps are applied, each time considering a subset of nine cameras. For a volumetric resolution of 64 ^{3} and 128 ^{3} voxels, a framerate of 1.4fps and 0.7fps is obtained respectively.
Like for shape from silhouette, real-time performance can be achieved for shape from photo-consistency when generating novel-views. Similar to the former, a partial view-dependent representation of the scene is obtained. In this spirit, SLABAUGH et al. [61] propose an image-based algorithm for reconstructing the photo hull, called Image-Based Photo Hulls (IBPH). This is a direct extension of the Image-Based Visual Hulls from MATUSIK et al. [42]. IBPH uses a hybrid approach to increase performance. First, IBVH is applied to fastly sample along the viewing rays until the visual hull is reached. A second step continues ray-sampling with an adaption of Generalized Voxel Coloring to determine the photo-hull. To further increase performance, a lower-resolution image sampling is applied. The color value for each ray is determined by blending the pixels from where the corresonding 3D point is visible. Experiments are
Page 16
Chapter 2 - Related Work
performed with ﬁve QVGA cameras, each connected to a PC for image segmentation. The data is sent over a 100Mbit network to a dual processor 2GHz HP x4000 server where a multithreaded reconstruction is done. IBPH outperforms IBVH in quality but also leads to a high decrease in performance. For rendering a QVGA image with sampling each 16th ray, IBPH achieves a framerate of 6fps while IBVH alone runs at 25fps.
SLABAUGH et al. [61] perform the entire shape computation on the CPU. LI et al. [35] propose a system for GPU accelerated rendering of novel views. Their approach can be seen as a view-dependent mapping of Voxel Coloring SEITZ and DYER [58] on the GPU. Reconstruction is done by sweeping a plane through the frustum of the viewing camera and performing photo-consistency testing in the fragment shader. Visibility is handled by backprojecting a particular plane into the input views to mark occluded pixels. To speed-up computation, only a bounding box around the visual hull is processed. To accomplish for this, the fast visual hull rendering from LI et al. [?] is invoked at a very low resolution for the novel view. Eight QVGA cameras are used for experiments. After image processing on the corresponding PCs, the data is sent via 1Gbit network to a Pentium4 1.7GHz server with Nvidia Geforce FX 5800 Ultra GPU. Novel views are rendered as QVGA, where the bounding rectangle occupies about one third of the image. 2fps are achieved for reconstructing a dancer by using 60 depth planes. A speed-up of 7× is obtained in comparision to an implementation of IBPH on a dual processor 2GHz server.
GPU-accelerated image-based stereo (IBS) approaches differ in their goal, which is usually either (1) the generation of a depthmap or disparity maps for an existing view or (2) the construction of a novel-view from at least two nearby cameras. (2) is similar to the explained photo-consistency (PC) approaches. This is because reconstruction is also done by view-dependent sampling and considering of color distances between pixels from multiple views. The difference lies in the fact, that IBS does not explicitly handle scene-space occlusion. Hence, the sampling approaches differ in the way, space occupancy decisions are performed. For the comparision, it can be assumed that PC and IBS both step along a viewing ray for a particular destination pixel. In case of PC, non-surface points are carved at each depth step. This immediately updates the visibility of points at increased depth. Instead, IBS ﬁrst steps along the whole ray and just determines color distances between the corresponding pixels for each respective 3D point. Finally, the 3D point with minimal color distance for its corresponding pixels is chosen.
In YANG et al. [72] a novel view is generated from multiple nearby cameras (multi-baseline). This is done similar to LI et al. [?] by sweeping a plane along the viewing direction. For each viewing ray, the best match is chosen from all intersections with the depth planes. YANG et al. [72] use ﬁve QVGA cameras. The images are sent from the respective image processing PCs via 100Mbit network to a server having a Nvidia Geforce3 GPU. The rendering performance varies from 62.5fps to 2.5fps for 20×128 ^{2} to 100×512 ^{2} planes. While this system aims at novel-view generation, YANG and POLLEFEYS [71] propose a plane-sweep implementation for computing disparity maps between pairs of existing views
Page 17
Chapter 2 - Related Work
with read-back to the CPU. For more than two cameras, the resulting pairs are used to improve each other by cross-checking. Furthermore, several features to enhance quality and robustness are explained. Experiments are performed on a 2.8GHz server with ATI Radeon 9800 XT GPU. Using 32×512 ^{2} depth planes, 21.5fps and 10fps are achieved in case of two and eight input cameras respectively.
Reconstructing by shape from photo-consistency generally leads to a better approximation of shape and color. However, (1) increased space sampling, (2) visibility handling and (3) photo-consistency testing result in a low performance compared to shape from silhouette. Like the proposed approach, many works exploit the fact that the photo hull is a subset of the visual hull and thus perform shape from silhouette in advance.
Diploma Thesis, 106 Pages
Bachelor Thesis, 90 Pages
American Studies - Culture and Applied Geography
Research Paper (undergraduate), 17 Pages
Business economics - Investment and Finance
Diploma Thesis, 169 Pages
Diploma Thesis, 79 Pages
GRIN Publishing, located in Munich, Germany, has specialized since its foundation in 1998 in the publication of academic ebooks and books. The publishing website GRIN.com offer students, graduates and university professors the ideal platform for the presentation of scientific papers, such as research projects, theses, dissertations, and academic essays to a wide audience.
Free Publication of your term paper, essay, interpretation, bachelor's thesis, master's thesis, dissertation or textbook - upload now!