Indoor Location Retrieval with Depth Images using 3D Shape Features

Bachelor Thesis, 2014

72 Pages, Grade: 1,0



1 Introduction

2 Background and Methods
2.1 Image and Location Retrieval
2.2 Database of Range Images
2.3 Feature Extraction
2.4 Matching of Features
2.5 Evaluation of Matching
2.6 Depth Image to Point Cloud Projection
2.7 Keypoints Detection Methods
2.7.1 NARF Keypoint Detector
2.7.2 ISS Keypoint Detector
2.7.3 HARRIS3D Keypoint Detector
2.8 Feature Description Methods
2.8.1 NARF Feature Descriptor
2.8.2 Spin Image Feature Descriptor
2.8.3 SHOT Feature Descriptor
2.8.4 USC Feature Descriptor

3 Implementation
3.1 Overall Design
3.2 Filtering
3.3 Normal Estimation
3.4 Keypoint Detectors
3.4.1 NARF Implementation
3.4.2 ISS Implementation
3.4.3 HARRIS3D Implementation
3.4.4 NARF and ISS Combination
3.5 Feature Descriptors
3.5.1 NARF Feature Descriptor Implementation
3.5.2 Spin Image, SHOT and USC Feature Descriptor Implementation
3.6 Saving Files

4 Optimization and Analysis
4.1 Parameters
4.2 ISS Border Estimation
4.3 Preparing Cloud for HARRIS3D
4.4 Computing Times
4.5 Keypoints with ISS vs. NARF

5 Evaluation and Results
5.1 Suitability of 3D Features for CBIR Location Recognition
5.1.1 Query Sets
5.1.2 HARRIS3D and SHOT
5.1.3 Intermediate Result
5.1.4 Further Results of Feature Matching
5.1.5 Explanation
5.2 Combining ISS and NARF

6 Summary and Outlook for Future Work

A Parameter List

B Test Configurations

C Computing Times

D Values for ISS and NARF

E Evaluation Results

List of Figures

List of Tables



Content-based image retrieval (CBIR) for location recognition allows more precise indoor navigation than state-of-the-art methods. Using range images and matching a query im- age to a dataset of geo-tagged images is current research. This thesis investigates the prospects of applying 3D Shape feature detectors and descriptors to a point cloud projec- tion of the range image. Therefor at first the keypoint detection methods Normal Aligned Radial Feature (NARF), Intrinsic Shape Signatures (ISS) and HARRIS3D detector are described, followed by the shape feature descriptors Spin Images, Signatures of Histograms of Orientations (SHOT) and Unique Shape Context (USC). Special attention is paid to the parameters. Varying radii, border estimation methods, preset filters and computing times are analysed in order to determine, how to set those parameters to obtain good results. The results exhibit the shortcomings of the state-of-the-art 3D feature algorithms, in application of indoor navigation. Finally suggestions for improvement are made.

Chapter 1 Introduction

An interesting and current topic in computer vision is image retrieval. A specific approach is content-based image retrieval (CBIR) with 3D data to recognise a range image in a dataset. CBIR is a procedure to extract the features of an image, describe them with their features and compare the features to find matching images.

The driving force in this field and in the associated communities of 3D-object recognition are the novel and cheap depth sensors such as the Microsoft Kinect sensor, new smartphones like the HTC One M8 or the recently announced Google smartphone developed in the ”Project Tango”. Particular mobile applications are a promising field with a growing number of mobile devices equipped with depth-sensors.

The main focus of this thesis is location recognition. As described by Al-Nuaimi et al. [ANHT+13], the basic idea is to retrieve an image, e.g. taken by a mobile device in a database. This database contains captured photos at different locations associated with geo-tag representing the location-information. The most similar photo defines the location of the taken picture. This application enables indoor-navigation without GPS reception or augmented reality with interactive information, e.g. for mobile devices, but also enables orientation for autonomous robots.

To calculate the similarity of images, feature detectors and descriptors are needed. In order to obtain reliable results, the suitability of feature extracting algorithms has to be observed. For this purpose Al-Nuaimi et al. [ANHT+13] already compared the 3D feature Normal Aligned Radial Feature (NARF) with the RGB features from the Speeded-Up Robust Features (SURF) detector, the Maximally Stable Extremal Regions (MSER) detector and the Hessian-Affine detector algorithms. The result was to combine the 3D features of NARF with the RGB features of SURF to a novel feature named NURF. The achievement led to 15 percent improvement in absolute location recognition performance.

The referenced research has compared standard RGB features to range image specific features. However, as yet no investigation has been conducted into the usability of 3D shape features for retrieval using range images has been conducted. Building on that, this bachelor thesis aims to determine the suitability of the 3D shape features for the indoor location retrieval scenario. For this purpose the state-of-the-art range image algorithm NARF, the new algorithms signature of histograms of orientation (SHOT) and the Unique Shape Context (USC) as described by Tombari et al. [TSDS10b, TSDS10a] are introduced. Furthermore, an existing experimental setup of extensive and challenging images is used. The images have been rendered from a point cloud of the TUM main campus. The results are analysed and assessed in order to draw conclusions from the calculations. Furthermore, to enhance the keypoint detection, a combination of two state-of-the-art keypoint detectors is considered.

For this bachelor thesis, a test software is implemented in C++ code. The shape feature extractors are implemented in the Point Cloud Library (PCL). With the usage of the PCLfunctions this software loads a range image, projects it into a point cloud and extracts the selected features to finally save them in a pre-defined format. With this files the standard LMT CBIR engine can be used for retrieval tests and evaluation.

An important step is to analyse the parameters for the keypoint detectors and feature descriptors. This is frequently neglected in current papers but has a huge influence on the outcome. This thesis analyses how the parameters should be set and discusses how filters need to be applied on the data.

Relying on the deliberations an advanced evaluation is carried out. Real query range images, obtained using the Kinect sensor, are used for the evaluation. These images show different locations of the TUM campus. With this set of query images a real-world scenario is created to obtain application-related results.

Chapter 2 Background and Methods

In the fields of Content Based Image Retrieval (CBIR) and object recognition some mile- stones are already achieved. To understand the methods and their backgrounds, this chap- ter explains the work related with the approach of indoor location retrieval with 3D shape features. It starts with image and location retrieval in Section 2.1, explains the database of the used range images in Section 2.2 and how the features are extracted in Section 2.3. Afterwards the matching of the features is described in Section 2.4, followed by the used evaluation of matching in Section 2.5. For the sake of completeness this chapter explains the projection from a depth image to a point cloud in Section 2.6, the keypoint detection methods in Section 2.7 and the feature description methods in Section 2.8.

2.1 Image and Location Retrieval

In order to recognise an image in a database Content Based Image Retrieval (CBIR) is used. The research on CBIR already started decades ago with the main focus on RGB- images. This approach is based on extracting content information of an image using a descriptor and to save it in a database. The link between descriptor and the original image is retained. For a search request the content information of the query image is extracted and matched to the database. The similarity of the descriptors are calculated and the biggest accordance determines which image belongs to the request. The basic procedure is illustrated in Figure 2.1. The CBIR pipeline can be divided into three main components. The first component, the database of the images, is explained in Section 2.2. Section 2.3 explains the feature extraction unit and the third component matches the features as described in 2.4. CBIR is a state-of-the-art procedure for RGB-images and will now be applied to depth-images as well.

illustration not visible in this excerpt

Figure 2.1: Content Based Image Retrieval

Location recognition with images has multiple advantages. The usually used Global Posi- tioning System (GPS) has an accuracy of currently 7.8 m as declared by the United States Department of Defense [Pen08]. But in an indoor scenario the GPS reception is poor or not existing at all. Hence GPS does not work indoors. A more precise method for deter- mining the location is the WLAN localization solution. This method is analysed by Liu et al. [LDBL07]. The accuracy is estimated at 6 m but strongly depends on the amount of WLAN networks in the environment. Recent research has shifted towards CBIR-based location recognition to get even more precise results with the advantage of low setup costs. To achieve location recognition with an image, a database of images must first be created. Therefore, each image in that database has to be linked to a geo-tag, representing the information where it has been taken. The CBIR procedure is applied for each image, so that a query image now returns the location, that represents the information where it was taken. This enables different applications such as interactive information to the current location or indoor-navigation. The latter is the application where the results of this thesis will be integrated. The location retrieval scenario needs a big database. The used database is explained in the next section.

2.2 Database of Range Images

The database used in this thesis consists of a set of range images. The range is represented by the grey value of each image-pixel. More precisely the grey-value indicates the distance from the sensor to the pixel point. In order to work with the range images, they are converted to point clouds representing the 3D coordinates of each point. The focal length and the center point must be considered for the conversion, otherwise the point cloud would be distorted.

To obtain representative results especially for the application of indoor navigation, the 1st floor from the TUMindoor dataset presented by Huitl et al. [HSH+12] is used. Two horizontally mounted laser range scanners and six CCD cameras on a trolley recorded the images of that dataset. The images show the corridors of the 1st floor of the Technical University in Munich (TUM) with a length of 1 169 m. This equals 12 584 images and about 59.7 million points in the point cloud. The range images corresponding to a subset of the captured photos are rendered as described by Al-Nuaimi et al. [ANHT+13]. From this follows that a database of geo-tagged range images, representing the TUM campus, is generated and serves as a reference to perform CBIR-based location retrieval. The challenge is that the amount of information is very variable for each range image. In this thesis, the therein described poseset p=0 is specifically used as database. Figure 2.2 shows an example: A range image of a wall (left pictures) only contains a surface regardless the texture e.g. images on that wall. A stairwell on the other hand (right pictures) contains many details with lots of depth changes.

illustration not visible in this excerpt

Figure 2.2: Range Image with Corresponding RGB Picture. The left images show a wall with pictures but almost no depth information. The images on the right side show a stairwell with a detailed range image

2.3 Feature Extraction

In order to get a numeric and comparable representation of an image, two consecutive steps are necessary. The first step is to extract interesting points out of the point cloud, so-called keypoints. The requirements for keypoints are, that they are stable, distinctive and reproducible. A typical keypoint would be close to a corner or an edge. Given the importance of the keypoints, three different keypoint detection algorithms are introduced: Normal Aligned Radial Feature (NARF) by Steder et al. [SRKB10], Intrinsic Shape Signatures (ISS) by Zhong [Zho09] and the HARRIS3D by Gedikli et al. [GH12]. The functionality of these keypoint detection methods is described in Section 2.7.

The second step is the feature description. Here the surrounding of each keypoint is de- scribed by a vector. The description must be invariant to transformation and comparable. A typical description to a corner point contains information about the surrounding edges in the description vector. In this thesis, 4 description methods are introduced: Nor- mal Aligned Radial Feature (NARF) by Steder et al. [SRKB10], Spin Image by Johnson [Joh97], Signature of Histograms of Orientation (SHOT) and Unique Shape Context (USC) by Tombari et al. [TSDS10b, TSDS10a]. The differences are described in Section 2.8.

Once each image is described by a collection of feature descriptors, they can now be matched against one another. This process is called feature matching and described in the following section.

2.4 Matching of Features

The extracted features of a query image have to be compared to the features of the database images. Therefor the Vocabulary Tree concept (VT) by Nister & Stewenius [NS06] is used. This model is based on the Bag-of-Features (BoF) model by Sivic & Zissermann [SZ03].

The features describing the images are mapped to a feature space. This feature space is divided into N clusters. Each clusters now combines multiple features. The features of an image can now be assigned to their clusters. The N clusters are all visual words and described in a tree. To each visual word in the tree, a rank of the images is calculated. This rank is the amount of features that each image has assigned to respective word. An image can also have similar features assigned to the same word, therefore the rank of that image is higher for this word. It is also possible that an image has no features assigned to a word. Each rank needs to be weighted as Sivic & Zissermann [SZ03] describes. The weight ensures that visual words which occur often in all images are taken less into account than distinctive words. Equation 2.1 is the weight w with the amount of all features in all images N and the respective frequency Ni of each word. A new rank is the result, the highest score is the image with the highest similarity. In this thesis, the CBIR engine from the LMT is used, which implements the ”Randomized Approximate K-Menas Forest” variant of BoF (cite Dipl.-Ing. Robert Huitl). The evaluation of the matches is explained in the next section.

illustration not visible in this excerpt

2.5 Evaluation of Matching

For each query image a rank was created as described in Section 2.4. For a distinguishing and stable evaluation of the rank the Mean Average Precision (MAP) is used. Multiple images may capture the same location. The ideal case would be, if all of these images are at the top of the rank. The top k-rank is evaluated. Therefore, the precision Pr,q for each rank r of the k-ranks is calculated for the query q. As Figure 2.3 shows, a mask can be set on top of the k-rank because it is known whether the result is true or false. Pr,q is calculated with the indicator function rel (i, q), which is 1 for the right results and 0 otherwise. Thus the calculation is to summ up the amount of right results from the top rank to rank r and dividing by r:


illustration not visible in this excerpt

The Average Precision (AP) only considers the right results with the indicator function rel (k):

illustration not visible in this excerpt

Table 2.1 shows the results of the precision Pk,q and the average precision APk,q for the example in Figure 2.3.

The last step for the MAP is to combine all results from all query images. For the top k-ranks the mean average precision is:

illustration not visible in this excerpt

The precision curve matters because the intended goal is that the first ranks are right and therefore the curve should be high at the beginning. The dimension of the k-ranks needs to be adapted to the amount of possible right matches. When k is higher, the MAP can be worse because too many ranks are taken into account. On the other hand, choosing k too small would not reflect good results close to the top-rank.

Figure 2.3: Example of k-rank Results

illustration not visible in this excerpt

Table 2.1: Average Precision Example

illustration not visible in this excerpt

2.6 Depth Image to Point Cloud Projection

The focus of the thesis is to apply all calculations in the 3D space. A depth image, e.g. taken by a Kinect camera, is stored in a png file. This png image has the dimensions dwidth and dheight. That means that the image has an resolution of dwidth times dheight pixels. Each pixel has a grey value representing the depth. To reproduce the 3D space, these points can not only be set to their depth, but also the recording procedure must be taken into account. As Figure 2.4 shows, the focal length f has an important role. The camera or sensor C is set centered in front of the image. Taking beams from C through the points p (u,v) with the length of the depth d behind the image, indicates the 3D coordinates. Equation 2.6 is the mathematical approach, wherein uc is the horizontal center and vc the vertical center of the image, which in our case is set to:

illustration not visible in this excerpt

Now, using the pin-hole camera model, each pixel p (u, v) is projected to 3D point P (x, y, z) as follows:

illustration not visible in this excerpt

The grey image has a certain grey value resolution. The images in the database as well as those captured by the Kinect are stored in 16 bit precision. But the images are loaded in 8 bit. This causes inaccuracy given by the quantization. The point cloud has steps and gaps on the z axis which causes artefacts. An example is given in Figure 2.6, showing the artefacts of the point cloud from Figure 2.5.

Note that the used point clouds are only a projection. These point clouds are not complete because that would require a recording from each angle for each object. Overlap of objects causes gaps in the point clouds. As this is a projection, no additional information is added, only the representation has changed.

Figure 2.4: Depth Image to Point Cloud Projection

illustration not visible in this excerpt

Figure 2.5: Depth Image and Corresponding Point Cloud. From left to right: depth image, RGB image, back-projected point cloud

illustration not visible in this excerpt

Figure 2.6: Artefacts in Point Cloud. This point cloud is the same point cloud as Figure

illustration not visible in this excerpt

2.5 shows, only from a different perspectiv. As one can see, there are circular ”grooves” in the point cloud. Those are artefacts caused by the quantization along the z-axis

2.7 Keypoints Detection Methods

The point clouds generated from the first floor dataset each consists of 240 000 points. To calculate a descriptor for each of the points would be costly in terms of computing effort. Furthermore it would not lead to the desired result because non unique points like points on a wall would be extracted on most of the images and would make a comparison very difficult and inaccurate.

To reduce the number of points to a small number of distinctive points, different approaches of state-of-the-art detection methods are compared. In Section 2.7.1 the NARF (Normal Aligned Radial Feature) by Steder et al. [SRKB10], in Section 2.7.2 the ISS (Intrinsic Shape Signatures) by Zhong [Zho09] and in Section 2.7.3 the HARRIS3D by Gedikli et al. [GH12] are described.

2.7.1 NARF Keypoint Detector

As described by Steder et al. in their paper [SRKB10] the NARF algorithm can be divided into five steps. The first step is to find borders in the range image. In a range image, borders are spots where the depth of two neighboring points differ significantly. The second step is to determine a score for each point, that describes how much the surface changes and into which direction. Also the border information taken into account, e.g. in terms of shadow effects. The third step is to calculate an interest value to the dominant directions originating from each point. The value represents the amount of how much the directions differ from each other and how much the surface of the point changes. The latter defines how stable a point is. To determine the sphere of those interest values the support size parameter indicates the diameter. The fourth step is to smooth this interest values and the last step performs a non-maximum suppression to get the keypoints.

2.7.2 ISS Keypoint Detector

The ISS keypoint detection algorithm by Zhong [Zho09] proceeds as follows. Each point of the point cloud is weighted, inverse depending on the number of points in the spherical neighbourhood, which is defined by the radius rdensity. In this way points with a detailed neighbourhood are taken less into account than points with a sparsely neighbourhood. Afterwards the weighted scatter matrix for each point within a distance of rframe is calculated. With the three eigenvalues of this matrix, λ 1, λ 2 and λ 3, the magnitude and the eigenvectors are decreased. The result is an intrinsic reference frame for each point, indicated by a 3D coordinate system. This coordinate system has the point itself as origin and uses the eigenvectors and the cross product of the eigenvectors as the x-, y- and z- axes.

To get keypoints using intrinsic shape signatures a set of salient basis points is selected.

The intrinsic reference frame is computed for each point with large dimensional variations in the neighbourhood. This is measured by the eigenvalue of the point scatter matrix of the spherical neighbourhood, defined by the radius rsalient. Frames with ambiguous axis at points of local symmetries are excluded. The parameter dvoxel defines a cubic volume in which at most one basis point is extracted. As a filter the parameter threshold 12 and threshold 23 can be set. They set the upper bound on the ratio between the eigenvalues λ 2 and λ 1 or λ 3 and λ 2. The final result is a reduced point cloud. Those points are used as keypoints.

2.7.3 HARRIS3D Keypoint Detector

The HARRIS3D detector by Gedikli et al. [GH12] is based on the Harris operator. The covariance matrix Σ j is calculated for the neighbourhood of each point pj. The neighbourhood is defined by the keypoint radius (KPR). The matrix Σ j regards the normals nj of the points. This is comparable with the gradient calculation by the Harris operator. Let < pi, ni > be the point and its normal in 3D coordinates, m the amount of the points in the neighbourhood and n the mean over all normals:

illustration not visible in this excerpt

This matrix allows first conclusions. The number of eigenvalues which are greater than a certain threshold define the shape. Two great eigenvalues indicate an edge and three eigenvalues indicate a corner. The corners are interesting for the keypoint extraction. Different response methods are proposed, such as HARRIS H, NOBLE N and TOMASI T. With:

illustration not visible in this excerpt

The implemented response in the test software is TOMASI. This response is fast, because it only regards the minimal eigenvalue λ min. If this eigenvalue is greater than a given threshold, all other eigenvalues have are also greater than the threshold. Therefore this point is a corner and are considered becoming a keypoint. The final selection is done by the non maximum suppression. Remaining points are the extracted keypoints.

2.8 Feature Description Methods

For the feature description methods a numeric representation of the scene is calculated. This representation must be repeatable and invariant to view point changes like rotation and distance. Also noise must be taken into account. The goal of a good description is to be as accurate as possible and still be reproducible.

In the next sections different approaches are presented. The first state-of-the-art method described in 2.8.1 is NARF (Normal Aligned Radial Feature) by Steder et al. [SRKB10] building on the NARF keypoint extraction. 2.8.2 presents the Spin Image concept by Johnson [Joh97], the eldest algorithm used. In 2.8.3 SHOT (Signatures of Histograms of Orientations) by Tombari et al. [TSDS10b] and in 2.8.4 the USC (Unique Shape Context) also by Tombari et al. [TSDS10a] are described.

2.8.1 NARF Feature Descriptor

Steder et al. also described the NARF feature descriptor in their paper [SRKB10]. For each keypoint the algorithm calculates a small range image, so called normal aligned range value patch. The diameter of this patch is the same parameter used for the NARF keypoint detection, support size. The observer of the small range image is looking at the point along the normal. Away from the center of this patch the final descriptor is fed with the pixel changes under the radial beams. The length of each beam is half of the support size. An example of the method is shown in Figure 2.7. Afterwards a unique orientation of the descriptor is extracted and finally the descriptor is shifted according to this value. This makes the descriptor invariant to rotations.

illustration not visible in this excerpt

Figure 2.7: Example of NARF Feature Descriptor. The left image is a range image of an office with a chair. The cross marks the keypoint, which is shown in the right image in detail. The green radial beams are the description vector to the keypoint. They are filled with the depth values which are under the beams. Image from original paper [SRKB10]

2.8.2 Spin Image Feature Descriptor

Already in 1997 Johnson introduced the Spin-Images [Joh97], as representation for 3D Surface Matching. The Spin Image concept represents 3D Shape in an object oriented coordinate system. It uses oriented points, which are 3D surface points with a direction. These points are surrounded by a square surface and in the position of the surface vertex. The direction of the oriented points is defined by the surface normal. With a tangent plane through these points with a right angle to its direction, surrounding points can be represented in a cylindrical coordinate system.

Building on that, a spin map can be produced. This map is a projection of 3D points to 2D coordinates associated with the oriented point. To compare those maps they need to be converted to a 2D array. This is done by overlaying the map with a grid to assign the points into a discrete cell of the array. In order to regard the noise of the data the points are spread to four surrounding points using bilinear interpolation. This array is a comparable descriptor for object recognition.

2.8.3 SHOT Feature Descriptor

The SHOT (Signatures of Histograms of Orientations) description method by Tombari et al. [TSDS10b] is based on a novel local Reference Frame (RF) to overcome problems like sign and not uniqueness. This RF builds on the technique by Hoppe et al. [HDD+92] and Mitra et al. [MNG04]. With Eigenvalue Decomposition (EVD) of the covariance matrix of the k-nearest neighbours of the keypoint the Total Least Squares (TLS) estimation of the normal direction is obtained. The new proposal weights distant points smaller and the covariance matrix is calculated by all points laying in the sphere of the support radius. This should increase repeatability of cluttered material and improve the robustness. Different from Hoppe et al. [HDD+92] and Mitra et al. [MNG04] the third eigenvector does not represent the TLS estimation of the normal direction. To overcome the problem of sign disambiguation for EVD, Tombari et al. used the proposal of Bro et al. [BAK08]. It makes the sign coherent with the major part to the representing vectors by reorienting the sign of each singular or eigenvector. The x- and z- axis are computed in that way and the y axis is the cross product of x and z.

Building on their own RF and according to the descriptor advantages of the use of histograms over point coordinates, Tombari et al. introduces the SHOT descriptor. This 3D descriptor encodes histograms with first-order differential entries. This means it represents data like normals of the points instead of coordinates. The algorithm computes multiple local histograms over a 3D grid in the support and combines all local histograms for the descriptor. This descriptor got his name because it represents geometric information of the location of points (in the support) like a signature. Therefore the name of the method is ”Signature of Histograms of Orientations”.


Excerpt out of 72 pages


Indoor Location Retrieval with Depth Images using 3D Shape Features
Technical University of Munich  (Media Technology)
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
2176 KB
Image and Location Retrieval, Feature Extraction, Matching of Features, Evaluation of Matching, Depth Image to Point Cloud Projection, NARF, ISS, HARRIS3D, Spin Image Feature Descriptor, SHOT, USC, Implementation, Optimization, Analysis, suitability of 3D Features for CBIR Location Recognition, CBIR
Quote paper
Konrad Vowinckel (Author), 2014, Indoor Location Retrieval with Depth Images using 3D Shape Features, Munich, GRIN Verlag,


  • No comments yet.
Read the ebook
Title: Indoor Location Retrieval with Depth Images using 3D Shape Features

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free