Rolling Shutter Bundle Adjustment

Master's Thesis, 2014

55 Pages, Grade: 5.5



1 Introduction
1.1 Structure-from-Motion
1.2 Bundle Adjustment
1.3 Rolling shutter cameras

2 Structure-from-motion on rolling shutter cameras
2.1 Related work
2.1.1 Klingner et al. 2013
2.1.2 Hedborg et al. 2012
2.1.3 Oth et al. 2013
2.2 Proposed model

3 SfM Pipeline
3.1 Feature extraction
3.2 Feature matching
3.3 Feature tracking
3.3.1 Triangulation
3.4 GS initialization
3.4.1 RS correction
3.5 Rolling shutter bundle adjustment
3.6 Interpolation of the angle-axis vector .
3.6.1 Slerp & NLerp
3.6.2 Angle-axis vector interpolation
3.7 Least Squares RS PnP
3.7.1 Motion priors

4 Experiments
4.1 Improving GPS/INS prior
4.1.1 Effects on dense reconstructions
4.2 Measurements on synthetic data
4.2.1 Influence of noise/outliers
4.3 Full pipeline
4.3.1 Street View dataset
4.3.2 Comparing to Bundler SfM
4.4 Windowed BA

5 Future Work
5.1 Loop Closure
5.2 Large scale and portability
5.4 Alternative models

6 Conclusion

Chapter 1 Introduction

Structure-from-motion (SfM) is the estimation of three-dimensional structures from two-dimensional image sequences from one or more cameras, while the camera and/or the scene is moving. The camera motion has to be simultane- ously estimated during this process, and it has already found many real world applications. SfM is now commonly used to add visual effects to video, e.g. in the movie industry, TV and augmented reality. It has also been successfully em- ployed to build 3D models from both unordered photo collections and from video.

Until few years ago, virtually all SfM implementations assumed each whole image to be statically and instantaneously captured. This assumption can in fact be close to reality, to a pixel precision, on good quality static pictures taken with a global shutter camera. Cameras fitted with a mechanical shutter or a charge-coupled device (CCD) sensor can capture the whole image in a very short period of time - provided correct shutter speed configuration in addition to a static or slow motion scene, and/or with the help of a good optical image sta- bilization system. However, an overwhelming majority of camera sensors sold today use complementary metal-oxide-semiconductor (CMOS) circuits: nearly all mobile video recording devices, compact cameras and, since around 2010, also most high-end digital single-lens reflex (DSLR) cameras have them.

In contrast to the classical CCD sensors, the image rows of standard CMOS sensors are read line-by-line in a rapid succession over a readout time of 10-60 ms, depending on refresh rate. This delayed and fragmented image formation is known as an electronic rolling shutter (RS). This can lead to a RS camera distor- tion when combined with motion, of either the camera or the subject. Stretching, squeezing and sharing are common artefacts produced by the RS.

As said before, most work on SfM is based on the global shutter camera model. If classical structure-from-motion is applied to rolling shutter video, the result is unpredictable, as showed by Hedborg et al.[19]and by Saurer et al.[37]. In recent years, many alternative solutions taking RS into consideration appeared in the literature. Very good results were shown and some implementations were even taken to global scale[24]. Meilland et al.[31] also model motion blur together with the RS effects on dense 3D structure, but required a RBG-D input. Apart from the latter and[37], it seems to be no other references handling dense models and RS simultaneously.

Therefore, the objective of the project is to merge some of the current efforts to create a general SfM pipeline, including all its steps, so that it serves as a base for large scale 3D reconstructions. This platform should create a high perfor- mance and precise scene reconstruction and camera motion estimation of video input, which can be used to bootstrap denser reconstructions or camera tracking applications. The implementation will focus on dealing the rolling shutter aber- rations of commonly available CMOS videos. Nevertheless, the model should be flexible enough to also cope with global shutter video cameras or any ordered set of images.

The estimated camera motion and the SfM sparse 3D model should have precision comparable or superior to other standard methods. For this reason, the work will start by testing popular SfM implementations like Bundler SfM[42], which will serve as a baseline for the further tests. This referred implementation assumes global shutter images, so modeling the rolling shutter should considerably improve the result on video sequences, specially when dealing with relatively fast rotation and translation motions.

1.1 Structure-from-Motion

As mentioned before, SfM is the simultaneous estimation of 3D structure and camera poses. Finding structure-from-motion presents a similar problem as find- ing three dimensional structure from stereo vision. In both instances, the cor- respondence between images and the reconstruction of 3D object needs to be found, [7].

To find correspondence between images, 2D features such as corner points (edges with gradients in multiple directions) need to be tracked from one image to the next. The feature trajectories over time are then used to reconstruct their 3D positions and the camera’s motion. There is a vast variety of 2D features packages, SIFT[28], SURF[4], ORB[36], KLT[38], etc, which allow the user to extract and track 2D interest points between consecutive images.

illustration not visible in this excerpt

Figure 1.1: Typical structure-from-motion pipeline. Raw data (white boxes) is normally compressed in the form of sparse 2D point features before usage. The blue boxes depict standard methods using epipolar geometry and simpler linear systems to find an initial estimate of the next poses and features. Bundle adjustment BA (red) is used to refine the previous steps to an optimal estimate, normally taking a larger part of the problem into consideration.

Once visual matches were found, they can be used to initialize the reconstruc- tion pipeline as shown on figure 1.1. Epipolar geometry can be used in order to estimate the relative pose between the first two cameras. However, the latter technique cannot be used on the additional images since its pose estimate is up to a scale and conserving the scale between the two independent estimates is not trivial. Nevertheless, the 2D matches can be already triangulated using the first camera pose estimates and their 3D coordinates can be easily used in order to acquire a scale consistent pose estimate for the following images. The latter technique is typically known in the computer vision community as Perspective- n-Point (PnP). Figure 1.1 also shows an intermediate bundle adjustment (BA) step after each one of the before mentioned steps. BA is a nonlinear refinement step, which can be repeatedly used to improve the estimate, as we see next.

1.2 Bundle Adjustment

Bundle adjustment is the optimal jointly estimate of 3D structure and camera parameters (camera pose and/or calibration). Optimal means that the parameters are refined by minimizing some cost function that quantifies the model fitting error, and jointly that the solution is simultaneously optimal with respect to both structure and camera variations.

The name refers to the bundles of light rays tracing from their 3D features to each one of the camera centers, which are optimally adjusted with respect to both feature and camera parameters. Equivalently, the whole 3D structure and camera parameters are adjusted together in one bundle. Bundle adjustment is simply a large sparse geometric parameter estimation problem, the parameters being the combined 3D feature coordinates, camera intrinsics and extrinsics.

This mathematical method has a long history in the photogrammetry and geodesy literature and has been typically formulated as a nonlinear least squares problem [8, 17, 43, 2, 46]. In computer vision, BA is often used as a refining step in SfM pipelines as it generally requires a reasonably good initialization of the parameterization, specially when working with robust loss functions as we will see in chapter 3.

yn = KPxn (1.1)

Classically, a reprojection error is used as the cost function to be optimized. The reprojection of a homogenized 3D feature point xn is linear and given by the equation 1.1, where P is the perspective transformation from world frame to the camera frame of reference and is composed as a 4x3 matrix [ R|t ] by the rotation matrix R and a translation vector t. K denotes the camera intrinsics in a 3x3 matrix18. The resulting yn is the reprojection of the keypoint xn in the image plane in homogeneous coordinates.

Therefore, by assuming a quadratic BA cost function, the solution of the bundle adjustment optimization is at the minimum of the sum of the squared reprojection error as in equation 1.2, where y ′ n aretheoriginalfixed2 Dobser- vations. Since KP xn uses homogeneous coordinates, a dehomogenizer operator g: R3 R2 must be used as defined in equation 1.3. K is normally fixed in a calibrated setup, but can also be jointly optimized with the other parameters P and xn.

illustration not visible in this excerpt

As we saw, in a homogenized setup P is composed of a 3x3 rotation matrix and a 3D translation vector. This 3D rotation matrix is obviously over-parametrized as the rotation has only 3 degrees of freedom (DoF). It should be substituted by a lower dimensional parametrization like Euler angles, angle-axis or quaternions. Please note that the latter also suffers from the same problem as its represen- tation includes 4 parameters and Euler angles suffers from the known gimbal locks [39]. Therefore, the angle-axis is the simplest and most direct represen- tation for 3D rotations and has been used in several bundle adjustment setups, including [20, 19, 33, 42, 6].

1.3 Rolling shutter cameras

Rolling shutter cameras are the most common type of digital camera nowadays due to the relative low cost, small size and quality of CMOS sensors. However, they impose an additional difficulty to traditional computer vision techniques. RS cameras captures images not by taking a snapshot of the entire scene at a single instant in time, but rather by quickly scanning across the scene, either horizontally or vertically. Figure 1.2 shows a typical readout sequence of a RS camera with vertical shutter. First, reset signals (blue) are used to clear the scanlines line-by-line. At the same pace, each scanline is read (red and green) after its exposure time. Note that the total frame exposure timing, also called scanning time, is considerably larger than the scanline exposure time.

illustration not visible in this excerpt

Figure 1.2: Rolling shutter cameras acquire images or video by sequentially reseting, exposing and transferring visual data. Source: Matrix Vision

The images obtained during this sequential scanning are equivalent to a in- stantaneous snapshot, i.e. global shutter (GS), if the camera and scene stay com- pletely static during the whole exposure. However, the sequential exposure of the images during motion on either sides (camera or scene) will lead to visual deformations, e.g. stretching, squeezing, smearing or skewing. Figure 1.3 shows typical visual distortions suffered during camera motion in different directions.

These RS deformations on pictures are seen because all scanlines are gen- erally shown simultaneously, ignoring the fact that they actually represent dif- ferent points in time. A RS video could potentially be synchronized with a non-interlaced screen to show the real sequence of the reading. However, this would have practical implications as the scanning direction and timings can vary from camera to camera.

This same problem will appear on computer vision techniques, including structure-from-motion. Holding the global shutter assumption during motion can easily cause SfM pipelines to fail [20]. The camera’s rolling shutter needs to be taken in consideration in order to correctly estimate the poses and feature reprojections. However, the full camera model is an underdetermined system as it would represent 6 DoF per scanline, 3 for translation and 3 for rotation.

illustration not visible in this excerpt

Figure 1.3: Rolling shutter visual deformations during translation movement. Stretching, squeezing or skewing are common in RS images taken in presence of motion. Source: Saurer et al. 2013[37]

Chapter 2 Structure-from-motion on rolling shutter cameras

As mentioned in the last chapter, independently estimating a pose for every scanline would lead to a very underdetermined system. Nevertheless, coping with rolling shutters is a requirement for any SfM framework which wish to work with nowadays off-the-shelf digital cameras and mobile devices.

There were several publications in recent years that tried to tackle the problematic of the rolling shutter cameras. [24, 19, 33] successfully applied a RS model to the typical bundle adjustment as we will see below. Based on these 3 works we propose a general camera model that can cope with real world datasets, from video to general ordered imagery.

2.1 Related work

2.1.1 Street View Motion-from-Structure-from-Motion, Klingner et al. 2013

Klingner et al. 2013[24] used RS BA to optimize the position of the car on Google Street View. The data already had very good pose estimates from the in-car GPS/INS system, however, inertial measurement units (IMU) are known to drift and GPS localization is not precise, specially in metropolitan areas. Nevertheless, they opted to keep the relative GPS/INS estimate fixed within each image frame of their 15-camera rig and only optimize for the whole car pose at the begin of each aggregated frame (with all 15 cameras).

This setup reduced the freedom of the system back to only 6 DoF like on a GS BA, yet it could explain the motion during exposure by trusting the relatively

illustration not visible in this excerpt

Figure 2.1: Left: Light rays converging to the center of each one of the 15 cameras of the vehicle’s rosette, one color per camera. Right: Same rays during typical 30 km/h velocities. Source: Klingner et al. 2013[24]

good low frequency estimate of the GPS/INS system. This model was successfully used in the Google Street View at a global scale, in spite of that, it relies on high quality GPS/INS hardware, which is not always available. Figure 2.1 depicts their typical camera setup; it clearly shows the position of the scanlines of each camera spreading through over a meter on every shot.

2.1.2 Rolling Shutter Bundle Adjustment, Hedborg et al. 2012

illustration not visible in this excerpt

Figure 2.2: Traditional global shutter SfM pipelines can fail on nowadays rolling shutter cameras (bottom right). Source: Hedborg et al. 2012[19]

Hedborg et al. 2012[19] creates the first video RS BA as far as we know. They assume a continuous exposure of the sensor and no gaps between frames. Moreover, a linear motion within the exposure time was assumed, interpolating poses between consecutive frames to determinate the position and orientation of each scanline, while also maintaining a system with 6 DoF per frame. Their 6 DoF pose represents the initial scanline of a frame and all the following scanlines are accessed by the interpolation between this initial position and the pose of the next frame. They used linear interpolation for the translation vectors and spherical linear interpolation (Slerp) for the rotation.

Additionally to the rolling shutter bundle adjustment step, they also proposed rolling shutter versions of the initialization and PnP step of the SfM pipeline. A rotation only rectification was used on the initialization images before proceeding to the standard methods. Their approach to a general PnP will be discussed in details in section 3.7.

Their setup could successfully model several datasets with natural motion on a smartphone, where the GS SfM pipeline would previously fail as showed on figure 2.2.

2.1.3 Rolling Shutter Camera Calibration, Oth et al. 2013

In order to precisely calculate the line delay, i.e. the time delay between scanlines on a RS video, Oth et al. 2013[33] elaborated a smooth 4th order B-spline motion model. For this calibration, they used a RS BA to fit the model on videos of a known structure, which is partially equivalent to a SfM pipeline.

However, in order to have an estimate of high precision they iteratively repeat the process to also estimate the optimal knot placement on the B-spline curve. This latter process is not only extremely time demanding but it could potentially lead to overfitting, as showed by themselves in figure 2.3. We strongly believe this problem would show itself even more heavily on noisy environments, where the position of the 3D points is not known and cannot be precisely estimated.

2.2 Proposed model

In this work, we model a general camera which can cope in a transparent way not only with global shutters, but also with the most common rolling shutters digital cameras. This model should be able to incorporate the extra degrees of freedom

illustration not visible in this excerpt

Figure 2.3: Although the uniform B-spline knot placement continuously reduces the reprojection error as more knots are used (top), the camera position estimate error (bottom) and the deviation in their line delay prediction (middle) start increasing again after a certain point. This indicates overfitting to the noisy data on the right side of the grayed out area. Source: Oth et al. 2013[33]

caused by the sequential readouts without overparameterizing the system.

As reported earlier [33], most video cameras do have an additional delay be- tween consecutive frames. Additionally, we would like to also be able to work with non-video datasets of ordered but more spaced out images, similar to Google Street View data [24] without requiring any non-visual information. For these datasets, the RS BA proposed by Hedborg et al. [19] might not be the best fit because of their assumption that the pose of the first scanline in a frame is the same as the last scanline in the previous frame. The latter model would not be able to explain the large gap between frame acquisitions without compromising the pose estimation.

Therefore, we propose a model with additional freedom for noncontinuous sets of images. Like Hedborg et al. [19], we assume a linear motion of continuous velocity within a single image frame, as the exposure time does not normally ex- ceed a couple of tens of milliseconds - around 30ms on modern smart phones [33]. This linear motion can be easily represented by the position and orientation of the first and last scanlines of each frame, in a total of 12 DoF, i.e. 3 DoF for position and another 3 for orientation at the first scanline pose, and the same again for the last scanline. As we assume a constant linear motion throughout the frame exposure time, the intermediate poses can be found by a linear inter- polation between the two extremes.

We will go into more details on the whole proposed RS pipeline in the next chapter, and specifically on the latter model in section 3.5.

Excerpt out of 55 pages


Rolling Shutter Bundle Adjustment
University of Zurich  (Department of Informatics)
Computer Vision
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
3280 KB
Note: 5,5 (Schweiz) entspricht Note: ~ 1,5 (dt. Notensystem)
Rolling Shutter, Bundle Adjustment, structure from motion, SfM, computer vision, openCV, ceres-solver, non-linear optimization
Quote paper
Henrique Mendonça (Author), 2014, Rolling Shutter Bundle Adjustment, Munich, GRIN Verlag,


  • No comments yet.
Look inside the ebook
Title: Rolling Shutter Bundle Adjustment

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free