Free online reading

## Table of Contents

**1 Introduction **

1.1 RelatedWork

1.2 Objective and Contribution

**2 Fundamentals **

2.1 VideoCoding

2.1.1 History of Video Codecs

2.1.2 Structural Overview of HEVC and WC Coding

2.1.3 Temporal Coding Structure

2.1.4 Spatial Coding Structure

2.1.5 Quantization

2.1.6 Intra and Inter Prediction

2.1.7 In-Loop Filtering

2.1.8 Coding Schemes

2.2 VideoStreaming

2.2.1 Omnidirectional Projection

2.2.2 Viewport Adaptive Tiled Streaming

2.2.3 Rate Assignment for Encoder Control

2.3 Regression

2.3.1 Regression Model Overview

2.3.2 Decision Trees

2.3.3 Model Optimization

2.3.4 Model Validation

**3 Methodology**

3.1 Aggregation of the Trial-Encodings Dataset

3.2 Regression Features and Quality Measures

3.3 Rate Assignment Models

3.3.1 Non-Linear Reference Model

3.3.2 Length of the Training Sequence

3.3.3 Ratio Features

3.3.4 Direct Ratio Assignment

3.3.5 Training of Multiple QP

3.3.6 Training of Frame-Subsets

3.3.7 Training of Multiple Resolutions

3.3.8 Prediction of Multiple Outputs

3.4 Rate Assignment Evaluation Framework

3.4.1 Structure of the Framework

3.4.2 Parameters to Configure the Framework

3.4.3 Modes as Preset Parameter Combinations

3.5 Regression Model Optimization

3.5.1 Feature Selection

3.5.2 Hyperparameter Optimization

**4 Viewport-Dependent Streaming Simulation**

4.1 Analysis of theFeatures

4.2 Induced Error of the Evaluation Framework

4.3 Analysis of the Rate Assignment Models

4.3.1 Prediction Accuracy

4.3.2 Quality Distribution

4.3.3 Network Utilization

4.3.4 Model Overview

4.3.5 Non-Deterministic Variance

4.3.6 Model Optimization

4.4 Variation of the Evaluation Parameters

4.4.1 Resample Factor

4.4.2 TargetBitrate

4.4.3 Chunk Length

4.4.4 Streaming Scenario Dataset

**5 Conclusion and Prospect**

List of Abbreviations

List of Symbols

List of Figures

List of Tables

List of Listings

List of References

## 1 Introduction

By 2018 video content reached a total downstream volume of 58 percent of the world's internet traffic [1] and is estimated to grow to 82 percent until 2022 [2], While YouTube, Facebook Video and Netflix make alone 42 percent of the mobile downstream traffic in 2019 [3], virtual and augmented reality traffic is predicted to increase 12-fold by 2022 [2], One of the main drivers for this growth is the introduction of mass-produced VR displays like Oculus popular virtual reality headset Rift. It was first brought to the developer's market in 2013 but is available for the consumer market since 2016 [4],

As part of the standardization of new video transmission and coding procedures, new areas of application are moving into focus. An example of these new applications is omnidirectional video, also known as 360-degree video, for viewing through Virtual Reality (VR) glasses. The first specifications for VR services have already been developed by the responsible standardization organizations like the Motion Picture Expert Group (MPEG) and the 3rd Generation Partnership Project (3GPP). The VR Industry Forum issued detailed guidelines for the consumer experience [5]. In order to supply this high-quality content, the high bandwidth requirement for omnidirectional video contingency became a fundamental problem, since the resolution level can quickly exceed what is available in conventional video applications [6, 7], especially for mobile content streaming.

One solution is a new transmission approach based on splitting the omnidirectional video into independently coded video tiles of different resolutions. These video tiles are stored on the server-side. During streaming, the client device requests the tiles in the current viewing direction in higher resolution than the tiles outside the field of vision. Complying with the bandwidth capacity of the streaming system is the goal, while the bitrate allocation between tiles shall be such that the corresponding quality distribution between tiles contributes to optimal user experience, i.e. the consistent quality of tiles is the desired outcome of such bitrate allocations. All requested tiles are packed into one constrain compliant video stream and transferred to the client device.

For a service provider providing the video data appropriately is a challenge. Video tiles are typically encoded by parallel independent encoder instances. A proper rate assignment of the total available bandwidth to these instances is, therefore, necessary to achieve a uniform quality across all video tiles. The proprietary implementations of encoders that are in development do not necessarily implement a constant quality mode. Instead, the constant target bitrate (CBR) or rate controlled mode must be used. Finding the proper bitrates which correspond to a uniform quality could be done by a trial-and-error search. However, such an approach comes with an enormous overhead in encoding complexity and is not practicable for newer and com- putably more complex video encoding standards or on a large scale. Therefore, a rate assignment must be made based on simple complexity features derived from the pictures before encoding.

### 1.1 Related Work

In the following, the result of the literature research is listed. Further results of the literature research are distributed in this work and referenced in the proper context.

Early, in 2010, Mavlankar researched segmented streaming with the usage of multiple parallel encoders while meeting the requirements of a standard video stream format for an online lecture viewing system [8]. Van Brandenburg investigated network delivery mechanisms of immersive media delivery for mobile devices and introduced a framework to navigate through tiled streaming video in 2011 [9].

Besides the still existing task of coordinating encoders in such a scenario, e.g. concerning the reference structure, suitable rate assignment is essential and bitrate adaptation in applications like this is an active area of research [10]. Concerning the streaming application, extensive research was made, including the right choice of tiles to include into the stream, where tile borders should end [11], or on howto access spatial areas in a bitstream [12], Related research included the mathematical investigation of the optimal composition of video tiles for omnidirectional streaming [13]. The tiles which shall be included in the final streams and where the cuts should be made were investigated [14], A testing scenario for omnidirectional video streaming was introduced [15]. Lastly, an HTML5 omnidirectional tiled video streaming player was implemented [16].

Viewport-dependent tile-based streaming was shown suitable for alleviating the bandwidth shortage by splitting the omnidirectional video plane of panoramic video into these tiles [17, 18]. Supplying only those sections in the current client viewport in high-resolution and sending additional tiles at lower quality and resolution to prevent the waste of decoder assets for content the user rarely looks at.

The authors of [6] proposed omnidirectional video that is split into tiles and transferred in highresolution inside the user’s viewport and low-resolution outside the user’s viewport. Three tiling configurations with different percentages of high-resolution content and the corresponding streaming system structure were proposed.

A bitrate estimation scheme based on spatio-temporal content complexity metrics of uncompressed video was proposed utilizing two metrics and a generalized linear model [19]. In recent work, the authors adapted the work of [19] and proposed a non-linear regression model for tile rate assignment focused on the spatio-temporal activity of omnidirectional video [20]. The authors published a decrease in the quality distribution variation between the tiles of the encoded panoramic video. For this model, the spatial-temporal activity of the video is determined on trial-encodings of a High Efficiency Video Coding (HEVC) based video set.

1.2 Objective and Contribution

The objective is to create a rate assignment for tiles of an omnidirectional video in a way that equal quality is guaranteed over the picture plane. Hardware decoders with a capacity of a 4k resolution are common in recent mobile hardware and for broadcast applications. The main purpose of this work is to stream a 6k video to devices with decoder capacity less than 6k by means of partial subsampling of the video picture to address 4k hardware decoders. However, the investigations can be adapted to other resolutions.

This work introduces new rate assignment models in a distributed tile encoding method for multi-resolution tiled video services based on the current evolving Versatile Video Coding (VVC) standard. Progress is made on earlier work by

- using more sophisticated features defining the spatio-temporal behavior of video data in model learning,

- implementing a better converging regression model, namely a Random Forest (RF) regression model with cross-validation,

- and the evaluation in a viewport-depended streaming simulator, which can determine the influence of system parameters on the rate assignment model performance.

The variation of video length, i.e. chunk lengths in video streaming, and the alignment of interframe coding based on different encoding parameters without the computation overload of encoding each tile combination for each are described in the following sections. All work is designed, implemented, and evaluated in the context of the MPEG standard VVC which is currently in development.

## 2 Fundamentals

The fundamentals for this work are described as follows. First, the relevant aspects of video coding are described. After a short structural overview is given, aspects related to omnidirectional tiled video streaming are explained. Subsequently, the basics of video streaming of omnidirectional content as it concerns the projection of the 3d plane onto a 2d plane and the viewport adaption are discussed. In a third section, the basics of statistical regression are introduced, with a focus on the methods of Random Forests.

### 2.1 Video Coding

In order to understand the methodology of this work, an introduction into the video coding fundamentals is given here. The explanation gives a short overview of video coding techniques, as still picture compression, motion estimation, prediction types, quantization and especially the main component of this work; the quantization parameter (QP). The specialties of the temporal and the spatial coding structure of recent video codecs are introduced, including a fundamental of this work; the partitioning of a video frame into tiles. Thereafter, the omnidirectional video projections, the basics of streaming and viewport adaptivity in the omnidirectional video, as well as basic quality measures, are introduced. As the Versatile Video Coding standard is still in development, this section is mainly derived from the extensive explanations of High Efficiency Video Coding by Wien, Sze, Budagavi, and Sullivan [21,22],

### 2.1.1 History ofVideo Codecs

Over time, two different applications moved into the focus of the development of video coding standards; real-time video communication and the distribution of video content, i.e. files. Since the 1990s the International Telecommunications Union (ITU) and the International Standardization Organization/lnternational Electrotechnical Commission (ISO/IEC) published specifications for these two applications. This work tackles the latter.

Many of the fundamental technologies used today in video coding were already introduced in the pioneering standards during the last 30 years. While yet advanced, these technologies are still relevant today. As Wien [21] and [23] explain, the first digital video coding standard was ITU-T H.120 in 1984, which used of the fundamental technologies of video compression; an intra-frame Differential Pulse-Code Modulation (DPCM) coding and a scalar quantization, as explained the next section, followed by Variable-Length Coding (VLC). With ITU-T H.261 in 1988, motion compensation on 16x16 macroblocks was introduced, including chrominance subsampling. Utilizing the 8x8 Discrete Cosine Transform (DCT) followed by a zigzag-scan and Huffman VLC. The combination of temporal prediction between successive pictures and transform coding techniques is called the hybrid video coding scheme which is the structure for all video coding standards since H.261.

The MPEG released its MPEG-1 (ISO/IEC 11172-2) standard in 1993. While it is a hybrid coding scheme like H.261, it added bidirectionality and half-sample precision for motion compensation, slice structured coding, quantization weight matrixes, and an all-intra mode. This standard was advanced by the MPEG-2 (ISO/IEC 13818-2) standard in 1995, adding the support of support for interlaced-scan pictures for television and increased DC quantization precision, and the general concept profiles and levels. Many of these specifications were later adapted by the ITU for as H.262, who also published H.263, H.263+, and H.263++ in 1996, 1997, and 2000, respectively. These standards introduced a variety of coding tools and design patterns, and today used profiles and levels configuration schemes.

In 1999 the ISO/IEC reached greater coding efficiency and packet loss enhancements, coding of still textures and synthetic content, and 10-Bit and 12-Bit sampling for studio applications, with their successor MPEG-4 (ISO/IEC 14496-2). The with 90 percent usage most used standard as of 2019 [24] is the Advanced Video Coding (AVC), which was developed by the Joint Video Team (JVT) and was approved as ITU-T H.264 and MPEG-4 part 10 (ISO/IEC 1449610) in 2003. Furthermore, AVC introduced improved coding performance, a bit-exact decoder specification, 16-bit integer arithmetic, and a network-friendly design with a clear distinction between the video coding layer and the network abstraction layer for the integration into different types of transport environments. Over time, H.264 AVC was extended by Scalable Video Coding (SVC) in 2007 and Multiview Video Coding (MVC) in 2009.

Through continuous research on a new algorithm and the increasing computing power, new application scenarios can be addressed. The Joint Collaborative Team on Video Coding (JCT- VC) set requirements for the new standard by 2010. The High Efficiency Video Coding standard, which has been finalized in 2013, provides with its Motion-Constraint Tile Sets (MCTS) the possibility of tiled streaming in which independent tiles can be encoded in parallel by applying certain constraints on encoder-side and re-written into one video stream [25, 26] on client-side. The data encoding itself must be performed systematically. Usually, the video content is split up and synchronized encoders process each segment independently in parallel for network scalability while ensuring that the selected encoder parameters suit to a degree that enables a low-complexity tile merging.

The next-generation codec is called Versatile Video Coding (WC) [27], VVC is developed by the Joint Video Experts Team (JVET). VVC is intended to substantially reduce the workload when generating and combining separate portions of omnidirectional videos. This is achieved by incorporating new concepts including normative boundary treatment [28] and an addressing scheme built on so-called subpictures within a coded picture [29].

As of October 2017, the standardization for this new codec had begun. While recently drafts for this standard were released [27], it is still in development and will prospectively be finalized in July 2020. It recently reaches up to a 38 percent bitrate reduction at the same Peak Signal- To-Noise Ratio (PSNR) for high definition content objectively, compared to the latest HEVC test model, which relates to an average bitrate saving of 50 percent at the same perceptual quality subjectively. Figure 2.1 depicts the timeline of the video coding standards mentioned above.

Abbildung in dieser Leseprobe nicht enthalten

**Figure**2.1 Timeline of the Video Coding History

#### 2.1.2 Structural Overview of HEVC and VVC Coding

Due to the years of development of video coding standards, a comprehensive description of all principles would go beyond the scope of this work. However, the most relevant parts should be described, as follows. Considering natural picture content, usually the special and temporal information does not change quickly overtime. The information on the picture dimension, i.e. the spatial plane, and the time dimension, i.e. the temporal plane, is, therefore, autocorrelated to a large degree. Intra prediction for the spatial decorrelation and inter prediction with motion compensation for the temporal decorrelation was developed to take profits from these correlations. Furthermore, the quantization in the coding process will be discussed, and the quality parameter, which has high importance in this work.

Modern encoders use way more advanced techniques to further compress the video content to a minimum. However, the main principles are still relevant. An input picture is processed by a transform and quantized. The differences between sequential pictures are calculated and encoded in an inter-depended fashion. Figure 2.2 depicts a usual coding structure in HEVC, which newer coding standards like VVC are based on.

Abbildung in dieser Leseprobe nicht enthalten

A color picture is usually defined by a luminance (luma), i.e. brightness, and two chrominance (chroma) channels, as the corresponding YUV-format can be used for subsampling of the less- relevant chroma channel, as opposed to the RGB-format. In this work, the luma channel is analyzed. Essentially, a residual video coding process, the process of reducing redundancy in video data, is based on a linear transformation of the picture signal from the spatial to the frequency domain using the DCT, as defined by [30]. Besides the fact, that the DCT does only produce a real spectrum, unlike the Discrete Fourier Transform (DFT). The DCT can also be implemented efficiently for the power of two block sizes, like the DFT. A block size of 8x8 is mostly chosen as a tradeoff between calculation effort and quality. The frequency coefficients are the transformation result of the spatial coefficients. The Direct Component (DC) is the first component. The coefficients are quantized with minimized perceptual loss. Foran all-intra prediction mode, i.e. a prediction based on local correlations, this information will directly entropy coded by the Context-Adaptive Binary Arithmetic Coding (CABAC). CABAC removes redundant information that was not already eliminated by the former prediction and generates the bitstream. For the inter prediction, i.e. the prediction considering also the correlation in the time domain, the scaling and transformation will be inverted and the results will be written to a picture buffer. This buffer will be processed by motion estimation controlled motion compensation and an in-loop-filter, as described in the following sections. The residual signal will be entropy coded together with the motion vector information and other relevant encoder signals. Entropy encoding using arithmetic coding is usually complex and hard to parallelize. It is one of the bottlenecks of an encoding process [21],

#### 2.1.3 Temporal Coding Structure

The temporal coding structure includes everything that is related to the temporal dimension. A partition of a sequence of pictures which may be partitioned together is called a Coded Video Sequence (CVS). This CVS is encoded independently from other sequences, which may be included in one bitstream. The following explanation focusses on the organization of input pictures of a CVS, the coding order, and the output order.

The order in which pictures have been recorded is referred by as the output order. In this order, the encoded video should be displayed again after decoding. To reconstruct the output order ofthe temporally successive pictures from the re-ordered coded pictures in the bitstream, the Picture Order Count (POC) is defined. The POC internally denotes the output order inside the CVS. The distance between the two following pictures in the output order throughout the CVS can vary.

On the other hand, the order in which the decoders reconstruct the pictures internally, which can differ from that output order, is referred to as a coding order. Only already decoded pictures are available for references in the prediction structure. The coding structure defines the relationship between successive pictures in a sequence, including their dependencies. Over the sequence, this coding structure can be periodically repeated. The pictures included in this structure are typically referred to as a Group of Pictures (GOP).

Abbildung in dieser Leseprobe nicht enthalten

The first picture of a GOP is usually referred to as the key picture. A typical GOP is depicted in Figure 2.3. As can be seen, there are different types of references between single pictures. The base pictures, which do not refer to any other picture are called Intra-Frames (l-Frames), a single time direction referenced, i.e. uni-predicted, frames are historically called P-Frames, as for*prediction.*The in both time directions referenced pictures are referenced as B-Frames, as for*bidirectional.*The frames are ordered in a hierarchy in a way that certain pictures can be discarded without making it impossible to decode other pictures. This structure is referred to as the hierarchical coding structure.

The role of a single picture within the coded video sequence is defined by the picture type, usually indicated by the Network Abstraction Layer (NAL) unit type, i.e. the network package type of encoded streams. In HEVC there are four classes of pictures, with further subclasses, as follows:

- Random-Access Point (RAP) Pictures

- Leading Pictures

- Temporal Sub-Level Access Pictures

- Trailing Pictures

These picture types define constraints for the slice types and predictions methods. As relevant in this work, the first type, the RAP, shall be explained as follows. As the name suggests, this picture can be encoded and the prediction can be started at this type of picture. This RAP pictures can only be l-Frames, since only intra prediction does not reference other pictures in the sequence. These frames are referred to as Intra Random-Access Points (IRAP).

#### 2.1.4 Spatial Coding Structure

The spatial coding structure is concerning the local dimension, i.e. the picture plane, of the video. This is important since the spatial coding structure of the so-called tiles is under the main investigation. Since the coding structure of HEVC, and VVC, includes many sub-structures, only for this work relevant structures are described in this section, whereas the underlying partitioning is listed.

In general, a video is split into Coding Tree Units (CTU) for encoding. A CTU holds three Coding Tree Blocks (CTB) for each luma sample and two chroma samples as well as an additional syntax element. This corresponds to the macroblocks used in previous standards. CTUs can be arranged into blocks, slices, and tiles.

Slices are a way of partitioning a picture so that every slice is individual decodable. Each CTU is thereby contained by only one slice. In extreme, a slice can hold the full picture. The intra prediction method can be set for slices and the CABAC is terminated at the end of each slice. Since slices are traditionally arranged in the scan order of CTBs, tiles are introduced to split the picture plane more flexible. Tiles introduce extra horizontal and vertical boundaries at which the intra and motion prediction are limited. A slice must either contain complete tiles, or a tile must contain complete slices. Therefore, instead of ordering the CTUs column-for-column the stream is constructed partially vertically first, as depicted in Figure 2.4. The configuration of the tiles can be changed from picture to picture.

Abbildung in dieser Leseprobe nicht enthalten

The structure below CTUs is as follows: CTUs and CTBs can be split into Coding Unit (CU) and Coding Blocks (CB), respectively. Each CU can consist of Prediction Units (PU), for which the same prediction, i.e. intra or inter, is applied. For transform coding, a CU or CB can also be partitioned into Transform Blocks (TB) or Transform Units (TU), respectively. For the sake of completeness; a slice consists of a slice header and slice segments, which encapsulate the CTUs. In contrast to HEVC, VVC CTUs can be set into slices, bricks, tiles, and sub-pictures more flexible. This is beneficial for viewport-based video, as single areas of the full-frame can be decoded.

#### 2.1.5 Quantization

As described in the previous section, quantization is performed after the linear transformation. This is one of the points at which the perceived quality of the video can be set. For this purpose, the quantization step size is selected. This step size can control the quantization noise and thus the data rate. In general, the quantization provides an irrelevance reduction at this point introducing hardly perceivable information losses. The values in the transformation matrix are scaled with the corresponding elements of a quantization matrix, which compresses high diagonal frequencies the most. This is because in natural scenes horizontal and vertical structures are more common than diagonal structures. The human visual system has adapted to these conditions perceiving oblique structures weaker, which can, therefore, be quantized more coarsely. This is called the oblique effect.

A step size is defined by the quantization parameter (QP), acting as an index for predefined step sizes. A low QP is equivalent to a fine quantization, resulting in a precise rate control at additional processing cost, while a high QP results in a rather rough step size, resulting in a less complex encoding at less quality [21], The QP can take values from 0 to 51 and thus has 52 possible integer levels for 8bit video. A step of one in QP increases the quantization step size by ~12 percent. Thus, an increase of the QP by six doubles the quantization step size [22],

Although it is not further discussed here, a different QP for the luma and the chrominance channel are defined during quantization. Furthermore, for standards since AVC, quantization groups can be formed on macroblock level, where the QP for each macroblock can be changed inside a frame and stored as a residual [22], Additionally, as relevant in this work, a QP cascading approach is used in the encoding scheme. The QP is internally decreased in a hierarchical manner, related to the hierarchical GOP structure with temporal layers [31]. However, this procedure can lead to block artifacts in the pictures of a higher QP hierarchy when encoding very random movements, like fire, water, or snow [22], For a bitrate prediction using the encoder outputs, this holds the problem that bitrates using the same input QP parameter are encoded with different QP internally, and are, thus, scaled differently in terms of their quality and bitrate.

After the quantization, a scan is used to prepare the linear transformation coefficients for entropy coding. In HEVC the traditional diagonal zigzag-scan was replaced by a horizontal scan for performance reasons [32], In VVC the Dependent Quantization (DQ) was introduced. The rate-distortion-to-calculation-complexity rate is a measure to evaluate the goodness of new coding tools. As one of the coding tools with these best rates, the DQ sets the available levels for the transform coefficients dependent on the levels in previous frames. Furthermore, the QP range in VVC was increased from 0 to 63 [27],

In previous work by [20], the authors introduced the QP parameter chosen by the encoder and reported by the encoder output and introduced the QP-variance as a measure for quality over the tiles, where the variance of the QP distribution as reported at the encoder output are across the spatial plane, i.e. all tiles, is calculated.

#### 2.1.6 Intra and Inter Prediction

The coding process of Intra and Inter Prediction is also called the basic hybrid video coding approach, as it combines video frame prediction with the transformation of the prediction error. To remove redundancy inside one picture, i.e. in the spatial space, the intra coding is used. The picture frame is processed by the linear transformation and quantization, as described before, prior to the entropy coding. This can be also used in an all-intra mode. The all-intra mode is useful at scene breaks or to prevent error propagation.

Inter prediction is one of the main components of video standards since H.261 and tries to eliminate redundancy in the temporal space, i.e. from frame to frame. This Motion-Compensated Prediction (MCP) is used to find references for the current picture in the neighboring pictures and generates motion vectors. MCP typically is based on the translational motion model, which represents the movement of picture content overtime. MCP operates on a prediction block level and is therefore also called block-matching. The blocks can be searched in available reference pictures in a full search, looking at all possible blocks, or a fast search, where only most probable blocks are considered. The fast search often leads to a relevant speedup at a marginal quality loss. Finally, the residual signal is calculated as the difference between the predecessor frame and the current frame. After the transform, it is given to the entropy coder. If a previous picture is referenced from the picture buffer, this is called bi-directional-prediction and the resulting frames are called B-frames. For this task, the transmission order of the frames must be changed. Therefore, a prediction or coding order and output order are defined, as explained in 2.1.3.

In HEVC the intra coding consists of three steps: reference sample array construction, sample prediction, and post-processing. The Advanced Motion Vector Prediction (AMVP) was introduced, which sends the best motion-based prediction to the decoder. Since motion compensation is one of the most computationally costly tools in a video encoder [21], HEVC implements two prediction modes while keeping computational costs low. The angular prediction methods model structures with directional edges leading to a gain of approximately 10 percent compared to AVC [33]. There is also the planar and DC prediction, which smoothens picture content [22], Finally, the residual signal is calculated as the difference between the predecessor frame and the current frame and given to the entropy coder.

In VVC one of the coding tools with the highest gain in bitrate-distortion-rate-to-computation- complexity is the Affine Motion Compensation (AMC). AMC supports zoom, rotation, and other transformations on top of the traditional motion translation techniques. It may use triangle partitioning of coding units for more precise in-picture border representation. The adaptive Motion Vector Resolution (AMVR) is a new well-performing coding tool. AMVR allows encoding with different precisions for each Motion Vector Difference (MVD) within the DPCM encoding procedure of the vectors. Another new coding tool is the Reference Picture Resampling (RPR). RPR allows a change of picture size inside a GOP, without the need of starting from a new intra picture, which would usually cause a spike in bitrate, through resampling decoded picture from the picture buffer to the new picture size [27],

#### 2.1.7 In-Loop Filtering

In-Loop filtering is an extra step introduced in AVC and refined by HEVC and VVC. The purpose of the technology is to remove block artifacts which results from the prediction block structure in the coding process (deblocking). In HEVC this technology was advanced by the Sample Adaptive Offset (SAO) filter, which while the deblocking filter got simpler. The In-Loop filter is applied after reconstructing the coding blocks through the inverse linear transformation [21], The deblocking filter acts in two passes, each for one of two rectangular directions with adaptive filter slope and length of the filter impulse response. The SOA acts sample-oriented and can be set to Edge Offset (EO), which filters local directional structures, to Band Offset (BO), which filters intensity values on a per-pixel basis, or not at all (OFF).

For each sample, differences in the spatial neighborhood or the spatial value-range of intensities are analyzed. Since the SOA needs the original and the reconstructed CTUs available in the offset deviation stage, two additional CTU buffers are needed. For efficiency reasons, the deblocking and the SAO search, where the SAO gathers statistics from the original and reconstructed CTUs, are processed in parallel. The SAO offset-filters successively in a second step, the filtering stage [22], Because the SAO thus uses non-deblocked data, a small error occurs. However, this error is considered a good trade-off between memory consumption and the error induced. In general, the In-Loop filter can operate across slice and tile boundaries, whereas intra encoded pictures are excluded since they already lead a perfect representation of the original picture.

In WC the Adaptive Loop Filter (ALF), reduces blurring and ringing artifacts and artifacts which are not filtered by the previous encoding tools by optimizing filter kernels on encoder-side based on the actual coded content. The adapted filters are transmitting as filter parameters. Another top increase has been achieved by the Luma Mapping and Chroma Scaling Tool (LMCS). This tool reshapes the luma components before the in-loop filter.

#### 2.1.8 Coding Schemes

The Common Testing Conditions (CTC) [34] define common coding schemes for several application scenarios. There are four coding schemes defined in the CTCs defined by the JVET for VVC: All Intra (Al) Random-Access (RA), Low-Delay P (LDP), and Low-Delay B (LDB). The RA coding scheme is mainly used for video-on-demand streaming applications, whereas the LD coding scheme is used for real-time communication applications. The Low-Delay coding schemes do not use reordering of the GOP as opposed to the RA coding scheme. The LDP uses only P-Frame and the LDB only B-Frames. Furthermore, the temporal layers are defined which also define a QP offset to the different frames inside one GOP. For LDB this is 5, 4, 5, and 1 for a GOP size of four without reordering. For RA the offsets are 1, 1,4,5, 6, 6, 5, 6, 6, 4, 5, 6, 6, 5, 6, and 6 for the picture order counts 16, 8, 4, 2, 1,3, 6, 5, 7, 12, 10, 9, 11, 14, 13, and 15, respectively. The GOP size of RA is consequently 16.

### 2.2 Video Streaming

Since this work mainly deals with omnidirectional video, the basics of the transmission of panoramic video content will be introduced in the following. First, the representation of omnidirectional scenes on a 2d plane will be described. Subsequently, video streaming processes are explained and the peculiarities of tiled streaming are highlighted.

#### 2.2.1 Omnidirectional Projection

With the purpose to store omnidirectional video in the usual rectangular form of a typical 2d video content, the projection plane of the content must be defined. For that task, JVET investigated and specified in [35] projection formats, their conversion and content quality measures for omnidirectional video for their 360Lib.

First, the 2d video plane and the 3d plane must be defined. In Figure 2.5 a 2d surface mapping onto a 3d hull is depicted. The vertical and horizontal axis of the plane is defined as*m*and*n*(orange circles). The middle is defined as 0 = 0 and 0 = 0, whereas the borders are defined as 0 = ±180 and 0 = ±90. To ensure a symmetry between the sampling points, a shift between the origin of*u*and*v*and the origin of*m*and*n*is defined.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 2.5**2d sampling coordinate definition, Source: [35]

The Cartesian coordinate system*XYZ*is depicted in Figure 2.6. The X-axis is directed to the front, the Y-axis to the top, and the lateral Z-axis to the right of the user, respectively. The longitude 0 ranges counter-clockwise from*—n*to*n*and the latitude*0*from the equator A3 to the Y-axis from -^/2 to*n/2.*

Abbildung in dieser Leseprobe nicht enthalten

**Figure 2.6**3d XYZ system, Source: [35]

Abbildung in dieser Leseprobe nicht enthalten

**Figure 2.7**Viewport generation with rectilinear projection, Source: [35]

The viewport is the visible area that is looked at by the user watching in a certain direction and therefore the sample positions inside that area. The viewport usually goes over more than one face of the projection format. To calculate the viewport, the XYZ-coordinates must be found and projected on the corresponding 2d plane. The letters A, B, C, and D denote the borders of the viewport, as depicted in Figure 2.7. The Equirectangular Projection Format (ERP) shall be defined as follows since it is common in 360 video coding by now and the source files used in this work are provided in this format. The ERP has only one face and the uv-plane ranges from zero to one. For conversion from 2d to 3d, a sampling position, defined by*m*and*n*is first transformed into*u*and*v:*

Abbildung in dieser Leseprobe nicht enthalten

Whereas,*u*and*v*are transformed into longitude 0 and latitude*0*by:

Abbildung in dieser Leseprobe nicht enthalten

For conversion from 3d to 2d, the transformation is processed inversely. The padding of eight luma samples on each side of the*uv*-plane can be applied, to reduce seam artifacts for viewports crossing over boundaries. Furthermore, in WC motion compensation can wrap around corners for omnidirectional video in case of equirectangular video projections [27],

The Cubemap Projection Format (CMP) is beneficial for the encoding process, as the picture content is less warped towards the poles compared to the ERP [36]. The CMP format consists of six faces, which are labeled by a combination of P for positive and N for negative, and a ZYZ-direction, as depicted in Figure 2.8. Each*uv*-plane is a 2x2 square, where*u*and*v*ranges from -1 to 1. Table 1 lists the face index to each face in the first column.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 2.8**Coordinates definition for CMP, Source: [35]

Abbildung in dieser Leseprobe nicht enthalten

**Figure 2.9**Frame packing for CMP, Source: [35]

For conversion from 2d to 3d, any square face is denoted as AxA and a sampling position, defined by*m*and*n*is first transformed into*u*and*v*by:

Abbildung in dieser Leseprobe nicht enthalten

The 3d ZY-coordinates are derived for each face index from column*X*and*Y*of Table 1. The inverse transformation is calculated accordingly under the condition in column Condition and column*u*and*v*of Table 1. Then*u*and*v*are solved through equation 2.8 and 2.9.

For the CMP projection format or for formats that have multiple faces, the faces are put into a single 2d picture plane. Since there are different ways of placing and rotating, the 3x2 packing method depicted in

Figure 2.9 was proposed. For this, there is only one boundary in the middle of the picture where the motion prediction must be interrupted.

**Table 1**Face indexes of CMP and 2d-3d-conversion, Source: derived from [35]

Abbildung in dieser Leseprobe nicht enthalten

#### 2.2.2 Viewport Adaptive Tiled Streaming

In general, the streaming process without any concern of tiles is processed with technologies like the MPEG Dynamic Adaptive Streaming over HTTP (DASH) from 2011. A video is split into multiple temporal segments, i.e. chunks. Chunks have a certain duration typically between 0.5 and 10 seconds. Each ofthese chunks is then encoded multiple times, ideally in parallel, each with different quality and bitrate settings. The results are held ready at the server. A streaming client can now download these segments in its best possible quality, limited by the bandwidth. The client may use an adaption algorithm in case of sudden changes in the network capacity while ensuring continuous playback and as few buffering events (playout stalling) as possible.

For tiled omnidirectional streaming, the decoder capacity is usually limited to a 4k resolution, as it is common in mobile devices and broadcasting. When transferring the whole 3d sphere to the user, a large amount of data and decoder resources are wasted for regions that are outside the viewport. Therefore, a more intelligent approach is to use higher resolution and lower resolution tiles for the visible and non-visible area, respectively. These are merged together to a 4k resolution video. Consequently, for omnidirectional tiled streaming, the video is again encoded for different quality settings, but additionally in a high and a low-resolution. Through the definition of the codec, single tiles from the high-resolution bitstream inside the viewport and low-resolution tiles outside the viewport can be extracted from a bitstream separately and be merged through a low-complexity rearrangement, as depicted in Figure 2.10.

Figure 2.10 shows a high-resolution and a low-resolution video frame on the server-side which is packed by the Region Wise Packing (RWP) and transferred to the client device. The blue tiles depict the high-resolution tiles and the red tiles depict the low-resolution tiles. The packed frame is then rendered to the projection space, shown in tiles (left) and in the cube projection (right). The figure depicts an arbitrary viewport marked in red. For the depicted tiling of 24, each cube face of the projection format, as shown in Figure 2.9 is split into four tiles.

Important for the tiled streaming is, therefore, that the inter prediction and intra prediction is terminated at the borders of tiles. While the CMP format allows for a full video, which has online one visible interruption in the middle of the picture plane, it is important that the motion predic

tion ends at tile borders to prevent the usage of wrong references when tiles are newly arranged by the RWP.

The MPEG Omnidirectional Media Format (OMAF) standard [37] specifies a tile-based viewport-dependent profile for HEVC coded video [38]. In this, the users can constantly change their tile collection, i.e. tile configuration according to their viewport. This feature is also implemented in 3GPP TS26 [39]. The bitstream is ultimately given to a single hardware decoder on the user device as common on mobile devices.

Abbildung in dieser Leseprobe nicht enthalten

Abbildung in dieser Leseprobe nicht enthalten

**Figure 2.10**From two resolutions to projected frame, Source: derived from Podborski, Fraunhofer HHI

An important limitation is the bandwidth or capacity of the network the video is streamed over. Since the individual tiles are having different content complexities, the bitrate for each tile is non-linearly decreasing when resampling it to a lower resolution. This generates a deviation in quality for each of the 24 possible combinations of the viewport for which a certain tile is the main tile or center of this viewport, e.g. having the largest surface in this viewport. The total bitrate for the same quality will be different depending on which tile-combination is selected. The resample factor represents the ratio of the resolution difference between the high-resolution and the low-resolution tiles and is individual for every tile.

When streaming such a video configuration usually the limiting capacity must not be violated. Therefore, the tile combination s with the highest total bitrate denoted as*smax,*is found and the quality under which the capacity is not violated set as the target quality for all tiles. Thus, every other tile combination will lead to a lower bitrate than the target bitrate.

For optimal tiled streaming, i.e. only showing the current viewport, a prediction of the user's head movement would be desirable to remove content outside the user’s viewport prior to the transmission. However, the head movement cannot be predicted with total accuracy. Due to the inter-prediction of the encoded video stream, the tile configuration can only be changed for each successive random-access point, which is in the most trivial case the next l-frame. If the user changes the viewport inside a RAP, the content would not be available immediately, which leads to undefined portions within viewports and should be avoided. Therefore, the areas outside the current viewport of the user must be also transferred.

However, much work on the prediction of the user’s head movement and the viewport was made by [40, 13, 14, 11,41], In [42] the authors inspected the upper limits of such head movement prediction. The impact of the tile-based approach concerning the delay the client device needs to adapt the high-quality viewport to the new viewing direction after a head movement was analyzed by [43].

#### 2.2.3 Rate Assignment for Encoder Control

If video content is encoded using tiles, a bitrate must be assigned to each of the individual tiles, as is usual and necessary in encoders as an input parameter. The bitrate of a tile is, therefore, a part of the total bitrate. To distribute the available bitrate, different strategies are available. For example, the rate can be distributed according to the*uniform rate*approach, where each tile is assigned the very part of the total bitrate which corresponds to the total bitrate divided by the number of pixels of the tile in relation to the total number of pixels. However, due to the strong picture content dependency of the bitrate for every video content, this approach leads to unwanted variations in visual quality for a user over the entire omnidirectional video. It is, however, desirable to achieve a rate assignment that leads to*uniform quality.*Here each part is assigned so much bitrate that the quality impression is the same over the entire omnidirectional sphere. For this equal quality distribution, exact knowledge of the bitrate of each tile after encoding is required. An estimation of the total bitrate and the bitrates of the individual tiles is therefore indispensable.

For an estimation of a rate assignment, measures are necessary which represent the complexity of a picture derived from their image content. Since the video has a spatial and a temporal domain, each can be analyzed. These metrics are called spatio-temporal content complexity metrics.

In video coding, there are quality measures. These are usually divided into the subjective and objective video quality or distortion measures, for example, the perceptual similarity indices. This work utilizes Haar Wavelet-Based Perceptual Similarity Index (haarPSI) [44] as a content complexity feature. HaarPSI is used to calculate an approximate to the subjective difference

in quality between an uncompressed and a compressed picture, as experienced by a user. To measure quality across omnidirectional video, however, the projection on the 3d sphere must be considered. Spherical objective quality features were defined by the JVET [35]. The PSNR and the closely related Mean Squared Error (MSE) which can be used to measure distortion on pictures, as defined by the JVET [45], as well. Given a distortion-free*OxP*monochrome picture*I*and its noisy approximation*K,*the MSE is defined as:

Abbildung in dieser Leseprobe nicht enthalten

### 2.3 Regression

Regression is a term from statistics or data science, describing the process of predicting data through estimating the relationship of a dependent variable (output) and a set of independent variables (input, predictor, regressor, feature). This chapter tries to give a detailed overview of the regression model of Random Forests (RF) and is based on the work of Gilles Louppe [46] and Bradley Boehmke [47],

#### 2.3.1 Regression Model Overview

There exist several, well-performing regression models. The class of linear methods is one of the oldest in supervised learning. It is derived from simple linear methods from statistics. In this kind of method, the central assumption made on the model is that the predictor can be described as a linear combination of the input variables. These results form input variables in a hyperplane of the mth dimension.

The class of Support Vector Machines (SVM) acts similar and tries to find a hyperplane, which linearly separates the data points with maximum distance. Though, it does not have to consider all features for finding this hyperplane. The support vectors are the vectors closest to the hyperplane.

The recently very common machine learning approach of the family of Neural Networks (NN) is considered. Based on information processes in brains (“neural”), it utilized neurons arranged in layers. The biggest drawback is the computation time fortraining and the missing knowledge about the learned data. Due to the immense extent of this topic, the reader is referred to Goodfellow on deep learning [48].

The Random Forest (RF) and Extremely Randomized Trees (ET) are ensembles of decision trees. Decision trees use a tree graph to model decisions and can, therefore, be displayed easily in a tree-diagram. To derive the RF and the ET models the following decision trees are explained in the following.

The class of non-linear models as used by [20] has, by its nature, the disadvantage that optimal variables must be found. Thus, it is also necessary to determine the input parameters for the search of these variables for the reference regression model to ensure convergence of the model. The support vectors machines, on the other hand, are not capable of learning the approximately 200000 data points and cannot be parallelized easily [49].

A NN might represent the data but needs a huge dataset to receive good results, which is not an option because of the limited size of test sequences set available. Furthermore, to receive accurate calculation results, the training process takes a long time compared with the decision tree models [50]. The neural network approach is therefore not considered further. Ideas on the implementations of neural networks for this problem are given in the prospect.

However, the decision to use RFs and the related ETs is made because of relevant reasons, which are derived from the use case in video coding. The following lists the advantages of decision trees and other tree-based models [46]:

1. Decision trees are non-parametric i.e. without any a priori-assumptions, a model of complex relations between features and predictors can be archived

2. Decision trees handle heterogeneous data i.e. ordered or categorical variables, or both

3. Decision trees inherently implement a feature selection i.e. they are robust to irrelevance or noise in data

4. Decision trees are robust to outliers or errors in labels

5. Decision trees are interpretable for non-statistically oriented users

Concluding, the decision towards the RF model and the ET model is made by performance and handling reasons. As the RF model and the ET model are fundamentally similar models. First, the RF model is used for the evaluations, since it has the capability of analyzing the OOB- results. Secondly, its performance is then be compared to the ET model. The hyperparameters number of trees, the maximum number of leaves and the number of parameters per split are set to reasonable values leading to reasonable regression performance on the dataset and are later optimized. Consequently, in the following sections, only the Random Forest (RF) and Extremely Randomized Trees (ET) as ensembles of decision trees are explained.

#### 2.3.2 Decision Trees

The first proposal of a decision tree to solve the multi-variate non-additive effects in survey data was made by Morgan and Sonquist in 1963, further developed by Sonquist in 1970, Messenger and Mandell in 1972 and Gillon 1972, and Sonquist et. al. in 1974. However, the fundamental work was made by Breiman in 1978, Friedman in 1977 and 1979, and Quinlan in 1979 and 1986, who proposed induction methods for decision trees. The often as a source for RFs quoted work by Breiman in 1984, which was later completed by Quinlan in 1993, is combining the work of these contributions [46].

The principle of tree-based models is that the training data is split according to certain split values in the predictor data. At first, all training data belongs to one subset, split, usually but not necessarily in two, i.e. a binary tree, subsets. The leaf nodes of the tree are the final subsets without any following split, every other node is called a split node, except for the first which is called root. While regression and classification can be tackled by decision trees, for both the average result for each training data point at each leaf node is used.

Abbildung in dieser Leseprobe nicht enthalten

The Classification and Regression Trees (CART) algorithm is the common algorithm for tree induction, while there indeed exist other algorithms which differ in the number of splits at each node, the splitting criterion, or the procedure of finding of splits. To train the model for continuous output, i.e. a regression, often a variance reduction method is used as introduced in the CART work by Breiman [51]. This variance reduction is defined as the total reduction of the variance of the output variable induced by a split at that node.

When building a tree from this method until each sample results in a leaf node, the tree will overfit, i.e. does not generalize on other data, since it learns every data sample including noise. The deeper a decision tree gets and therefore the smaller the measured error gets, does not necessarily lead to a better model. On the other side, when the tree is too shallow, the data cannot be learned fine enough and the model might diverge. Therefore, it is necessary to find the right depth of a tree, as depicted in Figure 2.11. The blue line is the function where samples were drawn from, the gray dots are the samples and the red line is the split by the decision tree. The figure shows underfit trees in the right column and an overfit tree in the upper-right. The lower-left shows a well-performing model.

The problem is to know when the tree is deep enough so that the iterative splitting can be stopped. Regardless of overfitting, a node is a leaf node, when there are no two data points left for a split. While there exist different formulations of an additional stopping criteria to prevent overfitting, the minimum samples per leaf criteria as used in this work is defined by the following criterion: Set the current node as a leaf node if there is no split such that the left and

the right child node both count at least N samples. The choice of this so-called hyperparameter is included in a dedicated heuristic, which can lead to expensive computation costs, but usually increases the generalization performance. The procedure will be discussed in the later section of hyperparameter optimization. Figure 2.11 depicts in its rows the influence of the minimum leaf parameter for different tree depths.

When the stopping criterion is not met, a split must be found. The split maximizes the impurity measure, which is usually the Sum of Squared Errors (SSE) for regression:

Abbildung in dieser Leseprobe nicht enthalten

When the output variable is quantitative, it can be shown that a split in any way reduces the squared error loss. There are therefore n-1 splits for n data points. This loss is positive for values at the child nodes which are different from the parent node, i.e. values are unique. Therefore, a process of pruning is defined which tries to trim the tree by minimizing the SSE while penalizing deep trees. The objective of pruning is, thus, to find the shallowest tree with the highest SSE.

Instead of modeling the output as a single variable, another way is to define a vector as output. For example, when predicting multiple tile bitrates from a single set of features. The intuitive approach is to train multiple decision trees, which leads to more computation complexity compared to a multi-output regression. The correlations between output variables can be considered when training the model, which may affect the overall accuracy of the model. A single multi-output model is less computationally expensive than the computation of multiple models.

Since RFs usually have low bias but suffer from high variance, the bias-variance decomposition can be used to reduce variance over an ensemble of trees. This is the reason the predictions of the single trees of an ensemble are merged into one final model. Bias and variance are explained as follows: While the bias is the expected deviation from the actual sample, the variance is the spread ofthe guessed value, as depicted in Figure 2.12. Because randomization in the training process increases the bias, but enables a reduction in variance, finding a well-performing tradeoff between bias and variance is challenging. The maximum number of features is, therefore, a hyperparameter and can be used to tune the bias-variance tradeoff.

After decision trees were introduced, the focus is now on the process and for building ensembles of decision trees, the so-called forests. A forest thereof consists of a larger set of trees, e.g. up to thousands, with a randomized tree induction. Because decision trees usually have low bias and a high variance [46], they can benefit from building ensembles, which can drive down the variance while they keep the bias low. For regression, random changes ofthe training data and variable selection for each tree can lead to lower variance in the final model, when averaging the decision thresholds over the whole forest.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 2.12**Bias and variance, ground-truth corresponds to the center of a circle, Source: derived from [46]

The generation of separate models on bootstraps, i.e. sample subsets, is called bagging or bootstrap aggregation and was introduced by Breiman in 1994. For bootstrap aggregation, as the name suggests, bootstrap samples are drawn from the dataset and models are trained on these sets. The results are then averaged to the results of the final model. While this procedure has a high probability that the final model will perform with higher accuracy, it may happen that the usually two-thirds of the original size bootstrap set is so small that the bias increase cannot be compensated by the achieved variance reduction and therefore the overall model will perform worse. Irrespective of this, bagging has proven to be a well-performing approach in many applications. In 2001, Breiman proposed a combination of a random feature selection at each node, as suggested by Amit et. al. in 1997, with his bagging approach [46].

Since RFs consist of bootstrap samples, the so-called out-of-bag (OOB) methods are available whose measures are calculated for each bag and then averaged over all trees in the forest. These can be used to, for example, calculate a prediction error on the training data or estimate the feature importance. Thus, for each tree, an OOB prediction error can be calculated.

Feature importance can be used to reveal correlations in the data and possibly remove features with low relevance to reduce the model complexity and increase the probability of model convergence. As has been shown in [46], irrelevant variables do not change the resulting variable importance and do not decrease the model accuracy. Nonetheless, redundant variables can lead to a bias in importance towards one of these redundant variables. Masking effects are automatically reduced in RFs, as for a split, a random subset of the available features is used. This is true as long the number of removed features is smaller than the number of redundant variables. Because of this bagging approach, Breiman proposed a measure of feature importance by summing up the impurity decrease at all nodes where a feature is used and average over the forest.

The variance in decision trees has a significant influence on the generalization error, i.e. the error on unseen data, as shown by Guerts. Research by Geurts and Wehenkel indicated that the random splitting of the learning set at the training phase/stage can contribute to the high variance of decision trees and even the optimal split point can still induce significant variance. Therefore, Guerts proposes the Extremely Randomized Trees (ET), which instead of finding the best decision threshold at each node, uses a random decision threshold in a certain range, additionally to the random subset variable section of RFs. Also, the ET does not use bagging, as bagging often reduces performance for large datasets [46].

#### 2.3.3 Model Optimization

In order to optimize the hyperparameters, i.e. the user-defined parameters of the RF model, there are several approaches. In this work, we focus on Bayesian Optimization, as it was used, for example, in the optimization of the AlphaGo system [52], Objective functions that take a long time to compute, are simply very costly, or limited by physical resources are often optimized by the Bayesian Optimization approach. The Bayesian Optimization is well suited for continuous domains of less than 20 dimensions while tolerating noise in each function evaluation. Usually, a Bayesian Optimization algorithm consists oftwo main steps: a Bayes statistical model is derived for the objective function and an acquisition function is defined, to define at which hyperparameter combination is sampled in the next step. Formally, the approximation error measures the distance between the optimal models and their approximate learned model [53]. With the goal of finding a more generalized result of the prediction output, the Bayesian Optimization calculates an error for the non-deterministic prediction output for the decision tree models. For any details, it is referred to [53].

The RF has the internal capability of a feature-selection from the out-of-bag feature importance measure and makes use of a feature selection when training [46]. It was stated before that redundant and strongly correlated features lead to a misinterpretation of the feature importance. Feature selection, also known as variable subset selection, is therefore important to ensure successful training. The goal of a feature selection is to remove either redundant or irrelevant features [54] to reduce the variance of the model. Not to be mistaken, redundant features do not necessarily be irrelevant, but irrelevant features can still be redundant.

#### 2.3.4 Model Validation

For any simulation or experiment, validation is necessary. This is also true for regression models, where the trained model is supposed to be tested against a dataset, which it was not trained on. This process is called cross-validation. The LOOCV is a special case of the k-fold cross-validation [48], where a model is trained on*k*random subsets of the original dataset and a user-defined percentage is held out for testing. The overall process is repeated*k*times. The LOOCV is an exhaustive method, where*k*is equal to the original data sample size so that each set is held out once. Since in this work it is trained on sequences, holding out a full sequence and training on the other sequences is a LOOCV. However, this a computationally expensive approach.

## 3 Methodology

In this chapter, the basic methodology of this work is described. First, the data aggregation is described, which starts by collecting sequences from different sources, converting them into one format and encoding them by the VVC Test Model (VTM) for different encoding parameters. In the process of splitting the video sequences into tiles, the calculation of activity features is accomplished. The video tiles are processed frame-wise in parallel and for each frame, spatio-temporal activity features are calculated. After this collection of the ground-truth dataset, rate assignment models are explained, which are segmented into two groups, the bitrate prediction models, and the direct ratio assignment models. A Viewport-Dependent Streaming Simulator (VDSS) is designed and implemented to evaluate the results of the rate assignment models under different streaming scenarios. Finally, an optimization of the rate assignment models is executed, both by the regression model itself and by using the output of the VDSS.

Since in this work a variety of rate assignment models are discussed, the investigations are separated into four steps. Firstly, the models are benchmarked by their prediction performance by the VDSS in a standard model. If the model does not lead to reasonable results, it is discarded, as no gain on the uniformity in quality is existent. For a selection of well-performing models, the influence of different VDSS parameters is analyzed. Finally, the best performing models will be evaluated by a real-world streaming simulation.

### 3.1 Aggregation of the Trial-Encodings Dataset

For the rate assignment models, ground-truth encoded trial-data is needed fortraining and for validation. In the context of video coding research and development, research groups have captured high-resolution video scenes in different natural scenes. These video sequences are provided in an uncompressed YUV-format using the ERP projection. However, as this work tackles tiled video streaming on CMP projections and the sequences can differ in their framerate, resolution, and bit depth, they are transformed into one uniform format and split into tiles first. Subsequently, each tile must be encoded by the VTM.

The reference dataset consists of twenty-five omnidirectional video sequences from the JVET common test conditions dataset [34], sequences from Shanghai Jiao Tong University (SJTU) [55], and other proprietary sources, as listed in Table 2. The Common Testing Conditions set is a collection of encoding configurations that are used during the development phase of a video coding standard, defined by the JVET for typical application scenarios for VVC. These CTCs include a list of sequences to be used in the standardization process. Furthermore, the CTCs define the coding schemes which in turn define the hierarchical GOP structure, the referencing structure setups.

All sequences are cut to a length of 300 frames and are converted by the 360Lib published by JVET [35] to CMP in 6k resolution, i.e. 4608x3072 pixels, for high-resolution and 3k resolution, i.e. 2304x1536 pixels, for low-resolution. The resolution is therefore divided by two, because for higher resample factors the portion of high-resolution content in the bitstream gets larger and might exceed the capacity of the 4k encoder and the low-resolution tiles become inadequate in quality. The sequences are all converted to 30 frames per second, and 8-bit depth. The sequences are then divided into 24 tiles, i.e. four square tiles per each of the six faces of the cubemap projection.

**Table 2**Sequences of the test-dataset

Abbildung in dieser Leseprobe nicht enthalten

Each tile is encoded by the VVC reference encoder 4.2 [45] at commit 780cae85 using the RA and the LDB configuration of the common test conditions. The intra-period of LDB is set to 64. The frame size, quality parameter, frame type (I, B, P) and mean square error for each encoded frame are recorded at the encoder output. The encoding process is completed for different QP (1, 17, 22, 27, 32, 37, and 51) to get a large range of possible QP, with more samples in the typical quality range of 360-degrees video sequences (17 to 37). Each tile combination s amongst which a streaming client can choose includes 12 high-resolution and 12 low-resolution tiles, resulting in an overall resolution equivalent to 4k, i.e. 3456x1536, as depicted in Figure 2.10. For each s, the resulting bitrate is calculated, as the sum of the spatial size of the high-resolution and low-resolution tile, multiplied by the 30 frames per second. Four other sequences, namely*AerialCity, DrivinglnCity, DrivinglnCountry,*and*PoleVault*were not considered, as they are only available in a relatively low resolution of 3840x1920 and therefore would have had to be scaled to 8K resolution first. However, this would lead to scaling influences on the features described below. Concluding, there is trial-encoding data for 25 equally split and encoded sequences available for further analysis.

### 3.2 Regression Features and Quality Measures

In general, it is beneficial for regression models when more individual features are available. For this reason, the set of the two in former work defined spatio-temporal content complexity features is extended by five more features. Furthermore, it is assumed that higher content complexity results in higher bitrates of the encoded video.

After the data is collected, features are examined to determine the content complexity of the tiles, as follows. As a start, the ITU-R recommendation BT.500 [56], which provides a framework for measuring the quality of television pictures and the ITU-T recommendation P.910 [57] are considered. Both describe subjective video quality assessment methods for multimedia applications. In this work, these quality assessment methods are used as spatio-temporal content complexity features.

To measure the complexity for each frame per tile from the picture domain, these two spatiotemporal activity measures, namely spatial perceptual information (SA) and temporal perceptual information (TA), are extracted. For the*SA*on each luminance plane*F*a Sobel filter is applied. The standard deviation is then determined for each*i.*plane to compute the*SA*as result. The*TA*is measured using the disparity as the difference oftwo neighboring pixel values of the luminance plane over two successive frames, followed by the computation of the standard deviation over the local space. The*TA*is defined as zero for the first frame in sequence as no previous frame exists and setting it as undefined leads to an exclusion in the regression models. The measures are defined as follows:

Abbildung in dieser Leseprobe nicht enthalten

To obtain additional prediction features for the bitrate adaption model, a variety of spatio-temporal complexity measurements are assessed, namely the standard deviation of the vertical or horizontal first derivative*(Dx, Dy),*the entropy*(H)*of the picture, and the Hausdorff dimension*(HD),*each of which is obtained using the luminance plane of the picture data to calculate spatial activity. The derivatives are chosen in analogy to the special activity feature (SA). The*HD*feature is chosen through research of literature towards of spatio-temporal complexity features in different contexts. The features are defined as follows:

Abbildung in dieser Leseprobe nicht enthalten

where*N(e)*is the count of hyper-cubes of dimension*E*and length*e*that the object is filled by, calculated by the box count algorithm [58].

The haarPSI*(HA)*quality measurement between two successive frames as average over the time is calculated as a temporal activity feature. Equivalent to experiments in [20, 19], all variables are averaged across the time dimension to represent a video composed of a sequence of pictures. The intermediate results inside the braces before the average over time is calculated are saved for each frame.

When the bitrates are predicted it might be beneficial to include the QP offset by the coding scheme into the regression features, to make it possible for the model to learn the QP cascading of the hierarchical GOP structure. However, this offset QP must not be confused with the parameter QP. Therefore, the offset QP is denoted as encoder-QP and the QP set as an input parameter is denoted as input-QP. Furthermore, in the evaluation, the QP is also used as a measure for the uniform distribution of the picture. The input-QP can only be an integer, while the quality measure QP is because of its interpolated nature a floating-point value. The latter is only denoted as QP.

In general, the output parameters by the encoder, i.e. the frame-type, the encoder-QP, and the resample factor between the high-resolution and low-resolution tiles, can be considered as features for the regression models, as explained later. Concluding, a dataset of 25 sequences, split to tiles, is available. It consists of the frame-wise trial-encoding output data and seven features each. At this point, all data for this work is extracted from the sequences and the sequences do not have to be considered again.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 3.1**Interpolated bitrates over MSE for constant QP encodings of sequence*Trolley,*each color corresponds to a tile

In Figure 3.1 the interpolated bitrates of the tiles of an example sequence are depicted over the MSE. Each circle represents a QP in the range from 17to 37 from left to right. The curves for the single tiles vary, indicating a variance in content complexity. To reach a certain MSE for different tiles of a video sequence, different bitrates are necessary. Furthermore, the same QP results in different bitrates, as the corresponding circles are at different y-axis sections. To reach the same MSE for all tiles, different input-QPs are necessary, which might not directly correspond to the user’s perception of quality variance.

For this reason, the quality measure used in this work is the QP-variation, as introduced by [20]. The QP-variation describes the QP under which the full picture would have to be encoded so that the tile combination with the highest total bitrate*smax*results in the target bitrate. Because of the continuity of the bitrate, the QP-variation is continuous as well, as opposed to the input parameter QP, which is only in integer steps.

When referred to*variation of quality*in this work, the standard deviation of the continuous encoder output QP is meant, because of its interpretability. For a comparison of the variation of quality at different training-QP, the Coefficient of Variation (CV), as defined by the standard deviation divided by the mean, is used.

### 3.3 Rate Assignment Models

In this section, the different rate assignment methods analyzed in this work are discussed. In general, the rate assignment is defined as the process of using content-complexity based features to create a rate on the total normalized bitrate for each tile, which is in a range from zero to one, and whose sum over all tiles is one. For most models, a bitrate is predicted intermediately. However, the bitrates are not assigned to the encoder directly, since the maximum bitrate combination must be found first and each combination must be scaled in a way that the maximum bitrate tile combination*smax*does not violate the target bitrate. Therefore, scaling is included in the process, anyway, and only ratios are derived from the intermediate bitrate prediction. The rate assignment is the resulting scaled bitrate portions for each tile.

There are two methods of a rate assignment discussed in this work, which differs in the intermediate variable they predict. Firstly, the bitrate regression models, which predict a bitrate intermediately, for example by using the content complexity features for a regression. This is the approach proposed by [20], where the model predicts the bitrate of each tile for the full sequence from two content complexity features. Secondly, the direct ratio assignment models, which predict a ratio on features that are already ratios. This can be archived using the features directly in relation to each other or by building a regression model. The direct ratio assignment models are listed in 3.3.3, the other sub-sections explain the first type of bitrate predictions. The ratios calculated by both methods are then scaled to the target bitrate and the maximum bitrate configuration*smax*found. Afterward, the bitrates are scaled so that*smax*corresponds to the target bitrate. For a perfect prediction, the predicted bitrates are equal to the bitrate the encoder used to encode the dataset reports, i.e. the real size of the final bitstream.

The training for the regression model uses a sequence-based LOOCV on the set of the trialencodings dataset. Thus, in each iteration, the sequence evaluated is removed from the learning process, until each sequence is learned once i.e. each sequence is left out once. The resulting error measures are then averaged over the cross-validation.

In the following, several bitrate predictions are explained. All models are first compared to their prediction accuracy and discarded when their performance does not lead to any gains compared to the non-linear reference model. Models, which have reasonable performance are evaluated by a Viewport Dependent Streaming Simulator.

When a bitrate is predicted by a bitrate regression model, it most likely induces a prediction error. Because of the bitrate scaling, it is assumed that this absolute error has no influence on the performance of the regression model in the terms of uniformity of quality across the picture plane since only the ratios of the prediction for each tile to the other tiles are important. In cases where the bitrate prediction has an absolute error, the ratios between tiles might be still accurate, because of their scaling to the target bitrate. It shall be confirmed by measurements, whether the bitrate prediction models are robust against prediction errors.

#### 3.3.1 Non-Linear Reference Model

As a baseline, besides the in the fundamentals described uniform quality and uniform rate models, the sequence-wise predicting reference model by [20] is used. It is a non-linear model with the formula:

Abbildung in dieser Leseprobe nicht enthalten

A gradient descent method optimizes the parameters*a,b,c,d,e*starting from the initial value*0O*for each parameter. To implement this model in the context of the evaluation framework, the*0o*must be determined first to ensure a convergence of the model. Because a non-linear model is used, it can be possible that the applied gradient descent method does not converge. The*0o*values, as reported by the authors of [20], are:

Abbildung in dieser Leseprobe nicht enthalten

The results of the implementation in this work were confirmed by the authors of the reference implementation.

#### 3.3.2 Length of the Training Sequence

The bitrate regression can either be calculated chunk-wise or a frame-wise. As described, the reference model is a non-linear regression model using the features calculated over the full sequence to receive the model. This corresponds to a chunk-wise calculation when using shorter chunks instead of sequences. For chunk-wise models, the average of the features is calculated by the formulas given and the total bitrate as the sum of the frame in a sequence is predicted. A second method, however, is to calculate each feature for each frame, as it is done in the calculation process, anyway. Therefore, for a frame-wise model, the average is calculated over one frame, i.e. is equal to the feature of this frame. Consequently, frame sizes are predicted instead of a bitrate. These frame sizes can be scaled by the frames per second to receive bitrates, nonetheless. Because for the frame-wise calculation of the features, no information is averaged over time, they might be more accurate in terms of bitrate prediction. A comparison of the two approaches is to be made.

#### 3.3.3 Ratio Features

The ratio of a feature to the sum of this feature for the frame is tested as an assignment model. This is derived from the fact that no bitrate is assigned to the single tiles, but a ratio of the tiles to the other tiles in terms of bitrate is. Therefore, it is possible to derive ratios from the features, as the features are all calculated per tile and do have relations to their neighboring tiles in the same picture, as well. The ratio is calculated by the division of the feature of the tile by the sum of the features over the tiles, i.e. the ratio of the tile to the full picture, defined by:

Abbildung in dieser Leseprobe nicht enthalten

Of course, the sum of the measured features does not necessarily correspond to the measured features when calculated on the full picture. This means, that the ratios do not scale linearly with the size of the image, as the bitrate does not. The latter is in the same range as the tilebased features.

Ratio features can as well be used as features for a bitrate regression, which can combine the ratio models with the bitrate models described here. In theory, the regression model can, therefore, not only learn the absolute bitrate values. These absolute bitrate values are relevant since there is variability in bitrate depending on the complexity of the tile itself. Thus, for more complex tiles at a higher quality and, therefore, a higher bitrate, the bitrate increases faster than for less complex tiles. But it could also learn the ratio to neighboring tiles from the ratio features. However, the number offeatures doubles in this approach and creates a much bigger model.

#### 3.3.4 Direct Ratio Assignment

In this work, the term ratio is used for the relation of features to each other and the term rate is used for the rate assignment. The direct ratio assignment methods are based on the correlation between the features and the ratio of the bitrates of the encoded tiles. Without predicting a bitrate from the features first, a ratio is calculated. No additional calculation step is required and the available features per tile are used in their relationship to the other tiles. Therefore, processing complexity should be reduced, since, for the direct ratio assignment methods no model must be trained. This approach is a calculation complexity baseline against more complex models, as it is similar in complexity in the prediction process to the usage of a uniform rate model as explained in the basics. However, contrary to the uniform rate assignment, features must be calculated. In this work, besides the seven possible ratio models are considered, combinations, as proposed by [19], are considered.

Besides the linear combinations, a regression model on ratio features is implemented. This is an advancement ofthe first model using only single ratio features. Still, the regression model predicts a ratio which is then assigned to the tiles. All seven features were given to the regression model. These models are referred to as the used regression model, i.e. Random Forest or Extremely Randomized Trees, and Ratio Features Only, i.e. the regression using only ratio features as input. For direct ratio regression assignments, it is not ensured that the sum of all predicted ratios is one after prediction and might be scaled to one. However, this is already done at the scaling to the target bitrate and therefore not necessary immediately.

#### 3.3.5 Training of Multiple QP

In general, the frame-wise models are trained on one training-QP, which results in a dataset size of approximately 200000 samples. However, it is also considered to add the encoder-QP as an input feature, since the model could learn the influence of the hierarchical GOP structure, which corresponds to a normalization of the bitrates on the encoder-QP. Furthermore, the training of multiple quality parameters enables the model to be more flexible in the choice of its reference QP when adapting to a target bitrate. Whenever the target bitrate is set, a certain input-QP exists which is closest to the corresponding bitrate. Since the bitrates of the individual tiles are increasing differently for lower QPs, the model has to adapt to this fact.

Later in this work, the influence of reference-QPs, i.e. target bitrates, different from the training- QP shall be analyzed. In addition, the training of multiple qualities could enable a second model with an initial guess of the training-QP. The reference-QP, i.e. the target bitrate can then be used depending on this initial training-QP reducing the error induced by a wrong training-QP.

#### 3.3.6 Training of Frame-Subsets

Until now only the training on each frame type is considered for frame-wise models. As pointed out earlier, the hierarchical GOP structure leads to a huge difference in the frame sizes. Additionally, l-frames without inter prediction are large by themselves and do therefore make up a large proportion of the total bitrate. When the frame type is set as an input parameter, it could lead to an enhancement of the model. However, the frame type corresponds to the temporal layers and therefore to the training of multiple-QP. Another method is to predict subsets of frames, e.g. on only the l-frames. For this, fewer features must be calculated. Furthermore, the bitrate portion of the l-frames on the total bitrate of the uniform quality rate assignment is large. Therefore, predicting only on l-frame leads to less computation complexity, while considering a large portion of the total bitrate. Three frame-subsets are analyzed:

-only l-frames are considered, to set a baseline for this approach.

-only l-frames and the following frame. Since the second frame is also a large frame in terms of bitrate in the evaluated RA encodings.

-only the first four frames of a chunk. This set is not necessarily aligned to the GOP structure. It is used to prove that using arbitrary combinations are not beneficial.

The methods are denoted as the bitrate assignment model followed by*-I, -l+,*and*-4,*respectively. When a subset of frames leads to the same or a comparable accuracy of the model, the calculation time is reduced without loss of accuracy. However, it must be trained on all frames instead of the frame-subset, as the dataset is probably too small for the model to converge. Since the intra period for RA is set to 32 and for LDB to 64, as usual in video encoding, the dataset becomes smaller by this factor.

#### 3.3.7 Training of Multiple Resolutions

In the other models, only the high-resolution tiles are used to train the regression model, as adding both resolutions would lead to a highly correlated dataset. However, as two resolutions must be available in the bitrate calculation process, scaling must be made using the resample factor. To avoid this scaling, a way can be to create two different prediction models, one for the high-resolution and one for the low-resolution tiles. Another method is, to train one model with the resample factor as an input parameter, similarly to including the encoder-QP as a feature. This multi-resolution model might benefit from the larger test set and can then be applied to more resolution factors than the in this work analyzed half-resolution resampling.

#### 3.3.8 Prediction of Multiple Outputs

As described in the fundamentals the RF and the ET can also be trained in a multi-output way. In this model, the multi-output regression tree approach is used to include correlations between tiles in the RF calculation. In general, a multi-output regression as described in the fundamentals is possible for the prediction of the bitrates per tile, where all tiles are predicted in parallel and might increase the model accuracy through the training of correlations between the tiles in one picture. However, training on a set of tiles in parallel decreases the size of the available dataset by the number of tiles, i.e. 24. The dataset must be prepared by a shuffling method to ensure that enough samples are available. To enlarge the dataset again, a shuffling of the tiles can be used.

Tiles are chosen out of all available tiles of all sequences and frames at random. Every 24 tiles are then defined as a new frame. This is done for 300 frames which are set as a new sequence. For this new sequence, the total bitrate is calculated. Since there are 24 tiles and 25 sequences, there is a myriad (600 over 24) of combinations from which a new dataset can be drawn. However, this leads to training of the same tiles in different combinations repeatedly. This model can be advanced by adding the ratio features as well. By this approach, the number of predictors is as well multiplied by the number of tiles, which leads to a huge increase in computational complexity, because more trees are needed in the training process for the model to converge. A benefit is, that there is no cross-validation on the training dataset necessary, as the ratios which are trained on are very unlikely the same as randomly drawn from the shuffle approach or can be manually excluded.

### 3.4 Rate Assignment Evaluation Framework

An evaluation framework, the Viewport-Dependent Streaming Simulator (VDSS), is implemented by the author of this work to evaluate rate assignments of tiled omnidirectional video sequences for a variety of streaming scenarios, such as the sequence length, chunk length or target bitrates. The VDSS is designed and developed as part of this work. In the following chapter the structure of the simulator will be described, and the function it provides. Afterward, the modifiable parameters are explained and two evaluation modes are defined.

#### 3.4.1 Structure of the Framework

The VDSS is consisting of two parts: the inner part, a reverse lookup, which calculates the QP based on interpolated trial-encodings, and the outer part, which executes the inner for a variety of streaming scenarios, i.e. VDSS parameters. Initially, a reference dataset oftrial-encodings in high-resolution (6k) and low-resolution (3k) is imported. This dataset is used as the training dataset for the rate assignment models, and the ground-truth dataset for the measurement of uniform quality after prediction.

The inner part calls a cross-validated regression model to receive a prediction of the uniform quality rate assignment for a certain set of frames, i.e. chunk. Afterward, this ratio is scaled to a target bitrate, and the maximum bitrate tile configurations s is calculated. Afterward, the QP for each configuration is calculated using a pseudo-continuous relationship through interpolation between the quality parameter and bitrate. These can be used to derive the variation of quality or the average quality over the picture plane.

The outer part calls the above process for each chunk in which the provided sequence can be split in a possible range from one to the size of the whole sequence. Training on the whole sequence corresponds to a fix chunk length. It also loops over the following parameters:

-The trial-encoding datasets, RA and LDB,

-The rate assignment models,

-The reference-QP,

-The target bitrates,

-The resample factor.

The reverse lookup process is the center of the VDSS. Basically, it searches for the minimum absolute distance between the trial-encodings for a certain configuration s and the corresponding predicted bitrates by the rate assignment model. The index of this minimization is the resulting QP. Because of the reverse lookup, an error is made which consists of two parts. Firstly, the interpolation error, as explained before, and the error which is made because of the rounding of the QP steps, as no continuous function is created. Before analyzing the results of the streaming simulation, an estimation of the error induced by the interpolation is made.

To enable the reverse lookup, a continuous relationship between the input-QP and the resulting bitrate must be known. One method is to create a regression model for every tile, however, this leads to a huge computation load and inaccuracy. Therefore, an interpolation of the QP over the bitrate is considered adequate. The interpolation is adapted from the video coding analysis process of calculating rate-distortion curves, which are interpolated by a cubic or piecewise cubic, i.e. Piecewise Cubic Hermite Interpolating Polynomial (PCHIP), interpolation algorithm. After loading the reference dataset into the simulator, the bitrate of the high-resolution and the low-resolution are interpolated in a user-set step size for the QP.

Each viewport, which can be defined pixel-wise, corresponds to a tile that has the largest number of pixels inside that viewport. This tile is considered as the main tile of this tile configuration s. Depending on this main tile, the surrounding tiles with the smallest angle from the center of this main tile are defined as high-resolution tiles, other as low-resolution tiles. For a 24-tile partitioning of the cubemap face, Table 3 lists the high-resolution tiles for a certain main tile. The configurations are therefore labeled by their main tile, whereas the counting goes in picture coordinate direction row-wise to column-wise, i.e. in the x-direction and then in the y-direction.

As explained in the fundamentals, there exist different tile-combinations s with different total bitrates. The maximum bitrate combination*smax*is the configuration s for which the sum of all tiles, i.e. low-resolution and high-resolution, is the largest. In general, the target bitrate to which the predicted ratios are scaled and which shall not exceed the available bandwidth. However, depending on the content complexity of a picture, the quality which results in the same bitrates can vary immensely. For very low-complex content it is even possible that the highest possible quality, i.e. input-QP equal to one, results in a lower bitrate than the lowest possible quality for a very complex sequence.

Because the regression models are trained on a certain training-QP, chosen as QP equal to 22 in this work, whenever a bitrate and therefore implicitly an input-QP is chosen, which does not correspond to training-QP a prediction error in terms of uniformity of quality is introduced. This error occurs because depending on the content complexity of the tiles, the bitrate to encode the video in a uniform quality does not increase equally to other tiles with different content complexity. This is also the case, if the target bitrate is so low, that the lowest quality of all tiles already exceeds the target.

However, the complexity of single tiles of one video can vary to such an extent, that for a certain target bitrate certain tiles are already out of the possible QP range, i.e. a uniform quality cannot be reached with this target bitrate. Furthermore, since the bitrate predictions still induce a variation of quality, the target bitrate must be chosen in a way, that the QP range is not violated by outliers. To solve this problem, one approach is that the target bitrate is chosen is such a way, that the corresponding input-QP is equal to training-QP, if not noted differently in this work, as 22. This target is adapted for every chunk in the calculation process.

**Table 3**Number ofthe high-resolution tiles for a tile configuration s, main tile bold

Abbildung in dieser Leseprobe nicht enthalten

A second method is to train the rate assignment model again for each target bitrate operation point. This is the usual approach for a streaming scenario, as the speed-up by not retraining the model, might not be as relevant for a streaming provider, compared to the computation power necessary for video encoding itself. However, it cannot be known which training-QP corresponds to an arbitrary target bitrate.

A third method can include the encoder-QP in the training model and train the model for each target bitrate. One method to get the right training-QP would be to predict all QP for a tile and then minimize the target training-QP as well for all tiles. Alternatively, an initial training-QP guess based on the content complexity features can be made, i.e. a second regression model. This initial training-QP guess can then be given as an input parameter to the bitrate regression model which can then only predict for the training-QP range defined by the variance of the second regression model. Both can be used in the final streaming scenario, where it is expected that training the model for each target bitrate leads to more accurate results and can still be feasible as long there a limited number of target bitrates to serve.

In the calculation process of the VDSS, after the rate assignments are calculated, these rate assignments must be scaled in a way that*smax*corresponds to the target bitrate. However, at that point*smax*is not yet known. Therefore, an approximate scaling to*smax*must be made first, before the bitrates for the high-resolution tiles and the low-resolution tiles using the resample factor can be calculated and*smax*can be found. After*smax*is calculated, the bitrates can then be scaled to the real target bitrate. If the ratios are not scaled to the target bitrate first and*smax*is calculated, the relative difference between differs largely from the real differences. With this procedure, the error should be smaller than the error which occurs, when the ratios are not scaled at all. This assumption is made in this work.

The rate assignment model predicts the bitrate on the high-resolution sequences. Therefore, in a second step, these bitrates must be scaled by the resample factor to receive the low- resolution bitrates. The resample factor is the ratio between the high-resolution and the low- resolution tiles. A better prediction of the resample-factor should lead to a more accurate representation of the low-resolution tiles and a lower QP-variation. Derived from the resolutions, if the bitrate would increase in a linear fashion, the resample factor would be equivalent to 37.5 percent, i.e. 2.70. In [20] a resample factor of 2.28 was derived for HEVC. Consequently, the bitrate decreases by approximately 40 percent for half the resolution, i.e. for four times fewer pixels.

To measure the uniformity and average of the prediction models, including the references uniform quality and uniform rate, output measures must be defined. As explained in 2.2.2, the QP standard deviation is used as a measure. Therefore, as intermediate results, the VDSS outputs these quality measures in a vector for all tiles and sequences for each evaluated VDSS parameter. When the predicted rate assignment results in a total bitrate which is lower than the target bitrate, the overall QP and therefore the variation of the QP decreases as well, however, this model is not considered as a better rate assignment. Besides the QP-variation also the average QP is calculated. Therefore, the effective bitrate and the difference to the target bitrate are output by the VDSS as well.

Summarizing, a bitrate is estimated by a bitrate regression model for each tile. These bitrates are scaled in such a way that their sum corresponds to the target bitrate. This bitrate is assigned to a constant bitrate encoder. Since no such encoder is available, a QP-bitrate curve is interpolated from trial-encodings for each tile. The QP correspong to the assigned bitrate for each tile is derived. Finally, the output quality for each tile is analyzed in comparison to the neighboring tiles. The goal is an uniform quality for all tiles.

For efficiency reasons, the implementation of the VDSS is constructed in a way that calculations, i.e. the expensive training of the RF models, are not computed unnecessarily often. The structure of the VDSS is given by the pseudocode in Listing 1:

Abbildung in dieser Leseprobe nicht enthalten

**Listing 1**Pseudocode of the Viewport-Dependent Streaming Simulator

#### 3.4.2 Parameters to Configure the Framework

The parameters of the VDSS are explained as follows. Parameters can be set in the configuration. Later modes are defined, which are a combination of the following parameters:

-Dataset

-Resample Factor

-TargetBitrate

-Chunk Length

-Rate Assignment Model

The*dataset*parameter describes the set of sequences loaded into the VDSS for training and analysis purposes. The datasets of RA encodings and of LDB encodings from the test dataset are used. Furthermore, concatenated sequences from these sets are used laterforthe streaming scenario, described in 3.4.3.

For reference, as real encodings for the high-resolution and low-resolution are available, the ground-truth resample factor is set as a baseline. For any approximation of the resample factor, the error introduced to the uniform quality measure is analyzed. First, an error measurement between the ground-truth resample factor and the constant resample factor is made. The constant resample factor as the average of all resample factors across the trial-encodings as introduced by [20] is considered. As all of these use the same resample factor without taking the content complexity of the video content into account, they build the class of constant resample factors. Then the*resample factor*is predicted by a regression model and its performance is compared to the reference approach. Instead of using a constant resample factor, the groundtruth resample factor of the trial-encodings is used to train a regression model with the same features as the rate assignment models, but for high-resolution and low-resolution. As mentioned in 3.3.7, if the rate assignment model by using the resample factor as an input parameter. From this, the low-resolution, and the high-resolution tiles are both predicted.

In general, the VDSS provides two methods for setting the*target bitrates*for calculating the uniform quality measurements. The one method uses a variable bitrate correspondent to the initial training-QP as target bitrate which is set for each chunk individually as the uniform quality rate assignment changes for each tile. Additionally, a QP-offset can be set. A range of target bitrates in a predefined step-size can be used for that task. The rate assignment model does not have to be trained again and different target bitrates are evaluated for one model. The input-QP necessary to reach the target bitrate differs from the training-QP which the model is trained on, therefore an error is introduced to the uniformity of quality for the predicted tiles. It is, thus, possible to analyze a trend in the uniformity of quality when the distance from the input-QP corresponding to the target bitrate to the initial training-QP gets larger.

In a streaming scenario, however, the sequence should be encoded in a way that all sequences which a service provider provides are available at equal bitrate gradations. Therefore, the VDSS includes a switch to calculate the average bitrate for a certain QP or all QP among all tiles and set this as a fixed target bitrate for all sequences. In the previous method, however, the bitrate for each chunk has been adjusted so that the uniform quality corresponds to the reference-QP. When the VDSS uses the uniform quality model for a uniform target bitrate, it still uses the input-QP equal to training-QP for evaluation and an error is introduced. The analysis of this error is not in the scope of this work, as it is comparable to the variable bitrate.

In a streaming scenario, the video is not sent at once, it is split into chunks, as explained in 2.2.2. Starting from the idea that smaller*chunk lengths*should lead to a more accurate prediction for a shorter video because less variation of the features over the whole video is averaged, a chunk length variation is implemented into the VDSS. The training on

-chunk lengths smaller than one sequence,

-chunk lengths larger than one sequence,

-chunk lengths equal to one sequence

are investigated. The results of multiple chunks that fit into one sequence are averaged for a comparison with other chunk lengths. For an investigation of the influence of different chunk lengths on the prediction in a scenario where the chunk length does not necessarily align to scene cuts, the measurements are repeated for long sequences, as described in 3.4.2.

For the reference model, there are two chunk implementations. One method is, that the chunk lengths which are used for training are kept at the full sequence length of 300. Only the chunk length for prediction is then adapted. The other method is, that the training chunk length is equal to the chunk length on which later will be predicted. Both are evaluated in this work. For the frame-wise models, both are equal as the training chunk length is already at the minimum possible of one.

For the sake of completeness, it should be mentioned that the*rate assignment model*can be selected as a parameter in the VDSS. It is also possible to evaluate models in parallel.

3.4.3 Modes as Preset Parameter Combinations

For the VDSS two settings or modes are defined. These are used to investigate the rate assignment models. However, as the rate assignment model is a parameter in the VDSS which is under investigation in this work, the following modes do not set the rate assignment model.

A*standard mode*is defined for the first analysis in the VDSS. The VDSS is executed with:

- **Dataset:**Random-Access (RA), 24 Tiles,

- **Chunk length:**None, 300 frame long sequences,

- **Target bitrate:**One fix target bitrate equal to the bitrate under training-QP,

- **Resample Factor:**Average of resampling factors for all tile.

The dataset is chosen as RA, as it used the usually chosen random-access coding scheme in the streaming scenario. The chunk length corresponds to the full length of the sequence. The resample factor is chosen as the average of all tiles for compatibility reasons to the reference model. This mode is for example used for the optimization process and the initial investigation on rate assignment model performance.

In a real-world*streaming scenario*for a service provider, usually, the same content is streamed at different bitrates to the users. It is assumed that when there is a near-infinite number of users, each sequence in each quality is viewed equally often, where each viewport and therefore each tile configuration s is viewed equally often, as well.

While the assumption about the bitrate distribution for an infinite number of users might be derived from the central limit theorem, the head movement is assumedly not equally distributed as it is based on the video content where the front face might be viewed more often. When a head movement prediction model is available, a prioritization of the tile configurations is easy to implement, by scaling the occurrences in the output vector by the VDSS accordingly. However, this is beyond the scope of this work. Consequently, the average over the tile configurations s is taken. The simulation of a long video stream of multiple hours would need a long calculation time. The VDSS makes parallel processing of a long sequence through a concatenation of shorter sequences possible.

An important detail for this process is, that for the sequence length of 300 frames and typical chunk lengths, the last chunk for each sequence spans over two successive sequences. To process sequences in parallel it must be ensured, that at the end of each parallel process, the end of a sequence and the end of a chunk meet. To create such long sequences, the sequences are concatenated to groups of eight. Because for the chunk lengths of 32, 96, 160, 480, 800 and 2400, this leads to the required alignment. As long the chunk lengths are aligned to the sequence length and the chunk length of an arbitrary concatenated sequence, these long sequences can be processed in parallel and the results can be evaluated together and averaged over time.

By concatenating eight sequences, all eight must be excluded from the training process. Therefore, one block of eight sequences is held out during training. A 17-out-of-25-cross-validation is therefore used. A total of 50 sequences is created, an arbitrarily high number, which results in a total frame number of 120000, i.e. 4000 seconds or 66 minutes and 40 seconds of video streaming. The sequences are combined in a way that each sequence is used equally often, i.e. sixteen times, in the final bitstream. Concluding the configuration is:

- **Dataset:**Random-Access (RA), 24 Tiles,

- **Chunk length:**Equal to the random-access period,

- **Target bitrate:**One fix target for all sequences, average of all sequences,

- **Resample Factor:**Average of resample factors for all tiles.

### 3.5 Regression Model Optimization

Both, the feature selection and the hyperparameters of the RF model are optimized by the internal out-of-bag predictor importance by the RF regressor and the output of the ViewportDependent Streaming Simulator in the standard mode. The standard mode was defined as 300 frames per sequence.

#### 3.5.1 Feature Selection

To avoid highly correlated predictors, redundant features are removed. Two different feature selections are compared; a correlation-based exhaustive feature selection and an on the VDSS based exhaustive approach. Both are used to reduce the number of predictors individually. The correlation-based approach is executed on an OOB-basis and is fast to implement. For the exhaustive approach, the VDSS needs to be executed 127 times for seven features. When the performance of the optimization on an OOB-basis is leading to the same results as the extensive approach, new features can be investigated more quickly in future work.

For the OOB-based approach, the correlation of the features and other features must be estimated as highly correlated features might decrease the model performance, as described in 2.3. To prevent iterating all 127 combinations of the seven features, each set of correlated features creates a subset of the total number of features. All subsets of correlated features are then iteratively examined, as follows. While keeping the other subsets untouched, each correlated feature is removed once from the training, then every combination of every feature is removed until the subset is removed once from the training set. For all subsets, the crossvalidated OOB error is determined and for each iteration, every feature subset with the least OOB error is assigned to the final set of features. Since this procedure is made on an OOB- level, the VDSS does not have to be used.

The exhaustive approach in contrary uses the VDSS. All 127 feature-combinations are tested and the feature-combination with the lowest variation in quality is chosen. This leads to an analysis of whether the output of the VDSS is correlated to the out-of-bag predictor performance. Going through all combinations is called exhaustive since all combinations of the seven predictors without repetition are calculated once and no other combinations are possible. The VDSS configuration used for this task is the RA dataset with 300 frames length and fixed- median-QP turned off.

#### 3.5.2 Hyperparameter Optimization

The hyperparameters of the RF are optimized empirically using the out-of-bag error for a first estimation. Then its expressiveness is compared to an optimizing approach using the VDSS. In contrary to the feature selection, which only leads 448 possible combinations, the hyperparameters are in a greater range. The hyperparameter “number of leaves” can go from one to the number of data points. To limit the range for the optimization the upper bound is estimated by the initial search so that the model starts to diverge. The number of features per split is limited to a range from one to the number of features. The number of trees must be chosen high enough, with an approximation of ten times the number of features and is therefore set to 75.

Because of the limited resources, Bayes Optimization is used, which searches the best combinations of hyperparameters. Therefore, a wrapper function is implemented around the VDSS which calls the VDSS for different hyperparameters of the RF. The minimum number of leaves and the number of features per split are then optimized using the cross-validated out-of-bag error as a measure. Using the Bayes Optimization algorithm, it is estimated to find the optimum in a few iteration steps. Before the actual production phase, it is iterated several times for the initial search of the hyperparameters in a larger range to gain insight into the performance of the optimization.

## 4 Viewport-Dependent Streaming Simulation

In this chapter, the results of the investigations carried out are presented and evaluated through the structure based on interim results. First, insights into the dataset, including the quality measures are given. Then the VDSS is investigated concerning its internal error. The now known error is used for the analysis of the rate assignment models. Well-performing models are optimized for their hyperparameters. These models are then evaluated by a variation of the VDSS parameters and investigated in a streaming scenario. All results are based on the standard mode of the VDSS, as defined in 3.4.3, or derived from the standard mode when a parameter is explicitly changed. The RA encoding scheme is used, if not specified otherwise. The results of the features and quality measures investigation are presented to get an overview of their behavior.

### 4.1 Analysis of the Features

Figure 4.1 gives measurements of the*SA*and the*TA*of the omnidirectional video sequence set. The dots represent measurements for each tile of the sequence. Each sequence is depicted using a unique color. The squares represent the average of the tiles of each sequence. As can be seen, the*TA*and the*SA*measurements vary in a large range and have both, low and high variation around the average of the sequence. It can be derived, that spatial and the temporal activity are varying in the sequence set, as can be confirmed subjectively by looking at the video sequences.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.1**Scatter plot of *SA* and *TA* for the sequences of the test dataset, each color corresponds to a sequence,

each dot corresponds to a tile, and each square corresponds to the mean of the tiles per sequence

Figure 4.2 displays the correlation chart of the features and the by the encoder reported bitrate*(BR)*for high-resolution RA encodings on a sequence base, i.e. the average is calculated over the 300 frames. As can be seen in Figure 4.2, the*TA*and the*HA*measure have the highest correlation to the bitrate of the sequence. Furthermore, a strong correlation between*SA, Dx,*and*Dy*, and*TA,*and*HA*is visible. While*H*is correlated to all features,*HD*is not correlated to any other feature.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.2**Heatmap of the features and the output (bitrate,*BR)*

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.3**Normalized features *SA, TA* and scaled bitrate *(BR)* over frames for the second tile of sequence

*Trolley.* Screenshots are taken at the frames marked with an arrow.

In Figure 4.3 the course of time of the normalized*SA,*the normalized*TA,*and the scaled bitrate*(BR)*for the for illustrative purposes chosen the second tile of the sequence*Trolley*are depicted. The bitrate was scaled in a way to compensate for the hierarchical GOP structure. An overall correspondence in the trends of the features can also be seen, albeit with different amplitudes. Other features are behaving similarly. Concluding, the features are correlated to the bitrate overtime and can be used fora prediction, however, the hierarchical GOP structure must be equalized.

### 4.2Induced Error of the Evaluation Framework

At first, the error which is induced by the simulation instead of encoding all video sequences is determined. The interpolation error is analyzed for the sequence*Trolley,*which was chosen empirically. Forthat purpose, the sequence is encoded from input-QP 17 to 37 in integer steps. The PCHIP interpolation leads to the lowest average and maximum relative bitrate error with 0.022 and 0.086, respectively. A spline interpolation leads to 0.034 and 0.112, and the cubic interpolation leads to 0.036 and 0.139 respectively.

The relative error between the real encodings and the interpolation for the 24 tiles is depicted in Figure 4.4. As can be seen, for the most part, the interpolation error is below the average, while its maximum is at input-QP equal to 23. However, because this interpolation error is as well made for the test and training dataset it only must be considered, that there is a deviation from the VTM encodings, but not an error induced into the simulation results of the VDSS. The interpolations are made for one dataset, therefore, they do not have to correspond to the VTM encodings, as future encoders can lead to different bitrates as well and the model is trained for each dataset. The interpolation error must be accepted since no constant bitrate encoders are available for WC now. When constant bitrate encoders are available in the future, the model performance can be evaluated in detail.

Further analysis has shown that the QP interpolation to two decimal places results in a negligibly small error at reasonable performance. A tradeoff between speed and accuracy is apparent. By setting the interpolation step size preferably low, the interpolation error gets smaller. However, already integer QP steps lead to an interpolation error measured in the uniformity of quality in the same range as the variance of the non-deterministic regression. Therefore, a step size of 0.01, i.e. two decimal places, are chosen for the following analysis, as the error is then considered negligible.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.4**PCHIP relative interpolation error of sequence*Trolley,*for constant QP encodings, each color corresponds to a tile

### 4.3 Analysis of the Rate Assignment Models

The in 3.3 introduced models are evaluated with help of the Viewport-Dependent Streaming Simulator in its standard mode, i.e. 300 frames, one chunk, and a variable target bitrate. The models are compared to the uniform rate and the non-linear regression model and benchmarked against the uniform quality assignment model for a variation of the VDSS parameter, i.e. streaming scenarios. At first, the RF model with frame-wise training and prediction is evaluated. The number of trees was set to 300, the minimum leaf size per split to 50, and the number of parameters to the number of features available, as those values were empirically shown to be feasible initial values. The target-bitrate is adapted for each chunk to the equivalent of uniform quality, as reached by the reference-QP of this chunk.

#### 4.3.1 Prediction Accuracy

Figure 4.5 demonstrates the normalized tile bitrate of the*Library*sequence according to the RF model, the non-linear regression model, the uniform rate, and the uniform quality distribution. The sequence was chosen for illustrative purposes. As shown in Figure 4.5, the RF model is best adapted to the uniform distribution ofquality in terms of the normalized bitrate relative to the uniform rate model and the non-linear regression model.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.5**Exemplary distribution of bitrate over the*Library*sequence tiles.

All bitrate assignment models are better than a uniform rate assignment. The RF model performs best in terms of bitrate prediction accuracy. However, it has not yet been shown, that a more precise prediction leads to better uniformity of quality.

#### 4.3.2 Quality Distribution

The main objective of the proposed RF model is to minimize QP-variation across the picture plane and to obtain a bitrate distribution close to the bitrate distribution obtained by the uniform quality approach. Figure 4.6 displays the corresponding QP-variation calculations averaged across sequences and configurations derived as results of the LOOCV. The figure shows that the RF model leads to lower values than the uniform rate model, and the non-linear regression model, in terms of the deviation of the QP over the tiles in each sequence.

In comparison, the cost evaluation is shown in Figure 4.7 where the corresponding average QP is recorded across sequences and tiles. It is evident from the diagram that the given RF model results in a lower deviation from the target bitrate, as measured by the average QP of the uniform quality model than that of the non-linear regression model. In conclusion, the RF model provides a more even uniform distribution of quality.

**Figure 4.8**Histogram of QP distribution for 300 frame sequences

#### 4.3.3 Network Utilization

In Figure 4.9 the normalized bitrate error between the target bitrate and the effective bitrate by the rate assignment models, as reached by the maximum bitrate tile combination*smax*, for the sequences of the test-dataset are depicted. The uniform quality model leads to an accurate assignment of the bitrate. Apparently, the error induced by a less accurate rate assignment in combination with a scaling of this rate assignment to the target bitrate leads to an error of less than 0.06 percent for all models. Therefore, the target bitrate is considered not relevantly exceeded by the scaling method. The target bitrate is met for all configurations less than*smax*

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.9**Normalized bitrate error between the target bitrate and the effective bitrate by the rate assignment models as reached by the maximum bitrate tile combination*sma:.*

In comparison, the normalized error between the target bitrate and the bitrate defined by the assignment models as reached by the minimum bitrate tile combination*smin*is depicted in Figure 4.10. It can be seen, that the difference of the utilized bitrate and the target bitrate is between 8 and 33 percent of the target bitrate. The uniform rate model leads to the same utilization for*smin*as for*smax*, as all tiles receive the same bitrate portions.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.10**Normalized error between the target bitrate and the bitrate by the assignment models as reached by the minimum bitrate tile combination*smin*

Certainly, this underutilization can be bypassed by encoding each tile for each combination s. However, this would change the application case, as for the video-on-demand streaming scenario the storage capacity would be multiplied by the number of tiles and the computation complexity and, therefore, the costs for the streaming provider would multiply. Concluding, the

scaling of the rate assignment defined by the rate assignment models is feasible for the application case, as it does not violate the target bitrate.

#### 4.3.4 Model Overview

The rate assignment models described in 3.3 are evaluated in the following. Table 4 lists the rate assignment models for the Random-Access (RA) and the Low-Delay B (LDB) encodings. The rate assignment models are separated into three groups. The reference models, including the direct ratio assignment as performance reference, the RF, and the ET models. The RF and ET models with the same name do correspond in their calculation, except for the regression algorithm. For each model, the standard deviation (SD) and the average (AVG) are listed for the QP. The color represents the deviation from the uniform quality model, where green stands for high accuracy and red for low accuracy. The intermediate colors are interpolated.

**Table 4**Model Overview of VDSS standard mode. The best non-uniform model per column is bold. The color represents the deviation from the uniform quality model in QP steps. Green for high accuracy, red for low accuracy.

Abbildung in dieser Leseprobe nicht enthalten

The best values, i.e. low for SD and close to 22 for AVG, are marked bold. In selective experiments, the ratio regression and multi-output models were not able to learn the relation between the features and the bitrate and therefore performed worse than the uniform rate model in all cases. The multi-resolution model using the resample factor as an input was able to learn the data, but performed worse than the reference model. Training two different models for the two resolutions, leads to a higher variance in quality, as the bitrates predicted for the low-resolution and high-resolution do not correlate anymore and, therefore, the rate assignment differs for each resolution largely, where the absolute bitrate error cannot be relativized. These models are, therefore, not explicitly listed with QP-results in the following.

The best performing direct ratio assignment model is the one with*Dy*as a feature and is therefore only listed in the table. The*Dx*feature direct ratio assignment performs equally. Other ratio feature models do not perform well, some worse than the uniform rate model and are therefore not listed. The*Dy*ratio regression, therefore, reaches a by 0.7 lower QP-variation at 0.4 lower average QP than the non-linear regression model and therefore already performs better. The results of Lottermann [19] that the logarithmic relation between bitrate and the*SA*and*TA*features leads to a better prediction of bitrate, cannot be confirmed.

In general, it can be seen, that the ET models are performing better than the RF models, as this area of the table is greener. The basic frame-wise ET models already reach 0.3 lower QP- variation at the same average QP. For the chunk-wise models, since the chunk length is set to the full-frame length, it is 300 and for the chunk-wise models, the fixed and the variable training mode are equal. The chunk-wise model is also 0.1 lower in QP-variation at the same average QP but leads to an overall higher QP-variation than the frame-wise ET. Furthermore, the frame-wise models are performing better than the chunk-wise models. The non-linear regression model was not tested frame-wise, as no initial values for the model to converge could be found.

The multi-quality models lead to a similar QP-variation at lower average QP and therefore do not perform better. This can be because the input-QP does not equally relate to the quality of different content complexity. The QP could be normalized by dividing the bitrate with the QP and a scale factor, but since for one QP the bitrates vary significantly, a simple scaling factor without integrating the complexity of the picture, i.e. the complexity features, cannot be achieved. However, adding the QP to the model leads to a strong correlation between the QP and features. Therefore, the complexity features are presumably correlated to the input-QP feature of the model, which influences the RF more than the ET, as the RF makes biased decisions towards correlated features at each split, whereas the ET decides randomly.

Apparently, adding the ratio features to each model leads to a lower variation in quality, but decreases the average QP, relativizing this gain considering the more than doubled computation complexity. The three tested frame-subset RF models perform worse than the direct ratio assignment model. However, while the first-four-frames (ET-4) models also perform worse than the direct ratio assignment model. The models using the l-frames and the successive frame (ET-I+ and RF-I+) are performing equally to the multi-quality models, while only using a fraction of the available frames to predict, i.e. the l-frames. A model that comprises the frames which are largest concerning their bitrates in the corresponding GOP, therefore, performs well.

The best tradeoff between the deviation of average quality to the average of the uniform quality and the variation of quality are the chunk-wise ET and RF models, and the frame-wise ET for the random-access encodings. For the low-delay encodings, only the*Dy*direct ratio assignment performs better than the uniform rate assignment. All models overshoot the target bitrate, as their average quality exceeds the average quality of the uniform quality assignment. It can be derived that the data could not be learned well.

For the following work, only the ET models will be considered, as they are largely equal to the RF models at better performance. For the frame-subset models, the ET-I model will be also analyzed, because the rate assignment can already be calculated after the features for only two frames are access without losing much of prediction performance. However, for the OOB to VDSS optimization compression, the RF is used, since the ET does not use bagging and therefore does not have OOB measures.

However, the results using RFs and ET regressions are non-deterministic and dependent on the hyperparameters of the models. Additionally, the ET-I+ model with multi-quality features was tested. It resulted in a QP standard deviation of 2.34 and an average QP of 21.17 and therefore did not perform better than other models.

#### 4.3.5 Non-Deterministic Variance

Since the training of the RF models and the ET models are non-deterministic, the prediction performance varies randomly for each cross-validated training and the above-stated values are considered as random samples. The standard deviation of the QP-variation of the VDSS- results of the ET model is 0.015, while the corresponding standard deviation of the mean QP is 0.010. The range ofthe values was 2.37 to 2.44. Any influence on the VDSS results which is lower than this deviation, e.g. the interpolation error, can, therefore, be neglected.

#### 4.3.6 Model Optimization

This chapter lists the results of the hyperparameter optimization by the out-of-bag approach and the out-of-bag feature selection. These methods are then compared to the results obtained by the Bayesian Optimization for the hyperparameter optimization and the extensive feature selection, both using the VDSS.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.11**Cross-validated OOB error in bit for different minimum leaves sizes over the sequences

First, the hyperparameters are investigated by both methods. As Figure 4.11 depicts, the mean square error between the predicted bitrate and the ground-truth bitrate is not reduced significantly for larger tree sizes. Figure 4.11 depicts the mean square prediction error for different minimum-number-of-predictors-to-be-chosen-at-random for each decision split, i.e. minimum leaves size. In general, a greater minimum leave size results in faster convergence of the model. The figure indicates, that a minimum leave size between 50 and 200 is enough. For low computation costs, the number of trees could also be reduced to 15.

The hyperparameters of the ET are optimized by the Bayesian Optimization. In Figure 4.12 the Bayesian Optimization for the RF model of the QP is depicted. The hyperparameters are the number of the minimum leaf size*(minLS)*and the number of features to sample*(numPTS).*The latter is the maximum number of features available at each split. Blue dots mark the evaluation points and the red plane marks the estimated optimization function.

Abbildung in dieser Leseprobe nicht enthalten

In general, Figure 4.12 shows the highest possible*numPTS*results in the lowest error (green arrow) for all*minLS.*As prediction performance increase with a lower number*minLS,*the model starts to be less accurate for greater values of*minLS,*presumably because of underfitting. However, for minimum leaf sizes below approximately 20, the model performs worse on the dataset and the error increases, i.e. it starts to overfit. For a low complexity calculation, the*minLS*can be chosen high, i.e. 500, without losing much accuracy, as long the*numPTS*is set to the number of features available.

For the ET model, less than 20 for*minPTS*is feasible. The ET model does not start to overfit and generalizes well on the given dataset. Considering the noise of the non-deterministic output, the QP-variation does not decrease significantly below 20. The lowest standard deviation of QP measured is 2.327. Including all seven features leads to a minimum of 2.333 for QP- variance. The optimized ET model leads to the same results as the ET-I and ET-I+ model with a QP-variation of 2.292.

Figure 4.13 shows the Bayesian Optimization for the low-delay dataset. As can be seen, the results vary largely for neighboring evaluation points and for the same evaluation point. Thus, the regression model apparently cannot learn the data well and the optimization schema cannot find an optimal hyperparameter combination.

Abbildung in dieser Leseprobe nicht enthalten

Evidently, the optimization of the hyperparameters with the VDSS leads to equal results, as the out-of-bag optimization of the RF. A Bayesian Optimization can, indeed, also be used on the OOB error. Although the OOB optimization takes for available implementations approximately the time of one training period, the VDSS needs less time than a single training period while providing the quality distribution as output. Compared to the time needed for video encodings, the VDSS leads to better results in a reasonable time.

For the feature selection, the OOB feature importance of the RF model is investigated. The features*Dx, Dy*, and*SA,*and*HA*and*TA,*turn out to be highly correlated, as Figure 4.2 suggests.*H*and*HD*are not correlated to other features. The iterative optimization method from 3.5.1 is implemented to obtain a set of less correlated predictors for the RF. Beginning with the complete set of features, each variation focuses on a series of correlated features, e.g. beginning with the first set of*Dx,Dy,*and*SA,*then*HA*and*TA,*and*H*and*HD.*The feature configuration with the minimum calculated OOB error is:*Dy, HA, HD, SA.*In general, the haarPSI feature*(HA)*has a major influence on the predictive value as shown in Figure 4.14.

Abbildung in dieser Leseprobe nicht enthalten

Table 5 lists the results of the VDSS of the feature selection above. The frame-wise and the chunk-wise models minimally decrease the variation of quality. On the contrary, the framesubsets models decrease significantly, while still not performing well overall. The OOB feature importance varies from sequence to sequence so that relevant features for some sequences might be discarded with this approach. Another drawback of this OOB based optimization is the same as for the OOB hyperparameter optimization. The calculation time of the OOB error takes the same time as training the model.

**Table 5**Random Forest with the four best features of the OOB optimization

Abbildung in dieser Leseprobe nicht enthalten

Table 6 lists the quality variation for the VDSS-optimized RF and ET models with all features and with the best-performing feature selection. For the ET no decrease in variation of QP is apparent. The deviation from the uniform quality QP decreased. Concluding, the ET model works well with all features and predicts the mean square error of uniform quality. The same is true for ET models on frame-subsets. The RF performs worse with the features*H*and*HD*included. Removal of features leads to the overall best performing model of the RF. However, the feature*TA*can also be removed with increasing the error minimally. The removal of other combinations of features leads to higher variation in quality.

**Table 6**VDSS results for ET and RF feature selection

Abbildung in dieser Leseprobe nicht enthalten

Concluding, the extensive feature selection by the VDSS compared to the feature selection based on the OOB feature importance leads to better results, whereas the OOB feature importance performed worse. Furthermore, the ET performs better in general, while not providing the possibility of using an OOB optimization. The lowest overall QP-variation was measured for the ET, ET-I and ET-I+ model as 2.29.

### 4.4 Variation of the Evaluation Parameters

In this section, the results of changing the three main parameters; the resample factor, the target bitrate, and the chunk length, are listed. Derived from the insights of 4.3.6, the best performing ET model was used, besides the non-linear regression model, uniform rate, and uniform quality.

#### 4.4.1 Resample Factor

The resample factor was examined regarding its variation. In Figure 4.15 the boxplot for the resample factor as the ratio of the high-resolution and low-resolution RA encodings of all sequences, all tiles, and all frames is depicted. As can be seen, the range of the resample factor is between 1.5 and 3.5 if the outliers are omitted. The outliers are marked in red. The outliers are in a range from 3.5 to 5.8. The variance is calculated as 0.611 and the median is at 2.3.

The resample factor was trained on the features by a cross-validated ET model, both for highresolution and low-resolution with a total of 2000 trees. The relative error was at 45 percent (0.4578), indicating that the model could not learn the resample factor on the picture complexity features. The average predicted resample factor was 2.477. The relative error of using the mean as an estimate is 44 percent (0.4381). A prediction with this error does not lead to an improvement of uniformity of quality, as QP-variation increases on the trial-dataset.

Furthermore, a model with the resolution as a feature, i.e. multi-resolution regression, was trained. However, because the prediction of the high-resolution and the low-resolution tiles differs significantly, the resample factor changes and the QP-variation increases. Adding the resolution factor in a categorical (low, high) and a scalar manner into the regression model does not lead to an improvement in QP-variation.

However, a prediction of the optimized ET model using the average resample factor leads to an already low QP-variation, as the optimal resample factor is only 0.02 lower for the same model without retraining. The general influence of the resample factor is therefore marginal. The statistical variance ofthe non-deterministic RF prediction model, however, is in the same range as the benefit of a perfect resample prediction. A prediction of the resample factor is considered fine-tuning ofthe model.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.15**Boxplot of the resample factor

Furthermore, the low QP variance induced by the estimated resample factor allows investigating other resample factors without re-encoding the video sequences. An estimation can be made by scaling the trial-encoding low-resolution bitrates properly. The content complexity features of the high-resolution tiles are all linearly correlated by a factor over 0.99 to their low- resolution equivalent, except*SA*which is correlated by a factor of 0.97. Therefore, the features can be estimated as constant over the resample factor.

#### 4.4.2 Target Bitrate

First, the influence of changing the target bitrate without retraining the model was investigated. Then the fixed median target bitrate, as used in the streaming scenario mode of the VDSS was used. The uniform quality model is defined as the uniform quality model at a certain training- QP, i.e. 22. In Figure 4.20, the variation in quality by choosing a target bitrate corresponding to a different QP than the QP which the model was trained on is depicted, measured by QP. On the very left and very right of the x-axis, the values are considered as inaccurate, as only the QPs in the range of 17 to 37 are densely sampled. The values on the y-axis are defined by the Coefficient of Variation (CV, see 3.2) of quality. This is to make the variation of quality comparable over different average qualities.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.16**Coefficient of Variation (CV) of QP over training-QP

As can be seen in Figure 4.16, the uniform quality bitrate model leads to a rate assignment without variation in quality for the training-QP of 22. However, when operating at a different target bitrate, a variation of quality in induced. In general, the performance of the different models can be seen. Nevertheless, in a training-QP range from approximately 15 to 30, the variation of quality does not change considerably, getting as large as the error made by the uniform quality model. The models can, therefore, be used in a certain range around the trained reference bitrate without making a greater error.

In Table 7 the quality results using the same fixed median bitrate for all sequences of the random-access dataset are listed. A QP-variation of 1.61 when using the uniform quality model for the different target bitrate is induced. For this dataset, the average QP is reduced by 4.5 to 17.5. However, the difference in QP for the ET and the ET using only the l-frames (ET-I) model are 0.15 and 0.10 and therefore compensating the different training-QP through the scaling. Therefore, training the model for each target bitrate can be traded off by QP-variation of 0.1.

**Table 7**VDSS Results for a fixed target bitrate

Abbildung in dieser Leseprobe nicht enthalten

Furthermore, a model was trained on the content complexity features to predict the input-QP necessary for a certain target bitrate. This model leads to an average error of two in input-QP. Therefore, to find the initial QP necessary for a certain bitrate in the range where the ET models perform equally well this regression model can be used.

#### 4.4.3 Chunk Length

The results of a variation of the chunk length to lengths other than the full sequence length of 300 frames are given as follows. Table 8 lists the VDSS-results with a chunk length of 32 frames, i.e. equal to the RAP for the random-access dataset. The uniform quality model is not influenced by a shorter chunk length. Nonetheless, all other models, except the ET-I+ model, are not influenced by a shorter chunk length. On the contrary, this modified ET model performs better, as the deviation of the average quality significantly decreases, without a change of variation of quality.

**Table 8**VDSS results for different chunk lengths

Abbildung in dieser Leseprobe nicht enthalten

The investigation of the training chunk length, i.e. equal to the sequence length or the chunk length, leads to the following results. While no difference in variation of QP for the chunk-wise ET model when training on an equal to the sequence length or an equal to the chunk length (both 2.44) is apparent, the non-linear regression model performs worse when adapting the training for each chunk (3.70).

In order to also investigate the influence of different chunk lengths on the prediction in a scenario where the chunk length does not necessarily align to scene cuts, the above measurements were repeated for the concatenated one-hour sequence. In Figure 4.17, the standard deviation of the QP over different chunk lengths is depicted. As can be seen, the chunk-wise non-linear reference model produces a higher variation of quality for lower chunk lengths. The

lowest non-uniform variation of quality is made by the ET model, which is lower than the one by the RF model, and at total lower than one for smaller chunk lengths. The QP-variation of the non-linear model and of the uniform rate model increases for shorter chunk lengths.

In Figure 4.18 the average QP over different chunk lengths is depicted. As can be seen, the error made by the uniform rate model gets smaller when the uniform rate assignment is averaged over a long period instead of short periods. For the non-linear reference model, the RF and the ET model a shorter chunk length below 160 frames is beneficial. The RF and ET model performs equally well, better than the reference model.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.18**VDSS results of long sequences, calculated by the standard streaming mode, average QP

Additionally, the above plots are depicted zoomed for a comparison ofthe ET model with the ET-I+. The ET model performs better than the ET-I+ model for every chunk length longer than the RAP, as shown in Figure 4.19.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.19**VDSS results of long sequences, calculated by the standard streaming mode, SD of QP

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.20**VDSS results of long sequences, calculated by the standard streaming mode, average of QP

Yet, the overall lowest QP-variation of 0.7 is made by the ET-I model, while reaching an equal average QP of 21.8, as shown in Figure 4.20. The figure also shows that the ET-I model does not perform equally to the ET model for chunk lengths than 32. This can be derived from the fact that for other chunk lengths the number of l-Frames in this chunk may differ.

In Figure 4.21 the histograms with 75 bins of the QP distributions for the uniform rate, the nonlinear reference model, and the ET model are depicted for a 24 00 frame chunk length, and the histogram of the ET model for a chunk length of 32 frames. For all a student-t distribution curve is fitted on top.

Apparently, the non-linear reference model leads to a more uniform distributed quality, i.e. QP, distribution around the training-QP than the uniform rate model. Still, the ET model leads to a much narrower distribution. In a comparison of the chunk lengths, the ET model with a chunk length of 32 frames leads to a similar distribution, but with lighter tails. Therefore, the ET model on the chunk length equal to the RAP is to be preferred. Further investigations have shown, that the prediction below the RAP leads to great inaccuracies in the prediction.

4.4.4 Streaming Scenario Dataset

For the combined long-term simulation, the concatenated sequences were used with a fixed target bitrate for all chunks defined by the median of the bitrate corresponding to training-QP, as defined by the streaming scenario mode in 3.4.3. As Figure 4.22 shows, the average QP of the prediction decreases with the shorter chunk lengths. The average QP for the non-fixed bitrate over the whole sequence was 17.42.

As can be seen in Figure 4.23, the ET model leads to the lowest deviation from the uniform quality model. The ET-I model leads to a marginal higher deviation from the uniform quality average. The depicted uniform quality model does not lead to a constant quality over different chunk lengths, as the non-optimal reference-QP is used. Additionally, the ET and the ET-I models trained on the content complexity metrics and the encoder-QP are depicted. While the single models led to a low deviation of average quality in the standard VDSS mode, both lead to a higher but equal deviation of average quality, higher than the non-linear reference model.

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.22**VDSS results of the streaming scenario mode, average QP

Abbildung in dieser Leseprobe nicht enthalten

**Figure 4.23**VDSS results of the streaming scenario mode, SD of QP

Concerning the uniformity of quality over the picture plane, the ET model performs best for longer chunk lengths, like in the standard VDSS mode. At a chunk length of 32, the ET model and ET-I model lead to a lower QP-variation than the uniform quality model using a non-accu- rate reference-QP. The ET and the ET-I model with additional QP, however, lead to a higher variance in quality, although lower than the non-linear reference model.

Figure 4.24 depicts the histograms of the resulting QP distribution for the long-term simulation with a bar for each integer QP. For the chunk lengths of 2400, the histogram of the non-linear reference (P) model has no visual similarity with the uniform quality (UQ) model. On the contrary, the ET model has a similar shape. However, for the UQ and the ET model, the distribution is widely spread. For the chunk lengths of 32, the P model shows a tendency to result partly in low QP. The ET model and the UQ model look largely equal.

In conclusion, the prediction of a chunk lengths equal to the RAP leads to a more central distribution of the resulting quality over the whole video than a chunk length of 2400. Furthermore, the ET model performs not worse than the UQ model which was derived from a certain reference-QP unequal to the partition necessary for a uniform rate assignment.

## 5 Conclusion and Prospect

This work tackled the problem of uniform quality assignment in tiled omnidirectional video.

A variety of spatio-temporal context complexity features were assessed and analyzed. These features were analyzed and added to a new regression model which was shown to reduce the variation in quality over the picture plane for omnidirectional video streaming.

Various rate assignment models were designed and analyzed by an in this work implemented Viewport-Dependent Streaming Simulator (VDSS). The error induced by the VDSS was analyzed and its approximation was considered suitable for further evaluations. The in related work used output QP was also used as a quality measure.

While some models performed weakly on the created dataset of 25 trial-encodings, the Random Forest (RF) and Extremely Randomized Trees (ET) regression algorithm with frame-wise training decreased the variance of quality by 60 percent compared to the reference model of former works and by 79 percent compared to a uniform rate assignment. For the best results, the hyperparameters of the regression model were optimized on the uniformity of quality as output by the VDSS. Integrating the hierarchical-GOP structure into the regression model led to a reduction of the generalization error. Furthermore, training on all available frames and predicting only on l-frames and additional frames that are high in the hierarchical-GOP structure leads to equal or better results of uniformity of quality while reducing the prediction complexity significantly.

The calculation approach by scaling rate assignments to the desired target bitrate was shown to lead to no violation of this objective. The minimum bitrate tile configuration, on the other hand, needs up to 33 percent less than the target bitrate, even for the uniform quality assignment. Thus, up to this degree, the network capacity is underutilized and a better scaling method might be tackled in future work.

It was shown, that the random-access encodings are easier to predict than low-delay encodings, presumably because of the more complicated inter-frame reference structure of the latter. The further analysis had shown that the prediction on a subset of pictures leads to a reasonable uniformity of quality while decreasing the number of frames for which the calculation of content complexity features is necessary.

The prediction of the resample factor, i.e. the ratio between the high-resolution and low-resolution tiles was shown to be non-trivial. However, a precise prediction has only minimal influence on the uniformity of quality.

It was shown as well, that the reduction of the chunk length leads to a similar uniformity of quality compared to the prediction of full sequences. Therefore, shorter chunk lengths equal to the random-access period of the video stream can be preferred, as they enable a faster parallel computation with less memory consumption of the rate assignment.

Finally, the available video sequence dataset was concatenated so that one one-hour sequence was available for a viewport-dependent streaming simulation by the implemented framework. The simulation confirmed the findings mentioned before for chunk length not aligned to scene cuts. The best performing model was reported as the ET regressor trained frame-wise on the whole dataset and with a prediction on the l-frames and the successive frame to compensate the hierarchical GOP structure of the random-access encodings.

Future work might extend the model for more streaming scenarios with stronger varying content. The usage of a subset of frames to predict the rate assignment can be extended to include only the relevant pictures of the prediction chain in the hierarchical GOP structure. Alternatively, the training dataset could be trained in the All-lntra mode at one QP and the predicted on the Random-Access dataset. The rate assignment model might be advanced by training a neural network on the full pictures instead of extracted features.

Furthermore, the rate assignment models could be trained on the VDSS output, instead of the bitrates. More generally a frame-drop approach could be examined where the subjective effect of skipping single frames in the non-visible tiles is compared to a resolution or bitrate reduction. Additionally, better quality measures for the uniformity of quality can be investigated. For this, the Haar Wavelet-Based Perceptual Similarity Index (HaarPSI) could be considered through the generalization of hair wavelets on the sphere.

Concluding, this work investigated the rate assignment of omnidirectional video tiled streaming and a model was proposed which outperforms the state-of-the-art assignment models by 59 percent in the common uniformity of quality measures. Furthermore, an evaluation framework was developed which makes the encoding of video sequences in the rate assignment model evaluation unnecessary. Thanks to this framework, meta parameters for tiled streaming rate assignment models were proposed.

## List of Abbreviations

Abbildung in dieser Leseprobe nicht enthalten

## List of Symbols

Abbildung in dieser Leseprobe nicht enthalten

## List of Figures

Figure 2.1 Timeline of the Video Coding History 5

Figure 2.2 Hybrid Video Encoder Scheme, Source: Wiegand, Fraunhofer HHI 6

Figure 2.3 Two sample GOP of length 8 with temporal layers 7

Figure 2.4 CTUs, partitioned into three tiles and six slices, Source: [21] 9

Figure 2.5 2d sampling coordinate definition, Source: [35] 12

Figure 2.6 3d XYZ system, Source: [35] 13

Figure 2.7 Viewport generation with rectilinear projection, Source: [35] 13

Figure 2.8 Coordinates definition for CMP, Source: [35] 14

Figure 2.9 Frame packing for CMP, Source: [35] 14

Figure 2.10 From two resolutions to projected frame, Source: derived from Podborski, Fraunhofer HHI 16

Figure 2.11 Overfit and underfit decision trees under different hyperparameter, Source: derived from [47] 20

Figure 2.12 Bias and variance, ground-truth corresponds to the center of a circle, Source: derived from [46] 22

Figure 3.1 Interpolated bitrates over MSE for constant QP encodings of sequence*Trolley,*each color corresponds to a tile 27

Figure 4.1 Scatter plot of*SA*and*TA*for the sequences of the test dataset, each color corresponds to a sequence, each dot corresponds to a tile, and each square corresponds to the mean of the tiles per sequence 41

Figure 4.2 Heatmap of the features and the output (bitrate,*BR)*42

Figure 4.3 Normalized features*SA, TA*and scaled bitrate*(BR)*over frames for the second tile of sequence*Trolley.*Screenshots are taken at the frames marked with an arrow 42

Figure 4.4 PCHIP relative interpolation error of sequence*Trolley,*for constant QP encodings, each color corresponds to a tile 43

Figure 4.5 Exemplary distribution of bitrate over the*Library*sequence tiles 44

Figure 4.6 Average QP-variation over sequences and configurations 45

Figure 4.7 Average QP over sequences and configurations 45

Figure 4.8 Histogram of QP distribution for 300 frame sequences 45

Figure 4.9 Normalized bitrate error between the target bitrate and the effective bitrate by the rate assignment models as reached by the maximum bitrate tile combination*smax*46

Figure 4.10 Normalized error between the target bitrate and the bitrate by the assignment models as reached by the minimum bitrate tile combination*smin*46

Figure 4.11 Cross-validated OOB error in bit for different minimum leaves sizes over the sequences 49

Figure 4.12 Bayesian Optimization of the feature selected RF model on SD of continuous QP, minimum at 2.36, Randem-Access dataset 50

Figure 4.13 Bayesian Optimization on SD of continous QP, Low-Delay B dataset 51

Figure 4.14 Cross-validated average OOB predictor importance over the sequences 51

Figure 4.15 Boxplot of the resample factor 53

Figure 4.16 Coefficient of Variation (CV)ofQP over training-QP 54

Figure 4.17 VDSS results of long sequences, calculated by the standard streaming mode, SD ofQP 56

Figure 4.18 VDSS results of long sequences, calculated by the standard streaming mode, average QP 56

Figure 4.19 VDSS results of long sequences, calculated by the standard streaming mode, SD ofQP 56

Figure 4.20 VDSS results of long sequences, calculated by the standard streaming mode, average ofQP 57

Figure 4.21 Histograms (blue) with 75 bins of the ET, non-linear reference (P), and uniform rate (UR) model for 2400 frames and for 32 frames for the ET model, student-t distribution fitted on top (red) 57

Figure 4.22 VDSS results of the streaming scenario mode, average QP 58

Figure 4.23 VDSS results of the streaming scenario mode, SD of QP 59

Figure 4.24 Histograms (blue) with one bin for each integer QP of the ET, non-linear reference

(P), and uniform quality (UQ) model for 2400 frames and for 32 59

## List of Tables

Table 1 Face indexes of CMP and 2d-3d-conversion, Source: derived from [35] 15

Table 2 Sequences of the test-dataset 25

Table 3 Number of the high-resolution tiles for a tile configuration s, main tile bold 34

Table 4 Model Overview ofVDSS standard mode. The best non-uniform model per column is bold. The color represents the deviation from the uniform quality model in QP steps. Green for high accuracy, red for low accuracy 47

Table 5 Random Forest with the four best features of the OOB optimization 52

Table 6 VDSS results for ET and RF feature selection 52

Table 7 VDSS Results for a fixed target bitrate 55

Table 8 VDSS results for different chunk lengths 55

## List of Listings

Listing 1 Pseudocode of the Viewport-Dependent Streaming Simulator

## List of References

[1] Sandvine, "The Global Internet Phenomena Report," 2018.

[2] Cisco Newsroom, "Estimation," 11 2018. [Online]. Available: https://newsroom.cisco.com/press-release- content?type=webcontent&articleld=1955935.

[3] Sandvine, "The Mobile Internet Phenomena Report," 2019.

[4] Oculus, "Rift," [Online]. Available: https://www.oculus.com/rift.

[5] VR-IF, "Guidelines 2.0," 25 06 2019. [Online]. Available: https://www.vr-if.org/wp- content/uploads/VRIF_Guidelines2.0.pdf.

[6] R. Skupin, Y. Sanchez, C. Hellge and T. Schierl, "Tile based HEVC video for head mounted displays," in*2016 IEEE International Symposium on Multimedia (ISM),*2016.

[7] K. K. Sreedhar, A. Aminlou, M. M. Hannuksela and M. Gabbouj, "Viewport-adaptive encoding and streaming of 360-degree video for virtual reality applications," in*2016 IEEE International Symposium on Multimedia (ISM),*2016.

[8] A. Mavlankar, P. Agrawal, D. Pang, S. Halawa, N.-M. Cheung and B. Girod, "An interactive region-of-interest video streaming system for online lecture viewing," in*2010 18th International Packet Video Workshop,*2010.

[9] R. Van Brandenburg, O. Niamut, M. Prins and H. Stokking, "Spatial segmentation for immersive media delivery," in*2011 15th International Conference on Intelligence in Next Generation Networks,*2011.

[10] C. Ozcinar, A. De Abreu, S. Knorr and A. Smolic, "Estimation of optimal encoding ladders for tiled 360 VR video in adaptive streaming systems," in*2017 IEEE International Symposium on Multimedia (ISM),*2017.

[11] M. Xiao, C. Zhou, Y. Liu and S. Chen, "Optile: Toward optimal tiling in 360-degree video streaming," in*Proceedings ofthe 25th ACM international conference on Multimedia,*2017.

[12] J. Le Feuvre and C. Concolato, "Tiled-based adaptive streaming using MPEG-DASH," in*Proceedings ofthe 7th International Conference on Multimedia Systems,*2016.

[13] X. Corbillon, A. Devlic, G. Simon and J. Chakareski, "Optimal set of 360-degree videos for viewport-adaptive streaming," in*Proceedings ofthe 25th ACM international conference on Multimedia,*2017.

[14] L. Xie, Z. Xu, Y. Ban, X. Zhang and Z. Guo, "360probdash: Improving qoe of 360 video streaming using tile-based http adaptive streaming," in*Proceedings ofthe 25th ACM international conference on Multimedia,*2017.

[15] M. Graf, C. Timmererand C. Mueller, "Towards bandwidth efficient adaptive streaming of omnidirectional video over http: Design, implementation, and evaluation," in*Proceedings of the 8th ACM on Multimedia Systems Conference,*2017.

[16] D. Podborski, J. Son, G. S. Bhullar, C. Hellge and T. Schierl, "HTML5 MSE Playback of MPEG 360 VR Tiled Streaming,"*arXiv preprint arXiv:1903.02971,*2019.

[17] Y. Sanchez, R. Skupin and T. Schierl, "Compressed domain video processing for tile based panoramic streaming using HEVC," in*2015 IEEE International Conference on Image Processing (ICIP),*2015.

[18] A. Zare, A. Aminlou, M. M. Hannuksela and M. Gabbouj, "HEVC-compliant tile-based streaming of panoramic video for virtual reality applications," in*Proceedings of the 24th ACM international conference on Multimedia,*2016.

[19] C. Lottermann, A. Machado, D. Schroeder, Y. Peng and E. Steinbach, "Bit rate estimation for H.264/AVC video encoding based on temporal and spatial activities," in*2014 IEEE International Conference on Image Processing (ICIP),*2014.

[20] R. Skupin, Y. Sanchez, L. Jiao, C. Hellge and T. Schierl, "Tile-Based Rate Assignment for 360-Degree Video Based on Spatio-Temporal Activity Metrics," in*2018 IEEE International Symposium on Multimedia (ISM),*2018.

[21] M. Wien, High Efficiency Video Coding: Coding Tools and Specification, 1 ed., Springer-Verlag Berlin Heidelberg, 2015.

[22] V. Sze, M. Budagavi and G. J. Sullivan, "High efficiency video coding (HEVC)," in *Integrated circuit and systems, algorithms and architectures,*vol. 39, Springer, 2014, p. 40.

[23] M. Jacobs and J. Probell, "A brief history of video coding,"*ARC International,*2007.

[24] Bitmovin, "Video Developer Report 2019," 2019.

[25] G. J. Sullivan, J.-R. Ohm, W.-J. Han and T. Wiegand, "Overview of the high efficiency video coding (HEVC) standard,"*IEEE Transactions on circuits and systems for video technology,*vol. 22, pp. 1649-1668, 2012.

[26] K. Misra, A. Segall, M. Horowitz, S. Xu, A. Fuldseth and M. Zhou, "An overview of tiles in HEVC,"*lEEEjournal ofselected topics in signal processing,*vol. 7, pp. 969977,2013.

[27] B. Bross, J. Chen, S. Liu and Y.-K. Wang,*JVET-2001-v9: Versatile Video Coding* *(Draft 7),*2019.

[28] Hendry, Skupin and Wan,*JVET-N0032: CE12: Summary report on Tile Set Boundary* *Handling,*2019.

[29] Hannuksela, Wang and Hendry,*JVET-N0126: AHG12: Signalling of subpicture IDs and* *layout,*2019.

[30] N. Ahmed, T. Natarajan and K. R. Rao, "Discrete cosine transform,"*IEEE transactions* *on Computers,*vol. 100, pp. 90-93, 1974.

[31] H. Schwarz, D. Marpe and T. Wiegand, "Analysis of Hierarchical B Pictures and MCTF.," in*ICME,*2006.

[32] J. Sole, R. Joshi, N. Nguyen, T. Ji, M. Karczewicz, G. Clare, F. Henry and A. Duenas, "Transform coefficient coding in HEVC,"*IEEE Transactions on Circuits and Systems for Video Technology,*vol. 22, pp. 1765-1777, 2012.

[33] J. Lainema and K. Ugur, "Angular intra prediction in high efficiency video coding (HEVC)," in*2011 IEEE 13th International Workshop on Multimedia Signal Processing,*2011.

[34] J. B. K. C. J.-L. L. P. Hanhart,*JVET-L1012: JJVET common test conditions and* *evaluation procedures for 360° video,*2018.

[35] Y. Ye, E. Alshina and J. Boyce,*JVET-1004: Algorithm descriptions of projection format* *conversion and video quality metrics in 360Lib Version 5,*2019.

[36] Lee, Lin, Chang and Ju,*JVET-K0131: CE13: Modified Cubemap Projection in JVET-* *J0019,*2018.

[37] ISO/IEC,*ISO/IEC 23090-2:2019: Information technology — Coded representation of* *immersive media — Part 2: Omnidirectional media format,*2019.

[38] I. JTC,*ISO/IEC 23008-2:2013: Information technology-high efficiency coding and* *media delivery in heterogeneous environments—Part 2: High efficiency video coding,*2013.

[39] M. Pope,*ETSI: TS26.118: 3GPP Virtual Reality Profiles for Streaming Applicationsg,* 2018.

[40] X. Corbillon, F. De Simone and G. Simon, "360-degree video head movement dataset," in*Proceedings ofthe 8th ACM on Multimedia Systems Conference,*2017.

[41] J. Vielhaben, H. Camalan, W. Samek and M. Wenzel, "Viewport Forecasting in 360 ° Virtual Reality Videos with Machine Learning," in*Proceedings of 2nd International Conference on Artificial ln-telligence & Virtual Reality,*2019.

[42] S. Gul, G. Bhullar, R. Skupin, T. Ajaj, S. Bosse, C. Hellge and T. Schierl, "Upper Limits of Head Orientation Prediction for 360-Degree Video Streaming,"*preprint,*2019.

[43] Y. Sanchez, G. S. Bhullar, R. Skupin, C. Hellge and T. Schierl, "Delay Impact on MPEG OMAF's tile-based viewport-dependent 360 ° video streaming,"*IEEE Journal on Emerging and Selected Topics in Circuits and Systems,*2019.

[44] R. Reisenhofer, S. Bosse, G. Kutyniok and T. Wiegand, "A Haar wavelet-based perceptual similarity index for image quality assessment,"*Signal Processing: Image Communication,*vol. 61, pp. 33-43, 2018.

[45] Fraunhofer HHI, "VVC Test Model (VTM) Reference Software," [Online]. Available: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM.

[46] G. Louppe, "Understanding random forests: From theory to practice,"*arXivpreprint* *arXiv:1407.7502,*2014.

[47] B. Boehmke and B. M. Greenwell, Hands-On Machine Learning with R, Taylor & Francis Group, 2019.

[48] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press, 2016.

[49] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brücher, M. Perrot and E. Duchesnay, "Scikit-learn: Machine Learning in Python,"*Journal of Machine Learning Research,*vol. 12, pp. 28252830, 2011.

[50] P. Roßbach, "Neural Networks vs. Random Forests--Does it always have to be Deep Learning?".

[51] L. Breiman, Classification and regression trees, Routledge, 2017.

[52] Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver and N. Freitas, "Bayesian optimization in alphago,"*arXiv preprint arXiv:1812.06855,*2018.

[53] P. I. Frazier, "A tutorial on bayesian optimization,"*arXivpreprint arXiv:1807.02811,* 2018.

[54] M. L. Bermingham, R. Pong-Wong, A. Spiliopoulou, C. Hayward, I. Rudan, H. Campbell, A. F. Wright, J. F. Wilson, F. Agakov, P. Navarro and others, "Application of high-dimensional feature selection: evaluation for genomic prediction in man,"*Scientific reports,*vol. 5, p. 10312, 2015.

[55] X. Liu, Y. Huang, L. Song, R. Xie and X. Yang, "The SJTU UHD 360-degree immersive video sequence dataset," in*2017 International Conference on Virtual Reality and Visualization (ICVRV),*2017.

[56] ITU,*ITU-R BT.500: Methodologies for the subjective assessment of the quality of* *television images,*2019.

[57] ITU,*ITU-T P.910: Subjective video quality assessment methods for multimedia* *applications,*2008.

[58] A. F. Costa, G. Humpire-Mamani and A. J. M. Traína, "An efficient algorithm for fractal analysis of textures," in*2012 25th SIBGRAPI Conference on Graphics, Patterns and Images,*2012.

- Quote paper
- Kai Bitterschulte (Author), 2019, Content Complexity based Rate Assignment for 360 degree Video Streaming, Munich, GRIN Verlag, https://www.grin.com/document/541615

Publish now - it's free

Comments