GPGPU Computing with OpenCL

Bachelor Thesis, 2013

127 Pages, Grade: 1.0

Free online reading


1.1 Motivation
1.2 Goal
l .3 History of GPGPU Computing
1.4 Chapter overview

2 OpenCL
2.1 What is OpenCL?
2.2 Components
2.3 Hardware architectures
2.3.1 Central Processing Unit (CPU)
2.3.2 Graphics Processing Unit (GPP)
2. I API Overview
2.5 Platform model
2.6 Execution model
2.6.1 Creating kernels
2.6.2 Kernel execution
2.6.3 Kernel code
2.7 Memory
2.7.1 Buffers
2.7.2 Images
2.7.3 Memory model
2.8 Code sample 2I

3Matrix multiplication
3.1 CPU Implementation
3.2 Naive GPU implementation
3.3 Optimization using local memory
3.1 Optimization using vectorization 3I
3.5 Combining both approaches
3.6 Further implementations
.3.7 Summary and conclusion

4Prefix sum
1.1 CPU Implementation I
4.2 Naive GPU implementation
4.3 Work efficient GPU implementation
4.4 Recursively scanning blocks in local memory
4.5 Optimization using vector types
4.6 Further implementations
4.7 Summary and conclusion

5.1 CPU Implementations
5.1.1 C/C| | standard library routines
5.1 .2 Radix sort
5.2 GPU Implementations
5.2.1 Bitonic Sort
5.2.2 Bitonic Sort using kernel fusion
5.2.3 Radix Sort
5.2.4 Radix Sort using local memory
5.2.5 Radix Sort using local memory and vector loads
5.3 Further implementations
5.4 Summary and conclusion

6 Conclusion
6.1 Summary
6.2 Conclusion
6.3 Personal experiences
6.3.1 Progress
6.3.2 Positive and negative aspects

List of Figures

List of Listings


A Utility functions

B Benchmark environment


Diese Arbeit bietet eine Einfiihrung in die Programmierung von GPUs unter Verwendung von OpenCL. Nach einem historischen Uberblick fiber die Entwicklung von Grafikkarten, werden die Besonderheiten von GPU und CPU Hardware erlautert. Basierend auf diesem Wissen wird OpenCL als API vorgestellt, das die Programmierung vor allem dieser Hardwaretypen unterstfitzt. Ein genauere Betrachtung des OpenCL Ausfuhrungs- und Speichermodels, das die Unterstitzung heterogener Hardware ermoglicht, wird durch ein einfaches Codebeispiel abgerundet.

Anschliefiend fahrt die Arbeit mit der Implementierung einiger Standardalgorithmen fort, die mittels OpenCL auf die GPU gebracht werden. Die ausgewahlten Problemstellungen beginnen mit Matrixmultiplikation und setzen sich fiber All-Prefix-Sum und Sortieralgorithmen fort. Aufgrund der Parallelitat die bereits in der Natur der ersten Problemstellung liegt, wird im betreffenden Kapitel ein besonderer Fokus auf Performanceanalyse und Optimierung gelegt. Die All-Prefix- Sum und Sortieren sind schwieriger in unabhangige Arbeitspakete aufzuteilen. Entsprechende Methoden werden vorgestellt, um mit dieser Art von Problemen umzugehen.

Jede GPU Implementierung wird einer Laufzeitmessung unterzogen und mit entsprechenden CPU Varianten verglichen. Da CPUs und GPUs sehr unterschiedliche Hardwarearchitekturen aufweisen, wurden entsprechende Algorithmen zur Losung der jeweiligen Problemstellung ausgewahlt, die das Potential der zugrunde liegenden Plattform bestmoglich ausnutzen.


This thesis provides a introduction into programming for GPUs using OpenCL. After a historical overview of how graphic cards have evolved, the peculiarities of GPU and CPU hardware are discussed. Based on this knowledge, OpenCL is introduced as an API supporting all kinds of processing hardware. A deeper look into OpenCL’s execution and memory model, which allows handling heterogeneous hardware, is rounded off by a simple, yet full example code.

The thesis then continues with several implementations of standard algorithms for the GPU. The chosen problems start with matrix multiplication and go along with the all-prefix sum and sorting. As the first problem already offers parallelism naturally, performance analysis and optimization is focused during the first implementation chapter. The all-prefix sum and sorting are both problems being more difficult to split into independent pieces of work. Techniques will be discussed to tackle such kind of problems.

Each GPU implementation is benchmarked and compared with one or more traditional CPU approaches. As GPUs and CPUs have different hardware architectures, appropriate algorithms and optimizations have been chosen to solve the problems by exploiting the underlying platform at best.

1 Introduction

1.1 Motivation

For a long time the speed of programs experienced a constant growth through improved processor hardware. Intel co-founder Gordon E. Moore was one of the first persons to describe this trend in 1965. This description is well known by the name Moore’s law which initially stated that the number of transistors on an integrated circuit doubles every year (Moore 1965). This trend slowed down and Moore had to change the interval to two years a decade later (Kanellos 2003). The most important consequence however has been realized by one of Moore’s co-workers at Intel, David House, who predicted that the performance of processors would double every 18 months (Kanellos 2003). As a result, an increase in computing power could be obtained simply by running the same programs on newer hardware. As a consequence, hardware dominated the growth of programs’ performance for a long period of time.

Unfortunately, processing hardware technologies hit a limit at the beginning of the third millen­nium. Physical bounds, like the size of atoms, prevented a further shrinking of integrated circuits which would allow an increase in clock speed. Therefore, processor vendors like Intel and AMD started to place multiple CPUs on a single chip which still lead to an increase in computational power but in a different way than in the last 40 years. This change is sometimes referred to as the multicore crisis (Warfield 2007). The consequence from a software developer’s perspective is that traditional algorithms stopped gaining speed by being run on newer hardware. In fact, improving performance is now up to the programmer, who is responsible for writing concurrent and parallel software that can make full use of the underlying hardware’s capabilities.

But multicores were not the only major hardware change that had an impact on the way software is developed recently. Graphics cards, which were initially intended to offload and accelerate 2D and 3D graphic operations from the CPU to a separate hardware device, started to gain popularity in non graphical areas of programming. A GPUs intense floating point processing power and parallel nature by design makes it ideal for uses in several areas of science, business and engineering (General Purpose GPU). When used in the right place, a GPU may compute results several magnitudes faster than the CPU (McClanahan 2010). However, due to the very different hardware architecture, the graphics card is still not suitable to efficiently solve the same problems as a CPU does.

Todays software engineers working in performance focused domains have a hard time designing their applications. With two different types of processors (CPU and GPU), both excellent in their own ways, and a lot of APIs and frameworks around them (OpenMP, CUDA, OpenCL, DirectCompute, C++ AMP, ...), the available hardware and corresponding software is more het­erogeneous than ever. Thus, profound knowledge of the available technologies, their performance and restrictions as well as the consequences for a development team is essential for choosing the optimal strategy to tackle todays computational needs.

1.2 Goal

The goal of this thesis is to provide the reader with a state of the art comparison of modern GPU and CPU performance in solving traditional problems such as sorting arrays and multiplying matrices using various approaches. The GPU algorithms will be implemented using the Open Computing Language (OpenCL). Therefore the reader is given a short introduction into OpenCL in order to understand the provided code samples and how programming for a GPU works. These OpenCL implementations are benchmarked against algorithms implemented in C/C++ solving the same problem. It is important to note that the GPU and CPU version do not necessarily have to use the same algorithm, they only have to output the same result. This decision was inevitable due to the highly diverse hardware properties of the CPU and the GPU. The problems chosen for this comparison are sorting an array, multiplying two large matrices and calculating the parallel prefix sum of an array (explained in corresponding chapter 4). These algorithms cover several different aspects that play an important role when implementing software for a GPU such as runtime complexity, memory footprint and memory transfer time vs. computation time.

Eventually, the results of benchmarking the chosen algorithms should give the reader an idea of how much a GPU can accelerate different kinds of algorithms and what a programmer has to pay attention to when developing GPGPU accelerated software. This knowledge is not only valuable during developing but also useful when choosing the right technology for a given problem. This thesis should aid all software engineers in understanding how GPU computing works and where it should be used.

1.3 History of GPGPU Computing

Before we go into details about OpenCL and several GPU implementations, some background information about how GPUs have evolved is provided which may help understanding the design and peculiarities of graphics hardware. The following information is taken from a paper survey in 2010 (McClanahan 2010).

In the early 1980’s a ”GPU” was nothing more than an integrated frame buffer, an additional chip that relied on the CPU for drawing.

One of the first dedicated video cards was the IBM Professional Graphics Controller (PGA) re­leased in 1984. It used an on-board Intel 8088 microprocessor to take care of graphical calculations in order to take off load from the main processor (CPU). Although the Intel 8088 was in fact a con­ventional CPU, the PGA’s separate on-board processing unit marked an important step towards the development of GPUs as co-processors.

In the following years, more graphics orientated features where added like shading, lighting, raster­ization, the depth buffer and color blending. With the introduction of the Open Graphics Library (OpenGL) in 1989 by Silicon Graphics Inc. (SGI), the world’s first application programming in­terface (API) for 2D and 3D graphics was released. OpenGL was designed upon a concept called the graphics pipeline which depicts video processing as data traveling through multiple stages of operations. The pipeline begins at the CPU by sending geometry data together with colors, textures etc., which is then transformed from the initial 3D coordinate space to pixel coordinates. Lighting operations use material colors and textures to shade the incoming triangles which are then rasterized into the frame buffer for display. As the programmer was not able to alter the functionality of the pipeline, it is also known under the term ’’fixed function pipeline”. This processing model determined the design of graphics hardware for more then a decade.

In the mid 1990’s NVIDIA, ATI and Matrox started to provide graphic controllers for consumers. Famous computer games like Quake and Doom became popular spurring the gaming industry and the interest in GPUs. However, GPUs at this time were only able to output one pixel per clock cycle and CPUs were able to send more triangles than the GPU could handle. This problem lead graphic hardware vendors to adding more pipelines and eventually more cores to their GPUs in order to enable parallel pixel processing and to increase the throughput.

NVIDIA’s release of the GeForce 3 in 2001 marked an important step in the evolution of GPUs by loosening the restrictions of the fixed function pipeline. With the ability of writing small programs for the GPU, which could operate on the vertices traveling through the pipeline, programmers were given a tool to make limited changes to the behavior of the pipeline. These programs were called vertex shaders and were written in an assembly-like language. One year later, the pixel shader followed running on a separate part of the GPU hardware.

illustration not visible in this excerpt

Figure 1: The graphics pipeline with programmable vertex and fragment processor. (Fernando and Kilgard 2003)

With the introduction of the High Level Shading Language (HLSL) with DirectX 9 in 2003, programming GPU hardware became easier than with the previous shaders written in assembler. The first developers started to use the available programmability for non-graphical tasks. The first attempts of GPU computing emerged. A year later, in 2004 Brook and Sh appeared representing the first languages targeting GPGPU computing. Although the workflow of the GPU is still based on a programmable graphics pipeline, the hardware mainly consists of strong, highly parallel floating-point processors with fast memory access.

In 2006 the GeForce 8800 featured the first GPU with a unified programmable processor called a streaming multiprocessor which is used to execute the vertex, pixel and a new geometry shader. The graphics pipeline has become only a software concept used by graphic APIs such as OpenGL and DirectX.

Software GPGPU support was introduced by NVIDIA with the Compute Unified Device Archi­tecture (CUDA). It offers a C like language to create general purpose programs (called kernels) that can be executed on the GPU. ATI followed with the ATI Stream technology and Microsoft introduced compute shaders with DirectX 10.

In 2010 NVIDIA released the first GPU based on their Fermi architecture which was explicitly designed for GPGPU computing. The GTX580 Fermi GPU released later that year contained 16 streaming multiprocessors with 512 CUDA cores and accessed 1.5 GiB GDDR5 RAM with a bandwidth of 192.4 GB/s (NVIDIA Corporation 2013b).

1.4 Chapter overview

After this short introduction, chapter 2 will continue with a comprehensive coverage of OpenCL. Beside general information about the API, this chapter focuses on the knowledge required to understand how OpenCL executes kernels on the GPU and what a developer has to pay attention to when programming for graphics hardware. This information is vital for understanding the implemented algorithms in the subsequent chapters.

Chapter 3 will present the first OpenCL implementation of a standard algorithm by tackling multi­plications of large floating point matrices. Beside a simply and naive approach, several possibilities of optimizations are discussed introducing graphic hardware features like shared memory, texture memory and vector types.

Chapter 4 continues with implementations of a prefix sum algorithm (also known as scan) - a ridiculous simple problem for a CPU due to its sequential nature. However, the linearity if the algorithm does not fit the architecture of a GPU. A tree based approach is discussed which is commonly used to partly parallelize linear algorithms.

Chapter 5 focuses sorting as one of the most famous problems in computer science. Several well performing CPU implementations such as C’s qsort and C++’s std::sort will be compared with GPU sorting techniques using less popular algorithms such as the bitonic sorting network and radix sort.

Chapter 6 rounds off the thesis with a summary of the covered information, a final conclusion and outlook as well as personal experiences made during the creation of this document.

illustration not visible in this excerpt

2 OpenCL

2.1 What is OpenCL?

OpenCL is specified by the Khronos Group in the OpenCL 1.2 Specification as follows:

OpenCL (Open Computing Language) is an open royalty-free standard for general pur­pose parallel programming across CPUs, GPUs and other processors, giving software developers portable and efficient access to the power of these heterogeneous processing platforms. (Munshi 2012)

The Khronos Group is an industry consortium that maintains OpenCL as an open standard. This means that the Khronos Group offers a specification of the OpenCL API and a detailed description about its functionality and behavior for free. This specification can be downloaded from their Website (Munshi 2012). Maintaining the OpenCL standard consequently means, that the Khronos Group neither provides software development kits (SDKs), drivers that implement OpenCL nor hardware that it can make use of. These concerns are subject to other companies called vendors, which are typically hardware manufacturers providing necessary developing resources and include OpenCL support in their drivers. Examples of such companies are the two famous graphic card vendors NVIDIA and AMD as well as the renowned processor manufacturer Intel.

Unlike other famous APIs of Khronos Group, which are very specific in their usage or the targeted hardware (like the famous 3D API OpenGL used to drive graphic hardware), OpenCL is a general purpose programming framework. It was conceived with universality in mind, offering almost no restrictions on the field of application it may be used in. Portability from its well designed hardware abstraction model, which enables it to run on many different kinds of devices, even ones which may have not been created yet, is one of OpenCL most powerful strengths. Such devices may be classical CPUs and GPUs, but also more uncommon types of hardware like FPGAs (field programmable gate arrays), DSPs (digital signal processors) or Intel’s MICs (Many Integrated Core) (Merritt 2011). Moreover, OpenCL may be used to combine multiple available devices into a single heterogeneous platform extending an applications processing resources beyond the bounds of individual pieces of hardware.

In addition to being independent of a specific purpose and decoupled from the underlying hardware, OpenCL is also available across all mayor operations systems including Windows, Linux and Mac OS X.

With the upcoming specification of WebCL (currently available as a working draft), OpenCL will eventually even find its way into Web browsers and the HTML5 technology. Thus making it even independent of an operating system and bringing high performance computing into Web applications.

2.2 Components

OpenCL is not only an API. As it allows programs to run on hardware that may have certain restrictions or offer different features than a classical CPU, traditional languages like C++, Java or C# can not be used to write those programs. Therefore, the OpenCL standard includes the specification of a separate language that is used to write small programs that are executed on a hardware device. These programs are called kernels and are written in the OpenCL C language, which is a restricted version of C99 with extensions for vectorization and handling OpenCL’s abstract memory model.

To allow OpenCL to support many different hardware platforms it consists of the following com­ponents:


The application programming interface (API) is specified by the Khronos Group, ensuring that every vendor implementation of OpenCL is compatible and exchangeable. The API is provided as a set of C header files that a software developer can download from Khronos’ Website. These headers are typically also shipped with an SDK. Khronos additionally specifies a C++ wrapper around the C API. Bindings for other languages exist but they are not maintained by Khronos. The API is used by a conventional program (e.g., written in C++) to run kernels on an OpenCL capable device. This program is called the host application.


The software development kit is the collection of tools and resources, a software developer needs to write applications using OpenCL. An SDK is usually provided by a vendor, an implementor of OpenCL. Examples of such SDKs are the NVIDIA CUDA Toolkit and AMD’s Accelerated Parallel Processing (APP) SDK. These SDKs typically contain header files to be included by a C/C++ program and static libraries for the linker, which are responsible for binding to the OpenCL driver at runtime. Furthermore, code examples, tutorials, documentation, developing tools etc. may be provided depending on the vendor. With headers and static libraries, an SDK contains all resources necessary to write and compile an OpenCL application.

OpenCL C language

Kernels are typically written in separate source files using the OpenCL C language. An OpenCL application reads these source files at run time and sends them to the OpenCL compiler to create a binary for one or more available hardware devices where it may be executed. A source file written in the OpenCL C language may consist of several functions, variable definitions, control flow statements, comments, etc., but has to have at least one function prefixed with the kernel attribute. This function may be called by the host application through the OpenCL API and serves as an entry point into a kernel.

OpenCL C Compiler

OpenCL kernels must be compiled for a specific platform before they can be executed. This compilation is initiated and controlled by the host application through the API. The separate compilation at run time is required to retain OpenCL’s portability, as an OpenCL application may be deployed on any kind of machine with a suitable driver. Consequently, the available OpenCL implementation (providing the compiler) and the used hardware device (affecting the compiler’s output) may only be determined at runtime.


Finally, the driver is OpenCL’s core. It implements the OpenCL API and maps the functionality specified in the standard to a vendor’s hardware. The host application uses the driver through the API to initialize OpenCL, compile kernels, allocate memory resources, initiate computations and communicate with kernels running on a device. A driver is always specific to a dedicated hardware and must therefore be provided by a vendor. The driver is sometimes also referred to as Installable Client Driver (ICD).

2.3 Hardware architectures

One of the greatest advantages of OpenCL and also one of the biggest influences on its design is OpenCL’s ability to support devices of many different hardware architectures. These devices can be used through OpenCL by the same common API and may even be used together in a single application or computation, which is sometimes referred to as heterogeneous computing. A closer look into the architectures of the two most common hardware devices, namely CPUs and GPUs, may help the reader to better understand design decisions made by the Khronos Group when OpenCL was conceived. Furthermore, a deeper understanding of graphics hardware will be needed for the implementation chapters of this thesis.

2.3.1 Central Processing Unit (CPU)

The CPU is the kind of processors developers are mostly used to. Software written in common languages like C++, Java or C# is directly or indirectly (using a Virtual Machine) executed on this kind of hardware. Figure 2 shows the design of a modern processor by the example of an Intel Ivy Bridge quad-core. One can see the four physical cores of the processor. Thanks to hardware multithreading (Hyper-threading in Intel terms) each physical core can process a pair of threads which are seen as two virtual CPUs by the operation system resulting in 8 available cores for an application. Note the on-chip graphics processor in the Ivy Bridge architecture which can be used as a replacement for a graphics card. It is however not used as general processing device such as the CPU’s cores.

illustration not visible in this excerpt

Figure 2: Architectural overview of an Intel Ivy Bridge processor. (Shimpi and Smith 2012)

When it comes down to calculation throughput, and the tackled problem offers some degree of data parallelism (several chunks of data are processed equally), SIMD (Single Instruction Multiple Data) features of the CPU can be used to increase the number of input elements an instruction can process. Intel’s Ivy Bridge for example allows the parallel execution (vectorization) of up to eight single precision floating point operations using AVX (Advanced Vector Extensions, the successor of SSE, the Streaming SIMD Extensions) instructions.

Considering memory access, the CPU uses the standard data bus to communicate with the external main memory. As this communication usually takes a while (up to hundreds or even thousands of CPU cycles (Gaster et al. 2011, p.54)), the CPU uses a hierarchical cache system. Each core has it’s own level one and level two cache, which have to be kept synchronized to the other cores. All cores share a common level three cache that is further connected to the memory controller.

From a performance concerned developer’s perspective, all one has to pay attention to is to have enough parallelism in an application to utilize all cores of the CPU (which is typically 2, 4 or 8 on consumer hardware). If an application has to process large amounts of data using the same operations on elements of the input (e.g. multimedia applications like image or video processing), vector instructions may be used to speed up execution. Regarding memory, beside small adjust­ments to improve cache effects, the developer has only limited possibilities to optimize, as the entire cache system is controlled by the CPU and operation system.

2.3.2 Graphics Processing Unit (GPU)

When we think of a GPU we think of a highly parallel, graphics orientated processing device. From its introduction with OpenGL in 1989 until the creation of the first GPGPU languages in 2004 (cf. section 1.3), the architecture of a GPU’s hardware followed the principles of the graphics pipeline. To fit this purpose, GPUs developed from original desktop processors to (as written in the book Heterogeneous Computing with OpenCL (Gaster et al. 2011)) heavily multithreaded devices employing sophisticated task management directly in hardware. This was necessary to cope with complex task graphs processing vertices, geometry and pixels. These tasks and the data they process are highly parallel and represent an extensive amount of independent work, which is therefore most suitably handled by a device with a great amount of cores employing latency tolerant multithreading[1].

illustration not visible in this excerpt

Figure 3: Full chip block diagram of NVIDIA’s Kepler GK110 architecture containing 15 streaming multiprocessors (SMX). (NVIDIA Corporation 2013c)

Figure 3 shows the design of a modern GPU by the example of NVIDIA’s Kepler architecture. On the top of the block diagram the PCI Express controller is placed. This interface is responsible for the communication of the graphics card with the CPU through the mainboard’s PCI Express bus. After data has arrived at the graphics device, it has to be scheduled for processing to one or more of the streaming multiprocessors (SM, or SMX in NVIDIA’s Kepler architecture terms). A SM is an aggregation of many cores with additional scheduling logic and therefore representing an extra hierarchical step in a GPU’s workflow. This is one of the first, bigger differences in the architecture of modern GPUs when compared with traditional CPUs. SMs are called compute units in OpenCL terms.

Figure 4 shows a detailed view of the structure of a streaming multiprocessors. Once a package

illustration not visible in this excerpt

Figure 4: Architecture of a Streaming Multiprocessor (SMX) of NVIDIA’s Kepler GK110 archi­tecture containing 192 single precision CUDA cores, 64 double precision units, 32 special function units (SFU), and 32 load/store units (LD/ST) (NVIDIA Corporation 2013c).

of work is scheduled to a SM, it is prepared for execution by the Warp Schedulers. A Warp is a block of threads (32 on NVIDIA hardware) executing the same instructions in parallel along cores managed by the Warp Scheduler. This concept is sometimes called SIMT (Single Instruction Multiple Threads). The difference to SIMD (Single Instruction Multiple Data) is that SIMD requires a vector instruction in each thread whereas SIMT only uses scalar code; the vectorization is handled by the hardware (Micikevicius 2012, p.99). A Warp Scheduler is able to manage multiple Warps at the same time (up to 64 on the Kepler GK110 architecture (NVIDIA Corporation 2013c, p.7)), which may be executed interleaved on the cores (similar to the two threads on a hyper­threading Intel core). This allows to hide memory latency (latency tolerance), because when a Warp has to stall because of a memory request, the Warp Scheduler can execute other Warps until the memory request is satisfied.

Another important aspect to notice is that all cores within a SM share the same registers. This plays an important role when choosing the right number of Warps to run concurrently on the SM. Too less Warps may result in unused core cycles whereas a too large number of Warps may request more registers than there are available at the SM. The additionally needed registers are then allocated in the slow global memory. This concern will be discussed later in section 2.6.

The actual work is then performed on the cores itself. The major number of cores (labeled ”Core” in green in figure 4) are used for single precision floating point and integer arithmetics. In contrast to older GPUs, modern graphics processing units also include support for double precision floating point arithmetic, which was not needed for graphical calculations in the past, but becomes increasingly important in GPGPU computing. The Kepler architecture serves this need by dedicated cores (labeled ”DP Unit”). Additional hardware resources include the load and store unit for register and memory access and the special function unit (labeled ” SFU”). On the bottom of figure 4 we find the texture units which are heavily used in 3D graphics. They can also be used by OpenCL via images (more in section 2.7.2).

Finally, beside the way of executing threads on its cores, the memory system is the second com­ponent of a GPU that shows significant differences to a CPU. Although using a hierarchical cache system, the GPU lacks caches for each individual core. The first level cache resides within the SM and is used by all cores inside the SM. A specialty of the GPU is the shared memory (also scratch pad memory or local memory), which can be seen as a programmer controllable cache. An OpenCL kernel may use this small block of memory to cache data from the global memory and to limitedly share data with other threads (more on this in section 2.6.2). Furthermore, shared memory is implemented in hardware using multiple memory banks. Special access patterns are required to avoid so-called bank conflicts resulting in slow memory request serialization between threads. Outside the SM (cf. figure 3) the GPU offers a larger level two cache which is then con­nected to multiple memory controllers to access the global memory. The global memory resides on the graphics card outside the GPU chip (such as the RAM is outside a CPU) and has a larger size of up to several GiB. An important difference of the global memory when compared with the RAM on the mainboard is that RAM can be accessed in almost any pattern[2]without significant performance penalties (apart from cache effects). This is significantly different for a GPU where memory requests should be coalesced[3]and blocked across the threads of a warp to achieve optimal memory system utilization.

Concerning the performance of a GPGPU accelerated application, there are a huge amount of concerns to worry about. Two of them have already been discussed, namely choosing the right number of warps concurrently executing on the SMs and paying attention to memory access. For the latter, the memory access pattern should be coalesced and frequently needed or shared data should be cached in shared memory. Further optimizations exist but would exceed the scope of this thesis. For further reading, NVIDIA has a detailed paper on this subject (Micikevicius 2012).

AMD GPUs are structured similarly as the NVIDIA Kepler architecture presented here. A small difference is that AMDs Ultra-Threaded Dispatch Processor replaces the Warp Schedulers and schedules Wavefronts of 64 threads instead of 32 thread Warps. However, the most fundamental difference is found on instruction level. Whereas NVIDIA’s cores only process one instruction at a time across a Warp, AMD processes a full packet of instructions in parallel. These packets are referenced to as Very Long Instruction Words (VLIWs). VLIWs are packets of four (or five on older GPUs) instructions that can be executed in parallel (instruction level parallelism). These packets are generated by the compiler. Due to dependencies between instructions, a VLIW may not always be completely filled and therefore not utilize a VLIW processor completely. Moving instruction level parallelization into the compiler is advantageous for the hardware, as it becomes simpler compared to hardware which tries to parallelize the instruction stream at run time via superscalar or out-of-order execution. Kepler’s Warp schedulers for example can issue two instructions in parallel but have to resolve data dependencies between them at runtime via scoreboarding[4](Kanter 2010, p.4). However, static scheduling might not be as efficient as scheduling at runtime as the compiler cannot make any assumptions about the occupancy of different pieces of hardware and thereof resulting delays. For further details about AMD’s GPU architectures, David Kanter wrote an article on real world technologies about AMD’s Cayman architecture from 2010 with several comparisons to NVIDIA’s Fermi architecture (Kanter 2010).

2.4 API Overview

The OpenCL API is specified using the C programming language. The Khronos Group defines a set of C functions, types and constants that make up the API a developer can use to write applications using OpenCL. These declarations are cumulated into the cl.h C header file, which can be downloaded from the Khronos’ Website and can typically also be found in the include directory of any OpenCL SDK. Although specified in C, OpenCL has an object-oriented design. The UML class diagram in figure 5 shows the objects which can be created, queried and manipulated using corresponding API calls. The Khronos Group also specifies a C++ Wrapper with OpenCL 1.1 built atop the C API which will not be covered in this thesis.

The following sections will give a brief introduction into OpenCL’s design and relevant API func­tions. The sections are based on chapter 2 of the book Heterogeneous Programming with OpenCL (Gaster et al. 2011, p.15-31).

illustration not visible in this excerpt

Figure 5: OpenCL class diagram from the OpenCL 1.2 specification. (Munshi 2012)

2.5 Platform model

As OpenCL is an open standard it can be adopted by any hardware vendor manufacturing proces­sors powerful enough to support the features specified by the Khronos Group. Many computers contain more than one of such processors; typically at least a CPU and a graphics card, which are both very suitable to be used by OpenCL. However, these devices do not have to share the same OpenCL implementation. On the one hand, someone may have an NVIDIA graphics card using NVIDIA’s driver and, on the other hand, have an Intel Core using their OpenCL CPU driver. And then, if somebody has an AMD Radeon graphics card with AMD’s Catalysts driver, the CPU can also be accessed using the same driver. This situation of multiple available OpenCL implementa­tions on the same machine is handled by the platform model. Each OpenCL implementation on the system (a vendor specific driver) represents a platform. The available platforms on the system, which in this context is often called the host, can be queried using the cIGetPlatformIDs func­tion (for a source code sample see section 2.8). Each platform supports one or more devices (the hardware supported by the driver). Reconsidering the previous scenario, the Intel platform would support one device, which is the CPU, whereas the AMD platform would support both the GPU and the CPU. The available devices of a platform can be retrieved by calling clGetDevicelDs with the desired platform. Devices available on the same platform (e.g., CPU and AMD GPU using the AMD platform) can be used together which is referred to as heterogeneous computing. A device itself consists of multiple compute units being functionally independent from each other (cf. the SMs of a GPU). Each compute unit itself is then eventually divided into processing ele­ments (cf. the cores inside a SM). The number of available compute units on a device and other information like memory capacities can be queried using clGetDevicelnfo. Figure 6 visualizes the relation of platforms, devices, compute units and processing elements. (Gaster et al. 2011, pp.19)

illustration not visible in this excerpt

Figure 6: OpenCL’s platform model.

2.6 Execution model

The central component of OpenCL’s execution model are kernels. But before a kernel can be executed on a device, a number of additional resources have to be set up. After a device has been selected, a context has to be created that manages objects like memory buffers or kernels associated with the device. A context is created by calling clCreateContext with one or more devices of the same platform as arguments. Additionally, properties may be passed to the function, like a platform to restrict the context to this platform allowing multiple contexts for different platforms to coexist, or a handle to an OpenGL (Open Graphics Library, a 3D API) context to enable graphics interoperability. (Gaster et al. 2011, p.22)

While a context holds objects shared between the host and one or more devices, OpenCL offers an additional mechanism for host device communication. To request action from a device, a command queue needs to be created within a context and associated with this device. This is done by calling clCreateCommandQueue with the context and device as arguments. Once a command queue is created, the host can request actions from the device by enqueuing operations like memory buffer writes/reads or kernel executions. Per default these operations are executed asynchronously to the host application, although explicit synchronization can be enforced by calling clFinish on the command queue, which blocks until all pending operations have been completed. (Gaster et al. 2011, p.22, 23,26)

2.6.1 Creating kernels

After the OpenCL execution environment has been successfully prepared, all that is left is the actual piece of code to execute. OpenCL C code is usually written into text files just as every other code. But in contrast to the host application, OpenCL C code is not compiled before an OpenCL application is deployed. OpenCL C code is compiled at runtime via corresponding API functions. This peculiarity is required as different vendors use individual binary codes that may be executed on their devices. Therefore, an application has to read the OpenCL C code from the source files at runtime and pass it to the clCreateProgramWithSource function which returns a program object representing the given piece of code. This program can be compiled and linked for one or more devices by calling clBuildProgram. As compile errors can occur (return value is not equal to CL_SUCCESS) the compile log may be retrieved by a subsequent call to clGetProgramInfo. After the program has been successfully built, kernels can be extracted (a program may contain multiple kernels). This is done by calling clCreateKernel with the compiled program and the name of a function with the __kernel attribute, which will be used as entry point into the kernel. The obtained kernel is ready for being enqueued on a command queue to a device, the kernel was compiled for. (Gaster et al. 2011, p.26, 27)

2.6.2 Kernel execution

A kernel can be seen as C function that is executed in parallel on the device’s processing resources by the OpenCL runtime. As elaborated in section 2.3, these resources may be organized very differently. Therefore, OpenCL uses an abstraction model describing the dimensions of the queued work and how it is split up into small pieces. (Gaster et al. 2011, p.16)

When the host application wants to enqueue a kernel, it has to specify the size of the problem as a so-called n-dimensional range (NDRange). This range can be seen as a one-, two- or three­dimensional grid of indexes which typically represent the size of the input or output of a kernel. Each element of the grid represents a single kernel invocation and is called a work item. Work items are identified by their position inside the grid which is named global id. The size of the grid is called the global work size. (Gaster et al. 2011, p.18)

For example: A matrix multiplication with an output matrix of M * N elements might enqueue a kernel with a two-dimensional range of M * N work items. Each work item would then use it’s indices to determine the row and column of the input matrices and the position of the calculated output value.

To closer adapt to GPU hardware architectures, the NDRange can be equally divided into work groups. Work items inside a work group benefit from having a dedicated shared memory block available. Access to this memory can be synchronized between the work items of a work group providing a basic method of communication. To address a work item inside a work group, each work item has an additional local id and a group id. The extents of a work group must be of the same dimension as the NDRange and the sizes in each dimension, called local work sizes, must be powers of two and evenly divide the global work sizes. (Gaster et al. 2011, p.18) The maximum number of work items in a work group is implementation dependent and can be queried by a call to clGetDeviceInfo with the CL_DEVICE_MAX_WORK_GROUP_SIZE constant as argument. Figure 7 shows the relation of the NDRange, work groups and work items.

A kernel is executed on a command queue by calling clEnqueueNDRangeKernel. Additional important arguments beside the kernel and the command queue are the NDRange’s dimensions, the global work size and the local work size. The local work size is optional. Before a kernel is executed, arguments may be specified for the kernel function according to the function’s signature. These arguments are passed to the device when the kernel starts executing and can be accessed via the kernel function’s parameters. Arguments are set from the host by calling clSetKernelArg with the argument’s index in the kernel function’s signature as well as the size and value of the argument.

NDRange a work group

illustration not visible in this excerpt

2.6.3 Kernel code

The language used to write kernels is OpenCL C, which is derived from C99 and extended by OpenCL specifics like vector data types, built in functions and additional keywords. A kernel function must have the kernel attribute and must have void as it’s return type. The function’s name identifies the kernel to the host. A kernel may have parameters which are declared in the function’s signature. An example kernel function for a matrix multiplication is given in listing

illustration not visible in this excerpt

2.7 Memory

Many applications often have to work with large amounts of data. As the data that can be passed to a kernel function by value is restricted to a small size, OpenCL offers the possibility to allocate larger blocks of memory represented as a memory object which is valid inside a context and moved to a device when necessary. At the very top, OpenCL defines two basic types of memory, buffers and images. Buffers can be seen as a continuous block of memory equivalent to an array in C allocated using malloc. Images in contrast are abstract memory objects which can make use of device specific hardware acceleration and optimization when being read from or written to (e.g., texture units on a GPU). Images do not have to be supported by a device. (Gaster et al. 2011, pp.23)

2.7.1 Buffers

Buffers are created by the host application via a call to clCreateBuffer. In addition to the context and buffer size, several flags may be specified. With these flags (beside further options) the buffer can be declared as read or write only, or OpenCL can be instructed to use or copy a provided block of host memory. Once a buffer is created, read and write operations may be enqueued on a command queue to transfer a buffer’s content (or parts of it) between the host and a device. clEnqueueReadBuffer and clEnqueueReadBuffer are the corresponding API functions for transferring memory. When needed by a kernel, buffers have to be set as arguments to a kernel and can be accessed via a pointer from inside the kernel function. (Gaster et al. 2011, p.24)

2.7.2 Images

Images abstract the underlying storage implementation to allow a device to use hardware opti­mizations. GPUs for example might use their texture units for improved read and write access as well as providing hardware accelerated interpolation between pixel values of the image. As images do not need to be represented as continuous arrays of a specified data type, a format descriptor is used to determine the size and data type of a pixel value of the image. Images are created by a call to clCreateImage2D or clCreateImage3D (unified to clCreatelmage in OpenCL 1.2) with similar parameters as the buffer creating functions. Additionally, the image format de­scriptor has to be specified and an optional padding between the rows of the image can be set. Images are read and written by the host application via calls to clEnqueueReadImage and clEnqueueWriteImage. Like buffers, images can be set as arguments to a kernel, but cannot be accessed via pointers inside the kernel function. Special built-in functions like read_imagef have to be used in combination with a sampler object which determines out-of-bounds image access, interpolation, and access coordinate handling. (Gaster et al. 2011, pp.25)

2.7.3 Memory model

As memory subsystems are highly diverse between different kinds of hardware devices, OpenCL defines an abstract memory model giving vendors the possibility to map OpenCL’s abstract types of memory to physical memory units on a device. OpenCL defines four types of memory which are shown in figure 8:[5]

illustration not visible in this excerpt

Figure 8: Memory model as defined by the OpenCL 1.2 specification.(Munshi 2012)

1. Constant memory

Constant memory is typically implemented as part of the global memory and is used to store read only data which is needed by all work items. All pointer variables pointing into constant memory have to be prefixed with the constant attribute inside a kernel. (Gaster et al. 2011, p.30) Newer types of hardware may provide special caches for this kind of memory (cf. the read-only data cache in NVIDIA’s Kepler architecture in figure 3).

2. Local memory

Local memory (or shared memory) is a small block of memory which is shared between all work items inside a work group. As access to local memory is very fast, it can be seen as a programmer controlled cache to global memory. Local memory is often used for preloading data needed by all work items of a work group or to synchronize work items across a work group. All pointer variables pointing into local memory have to be prefixed with the local attribute inside a kernel. Local memory can be allocated statically inside a kernel by declaring a fixed sized array like local float sharedArray[32]; or dynamically with a call to clSetKernelArg providing the size of the local memory block requested and nullptr as pointer to the arguments value. (Gaster et al. 2011, p.30)

3. Private memory

Private memory belongs to each work item itself and is typically stored in registers (on a GPU) except a work group requests more memory than the available register file. In this case, the spilled private memory is mapped to global memory. Private memory cannot be allocated, but pointers to local variables may be used to pass values by reference to functions called by the kernel function. Pointers to variables pointing to private memory can be prefixed with the __private attribute inside kernels, but they do not have to as private is the default pointer type. (Gaster et al. 2011, p.30)

2.8 Code sample

The following piece of code in listing 2 shows a minimal OpenCL application with a kernel that cal­culates the sum of two input vectors. To keep the code simple, no error handling is performed, only the error code is retrieved. The codes in the subsequent implementation chapters will completely go without error handling or error code retrieval.

illustration not visible in this excerpt

Listing 2: A minimalistic working OpenCL application which calculates the sum vector of two input vectors.

The first step in every OpenCL program is to choose a platform. As platforms correspond to the available OpenCL implementations on the system, multiple platforms can be available. The exam­ple code in listing 2 queries the first available OpenCL platform by calling clGetPlatformIDs with 1 and a pointer to a variable able to hold one platform id as arguments. The third parameter could be used to retrieve the number of actually available platforms.

After a platform has been chosen, we can continue by choosing a device for this platform which works analogously as selecting a platform. In addition to specifying the platform to query de­vices for, OpenCL also allows us to define the type(s) of devices we would like to get. In this case, we would like to have the first available GPU on the first available platform by specifying CL_DEVICE_TYPE_GPU when calling clGetDeviceIDs[5]. To be able to allocate OpenCL re­sources and enqueue operations on a device, we now have to create a context and a command queue using clCreateContext and clCreateCommandQueue. Further arguments to both API functions allow specifying further properties which can be looked up in the corresponding documentation.

The program is now ready to create a kernel. OpenCL kernel source code is typically placed in a file and read at runtime. For simplicity, the kernel code of this example is emplaced into the host code as a string literal. The OpenCL kernel simply queries the id of the current work item, loads the values from the buffers a and b at this index and writes the sum of them to buffer c at the same index.

To create an actual kernel object within the context, we first have to create an OpenCL program of the source code which is done by calling clCreateProgramWithSource and passing the source code as argument. Furthermore, the program has to be compiled for the chosen device. By calling clBuildProgram the source code of the program object is compiled and linked into an executable for the specified device. This step can take up to several seconds. Similar as building programs in other languages, compiling may fail. In this case, an error code is returned and[6] a compiler log could be retrieved using clGetProgramBuildlnfo. On a successful build, the kernel object (which can be seen as entry point into the program) can be finally retrieved by calling clCreateKernel and specifying the compiled program as well as the name of the kernel function.

Before we can execute the kernel, we have to allocate storage for the input and output data. Three arrays are allocated on the host with a size of N. The first two are filled with data. To move the data to the GPU, three buffer objects have to be created using clCreateBuffer having the same size as the corresponding host arrays. Note that the two input buffers are created as read only and the output buffer as write only by setting appropriate flags. To initiate the copy operations from the host to the GPU device, buffer write operations have to be enqueued on the command queue using clEnqueueWriteBuffer. Arguments are the command queue, the buffer to write to, a boolean specifying if the write operation should be blocking, the offset and length of the region to be written to, a pointer to the host memory from which should be read and several further arguments concerning events which will not be covered. By specifying false for the blocking write parameter, the write operation is execute asynchronously to the host program. Therefore, the provided host pointers (a and b) have to be available at least until the first blocking OpenCL call. Asynchronous memory transfers are advantages, as OpenCL may decide itself when to perform the actual copy. Furthermore, multiple copies may be executed concurrently and even beside kernel executions.

Now it is time to setup and execute the kernel. The three buffer objects, on which the kernel will operate, are set as the kernel’s arguments via clSetKernelArg. The second parameter specifies the index of the argument in the order they appear in the kernel function’s signature. Finally, the kernel can be enqueued to be executed using clEnqueueuNDRangeKernel. The problem is one-dimensional with a global work size of N. The local work size is not specified and therefore determined by OpenCL. Similar to the buffer writes, the kernel enqueue operation is also executed asynchronously. However, the arguments do not have to be available anymore after clSetKernelArg returns as they are copied by the OpenCL implementation. Furthermore, as the kernel depends on the buffer objects as inputs, the kernel is not executed until all buffer operations (memory transfers) on the inputs have completed.

The last step is to read the results back to host memory via clEnqueueReadBuffer having the same arguments as the corresponding write function. This time however, the synchronization boolean is set to true. Therefore, the function call blocks until the read operation has finished. Afterwards the result on the host may be further processed (e.g., printed).

When OpenCL resources are no longer needed, they should be released using their corresponding clRelease* API functions.

3 Matrix multiplication

Multiplying large matrices is often part of heavy scientific calculations and therefore an important building block in every mathematics library. Unfortunately, with increasing size the computation of the product becomes desperately slow, due to an asymptotic runtime complexity of O(n[3]) for a naive attempt. Therefore, several improvements have been tried to speed up the calculation. Strassen’s algorithm for example achieves a runtime complexity of O(nlog[2] [7]) = O(n[Abbildung in dieser Leseprobe nicht enthalten]) (Strassen 1969) which is faster than the naive version, but still does not compute results within a satisfying time for larger matrixes. Nonetheless, modern computing systems can take advantage of one important aspect of the matrix multiplication which is it’s embarrassing parallel nature. Each element of the output matrix can be computed completely independent (cf. figure 9). This potential may not only be used by today’s CPU’s vector extensions like SSE [1] or AVX [2]. Especially a GPU can make excellent use of the large number of independent pieces of work to perfectly utilize its hundreds or thousands of cores.

illustration not visible in this excerpt

Figure 9: Principle of the matrix multiplication (StefanVanDerWalt 2010).

Beside the implementation and optimization of the matrix multiplication on GPUs, this chapter will also present performance data of GPUs collected by an appropriate profiling tool (AMD’s CodeXL will be used). This allows a better understanding about an algorithm’s performance and how bottlenecks can be found and eliminated.

The provided benchmarks were created on a notebook with medium hardware components. Details can be found in the appendix section B. Furthermore, the host code may use the utility functions roundToMultiple and roundToPowerOfTwo for aligning input data to appropriate size. Their respective definitions can be found in appendix chapter A.

This chapter will take a look at different implementations of the matrix multiplication for both, the CPU and the GPU. For simplicity, square matrices are used and stored as one dimensional arrays in row major order.

illustration not visible in this excerpt

Figure 10: Benchmark of several square matrix multiplication implementations.

3.1 CPU Implementation

Although lots of optimized algorithms and their implementations exist, we will first have a look at a simple and naive implementation as the one given in listing 3.

illustration not visible in this excerpt

Listing 3: A simple C++ implementation of a square matrix multiplication for the CPU.

The naiveCPU function takes two pointers to the memory where the two input matrices are stored, another pointer to already allocated memory where the output is written to and a final parameter n giving the size of the input and output matrices. The algorithm is simple. Every element of the output matrix has to be calculated, therefore the outer two for loops run through all of these elements. For each output element the dot product of the row in matrix a and the col in matrix b are calculated.

The performance of this implementation is conceivably slow as one can see in figure 10. Multiplying matrices with an edge length of up to 700 elements can be done in a short time (below one second), but larger matrices require a huge amount of time which is unsatisfying in most cases. Also a multithreaded version of the code provided in listing 3 (e.g., by using OpenMP and place a # pragma omp parallel for above the outer for loop) does not bring substantial improvements. The optimized Fortan77 BLAS library (BLAS (Basic Linear Algebra Subprograms) 2011) achieves a significant better performance when compared with the simple CPU implementation provided here.

3.2 Naive GPU implementation

A working GPU implementation can be directly derived from the simple CPU implementation given in the previous section 3.1. As we can see in figure 9, each output element is calculated independently and the calculation of an output element corresponds to the loop body of the second for loop in the CPU implementation in listing 3. Therefore, both outer for loops offer a good starting point for parallelization.

The naive GPU implementation uses the same principle as the initial CPU version. But before we can jump into GPU code, some setup on the host is required which is shown in listing 4. To simplify the code, context and command queue have already been created and are provided as input to the multNaiveGPU function.

illustration not visible in this excerpt

Listing 4: Host code for a matrix multiplication implementation using OpenCL.

For the three pointers to arrays holding the two input matrices and the output matrix, OpenCL buffers have to be created. On creation, additional flags may be provided to allow the underlying OpenCL implementation to optimize memory access to these buffers. Therefore, the two input matrix buffers are set to CL_MEM_READ_ONLY and the output buffer to CL_MEM_WRITE_ONLY. These flags only affect access to the memory object from the kernel code and do not restrict the host application to read or write buffers. After the buffers have been created, a write operation is enqueued on the command queue to the GPU for both input matrices. Note, that the third parameter of the calls to clEnqueueWriteBuffer specifies whether the function blocks until the write operation has completed or not. This parameter is set to false allowing OpenCL to transfer the memory blocks asynchronously to the running host application and even in parallel. Before the kernel can be executed, the two input buffers, the output buffer and the size of the matrix are set as arguments to the kernel function. Furthermore, the global and local work size for the kernel invocation have to be determined. The global work size must be a multiple of the used work group size (local work size). The chosen size of the work groups affects GPU utilization as a too small size may lead to wasted processing power on the SMs and decreases latency tolerance. A too large work group size may cause the kernel to fill up the available register file on the SM. In this case local data is moved to global memory causing a serious drop in performance. Finding the optimal work group size is hardware dependent and often boils down to experimentation. If the kernels are small, it is usually a good idea to try the highest possible work group size (256 on the used GPU). Note, that choosing a work group size which does not evenly divide the global work size and therefore rounding the global work size to the next multiple of the work group size causes the GPU to execute the kernel for additional unneeded work items. Concerning the matrix multiplication example, the kernel would be executed for a bigger matrix with a size evenly dividable by the work group size. This has to be kept in mind when, e.g., accessing buffers inside the kernel. After global and local work sizes have been chosen, the kernel can be enqueued to the command queue to be executed by calling clEnqueueNDRangeKernel. The call immediately returns as the kernel is executed asynchronously when the command queue is flushed (clFlush) or a blocking command is enqueued, which is the case on the subsequent read operation. The call to clEnqueueReadBuffer reads the result from the device to the host application’s array. Note, that the third parameter is set to true indicating a blocking operation. All previously enqueued commands are ensured to be executed and finished before the read operation takes place, which returns after all data has been successfully copied to client memory.

The last missing component of the naive GPU implementation is the OpenCL kernel itself, shown in listing 5.

illustration not visible in this excerpt

Listing 5: OpenCL Kernel code calculating a single element of the output matrix per work item.

The __kernel function closely resembles the loop body of the second for loop of the CPU implementation in listing 3. The purpose of a single invocation of this kernel is to calculate one element of the output matrix. By using OpenCL’s built-in function get_global_id, with the dimension as argument, the work item can retrieve it’s position inside the NDRange which equals the size of the output matrix (plus extra space due to work group size rounding). Therefore, the retrieved position has to be checked against the matrices’ size as the NDRange may be larger than the original matrix. If the coordinates identify a valid matrix position, the output element is determined by calculating the dot product of the row in matrix a and the col in matrix b and eventually written to the output matrix buffer c.

When run on the GPU, the performance of this naive OpenCL implementation is amazingly fast, when we have a look at the benchmark in figure 10. For 800 x 800 elements, the size where the CPU implementation’s run time started to exceed several seconds, the GPU variant still runs at around 200 milliseconds. Additionally, the curve raises far slower than the CPU’s one.

However, when we have a look at the measured performance counters of the naive approach in table 1 we notice several things: Despite the perfect 100 % occupancy of the kernel (as a result of no local memory usage and low vector register footprint), the algorithm performs a relatively high number of fetch instructions (cf. the ALU fetch ratio of 2.5). This can also be seen at the time the fetch unit was busy, which was 99.9 %. This means the algorithm is highly memory bound. Furthermore, elements of the input matrices are accessed multiple times (edge length of output matrix times). Although this explains the nice cache behavior of 57.5 % hits, here is definitely room for improvement. Another approach to reduce the total number of memory requests would be to query larger elements per work item. This would lead to fewer but larger memory requests which might be handled more efficiently by the memory system. Furthermore, fewer work items would be required.

3.3 Optimization using local memory

Based on the performance analysis of the naive GPU approach, the first and most obvious op­timization is to reduce the number of memory requests to global memory. Ideally, each input matrix element should be accessed only once. This scenario cannot be achieved, but the number of memory requests to a single item can be reduced to one inside smaller subregions by subdividing the input matrices into tiles as shown in figure 11. Instead of each work item accessing all needed input elements like in the naive version, each work item of a work group (which is equally sized as a tile) copies one element of a tile from each input matrix to local memory. After the two tiles have been copied completely, the work items than access the corresponding elements from the tiles in local memory (avoiding redundant global memory requests), multiply and add the values to the output sum of each work item (element of the output matrix). Then the next two tiles are loaded. The algorithm repeats until all tiles have been processed.

The host code for this algorithms looks very similar to the one of the naive version. The major difference is that the size of the input matrix must now be a multiple of the tile size. To avoid complex handling of smaller matrices in the kernel, we now create fitting matrices beforehand. This can be achieved by allocating slightly bigger buffers (whose size is a multiple of the tile size) and initialize them with zeros. The actual input matrices can then be copied into the upper left rectangle of the larger buffer. OpenCL offers corresponding functions to achieve this task. Listing 6 presents the host code for the optimized matrix multiplication using tiles in local memory.

illustration not visible in this excerpt

Listing 6: Host code for a matrix multiplication implementation using OpenCL, where the input and output matrices’ sizes are rounded up to be a multiple of tile_size.

At first and most importantly a TILE_SIZE hast to be defined. This parameter defines the work group size and the size of the square tiles that are allocated in local memory and used to cache values from global memory. As the maximum allowed work group size is 256 on the used GPU, tiles of 16 x 16 elements fit perfectly. The buffers are again allocated with clCreateBuffer. If the adjusted buffer size is larger than the actual matrix size, the input buffers are initialized
to zero using clEnqueueFillBuffer. Afterwards, each input matrix is copied to a rectan­gular region (as large as the input matrix) in the top left corner of the allocated buffer using clEnqueueWriteBufferRect. For a detailed explanation of the arguments have a look at the corresponding entry in the OpenCL specification (Munshi 2012, p.76). Apart form the kernel’s arguments, which equal the ones of the naive version, the work group size must be equal to the defined TILE_SIZE. Finally, the results are retrieved from the output buffer following the same principle as when the inputs where copied to the input buffers.

Listing 7 presents the kernel code of the optimized matrix multiplication using tiles in local mem­ory.

illustration not visible in this excerpt

Listing 7: OpenCL Kernel code calculating a single element of the output matrix per wok item, where the input matrixes are split into tiles which are cached in local memory. (Tompson and Schlachter 2012)

When compared with the first, naive OpenCL kernel, the optimization using locally cached tiles achieves a speedup of 2.63 for a problem size of 4000 as we can see in figure 10. We can also see further improvements on the performance counters in table 1. Although the general thread management overhead remains constant (equal number of work items, work groups and therefore wavefronts) and the kernel occupancy still stays at perfect 100 %, we notice a major decrease in general ALU instructions. Moreover and most importantly, fetch instructions to global memory have been reduced to 1/16 (due to the used TILE_SIZE of 16). Despite this reduction, the eventually fetched data from global memory could only be reduced by a factor of 2.4 due to worse cache behavior. In addition to fewer memory requests and ALU instructions a third and probably unnoticed optimization occurred: As TILE_SIZE is a compile time constant, the compiler is now able to unroll and vectorize the inner for loop which results in more independent instructions available to the compiler to generate better filled VLIWs. This can be seen by an almost doubling of the ALU packing counter’s value.

3.4 Optimization using vectorization

Instead of using local memory to decrease the number of redundant memory requests, this approach focuses on a better utilization of the memory system by using vector types and multiple fetches and writes per work item. Therefore, one work item does not only fetch one element but a block of elements. The access pattern of the algorithm itself works conceptually in the same way to the local tile version of the previous section 3.3. The tiles are called blocks and each work item fetches two whole blocks itself. No local memory is used, as the fetched blocks are stored in each work item’s registers. Each work item then also produces an output block which is written to global memory.

The host code for this matrix multiplication version is almost identical to the one using tiles cached in local memory in listing 6. The differences are shown in listing 8.

illustration not visible in this excerpt

Listing 8: Differences of the host code for the matrix multiplication using vector types when compared with the one using tiles cached in local memory from listing 6.

Most importantly, a BLOCK_SIZE is defined, that sets the size of the block of elements that is fetched from global memory per work item. This parameter must be one of the OpenCL supported vector type widths (which are 2, 4, 8 or 16). A larger BLOCK_SIZE means more and bigger ele­ments are fetched from global memory (better bus utilization) per work item and less work items will be needed to run the calculation (lesser thread management overhead). Furthermore, multiple independent global memory fetches can be pipelined. Additionally, as the number of elements a work item calculates increments, the number of independent instructions (instruction level paral­lelism) of the kernel increases allowing better ALU utilization through mostly full VLIWs (super scalar execution of the kernel). However, as a consequence, the required number of registers per work item rises which allows fewer work groups to run concurrently on a SM. For further reading, Vasily Volkov held a great talk about this kind of tweaking (Volkov 2010). The determination of the optimal BLOCK_SIZE is probably different for various GPUs and subject to corresponding benchmarks. The implementation in listing 9 will use a value of four.

Listing 9 presents the kernel code of the optimized matrix multiplication using vector types and multiple reads/writes to global memory per work item.

illustration not visible in this excerpt

Listing 9: OpenCL Kernel code calculating a 4 x 4 block of elements of the output matrix per work item using four float4 vectors, where the input matrices are accessed directly from global memory. (Advanced Micro Devices, Inc. 2013a)

Compared with the naive GPU version, the block version achieves a speedup of 11.95 at a problem size of 4000 when looking at the chart in figure 10. The kernel also performs 4.54 times faster than the optimized version using locally cached tiles in section 3.3. Also, the performance counters in table 1 show interesting changes. Most notable is the change of global work size from 4000 x 4000 to 1008 x 1008 which accounts mostly for the achieved speedup. Although the kernel now processes 16 elements instead of one, the number of instructions executed per work item only increased by a factor of 1.19 (from 20019 to 23782). Furthermore, the number of total fetch instructions per work item remained constant (a work item now issues four instructions but fetches four times more data per instruction) resulting in equal management work for the fetch unit (which is still busy at 99.4 %) but higher throughput. Additionally, as each work item processes four independent output elements, the amount of independent instructions increased leading to an almost doubling in ALU packing performance. The major drawback of this version is that the high number of needed registers (VGPRs) limits the amount of wavefronts that can be concurrently executed on a SM. This can be seen at the lower occupancy of only 37.5 %. As a result, memory latency may not be hidden well enough (by scheduling wavefronts not waiting for memory requests). However, this problem is specific to the number of available registers which is likely to increase as GPGPU computing becomes an important design factor for GPU hardware vendors.

3.5 Combining both approaches

The optimization attempts presented sections 3.3 and 3.4 focus on two aspects.

1. The first one tries to reduce the number of redundant global memory requests by subdividing the input matrices into tiles which are read once by each work group and cached in local memory.
2. The second approach tries to improve the general memory system and ALU utilization by fetching larger chunks of memory per work item and processing them using vector types.

The final kernel presented in this chapter will combine both ideas. As a consequence, the matrices will be subdivided in two levels. On the lower level, elements will be organized into small blocks which are always loaded completely by a work item. On the higher level, the blocks are grouped into tiles which can be cached in local memory and are always retrieved by a whole work group.

The host code for this matrix multiplication version is almost identical to the one using tiles cached in local memory in listing 6. The only differences are shown in listing 10.

illustration not visible in this excerpt

Listing 10: Differences of the host code for the matrix multiplication using vector types and locally cached tiles when compared with the one using only tiles cached in local memory in listing 6.

The multiple of BLOCK_SIZE * TILE_SIZE, the matrix sizes are round up to, is required to ensure coalesced memory access and to avoid complicated bounds checking (resulting in branch divergence) inside the kernel. This value is significantly higher than in the previous approaches and may account for a notable memory overhead on larger matrices (e.g., a matrix of size 4033 is rounded up to 4096 creating 512127 padding elements which are 3 % of the processed data).

Listing 11 shows the kernel code of the optimized matrix multiplication using vector types, multiple reads/writes to global memory per work item and locally cached tiles.

illustration not visible in this excerpt

Listing 11: OpenCL Kernel code calculating a 4 x 4 block of elements of the output matrix per work item using fioat4 vectors, where the input matrices are split into tiles of blocks which are cached in local memory. (Advanced Micro Devices, Inc. 2013a)

Although this approach can be seen as an optimization of the block only version from section 3.4 by using locally cached tiles, this kernel performs, probably contrary to expectations, slower than the original block only version (cf. benchmark chart, figure 10). The reason for this at first glance confusing behavior is easily detected when we have a look at the corresponding performance counters in table 1. First of all, the number of work items is equal to the block only approach and therefore the number of resulting wavefronts the hardware has to process stays the same. The higher number of index calculations of the kernel becomes notable in the huge increase of processed ALU instructions per work item which is almost twice as much as in the block only variant (44214 vs 23782). However, the use of local memory as a cache works successfully as can be seen at the number of fetches to global memory which has been reduced by a factor of TILE_SIZE and the total number of fetched KiB which has been decreased to a fourth. So far, the counter values indicate an increase in performance. The big penalty, however, comes from the high register and moreover the local memory consumption of the kernel which only allows one work group (= four wavefronts) to run concurrently on a SM (the local memory usage of 32 KiB fills the entire available scratch pad memory). As a result, the kernel can only run 12. 5% of the hardware’s supported number of wavefronts concurrently (occupancy). Memory request latency will therefore account for most of the consumed runtime. By reducing the work group size, the local memory consumption can be decreased. This would allow more wavefronts to run concurrently on the cost of more redundant global memory requests. However, this idea has not proven to be beneficial during benchmarks. Nevertheless, as hardware advances, more registers and local memory will be available in the future, maybe allowing this kernel to develop its full potential.

illustration not visible in this excerpt

Table 1: Selected performance counters of all matrix multiplication kernels gathered using AMD CodeXL’s GPU profiler (Advanced Micro Devices, Inc. 2013c). Some values have been rounded. The descriptions of the counters are based on the tool tips provided by the CodeXL Visual Studio integration.

3.6 Further implementations

Besides the presented kernels, further approaches have been implemented and benchmarked which have not been covered. However, some experiences may help future developers.

1D vs 2D problem dimensions

For the naive approach from section 3.2, it does not make a difference whether the kernel is run with a 1D or 2D problem dimension. 2D seems more natural for matrices but does not affect performance.

Using images and the texture units

Using images instead of buffers as memory objects (without further specialization of access pat­terns) does not increase performance. Images may benefit from special hardware capabilities (e.g. texture units on a GPU provide different cache mechanisms).

Vector types vs arrays

Using vector types over an array (e.g. float4 a over float a[4]) does make a big difference. The blocked approach from section 3.4 has been implemented using arrays instead of vector types. Performance dropped by a factor of four. The reason for this is the size of a GPU’s vector registers which is exactly large enough to hold a float4 value. Components of the vector type can only be addressed at compile time (via, e.g., .y). Arrays however can be addressed at runtime and therefore the compiler has to ensure that every array element is addressable. As a result, an array of float values is placed into a consecutive block of registers where each element occupies only the first vector component of each register. This ensures fast array element access but on the cost of quadrupled variable size.

Furthermore, several libraries exist supporting GPGPU accelerated matrix multiplication.

clAmdBlas (Advanced Micro Devices, Inc. 2013b)

A BLAS (Basic Linear Algebra Subprograms) library provided by AMD achieving similar results when compared with the blocks kernel of section 3.4.

cuBlas (NVIDIA Corporation 2013a)

A BLAS (Basic Linear Algebra Subprograms) library provided by NVIDIA shipped with their CUDA Toolkit. It uses CUDA as GPU Computing language and runs on NVIDIA GPUs only.

3.7 Summary and conclusion

Multiplying matrices is an important part of many scientific applications and should therefore focus on performance. Although several extremely tuned CPU libraries have established them­selves, there are still very few GPU accelerated routines available. We have seen that the matrix multiplication is embarrassingly easy to implement for the GPU as the parallelism is already part of the problem’s nature. However, there is still as much room for improvements on a GPU as there is on the CPU’s side. By the help of profiling information (performance counters) bottlenecks of the presented implementations could be detected and tackled.

We have seen the importance of avoiding redundant global memory requests by using local memory as a programmer controlled cache. Furthermore, independent instructions are beneficial to allow for better ALU utilization by instruction level parallelism. Additionally, vector types and larger memory fetches and processed elements per work item reduce the general thread management overhead and may achieve a better utilization of the global memory subsystem. Finally, a fast GPU accelerated matrix multiplication could be implemented beating the already optimized BLAS library (CPU) by factor of 36 (43.49 vs 1.21 seconds). This result definitely shows the potential of modern GPU hardware.

4 Prefix sum

The all-prefix-sums operation (also referred to as prefix sum or simply scan) is a simple calculation which is most often found as part of larger and more complex routines. It serves as fundamental building block for implementing well-known algorithms such as radix sort (covered in section 5.2.3), stream compacting, minimum spanning tree and many more (cf. Blelloch’s papers on prefix sum (Blelloch 1989) (Blelloch 1990) and the GPU Gems 3 book section 39.1 (Harris, Sengupta, and Owens 2008)).

The all-prefix-sums operation uses a binary, associative operator © with identity element I to transform an input array of n elements

[Abbildung in dieser Leseprobe nicht enthalten]]

into an output array of n elements where

a) each output element is the sum of all elements preceding the corresponding input element. This is known as an exclusive scan. (Harris, Sengupta, and Owens 2008)

[Abbildung in dieser Leseprobe nicht enthalten]

b) each output element is the sum of all elements preceding the corresponding input element and the input element itself. This is known as an inclusive scan. (Harris, Sengupta, and Owens 2008)

[Abbildung in dieser Leseprobe nicht enthalten]

Contrary to the matrix multiplication of chapter 3, the all-prefix-sums operation does not offer similarly trivial parallelism. Scanning an array is naturally sequential with a complexity of O(n). Although each output element could be calculated independently to gain parallelism, a lot of redundant work would be necessary raising the overall complexity to O(n[2]) with the last element still taking O(n) time to calculate. This chapter will focus on efficient and parallel implementations of exclusive scan (except otherwise noted) using addition as operator on an input array of signed 32 bit integers. This chapter is orientated towards the Parallel Prefix Sum article from GPU Gems 3 (Harris, Sengupta, and Owens 2008).

4.1 CPU Implementation

Implementing a sequential, single threaded scan for the CPU is simple. The first output element is initialized to zero. We than loop over the remaining output elements and set each one to the value of its predecessor plus the corresponding input element. Listing 12 presents an example of a scan implementation.

illustration not visible in this excerpt

Figure 12: Benchmark of several prefix sum implementations, where both axes are of logarithmic scale.

This code performs exactly n — 1 additions which is the minimum number of additions required to produce an exclusive scan of an array with n elements. Concerning the following parallel scan implementations later in this chapter, we would like them to be work-efficient. This means that the parallel implementation should have the same work complexity of O(n) as the sequential one.

A parallel computation is work-efficient if it does asymptotically no more work (add operations, in this case) than the sequential version (Harris, Sengupta, and Owens 2008).

The benchmark of this algorithm in figure 12 confirms the linearity of scan. Furthermore, we can also see that scanning is a quite fast operation (when, e.g., being compared to the matrix multiplication of chapter 3). The CPU implementation manages to scan 2[26] elements (256 MiB of data) in 225 ms.

4.2 Naive GPU implementation

The first GPU implementation is base on the article Data Parallel Algorithms (Hillis and Guy L. Steele 1986). The discussed approach is to compute an inclusive (!) scan is shown in figure 13. The algorithm uses several passes to compute the final output array in place. In each pass the value of a predecessor is added to an element. The offset from each element to its predecessor is determined by the pass index and is 2d-[1] where d is the number of the pass starting with 1.

illustration not visible in this excerpt

Figure 13: A naive approach for a parallel scan. (Harris, Sengupta, and Owens 2008)

By now the algorithm assumes that in each pass all input elements are read before any output elements are written. This can only be achieved if this algorithm is run on a device with as many cores as input elements to ensure correct read and write ordering. This is usually not the case for larger arrays (current GPUs have around a few thousands cores. cf. NVIDIA Kepler GK110 in section 2.3.2). A solution to this problem is double buffering. Instead of computing the partial sums in place inside the input array, a second, equally sized buffer is created. In each pass input data is read from one of the buffers and written to the other. Before the next pass the buffers are swapped. Listing 13 shows an example host code implementing this approach.

illustration not visible in this excerpt

Listing 13: Host code for the naive scan algorithm.

At first, two buffers with the size of the input are created. Both of them have to be read- and writable as they are read from and written to alternatingly when executing the passes. The source buffer is filled with the input data. Although this algorithm is independent from the chosen work group size, we have to round the number of enqueued work items (one for each input element) up to be a multiple of the work group size, which will be the size of the enqueued ND range. After this short setup the passes are executed. Each pass corresponds to a power of two (loop variable offset, cf. figure 13) which corresponds to the offset of an element to the predecessor that should be added to it. This offset is raised to the next power of two each pass until it is larger than the problem size. The kernel is executed once for each pass, given the source and destination buffer, the offset and the original problem size as arguments. At the end of a pass the source and destination buffers are swapped (only the handles, not the actual contents). After the last pass has been executed, the result is read from the source buffer (the last pass wrote to the destination buffer which was swapped with the source buffer at the end of the loop). Listing 14 shows the kernel code corresponding to the host code from listing 13.

illustration not visible in this excerpt

Listing 14: OpenCL Kernel code for the naive scan algorithm.

The kernel starts by querying the id of the current element. If this id is larger than the actual problem size, the kernel returns. This case can happen when the problem size has been rounded up to be a multiple of the chosen work group size. If the id addresses a valid input element, we determine if this element has a predecessor at the current pass’ offset. If this is the case, the input element is read from the source buffer, added to its predecessor (also read from the source buffer) and written to the destination buffer. If the predecessor offset is to large, the input element remains the same, but has to be copied if it was just calculated in the last pass (to keep the buffers consistent).

When we have a look at the benchmark of this algorithm and compare the results with the CPU implementation, we can clearly see that this approach does not profit very well from the large computational power GPUs offer. With 1132 ms at 2[26] elements the naive GPU version is five times slower than the CPU version. The low performance has basically two reasons. The first is the high number of kernel invocations necessary to compute the final result. For 2[26] input elements to scan, 26 passes are necessary each consisting of 2[26] work items mostly performing one addition. Hence, leading to a runtime/work complexity of O(nlogn). Compared with the complexity of the original CPU implementation, which was O(n), this algorithm is not work-efficient. The second flaw of this implementation is the high rate of global memory access. Both buffers are accessed multiple times at the same locations throughout the passes. As the scan operation using a simple addition is more memory bound than computational, a lot of time is wasted on waiting for global memory transactions. Fortunately, both problems can be tackled which will be subject to the following sections.

4.3 Work efficient GPU implementation

In 1990 Blelloch presented a more efficient version of the all-prefix-sums operation in his article Prefix Sums and Their Applications (Blelloch 1990). He presented a tree-based approach consisting of three phases.

illustration not visible in this excerpt

Figure 14: The up-sweep phase of a work efficient parallel scan (Harris, Sengupta, and Owens 2008)

The first phase is called the reduce or up-sweep phase and is illustrated in figure 14. It takes an input array whose length must be a power of two. All elements of the input array are leaves of the tree. The algorithm than takes adjacent pairs of elements and adds them together forming a node holding the sum of it’s two child nodes. This step forms a pass of the up-sweep phase and is repeated for the created parent nodes until the root of the tree which than holds the sum of all values. As the values of the right child nodes are not needed anymore after their parent nodes have been calculated, the tree can be built in-place.

illustration not visible in this excerpt

Figure 15: The down-sweep phase of a work efficient parallel scan (Harris, Sengupta, and Owens 2008)

The second phase sets the value of the root node to zero to prepare for the third phase, the down- sweep phase, which is illustrated in figure 15. The down-sweep phase again consists of several passes starting at the root node and repeating along all levels of the tree until the leaves. In each pass the values of the left child nodes are read and temporarily stored. The left child nodes are than replaced by the value of their parent nodes. The previous, temporarily stored values of the left child nodes are then added to the parent nodes’ values and written to the right child node. As the parent nodes are not needed anymore after the sum for the right child has been calculated, the tree can again be traversed downwards in-place. Overall, for an input array of n elements, this algorithm performs n — 1 adds during the up-sweep phase and the same amount of adds during the down-sweep phase[9]. Although almost twice as much work is performed when compared with the sequential CPU implementation, this algorithm can still be considered work-efficient according to the definition given in section 4.1. Listing 15 shows the host code implementing the work-efficient tree based approach after Blelloch (Blelloch 1990).

illustration not visible in this excerpt

Listing 15: Host code for the work-efficient scan algorithm.

Contrary to the previous algorithms, this implementation will use two kernels, one for the up­sweep and one for the down-sweep phase. To start with, the number of input elements has to be rounded up to be a power of two. This size is then used to create a buffer which is filled with the input array. After these initialization steps we can begin with the up-sweep phase. The loop variable for the passes is the offset of the buffer index between two adjacent nodes on the same tree level. This value starts with one for the first pass and is raised to the next power of two for each subsequent pass until the size of the input buffer has been reached. For each pass the up-sweep kernel is enqueued with the buffer and the current pass’ offset as arguments. The number of work items required for a pass is equal to the number of parent nodes calculated on the current pass’ level. This value is a half of the buffer size for the first pass and halves itself after each pass. Although this algorithm is still independent of the chosen work group size, we have to make sure that it is not larger than the global number of work items.

After the up-sweep phase has been completed, the last element corresponding to the tree’s root node will be set to zero. This is easily accomplished by enqueuing a write operation writing the value zero to the last buffer element.

Finally the down-sweep phase can finish the scan computation. Similar to the up-sweep phase several passes are executed given the offset between two adjacent tree nodes on the same level and the buffer as argument. In contrast to the up-sweep phase, the tree is now traversed top-down meaning the offset starts with a half of the buffer size and is set to the next lower power of two in each pass until one has been reached. Analogous, the number of nodes to process (the global work size) starts with one and doubles every pass. When the down-sweep phase has completed, the buffer holds the final scan result which can than be read back to host memory. Listing 16 shows the corresponding OpenCL kernel code for the work-efficient scan implementation.

illustration not visible in this excerpt

Listing 16: OpenCL Kernel code for the work efficient scan algorithm.

The UpSweep kernel starts by computing the value of stride which is the distance of two parent nodes in the current pass. This value is used to calculate the buffer index (id) of the current parent node (equal to the right child node) from the work items global id. The input value at id is read (right child) together with the corresponding neighbor node at the given offset (left child). The computed sum is then written over the right child’s location.

The DownSweep kernel initializes the same way as it’s preceding one by determining the current pass’ stride and the parent node’s index. Then, the value of the parent node is read and temporarily stored. The value of the left child node (given by offset) is added the parent node (which becomes the right child node). Finally the previous, temporarily stored value of the parent node is passed down to the left child node.

When having a look at the benchmark results for this algorithm in figure 12 we can see that our efforts have payed off. The initial 1132 ms of the naive GPU implementation have shrunk to 604 ms on an array of 2[26] elements. Furthermore, this implementation is now work-efficient and therefore avoiding unnecessary additions. Additionally, as all intermediate results are stored in-place, the algorithm does not waste memory by double buffering, such as the naive approach. However, the algorithm requires the input to be a power of two, which becomes more disadvantages with larger input sizes. This can be clearly seen at the stepped shape of the runtime curve which rises at every new power of two as the time required for the calculation is equal for all problem sizes that are round up to the same power of two.

4.4 Recursively scanning blocks in local memory

The previous work-efficient implementation in section 4.3 does already perform quite well for an initially sequential problem. However, the input buffer size restriction is undesirable. Furthermore, no local memory is used which might be useful for computing and storing intermediate results.

illustration not visible in this excerpt

Figure 16: Scanning larger arrays of values recursively (Harris, Sengupta, and Owens 2008).

GPU Gems 3 Chapter 39 (Harris, Sengupta, and Owens 2008) shows a different approach for computing all prefix sums of an input array. The method is shown in figure 16. The input data is rounded up to be a multiple of a block size (which is two times the chosen work group size in their implementation). Each work group then loads a block from global memory to local memory. Each block is then scanned using Blelloch’s work-efficient scan algorithm (Blelloch 1990), but in local memory inside a work group. After the up-sweep phase is finished, the root element of each tree in a block is copied into a smaller, temporary buffer. This buffer is then recursively scanned again. The scan result of the temporary buffer can then be used to offset the scanned blocks of the original buffer to build the final scan. Although this algorithm is not tied to a specific work group size it performs better the larger the work group size is chosen, as the work group sizes determines the factor by which the input data is reduced in each recursion. Listing 17 shows the host code of this recursive scan algorithm.

illustration not visible in this excerpt

Listing 17: Host code for the recursive scan algorithm.

This implementation again uses two kernels, one for scanning the blocks in local memory and one for applying the offsets from the scanned temporary buffer to the original one. The host code consists of two parts, the setup code and the actual recursion. The setup is performed in recursiveGPU which starts by rounding the input data size up to be a multiple of twice the size of a work group, because each work item processes two values. Then a buffer is created and the input data written to it. This buffer is then passed to the recursive scan procedure recursiveGPU_r. The recursion starts by computing the size of the temporary buffer sums that will hold the values of the root nodes (the sums over each block). As each thread processes two input elements and each work group computes one sum over a block, this temporary buffer’s size is the problem size divided by two times the work group size. The result is then rounded up to be a multiple of this value, so it can be recursively scanned again with this algorithm. After the size has been determined, the sums buffer can be created. No initialization or memory transfer is required as the buffer is entirely accessed by the kernels. Afterwards, the block scan kernel can be set up. It takes the original input buffer as well as the buffer for the sums as arguments. Furthermore local memory is allocated to cache the block calculated by each work group. As each work item reads two input elements, the global work size is half the number of input elements. After the kernel has been enqueued, we have to check whether the input data for this recursion consisted of more than one block. If this is the case we scan the temporary buffer (containing more than 1 value) by recursively calling recursiveGPU_r on this buffer. After scanning the temporary sums buffer has completed, the sums can be applied to the original buffer. Therefore another kernel is enqueued given both buffers as arguments. The global and local work size are equal to the previous kernel on the same recursion level. Listing 18 shows the kernel code for the recursive scan implementation.

illustration not visible in this excerpt

Listing 18: OpenCL Kernel code for the recursive scan algorithm.

The ScanBlocks kernel starts with querying some values from OpenCL. Beside the global and local id, also the size n of the block of the current work group is determined. Each thread than loads two values of the block into local memory. The following code then implements the tree based scan approach by Blelloch which has already been discussed in the previous section 4.3. The three phases (up-sweep, set-last-zero and down-sweep) are executed across all threads of the work group on the block in local memory. The only difference is, that the computed value of the root node after the up-sweep phase is moved to the sums buffer at the index of the current work group. After the block has been scanned completely, it is copied back from local to global memory.

The AddSums kernel is executed with the same global and local sizes as passed to the ScanBlocks kernel on the same recursion level. The sums buffer now contains a scan of the sums of all blocks scanned in this recursion. Therefore, the position where the ScanBlocks kernel wrote the root node’s value, now contains the sum of all elements (of all blocks) preceding the block of the current work group. This value is retrieved and added to all values of the current block finishing the scan.

Concerning the performance benchmark in figure 12, this implementation scales better with the problem size although it is not faster than the global work-efficient scan of the previous section (606 vs. 604 ms on 2[26]elements). Furthermore, the recursive implementations’ kernels are enqueued far less often than the one’s of the work-efficient implementation. This can be explained by the reduction factor in each algorithm. The work efficient implementation reduces the number of work items by a factor of two each iteration while the recursive algorithm reduces by the work group size (which was 256 for the benchmarks) times two. In return, a lot more work is done inside the recursive algorithm’s kernels. Also the number of additions required to compute the final result increased as extra adds a needed to apply the sums from the temporary buffer to the input buffer. However, this implementation showed a different concept of how a problem can be broken down into smaller parallel pieces of work. The next implementation will follow up on this idea.

4.5 Optimization using vector types

The final implementation shown in this chapter takes the idea of the recursive scan approach from the previous chapter one step further (Harris, Sengupta, and Owens 2008, ch.39.2.5). Instead of loading only two elements in each work item we will load two vectors of elements. These are then scanned inside each work item. The sums of the vectors (root node after the up-sweep phase) is then copied to local memory. The algorithm then continues just as the previous chapter’s version by scanning the blocks in local memory. The results can then be used to offset the vectors of each work item just as the AddSums kernel on a higher level. The remaining part of the implementation basically stays the same. With this modification, each work group does not only load a block of two times the work group size of elements but a block of two times the work group size times the chosen vector width. As a result the input array is reduced faster by a factor of the chosen vector width in each recursion. The consumed local memory stays the same as only the sums of both vectors are placed into local memory. Only the consumed registers will increase. Listing 19 shows the differences of the host code of the this vector type implementation compared with the previous one.

illustration not visible in this excerpt

Listing 19: Differences to the host code shown in listing 18 for the recursive scan algorithm using vector types.

The first and most important step is to define the width of the vectors used. OpenCL supports vector types with the lengths of 2, 4, 8 or 16. The larger this VECTOR_WIDTH is chosen, the larger is the reduction of the input buffer in each recursion. However, as a side effect the consumed registers per work item increase leading to a lower occupancy of the kernel. The ideal vector width may be hardware specific and is subject to corresponding benchmarks. This implementation will use a VECTOR_WIDTH of 8. The remaining changes to the host code are straight forward. As the blocks get bigger by a factor of VECTOR_WIDTH, the size of the temporary sums buffer decreases by and has to be rounded up to the same factor. Furthermore, the global work size shrinks. Finally, the check whether the input array consisted of more than one block has to be adapted to the new block size. Although the changes to the host code are quite small, the kernel needs a few more extensions as shown in listing 20.

illustration not visible in this excerpt

Listing 20: OpenCL Kernel code for the recursive scan algorithm using vector types with a width of eight.

First of all, several macros are defined for the ScanBlocksVec kernel to avoid redundant code for the up-sweep and down-sweep implementation on the vector types. The kernel begins, after querying topological informations, by copying two int8 vectors into registers. The up-sweep phase is completely executed on the work item’s registers on both vectors resulting in a lot of independent instructions. After the up-sweep phase, the last vector elements hold the sum over both vectors which is then copied into the shared memory block. After the last elements are reset to zero, the down-sweep phase continues, finalizing the scan on the two vectors stored in registers. The remaining part of the kernel is equal to the one of the previous implementation in listing 18 which calculates the scan of the block in shared memory and copies the block’s sum into the sums buffer. A final small addition is to offset the vectors by the sum of all previous vectors of preceding work items in this block. Notice that the value loaded from local memory is added to each vector element. Finally the vectors are written back to global memory.

The AddSums kernel only changes by the type of the buffer which now consists of int8s. Also notice here, that the value of val loaded from the sums is added to all vector elements of the two vector elements of buffer.

When having a look at the benchmark results in figure 12 we can clearly see a difference to the previous, non-vector implementation. The 604 ms of the non-vector implementation could be reduced by 37% to 383 ms for a problem size of 2[26] input elements. This is already an amazing result for a sequential problem such as scan. Nevertheless, the initial CPU scan time of 225 ms is still out of reach. However, if one has a closer look at the benchmark data one can see that the final recursive vector scan implementation actually beat the CPU implementation when it comes down to run time. It took only 87 ms to scan the 2[26] elements on the GPU. Upload and download have the larger shares of the total time required of the GPU scan, contributing 154 and 141 ms for transferring the 256 MiB to and from the GPU’s main memory. This means that 77 % of the total run time is wasted on memory transfers, which is a huge problem for many memory intensive algorithms like scan. Figure 17 shows a detailed view on the upload, run and download phase of the recursive vector scan implementation.

illustration not visible in this excerpt

Figure 17: A closer look to the benchmark of the recursive vector scan from section 4.5. It points out how much time is wasted by memory transfers. Note that both axis are of loga­rithmic scale.

4.6 Further implementations

Beside the presented kernels a further approach has been implemented and benchmarked which has not been covered. However some experiences may help future developers.

Bank conflict avoidance

The GPU Gems 3 chapter about scan (Harris, Sengupta, and Owens 2008) also talks about avoiding bank conflicts occurring in the kernel in chapter 4.3 where blocks are scanned in local memory. Both recursive scan implementations (chapters 4.4 and 4.5) have been implemented using the presented approach to avoid bank conflicts. However, the time required to calculate the conflict free offsets to local memory compensates for the time won by faster local memory access.

Furthermore, several libraries exist supporting GPGPU accelerated scans.

clpp ( 2011)

The OpenCL Data Parallel Primitives Library is an open source and freely available library offering a few GPU primitives such as a scan and sort implementation. However, the project seems to be inactive (last release in July 2011).

ArrayFire (AccelerEyes 2013)

ArrayFire is a commercial GPU software acceleration library provided by AccelerEyes. It provides a lot of GPGPU accelerated algorithms using CUDA and OpenCL including scan.

4.7 Summary and conclusion

In this chapter we have seen how a sequential algorithm like the all-prefix-sum can be parallelized for GPUs. A tree based approach using several passes delivered good results. Also the idea of solving the problem in smaller sub groups which are than merged into the global result proofed to be successful. Both concepts are common patterns when trying to port sequential CPU code to a GPU. However, we did also learn that using a GPU does not always result in a performance boost, despite their enormous power. So does the final scan only perform three times faster than the CPU implementation excluding memory transfer. If the time for copying the data to and from the GPU’s main memory is included, we would be better off staying with the CPU. This is a problem for many fast algorithms operating on larger chunks of memory. Nevertheless, a fast GPU implementation of such primitives can still be needed in cases where the input and output data is already present or consumed on the GPU. This is the case when input data is produced on the GPU (e.g., scanning image data from a rendering process) or present as input for other purposes (other algorithms, data for rendering). Also the output may be directly used by the GPU (e.g using scan as part of a filtering routine generating a list of objects to render). Even if a GPU algorithm would be slower in run time than a CPU version it might still outperform the CPU variant including memory transfer to and from the systems main memory.

5 Sorting

One of the most fundamental operations in computer science is sorting a sequence of elements. And as old as the problem is, as numerous are the algorithms to solve it. These sorting algorithms differ in various aspects such as best, average and worst runtime or memory complexity, stability, the number of comparisons and swaps and whether the algorithm is a comparison based sort or not.

The sort implementations provided by today’s standard libraries are highly tuned and achieve an optimal asymptotic runtime complexity of O(nlogn) for comparison based sorts. Examples are variations of quick sort (C), merge sort (C++), intro sort (C++, .NET) or Timsort (Java, Python). All these algorithms are comparison based sorts, thus requiring a method of comparing two elements of the input sequence. This comparison is often provided either by the language or standard library (e.g., < operator) or by the programmer via a custom comparison function, giving the flexibility to compare and sort any kind of elements.

Another class of sorting methods are integer sorting algorithms. These algorithms do not use comparisons to determine the order of elements. They rely on more flexible integer arithmetic applied to the keys which have to be sorted. Therefore, they have a better asymptotic runtime complexity than comparison based ones. Popular algorithms of this kind are radix sort, counting sort and bucket sort. All of them running with O(n + k) or O(n * k) (where k is a constant) in linear time. However, despite their limitation on the sort key, integer sorting algorithms also work on other kind of types as long as they can be represented as integers in binary (e.g., strings can be seen as byte array forming a (larger) integer).

Considering parallelizability and an eventual GPU implementation, sorting lies between matrix multiplication and prefix sum offering some degree of parallelism depending on the chosen algo­rithm. This chapter will focus on the implementation of two widely chosen algorithms for GPU sorting. These are the comparison based bitonic sorting network and the integer sorting algo­rithm radix sort. To reduce code complexity (especially of the latter), the input array consists of unsigned 32 bit integers.

5.1 CPU Implementations

Before stepping into the details of the two chosen GPU algorithms, bitonic and radix sort, a set of CPU sorting algorithms is discussed. Their resulting run times are used as a reference and compared with the GPU implementations in the benchmarks.

Time in seconds

Figure 18: Benchmark of several sort implementations. Both axis are of logarithmic scale. Integer sorting algorithms (radix sort) are drawn in dashed lines.

5.1.1 C/C++ standard library routines

As the later GPU implementations should be compared with well implemented and wide-spread CPU ones, the first step is to measure the performance of the standard library’s routines. Listing 21 shows a typical usage of the qsort function provided by the stdlib header from C.

illustration not visible in this excerpt

Listing 21: Sorting an array of unsigned integers using qsort from the C stdlib header. The provided comparison lambda function uses comparisons instead of simply subtracting the input values (a - b) as this may cause unwanted behaviour due to overflows (e.g., 1u - max_uint is 2 instead of a negative number). Subtraction does work for signed types.

As qsort is usually precompiled and part of the runtime library, a compare function has to be provided which is called for every comparison. This will probably slow down the run time when compared with the C++ std::sort which can use an existing overload of the < operator or inline a provided compare function. Listing 22 shows a typical call to the std::sort function template from the C++ algorithm header.

illustration not visible in this excerpt

Listing 22: Sorting an array of unsigned integers using std::sort from the C++ algorithm header.

As the < operator is defined for unsigned integer types no custom compare function has to be provided and no additional overhead is created for comparing elements. The difference in the benchmark chart in figure 18 is evident. It takes qsort 14.746 seconds to sort a sequence of 2[26] elements while std::sort only requires 9.039 seconds.

5.1.2 Radix sort

In contrast to the comparison based qsort and std::sort, radix sort operates on the binary representation of the input elements in several passes. The input values are therefore seen as numbers of a numeral system of a chosen base (= radix). In each pass, a digit at the same position (starting with the least significant) is selected from all input elements. These input elements are then split into buckets according to the value of the current pass’ digit. Elements with the same digit value retain their order.

If the radix is two, in each pass a bit (digit) of each input element (number represented in the binary system) is selected (starting with the least significant bit). The input elements are then split into two sequences (buckets) according to the selected bit. This is done by creating a histogram holding the number of elements having the same bit value for each bit combination (0 or 1). After the histogram has been created, it is exclusively scanned. The scanned histogram now holds the start index for each bucket. The input elements are permuted and moved into their corresponding buckets according to the selected bit. Elements with an equal bit are written into the same bucket in the same order as they appeared in the pass’ input (each pass is stable). An auxiliary array is usually needed for this permutation step. This procedure is repeated for every bit of the input elements.

By only selecting one bit in each pass, 32 passes are required to sort the input sequence (elements are 32 bit unsigned integers). In each pass the input array is iterated over two times and the histogram has to be scanned once. This accumulates to 64 iterations over the input sequence and 32 scans of a two element histogram making radix sort a linear algorithm but with a quite significant constant factor. This factor can be reduced by selecting several bits of each input element at once in each pass (increasing the radix to a larger power of two). As a result, the number of passes decreases linearly with the number of selected bits (which is called radix and gives the algorithm its name). As a trade-off, the size of the histogram increases by a power of two. The CPU radix sort implementation used in the benchmarks is provided in listing 23.

illustration not visible in this excerpt

Listing 23: Sorting an array of unsigned integers using radix sort. The algorithm uses two passes analyzing 16 bits each. The implementation is based on the radix sort sample shipped with AMD APP SDK (Advanced Micro Devices, Inc. 2013a).

The implementation uses a relatively high radix of 16 (RADIX) which allows radix sort to finish in two passes. As a consequence, the histogram consists of 2[16] elements (BUCKETS) occupying 256 KiB of memory but has to be scanned only two times.

The performance difference of radix sort to the previously presented comparison based algorithm is significant as seen in figure 18. Although the high radix of 16 causes a large overhead on small input sizes (as the full 2[16] element histogram has to be scanned independently of the input size), radix sort shows its strength on larger inputs. It catches up to std::sort at approximately 10,000 elements and then outperforms the library routine largely sorting 2[26] elements in 0.884 seconds (compared to 9.039 seconds of std::sort).

5.2 GPU Implementations

After the presented CPU sorting routines in the previous sections, the following sections focus on creating efficient bitonic and radix sorting implementations for the GPU using OpenCL.

5.2.1 Bitonic Sort

Many regular sorting algorithms do not fit into a GPU’s massively parallel programming model by offering insufficient parallelism, requiring non-array data structures or begin based on recursion. However, a small subgroup of sorting approaches exists that fits the requirements of parallel hardware perfectly: sorting networks. A sorting network is a network of wires (one for each input) and comparators along these wires. Comparators connect two wires, compare their values and might swap them. Finding optimal sorting networks for a given number of inputs is difficult and still subject to research. However, in 1968 Ken Batcher presented (beside others) an approach to create sorting networks achieving reasonable results (Batcher 1968). His idea is based on efficiently merging a bitonic sequence into a sorted sequence, hence the name bitonic sorter. A bitonic sequence is a sequence of elements which consists of two equally long subsequences, where one is sorted ascending and the other descending. As two arbitrary elements form a bitonic sequence, they can be merged into a sorted sequence. As two sequences sorted in opposite order also form a bitonic sequence they can again be merged into a sorted sequence. This allows us to create sorting networks for any power of two sized inputs.

illustration not visible in this excerpt

Figure 19: Example of a bitonic sorting network for 16 inputs (Bitonic 2012).

Figure 19 shows a bitonic sorting network for 16 input elements. The input elements start on the left side of the network and travel through it to the right. Each arrow represents a comparator which compares the elements of the connected wires and swaps them if necessary, so that the larger element is swapped to the wire the arrow points at. Each blue or green box is a separate bitonic merge receiving a bitonic sequence as input and outputting a sorted one (blue = ascending, green = descending). The red boxes contain the comparators and are equally structured. They compare the top half of the input (bitonic) sequence against the bottom half and create two bitonic sequences where each element of the upper bitonic sequence is smaller than or equal to every element in the bottom one (vice versa in green boxes). Further red boxes are than recursively applied to the two outputs until the sequence is sorted.

The created networks consist of O(n log[2] n) comparators. As [2] comparators can be executed in parallel (cf. figure 19), the runtime complexity of a bitonic sorter would be O(log[2] n) on a computer with at least (2 processing elements.

Sorting inputs of any length is also possible with a bitonic sorting network. The network is con­structed for the next larger power of two and all comparators affecting a wire of a (possible) input element larger than the actual input are omitted. This approach and a sample implementation is discussed in more detail in an online article by Hans Lang (Lang 1998). To keep the code simple, the bitonic sorter implementation presented in this chapter will only focus on inputs with a power of two length. Listing 24 shows the host code of a bitonic sorter implementation.

illustration not visible in this excerpt

Listing 24: Host code for a bitonic sort implementation. (Bainville 2011).

At first the input’s size is rounded up to a power of two (adjustedSize). A read- and writable buffer is created with this size and the input data is written to it. If the input was smaller than the created buffer, the remaining part is filled with the maximum representable value of the input data type. This values should remain at the back side of the buffer during the sorting procedure and not disturb the actual input. As sorting networks are compare-and-swap-based no additional buffer is required. The input can be sorted in place. After the buffer is set up we can begin sorting the elements.

As we can see in figure 19, each column of red boxes contains half as many comparisons as inputs, which can be executed in parallel. Therefore, a kernel will be enqueued for every column of red boxes consisting of n work items. The distance between two wires of a comparator is equal across all red boxes of a column and will be called inc. The red boxes itself are combined in colums of alternating blue and green boxes. The width of these boxes (the number of wires they span) will be called boxWidth and determines the initial inc (which is a half of the width) of their contained columns of red boxes. Furthermore, the boxWidth allows each comparator to derive the sorting direction (ascending or descending) from the wires’/work items’ indexes.

The approach is implemented using two nested loops. The outer loop iterates the boxWidths which are powers of two (starting with two) until the last blue box, which is equally long to the number of inputs. The inner loop iterates over the distances between two wires of the comparators inside the red box columns (inc). This value starts with a half of the boxWidth and decreases to the next lower power of two each loop until one. In each iteration of the inner loop a kernel is enqueued. Arguments are the inner loops inc, the other loops boxWidth and the buffer with the values. The global work size is the number of comparisons of the current red box column which is a half of the input size. The local work size can be chosen freely except it must not be larger than the global one. After all passes have been executed, the sorted values can be read back from device memory. Listing 25 shows the corresponding kernel code for the described bitonic sorter implementation.

illustration not visible in this excerpt

Listing 25: OpenCL Kernel code for one iteration of a bitonic sort. The implementation is based on an article about experiences with OpenCL and sorting by Eric Bainville (Bainville 2011).

The kernel begins by calculating the index of the first wire for the comparator corresponding to the current work item. This is achieved by modifying the global id. The bits below the inc bit (the lower part) stay the same. This value determines the wire index in the upper half of a red box. The bit at position inc of the global id corresponds to the upper (0) or lower (1) half of the red box. This bit shall be zero to get the upper wire of the comparator. The remaining (upper) part of the global id above the inc bit is a multiple of the width of a red box and therefore determines the index of the red box inside the column. As only n work items have been enqueued, the upper part has to be doubled to have box indexes for the full range of wires. The combined value is stored in i. The next step is to decide upon the sorting order of the current comparator. This can be done by looking at the bit at position boxWidth of the upper wire’s index. If this bit is zero, the comparator belongs to a blue box and the sorting order is ascending. Otherwise we are in a green box and have to sort descending. Afterwards the two values at the input wires of the comparator are loaded from global memory. The positions are i and i + inc. If the values have to be swapped they are written back to global memory in opposite order.

Looking at the performance data in figure 18 we can see that the GPU bitonic sorter performs equally well as C++’s std::sort. If we look very closely we can even see that the bitonic sorter’s runtime raises a bit faster than the one of the CPU sorting algorithm. This can be explained simply by their runtime complexities which are O(nlog[2] n) for the bitonic network and O(nlogn) for std::sort. When profiling the code the first problem that can be noticed are the number of enqueued kernels, which is 1053 for an input of 2[26] elements. Furthermore, the large number of work items are quite lightweight. The latter is not a problem as such but it is important to notice that here is still space for optimization. Fortunately, both issues can be solved by combining several separate kernel invocations into a larger, heavier kernel. This is also known as kernel fusion and topic of the next section.

5.2.2 Bitonic Sort using kernel fusion

The basic concept of kernel fusion is to reduce the amount of redundant global memory loads and stores to the same locations between two or more separate kernel invocations. Using the example of the bitonic sorting network for 16 inputs in figure 19, kernel fusion could be used to combine the kernel invocations for the second and third column of red boxes. Considering only the upper blue box, work item zero and one would load the values of wire zero to three. After the first red box all four values would be stored back to global memory. On the next kernel invocation, work item zero and one would again load the same values for the following two red boxes of the same column. By combining the two separate kernel invocations into one, the redundant global memory store and load between the two kernels could be avoided. This idea can be taken further to combine even more invocations into a single kernel. However, fusing kernels tends to making the work items heavier in resource consumption. Considering the bitonic sorting network where each work item initially had to store two values from global memory, the fused kernel (combining two invocations) would require storing four values per work item (and reducing the global work size to nn). This value increases by powers of two and therefore sets a limit to the amount of kernels that can be fused together. Listing 26 shows the changes to the previous bitonic sorter host implementation.

illustration not visible in this excerpt

Listing 26: Changes to the host code from listing 24 for a bitonic sort implementation using kernel fusion. (Bainville 2011).

The outer and inner loop enqueuing the kernels stay the same. However, if the increment inc is large enough a fused kernel can be used which processes more than one column of red boxes with multiple increments (decreasing powers of two). The largest fused kernel being used is kernel16 which processes 16 input values and fuses four invocations with four different increments. Larger kernels are still possible, but showed to consume to much registers to increase performance (register spilling). Nevertheless, larger kernels may be beneficial on newer hardware with more available registers. By using kernel16 the inner loop’s inc variable can be incremented to the fourth lower power of two instead of the next one, which is controlled by the ninc variable. Kernels with eight, four and two inputs follow analogously. The kernel with two inputs is actually the original one used before kernel fusion. After the appropriate kernel has been chosen it is enqueued with the same parameters as an unfused kernel. Only the number of enqueued work items decreases as each work item now processes more values. The corresponding kernel code is based on the previous section in listing 25. As several very similar kernels have to be created, the preprocessor is used to reduce redundant code.

illustration not visible in this excerpt

Listing 27: OpenCL code of several (fused) kernels performing 1, 2, 3 or 4 iterations (2, 4, 8 or 16 inputs per work item) of a bitonic sort. (Bainville 2011).

The kernel code starts with the order procedure which orders two values (at index a and b) of a given array x either ascending or descending depending on the value of asc. The following BITONIC_SORT_FUSION macro declares the actual kernel plus a merge procedure. Arguments to the macro are the number of input wires the expanded kernel should process (lvl) as well as the logarithm of base two for this value (logLvl). Finally also the half of the lvl argument has to be specified. This value cannot be calculated as it is used in token pasting and macro expansion.

The macro begins by declaring a merge procedure for the current number of input wires. Argu­ments are a pointer to the array holding the values of the input wires (x) and a boolean determining the sort order (asc). For the half number of input wires, comparison have to be made (cf. a larger red box in figure 19). Then both halves of the input array are merged using two invocations of the merge procedure of the next lower level (half the current input wires). This is the actual fused part of the kernel which now processes the next two smaller red boxes sharing the same inputs as the initial red box. Although the merge procedure is recursive, it cannot be implemented as such due to OpenCL not allowing recursive function calls. The reason for this is that OpenCL programs do not have a runtime stack for storing local (not local!) variables. They have to be stored in registers and therefore the maximum amount of necessary register space needed by the program has to be determinable at compile time (and will be fully allocated when the kernel is executed). The final merge2 procedure will not have to call any recursive merges any more. Therefore the macro merge1 is declared empty (recursion anchor).

After the merge procedur the actual kernel for lvl number of inputs follows. The code is basically taken from the previous, non-fused version from the last section. The first adjustment made is to decrease the kernel’s increment to the one of the last fused original kernel’s invocation. Furthermore, when constructing the wire index of the first value to load, not only the bit at inc’s position is set to zero but logLvl bits, as lvl values will be loaded instead of two. These values are loaded from global memory starting at the computed index i with a stride of inc. The core forms the invocation of the bitonic merge on the loaded array[1]. The sorted values are then written back to global memory. Finally the macro is expanded to produce a merge procedure and a kernel for the levels 2, 4, 8 and 16.

The difference in performance can be seen clearly when comparing the fused bitonic sorter against the previous implementation in figure 18. For sorting 2[26] elements the fused variant only requires 3.808 seconds which is roughly two times faster than the original version (7.678 seconds). This is mostly due to reduced global memory traffic by cutting down the number of enqueued kernels which has shrunken to 294 (compared with the initial 1053). In comparison to std::sort, which requires 9.039 seconds, this is a speedup of 2.37.

The implementation can still be improved by taking advantage of local memory. Eric Bainville also shows a version of the fused level four kernel using local memory in his excellent article from which most of this bitonic sort approaches are taken from (Bainville 2011). The benchmarks however only show an insignificant boost (several milliseconds) in performance on the cost of a relatively complex kernel. Therefore this idea will not be covered.

Another idea would be to place the loaded values from the wires into local memory as local memory is accessible on a 4-byte boundary (1-byte with OpenCL extension cl_khr_byte_addressable_store). Arrays in registers can only by accessed by index at full register boundaries (16 byte) forcing the compiler to allocate more register space than the initial array requires. By placing arrays in local memory the register usage of work items could be cut down allowing higher occupancy on the cost of slower memory access. With a work group size of 256 (used in all benchmarks and the maximum of the used GPU), a level 32 kernel would already consume the full 32 KiB of local memory on the used GPU. However, when placing the array in registers, a level 32 kernel would already spill registers into global memory which is incredibly slow. Therefore a level 32 kernel might be beneficial when using local memory. Nevertheless, this approach has not been tested with the bitonic sorter. This idea is reconsidered when discussing radix sort in section 5.2.4.

5.2.3 Radix Sort

We have already discussed radix sort in the CPU implementation section 5.1.2. Implementing radix sort for the GPU comes down again to three steps which are executed in passes. The first step is to create a histogram for the radix of the current pass. Secondly, the histogram has to be scanned and finally, the input values are permuted according to the scanned histogram.

Creating the histogram would be easy. For each input element a work item is created which deter­mines the histogram element (bucket) which should be incremented. By using atomic operations which are available since OpenCL 1.1 (extension in OpenCL 1.0) the histogram can simply be stored in global memory and concurrently updated from all threads without errors. As global memory atomics are relatively slow, each work group should first compute a histogram in local memory which is then finally reduced to one in global memory.

However, this kind of histogram is of little to no use during the fully parallel permute stage. The reason for this is that every work item must be able to determine the destination address for its value independently of other threads. Although it would be possible for all work items of the same histogram bucket to obtain different destination addresses again by using atomic increments on the histogram (similar to the CPU implementation where each bucket’s histogram value is incremented after writing a value to this bucket), it does not work on a GPU. Because OpenCL does not guarantee the order in which work items are executed, work items with higher global id xAs the loaded values of the input wires have to be simply sorted, it is possible to replace the bitonic merge by any kind of sorting algorithm.

might write values to a bucket before work items with lower id do. As a consequence the permute pass would not be stable anymore which is crucial to not ruin the permutation of a previous pass.

A solution to this problem can be found by thinking about the information a work item needs to find its correct destination address. In addition to the the start address of the region where all values of the same bucket should be written to (as in a single histogram), each work item also needs to know how many work items having a value belonging to the same bucket have a smaller global id than the current work item. This can be solved by spreading the histogram down to all individual work items. As each work item now owns a separate histogram memory consumption raises enormously. Therefore, multiple input values are processed in each work item. The histograms, after they have been created by the work items, are moved to global memory as such that the histogram entries of the same bucket across all work items lie consecutively and ascendingly (concerning the global id) in memory followed by the next group of buckets and so forth. As a result, the scan step can simply scan the whole block of histogram memory without any special cases. Therefore, the vector scan kernel developed in section 4.5 can be fully reused as building block for this radix sort implementation. Listing 28 shows the host code for an OpenCL radix sort implementation.

illustration not visible in this excerpt

Listing 28: Host code for a radix sort implementation. (Advanced Micro Devices, Inc. 2013a).

At the top of the source code several macros define constants. The first three should already be known from the previous CPU implementation in section 5.1.2. A major difference is the relatively small radix used with the GPU implementation. This is justified by the huge amount of memory required to store a histogram of 2RADIX buckets for each work item. New is the BLOCK_SIZE constant which defines how many elements each work item processes. Its value of 32 has been determined by benchmarks and may be specific to the used GPU. The VECTOR_WIDTH macro defines the width of the vector types used in the scan kernel (cf. vector scan implementation in chapter 4.5). The host code stars by creating two buffers with the size of the input rounded up to be a multiple of the work group size times the BLOCK_SIZE. One will be used to hold the input values at the beginning of each pass (source) and the other one will be used by the permute step to write the output values to (destination). The source buffer is initialized with the unsorted input sequence from host memory. Any free space after the written sequence is filled up with the largest representable value of the input elements’ data type. Beside the source and destination buffer a further histogram buffer is needed to store a histogram for each work item. The size of this buffer has to be rounded up to fit the input size requirements for the vector scan. The global work size is the number of input elements divided by BLOCK_SIZE as each work item processes BLOCK_SIZE values. The local work size can be chosen at will as the algorithm does not depend on local memory or work group synchronization. After the setup is complete we can start executing several sort passes identically to the CPU variant. In each pass a histogram kernel is enqueued which calculates the per work item histograms and stores them into the histogram buffer. Arguments of the kernel are the source buffer containing the input values, the histogram buffer for writing the histograms to and the bit offset of the current pass to select the corresponding bits of the input keys. After the histograms are created and stored in global memory, the histogram buffer can be scanned. The vector scan algorithm (host and kernel) from section 4.5 will be used directly for this job. The scanned histograms are then input to the permute kernel which will reorder the values from the source buffer into the destination buffer. Finally the references to the source and destination buffer are swapped for the next pass. After all passes have been executed the resulting sorted sequence can be read from the source buffer (destination buffer of the last pass). Listing 29 shows the corresponding kernel code.

illustration not visible in this excerpt

Listing 29: OpenCL code for the histogram and permute kernels of a radix sort, where each work item stores it’s histogram in registers. (Advanced Micro Devices, Inc. 2013a)

The Histogram kernel begins by querying the global id of the work item and setting up a zero initialized array for holding the histogram values. Each work item then processes BLOCK_SIZE elements of the input buffer. For each element the bits corresponding to the current pass’ bit offset are extracted to increment the appropriate bucket of the work item’s histogram. After processing the input elements has finished, the histogram is split into its buckets and moved to global memory.

The Permute kernel begins with copying the scanned histogram back from global memory into registers. Each histogram bucket now contains the number of elements belonging into this bucket from work items with lower global id plus the sum of all elements of preceding buckets. Then the work item again reads in the BLOCK_SIZE input elements for which the scanned histogram is now available. Each value is written to the destination buffer at the start address of the corresponding bucket for this work item. The address is incremented afterwards.

Considering performance (cf. figure 18), the radix sort GPU implementation is roughly equally fast as the fused bitonic sort. When looked closely, one can even see that radix sort’s runtime increases a little bit slower than the one of the bitonic sorter. However, the GPU radix sort cuts off at the last few input sizes as the GPUs global memory is exhausted and the driver starts to swap GPU memory into the systems main memory. Nevertheless, the GPU implementation is far behind the CPU variant with 1.884 seconds at 2[25] (!) elements compared to 0.449 seconds on the CPU. When profiling the kernel one can see that most of the time is spent executing and waiting for global memory requests. If we have a look at the kernel code in listing 29 again, we can see that the Histogram kernel performs BLOCK_SIZE read operations of four byte (size of uint). Furthermore adjacent work items access global memory with a stride of BLOCK_SIZE times four byte. As a result each four byte value from each work item is loaded using a full memory transaction. However, on NVIDIA GPUs memory transactions always load a full memory segment of 64 or 128 words (32-bit values) (NVIDIA Corporation 2009, p.13). Global memory access should therefore be coalesced, meaning that adjacent work items should access adjacent elements in global memory. In this case, all work items can be serviced in as few memory transactions as possible. However, this is not possible in our radix sort implementation unless the BLOCK_SIZE is reduced to one. Nevertheless, another way of increasing memory bandwidth utilization is to request fewer but larger pieces of memory. This approach also improves performance on AMD GPUs and will be covered in section 5.2.5. A further but more subtle issue can be seen in the disassembly of both kernels. The local (not local) array uint hist[BUCKETS] is placed in a work item’s registers. However, a register’s size is four words (equal to an uint4). To allow indexing the array (runtime indexing into registers is not possible), the array has to be either placed strided into registers (each array element occupies only the first component of each vector register) or the compiler has to generate appropriate code to transform an index into the corresponding register address and vector component. An unorthodox but working solution is to put the array into shared memory which can be addressed at a 4-byte boundary (1-byte with OpenCL extension cl_khr_byte_addressable_store). This idea will be covered in the following section.

5.2.4 Radix Sort using local memory

Instead of placing each work item’s histogram in it’s own registers, local memory is used to store the array. Thus, wasted register space and instruction overhead necessary to access an array placed in registers is avoided. However, accessing local memory is usually slower and accesses might suffer from bank conflicts. This implementation will take the radix sort of the previous section 5.2.3 and move the histogram from registers to local memory. Listing 30 shows the required changes to the host code from listing 28.

illustration not visible in this excerpt

Listing 30: Changes to the host code of listing 28 for a radix sort implementation using local memory to store each thread’s histogram. (Advanced Micro Devices, Inc. 2013a).

As local memory is shared inside a work group, we have to allocate a block of local memory large enough to hold all histograms of all work items of a work group. The local memory needs to be allocated in both kernels. Listing 31 shows the required changes to the kernel code from listing 29.

illustration not visible in this excerpt

Listing 31: Changes to the OpenCL kernel code of listing 29 for a radix sort implementation using local memory to store each thread’s histogram. (Advanced Micro Devices, Inc. 2013a)

Instead of declaring hist as local variable inside the kernel, it is now passed as pointer to local memory as an argument to both kernels. The pointer has then to be offsetted to the current work item’s histogram place. Except initialization to zero in the Histogram kernel, the remaining code stays the same.

Benchmarking the changed code shows a subtle but noticeable difference in performance. The initial 1.884 seconds of the original implementation from section 5.2.3 could be reduced to 1.565 seconds, simply adjusting the code to fit better to the GPUs hardware features. Although this code has executed faster in the benchmarks, newer GPUs might already provide scalar registers having no troubles storing an array of scalar data types.

5.2.5 Radix Sort using local memory and vector loads

A further optimization of the presented radix sort implementation is to replace the many small global memory reads in both radix sort kernels by fewer but larger fetches. This can be achieved by declaring the pointers to the input buffers in both kernels to be of a vector type. Therefore the number of iterations of the fetching loops (and therefore the number of fetches) is reduced by a factor of the width of the used vector type. Disadvantages of using vector types may be increased register consumption and increased source code length (vector components cannot be iterated over). As this only concerns the kernel itself, no changes to the host code are required. Listing 32 shows the changes to the previous radix sort implementation using local memory from listing 31.

illustration not visible in this excerpt

Listing 32: Changes to the OpenCL kernel code of listing 31 for a radix sort implementation using local memory to store each thread’s histogram and vector types for loading values from global memory. (Advanced Micro Devices, Inc. 2013a)

The first difference is the increase of the BLOCK_SIZE constant which was 32 and has been raised to 128. It turned out that the larger memory requests now impose a smaller problem concerning the total runtime. Thus, more of them can be made inside each work item (may be specific to a GPU and should therefore be tested individually). To maximize memory throughput the largest supported vector type is chosen for the input buffer which is uint16. As a result the fetching loops only need to iterate over a 16th of BLOCK_SIZE input elements. The number of required iterations is placed in the macro BLOCK_SIZE_16. In each loop iteration a full uint16 (64 byte) value is fetched from global memory and the target bucket in the histogram is calculated. Now comes the drawback of using vector types. As we cannot loop over the components of the fetched value, we have to write code for each histogram access using a different component separately.

When we look at the benchmark chart in figure 18 we can see that our effort paid off. Compared with the version without vector fetches we could achieve a speedup of 1.5 (1.019 vs. 1.565 sec­onds). Beside a far better memory bus utilization also the amount of independent instruction increased leading to better ALU utilization. However, despite the optimizations, the GPU radix sort implementation runs still slower than the original CPU code. Reasons are the slow global memory write accesses in both radix sort kernels and the relatively small RADIX of four compared to 16 in the CPU variant which leads to eight passes on the GPU compared to only two on the CPU.

5.3 Further implementations

Beside the presented kernels a further approach has been implemented and benchmarked which has not been covered. However some experiences may help future developers.

Odd-Even Transition (Kipfer and Westermann 2005)

The odd-even transition sorting network from gpu gems 2 chapter 46 (Kipfer and Westermann 2005) has been implemented. However, performance was significantly worse than the other covered algorithms.

Furthermore, several libraries exist supporting GPGPU accelerated sorting routines.

clpp ( 2011)

The OpenCL Data Parallel Primitives Library is an open source and freely available library offering a few GPU primitives such as a scan and sort implementation. However, the project seems to be inactive (last release in July 2011).

libCL (Stier 2012)

libCL is an open-source library with the intention to provide a variety of all kinds of parallel algorithms. Although the list of functions is still small, it provides an implementation of radix sort.

ArrayFire (AccelerEyes 2013)

ArrayFire is a commercial GPU software acceleration library provided by AccelerEyes. It provides a lot of GPGPU accelerated algorithms using CUDA and OpenCL including a sorting routine.

5.4 Summary and conclusion

Sorting an array of elements is one of the most used routines of a programmer. Numerous al­gorithms and libraries (also standard libraries) exist targeting the CPU and providing highly optimized and ready-to-use sorting routines. However, the availability of GPU accelerated rou­tines is still limited and mainly consists of some experimental libraries. In this chapter we have seen how to implement both, a comparison based and an integer based sorting algorithm using OpenCL. Contrary to traditional concepts, sorting networks where introduced offering a higher amount of parallelism on the cost of runtime complexity. The bitonic sorter has been discussed as example of how a sorting network can be constructed and operated for a given input size. Kernel fusion has been presented as a further way of optimizing algorithms consisting of multiple OpenCL kernels or kernel invocations. Radix sort has been introduced as example of a fast integer based sorting routine. Beside noticing the bad handling of arrays in registers, the final radix sort implementation once again showed the importance of global memory utilization. In conclusion, a fast comparison based sorting routine could be presented beating C++’s std::sort with a speedup of 2.37 (9.039 vs. 3.808 seconds on 2[26] elements). Unfortunately, radix sorting is still faster on the CPU with the OpenCL implementation being more than twice as slow as the simple C++ variant.

illustration not visible in this excerpt

6 Conclusion

6.1 Summary

OpenCL is a free and open standard for general purpose parallel programming across various hardware devices. It is maintained by the Khronos Group and implemented by hardware vendors like NVIDIA, AMD or Intel. OpenCL is natively available for C and C++ although many wrappers for other languages exist. To use OpenCL in a C/C++ application an SDK is needed which contains the required header files (also available from Khronos) and static libraries. For running an application using OpenCL a corresponding driver has to be installed offering the API functions as well as a compiler to create kernels. Furthermore, one or more appropriate hardware devices supporting OpenCL are necessary and may even be used together.

The hardware architectures for which OpenCL can be used are quite different, especially concern­ing the differences between CPUs and GPUs. Modern consumer CPUs have a small number of independent, high power cores (mostly four or eight) which can be used independently by an ap­plication using threads. Synchronization between the threads is easy and threads may be spawned at any time. High throughput can be achieved by distributing the work on all available cores and make use of vector instruction sets like SSE or AVX. Memory is provided as a large block behind a hierarchical cache system.

GPUs however employ a massive amount of less powerful cores (up to several thousands) packed together in multiprocessors. Work is processed in smaller work groups and larger n-dimensional ranges which often occupy hundreds or thousands of threads (which often have to be a multiple of hardware specific resources) executing the same code step by step. Synchronization is limited and the amount of threads used has to be predetermined every time work is offloaded to the GPU. High throughput is achieved by utilizing all components of the GPU (ALU, read and write units, etc.) as much as possible. Branching and synchronization should be avoided and vector types/operations should be used wherever possible to maximize the ALUs throughput. The large global memory is cached but very slow and sensitive concerning access patterns. Local memory is available as programmers-controlled cache but again suffers from misaligned accesses (bank conflicts). However, GPUs can achieve a much higher computational throughput than CPUs when used correctly.

Using OpenCL in an application starts by choosing an available platform/implementation and device. Additionally, a context and a command queue have to be created. Kernels are mostly created directly from their source code which has to be provided at run time. The source code is then compiled and linked into a program for the selected device and the kernels can be queried by name. To provide larger input and to retrieve output from a kernel, buffer objects have to be created. These may be read from and written to, either asynchronously or synchronously to the main application, by enqueuing operations on the command queue. Kernels are also enqueued and executed asynchronously after their arguments have been set.

To show the strengths and weaknesses of OpenCL and GPGPU computing on a few real world examples, three kinds of problems have been chosen for which various approaches have been implemented and benchmarked.[10]

simple OpenCL implementation already beats the efficient BLAS sgemm routine. With some optimizations, making the code more complicated, we could push performance even further, achieving a speedup of 36. By the example of matrix multiplication we also took a look into performance analysis using profiler information from AMD’s CodeXL.

2. The second problem tackled was the all-prefix sum, a linear algorithm. The CPU implemen­tation can be coded in a minute using a simple loop in a few lines of code. A naive, also quite easy to understand GPU approach performed fairly poor. By trying a more sophisticated tree-based reduce and scatter algorithm, small improvements could be made. Adding local memory and finally also using vector types and applying the algorithm on several layers in the end showed significant cut-downs on the run time (at the cost of hundreds of lines of code). However, the runtime of the CPU could not be beaten.

3. The last implementation chapter focused one of the most famous topics in computer science, sorting. The two standard library routines qsort and std::sort have been benchmarked together with a radix sort implementation, the latter delivering amazing results. For sorting on the GPU, sorting networks have been introduced, because they are easy to parallelize. In particular, the bitonic sorter has been implemented and optimized. Despite the nasty input size restriction, the GPU implementation runs more than twice as fast as the comparison based variants on the CPU, which can be clearly seen as success. Radix sort, however, suffers from the GPUs highly parallel nature. Despite several optimizations it could not catch up and remains far behind the CPU implementation.

6.2 Conclusion

In conclusion it can be said that GPUs are different from CPUs in the way they process work. This originates from their quite dissimilar hardware architectures as we discussed in the beginning. But not only the hardware itself, also the way algorithms have to be designed and programs have to be written differs largely from traditional single-thread-orientated CPU algorithms. Although it is already quite common to assign several tasks to several CPU cores (e.g., GUI and work thread), bringing parallelism down to small routines is still far from being common. When a program is designed to be run on massively parallel hardware we tend to spend more time on fighting with the peculiarities of the hardware than thinking about the actual problem we try to solve. This was one of the hardest challenges when implementing scan. The scan operation imposes high data dependency between the processed elements and therefore makes it perfectly suitable for as few processing units as possible. The GPU however benefits from a maximum of parallelism, on data and instruction level. Most of the time required to develop an efficient GPU scan was spent on trying to get enough parallelism on the cost of minimal redundancy in additions. Although the tree based approach worked well in the end, a lot of computing power is wasted during the up­sweep and down-sweep phase inside each work group. The same problem affected the radix sort GPU implementation. The per work item histogram was only required because the GPU does not ensure the order in which work items are processed. And to improve memory reading performance for the scan step by aligning the histogram buffer for coalesced access, the scattered, unblocked writes of the histograms at the end of the permute step also teared down performance, because GPUs profit enormously from reading and writing larger, consecutive blocks of memory. On the CPU, apart from worse caching, scattered accesses is less of a problem. The final argument, that should arise from this observations, is that programmers have to worry about a lot more different stuff than they have to when coding for a processor. Imagine Java or C# programmers who are used to object-oriented design, dynamic memory allocation, huge standard libraries, containers and other complex data structures and all that candy of modern languages, when they find themselves porting one of their well-written, readable algorithms to the GPU where they have to break everything down into well-layouted arrays, think about the vectorizability of the produced instructions, repeatedly check the API documentation for a small number of built-in functions that might be of use and hunt down performance issues to stalled memory reads and bank conflicts. GPGPU computing is far different from modern software design and programming. Although the API is easily understood and the kernel language quickly learned, it takes months and probably years of practice to get a good feeling for designing GPU accelerated algorithms. GPUs are technological marvels offering immense computational power if used correctly. But it takes time to learn and study them in order to unfold their full potential.

Furthermore, it has to be said that not all algorithms fit the requirements of a GPU. In fact, only a small subset really benefits from the massively parallel hardware. In most cases, it is easier to focus on spreading a workload on several cores of the CPU instead of dealing with thousands of threads. Also the required memory transfer to and from the GPU is a crucial factor when deciding to offload a piece of work or not. If an algorithm does not fit the GPU one should not execute it there. However, the small subset of algorithms which is often parallel in its nature may benefit tremendously from graphics hardware. We have discussed such a kind of algorithm on the example of multiplying matrices. The final speedup of 36 is amazing and shows the potential of GPU hardware when used at the right place.

So, when should OpenCL be used? Apart from having the right kind of algorithm to take advantage of GPGPU acceleration, also time and budget play an important role. Developing an algorithm for the GPU can be tedious and cumbersome and therefore time consuming. Furthermore, GPUs also impose a higher risk of failing to achieve the desired performance boost. The scan algorithms with optimizations may have taken two weeks to develop with the final outcome that the CPU variant is still faster. Experimenting with GPUs in a real world project should therefore be considered well, also concerning the aspect of system requirements. From a developers perspective, appropriate GPUs have to be available for implementing and testing and additional tools have to be installed for debugging and profiling. Concerning customers, appropriate hardware has to be available, drivers have to be installed and maybe the case of missing compatible hardware or drivers has to be handled (fallback CPU implementation). Talking about tools, the currently available suits from AMD (CodeXL), NVIDIA (Nsight, Visual Profiler) and Intel (SDK for OpenCL Applications) already provide decent help when developing software for their respective GPUs. However, they are still quite away from being as suitable and feature-rich as corresponding CPU debugger integration and profilers. Last but not least also maintainability and reusability play a vital role in modern software companies. Both aspects suffer in most cases as OpenCL algorithms tend to be optimized strongly to a specific kind of hardware and problem.

Finally, we have to see how long GPGPU computing will be done in this fashion. With the increasing power of on-chip GPUs on modern Intel processors, using GPUs for smaller tasks might become more attractive as no memory transfers will be required. AMD is also working intensively on a new generation of processing devices called Accelerated Processing Units (APU). By using a so-called heterogeneous system architecture (HSA), GPU and CPU are combined tightly on a single chip. The GPU will have direct access to the system’s main memory with cache coherence to the CPU using the same address space as the CPU (theoretically allowing pointers to be passed). Furthermore, the GPU will allow multitasking via context switches between running tasks. As a result, long-running calculations will no longer freeze the display. However, in another corner of the high performance hardware sector, Intel is heavily working on their new Many Integrated Core (MIC) architecture. The Intel Xeon Phi (previously named Knights Corner) is a co-processor card with up to 61 individual Intel processors and up to 16 GiB dedicated on-card memory. From an OpenCL programmers perspective, the Xeon Phi is equally programmed as a GPU. But the card is composed of modified x86 processors and can therefore also execute conventional software written in traditional languages. As a result, language integration is also easy. In fact, Intel’s compiler already offers appropriate #pragmas to offload parts of a C++ program to the co-processor card in a similar fashion as OpenMP. In addition, almost all drawbacks of GPUs vanish as all threads (four cores on each processor) can be fully synchronized, memory can be allocated dynamically on demand and all existing language features including libraries can be used. However, a Xeon Phi is still financially out of reach for consumers and still more expensive than decent GPUs with prices among several thousand dollars. But as time passes by, they may become affordable. Intel already announced that the next generation of the Xeon Phi (codename Knights Landing) will be capable of being used as the systems main processor. So maybe in ten years we find ourselves with a hundred main processors in our notebooks capable of handling even computationally expensive software with ease. Who needs GPU computing then?

6.3 Personal experiences

6.3.1 Progress

I began working on this thesis on 2012-07-23 by creating the code base for the sort project. Since then, 415 days have passed (today is the 2013-09-11). 436 commits have been made during this long period of time. The individual commit dates clearly mirror certain situations in my life at the time they were made. Figure 20 presents how the lines of code have evolved over time, figure 21 presents the commits made in this time.

illustration not visible in this excerpt

Figure 20: Lines of code of all files in the repository over time. Chart generated using StatSVN.

illustration not visible in this excerpt

Figure 21: My commits over time. Chart generated using StatSVN.

During the first days I worked intensively on setting up a comprehensive testing environment which should allow easy integration of new algorithms. At the beginning of my internship during the summer holidays in 2012, the commit dates started to occur in the morning (I tried to work two hours on this thesis before going to work), some at lunch time when I had come up with something during my work, but most of them in the late night. I soon realized an impact on my well-being from all the day and night programming and stopped working on my thesis several weeks later (cf. mid to late August in figure 21). However, my motivation for the internship also dropped and I had a bad conscience from showing no progress on my thesis. I quit my internship three weeks earlier to enjoy the remaining summer holidays and concentrate on the thesis. Fortunately, I could collect and integrate a lot of algorithms in this time from the beginning of September into the first weeks of university until the mid of October (cf. figure 21). Although I found less time from October to November due to studies, I could finish most of the code base and collected several benchmark data. My supervisor, FH-Prof. DI Dr. Heinz Dobler, and I were confident that

I should be able to easily finish the thesis until February 2013, the first official data for handing in.

However, I made the mistake of accepting two tutorials (Programming and Project Engineering). Together with the increasing work load of my studies (mostly the large software projects by FH- Prof. DI Johann Heinzelreiter), I found almost no time to start writing. The two groups of commits around the 2012-11-20 and 2013-01-09 in figure 21 were for the two presentations about the thesis in the bachelor seminar course. Apart from them, development was halted until the end of march, which is can be clearly seen in the advancement of the lines of codes in figure 20. After the end of the sixth semester at the end of March, I found again time to continue with my writing. I started to work on the introduction and OpenCL chapter.

I began my mandatory internship, which is part of the Software Engineering course, at the 2013­04-04. From the time of my commits during April in figure 21 one can see that I was again writing on my thesis after work, sometimes until the morning hours. I found myself in the same situation as during my first internship during the summer, except there was no option to prematurely quit the internship. I was even working with the same technologies during work and at home. I lost most of my motivation and interest after a month and stopped working on my thesis after finishing the OpenCL chapter, which I sent to my supervisor. I also fought with bad health conditions, both physical and psychical.

I had the chance to learn a lot during my internship. However, as I was working with the same technologies as I use in my thesis, I came up with more and more design mistakes and ideas for improving my algorithms. During the last month of my internship, I felt the need to refactor most of the code produced during September and October 2012. This was the time where most of the final algorithms and sophisticated charts were created. By further tweaking the algorithms, I gathered a lot of in-depth knowledge about my implementations (which have mostly found their way into the matrix multiplication chapters). A finished my internship by the end of June, fortunately with amazing results and satisfied colleagues.

During the remaining time until the final deadline for the second bachelor exam (which is at the mid of September), I was working hard to write the implementation chapters. One of the big problems was the huge amount of information I had gathered around my implementations (mostly performance aspects). I was proud of all the peculiarities I found out about my GPU and how it behaves when processing my algorithms. However, time was of essence and I had to carefully decide on the algorithms and analyses to include. E.g., only the matrix multiplication contains a detailed view on performance counters. There was simply no time to include this kind of information in the other implementation chapters (although I would have loved to). Beside a week in France (the small gap around the end of July in figure 21) I used every day of my summer holidays for writing on my thesis. I finished the ’’Release Candidate” on 2013-08-25.

The remaining two weeks until the deadline, I was fully occupied with the creating of the second part of my thesis, which I surprisingly managed to finish during this time, producing an acceptable result. (Funny observation: During this time I also managed to change my sleep-wake cycle to more normal conditions, which can be seen at the commit times in figure 21). The last few days I was working on applying the feedback to my ’Release Candidate”.

6.3.2 Positive and negative aspects

Below is a list stating several positive (S) and negative (X) aspects encountered during the creation of this thesis:

S I started early with the implementation of my thesis at the beginning of my summer holidays in 2012.

S My supervisor gave me feedback to my first part of the thesis (Introduction and OpenCL chapters) which several hours. Amazing!

S Although I handed in my ’’Release Candidate” version of the thesis two weeks before the deadline, my supervisor still took for giving me excellent, detailed feedback and moved the deadline a few days to provide me enough time to include the feedback into the thesis.

S I was allowed to write the thesis and hold my presentations in English which helped me to improve my command of this language.

S Writing the thesis in LTEXgave me the chance to learn a new way of creating professional looking documents.

S I deepened my knowledge about GPUs and OpenCL tremendously, which will hopefully be beneficial for my future career.

S Although I was running late with handing in my thesis, my supervisor did not put additional pressure on me. He acted understanding, helpful and diplomatic.

S The initial knowledge about OpenCL which I gathered during the summer holidays in 2012 qualified me for an interesting internship.

S A learned how to setup a cross platform and cross language (C++ and Fortran) build system with Code::Blocks and make.

S A learned about several differences between the used compilers (GNU g++, Microsoft Visual Studio, Intel C++).

X I have selected to many subtopics/problems I wanted to cover in my thesis. It would probably have been a better idea to concentrate only on one problem (like matrix multiplication) and rename the thesis accordingly.

X Although LTEXdoes an amazing job when used correctly, trying to achieve extraordinary formatting (like having dots after a chapter number but not after a section number) is often tedious or impossible without having to hack into internal mechanisms.

X The curriculum of the fifth semester does not include any time required to write the bachelor thesis. Other courses provide time and even award ETCS for writing the thesis. Furthermore, the additional month of studies before the internship in the sixth semester also consumes a lot of time which would be much needed for writing the bachelor thesis. I cannot provide a solution (there is probably not an easy one without having to reduce the amount of covered material), but I want to point this out.

X I should have focused less on improving the performance of my algorithms as it took me a lot of time and was not required for the thesis (many improved versions were included though).

X I should have started earlier to write the actual document. This would have helped me to select the relevant algorithms sooner, focusing my work on code which will eventually become part of the final product.

List of Figures

1 The graphics pipeline with programmable vertex and fragment processor. (Fer­nando and Kilgard 2003)

2 Architectural overview of an Intel Ivy Bridge processor. (Shimpi and Smith 2012)

3 Full chip block diagram of NVIDIA’s Kepler GK110 architecture containing

streaming multiprocessors (SMX). (NVIDIA Corporation 2013c)

4 Architecture of a Streaming Multiprocessor (SMX) of NVIDIA’s Kepler GK110 ar­chitecture containing 192 single precision CUDA cores, 64 double precision units,

32 special function units (SFU), and 32 load/store units (LD/ST) (NVIDIA Cor­poration 2013c)

5 OpenCL class diagram from the OpenCL 1.2 specification. (Munshi 2012)

6 OpenCL’s platform model

7: The two-dimensional NDRange consisting of GSx * GSy work groups. Each work group consists of LSx * LSy work items. A work item has two indices in each dimension, the local index (relative to the work group) and a global index (relative to the NDRange), as well as the indices of the group it is part of. The local work sizes are LSx and LSy whereas the global work sizes are GSx * LSx and GSy * LSy

8 Memory model as defined by the OpenCL 1.2 specification.(Munshi 2012)

9 Principle of the matrix multiplication (StefanVanDerWalt 2010)

10 Benchmark of several square matrix multiplication implementations

11 Optimization of the matrix multiplication by subdivision into tiles. Here 4 x 4 matrices are divided into tiles of 2 x 2 work items

12 Benchmark of several prefix sum implementations, where both axes are of logarith­mic scale

13 A naive approach for a parallel scan. (Harris, Sengupta, and Owens 2008)

14 The up-sweep phase of a work efficient parallel scan (Harris, Sengupta, and Owens


15 The down-sweep phase of a work efficient parallel scan (Harris, Sengupta, and

Owens 2008)

16 Scanning larger arrays of values recursively (Harris, Sengupta, and Owens 2008)

17 A closer look to the benchmark of the recursive vector scan from section 4.5. It points out how much time is wasted by memory transfers. Note that both axis are

of logarithmic scale

18 Benchmark of several sort implementations. Both axis are of logarithmic scale

Integer sorting algorithms (radix sort) are drawn in dashed lines

19 Example of a bitonic sorting network for 16 inputs (Bitonic 2012)

20 Lines of code of all files in the repository over time. Chart generated using StatSVN

21 My commits over time. Chart generated using StatSVN

List of Listings

1 An example of a kernel function’s signature

2 A minimalistic working OpenCL application which calculates the sum vector of two input vectors

3 A simple C++ implementation of a square matrix multiplication for the CPU

4 Host code for a matrix multiplication implementation using OpenCL

5 OpenCL Kernel code calculating a single element of the output matrix per work item

6 Host code for a matrix multiplication implementation using OpenCL, where the

input and output matrices’ sizes are rounded up to be a multiple of TILE_SIZE

7 OpenCL Kernel code calculating a single element of the output matrix per wok item, where the input matrixes are split into tiles which are cached in local memory. (Tompson and Schlachter 2012)

8 Differences of the host code for the matrix multiplication using vector types when compared with the one using tiles cached in local memory from listing

9 OpenCL Kernel code calculating a 4 x 4 block of elements of the output matrix per work item using four float4 vectors, where the input matrices are accessed directly from global memory. (Advanced Micro Devices, Inc. 2013a)

10 Differences of the host code for the matrix multiplication using vector types and locally cached tiles when compared with the one using only tiles cached in local memory in listing

11 OpenCL Kernel code calculating a 4 x 4 block of elements of the output matrix per work item using float4 vectors, where the input matrices are split into tiles of blocks which are cached in local memory. (Advanced Micro Devices, Inc. 2013a)

12 A simple C++ implementation of an exclusive scan for the CPU

13 Host code for the naive scan algorithm

14 OpenCL Kernel code for the naive scan algorithm

15 Host code for the work-efficient scan algorithm

16 OpenCL Kernel code for the work efficient scan algorithm

17 Host code for the recursive scan algorithm

18 OpenCL Kernel code for the recursive scan algorithm

19 Differences to the host code shown in listing 18 for the recursive scan algorithm using vector types

20 OpenCL Kernel code for the recursive scan algorithm using vector types with a width of eight 50 21 Sorting an array of unsigned integers using qsort from the C stdlib header. The provided comparison lambda function uses comparisons instead of simply subtract­ing the input values (a - b) as this may cause unwanted behaviour due to overflows (e.g., 1u - MAX_UINT is 2 instead of a negative number). Subtraction does work for signed types

22 Sorting an array of unsigned integers using std::sort from the C++ algorithm header

23 Sorting an array of unsigned integers using radix sort. The algorithm uses two passes analyzing 16 bits each. The implementation is based on the radix sort sample shipped with AMD APP SDK (Advanced Micro Devices, Inc. 2013a)

24 Host code for a bitonic sort implementation. (Bainville 2011)

25 OpenCL Kernel code for one iteration of a bitonic sort. The implementation is based on an article about experiences with OpenCL and sorting by Eric Bainville (Bainville 2011)

26 Changes to the host code from listing 24 for a bitonic sort implementation using kernel fusion. (Bainville 2011)

27 OpenCL code of several (fused) kernels performing 1, 2, 3 or 4 iterations (2, 4, 8 or

16 inputs per work item) of a bitonic sort. (Bainville 2011)

28 Host code for a radix sort implementation. (Advanced Micro Devices, Inc. 2013a)

29 OpenCL code for the histogram and permute kernels of a radix sort, where each work item stores it’s histogram in registers. (Advanced Micro Devices, Inc. 2013a)

30 Changes to the host code of listing 28 for a radix sort implementation using local memory to store each thread’s histogram. (Advanced Micro Devices, Inc. 2013a).

31 Changes to the OpenCL kernel code of listing 29 for a radix sort implementation using local memory to store each thread’s histogram. (Advanced Micro Devices, Inc. 2013a)

32 Changes to the OpenCL kernel code of listing 31 for a radix sort implementation using local memory to store each thread’s histogram and vector types for loading values from global memory. (Advanced Micro Devices, Inc. 2013a)

33 Utility functions used in host codes throughout the thesis


AccelerEyes (2013). ArrayFire. Version 1.9. URL: re_tour (visited on 2013-08-06).

Advanced Micro Devices, Inc. (2013a). Accelerated Parallel Processing (APP) SDK. Version 2.8. URL: /amd-accelerated-parallel-processing-app-sdk/ (visited on 2013-07-09).

— (Apr. 2, 2013b). Accelerated Parallel Processing Math Libraries (APPML). Version 1.10. URL: -accelerated-parallel-processing-math-libraries/ (visited on 2013-07-17).

— (2013c). CodeXL. Version 1.2.3897.0. URL: sdks/heterogeneous-computing/codexl/ (visited on 2013-07-09).

BLAS (Basic Linear Algebra Subprograms) (2011). URL: http://www. netlib . org/blas/ (visited on 2013-07-10).

Bainville, Eric (June 25, 2011). OpenCL Sorting. URL: http : / / www . bealto . com/ gpu - sorting_intro.html (visited on 2013-08-14).

Batcher, Ken E. (1968). “Sorting networks and their applications”. In: Proceedings of the April 30-May 2, 1968, spring joint computer conference. AFIPS ’68 (Spring), pp. 307-314.

Bitonic (Oct. 8, 2012). File:BitonicSort1.svg. Wikipedia. URL: wiki/File:BitonicSort1.svg (visited on 2013-08-07).

Blelloch, Guy E. (1989). “Scans as primitive parallel operations”. In: Computers, IEEE Transac­tions on 38.11, pp. 1526-1538. ISSN: 0018-9340.

— (1990). “Prefix Sums and Their Applications”. In: Sythesis of parallel algorithms. Morgan Kaufmann Publishers Inc., pp. 35-60. URL: Ble93.pdf (visited on 2013-07-31).

Fernando, Randima and Mark J. Kilgard (2003). The Cg Tutorial: The Definitive Guide to Pro­grammable Real-Time Graphics. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc. ISBN: 0321194969.

Gaster, Benedict et al. (2011). Heterogeneous Computing With OpenCL. Morgan Kaufmann Pub­lishers Inc. ISBN: 9780123877666.

Harris, Mark, Shubhabrata Sengupta, and John D. Owens (2008). GPU Gems 3. GPU Gems. Addison Wesley Professional. Chap. 39. Parallel Prefix Sum (Scan) with CUDA. URL: http: //

Hillis, W. Daniel and Jr. Guy L. Steele (Dec. 1986). “Dataparallel algorithms”. In: Commun. ACM 29.12, pp. 1170-1183. ISSN: 0001-0782. URL: http://cva. stanford . edu / classes / cs99s / papers / hillis - steele - data - parallel - algorithms . pdf (visited on 2013-07-31).

Kanellos, Michael (Feb. 10, 2003). Moore’s Law to roll on for another decade. CNET. URL: http: // 0-10 01-984 051 .html (visited on 2013-03-25).

Kanter, David (Dec. 14, 2010). AMD’s Cayman GPU Architecture. URL: http://www.realwo (visited on 2013-08-20).

Kipfer, Peter and Riidiger Westermann (2005). GPU Gems 2: Programming Techniques for High- Performance Graphics and General-Purpose Computation (Gpu Gems). GPU Gems. Addison- Wesley Professional. Chap. Chapter 46. Improved GPU Sorting. ISBN: 0321335597. URL: ht tp://

Lang, Hans W. (Dec. 7, 1998). Bitonic sorting network for n not a power of 2. URL: http : / / (visited on 2013-08-16).

McClanahan, Chris (2010). History and Evolution of GPU Architecture. Georgia Tech College of Computing. URL: http://mcclanahoochie . com/blog/ wp- content /uploads / 2011/03/gpu-hist-paper.pdf (visited on 2013-03-23).

Merritt, Rick (June 20, 2011). OEMs show systems with Intel MIC chips. URL: http: //www. MlC-chips (visited on 2013-04-09).

Micikevicius, Paulius (June 6, 2012). GPU Performance Analysis and Optimization. NVIDIA Corporation. URL: http : / /developer . download. nvidia . com/GTC/PDF/GTC2 012/ PresentationPDF/S0514-GTC2012-GPU-Performance-Analysis .pdf (visited on 2013-04-15).

Moore, Gordon E. (Apr. 19, 1965). “Cramming more components onto integrated circuits”. In: Electronics. URL: http:// download . intel . com/museum/Moores_Law/ Articles- Press_ReleasesZGordon_Moore_1965_Article.pdf (visited on 2013-03-25).

Munshi, Aaftab, ed. (Nov. 14, 2012). The OpenCL Specification. Version 1.2. Khronos Group. URL: (visited on 2013-04-08).

NVIDIA Corporation (July 2009). NVIDIA OpenCL Best Practices Guide. URL: http://www. nvidia . com / content / cudazone / CUDABrowser / downloads / papers / NVIDIA_ OpenCL_BestPracticesGuide.pdf (visited on 2013-08-19).

— (2013a). CUBLAS (CUDA Basic Linear Algebra Subroutines). URL: https://developer. (visited on 2013-07-17).

— (2013b). GeForce GTX 580. Specifications. URL: desktop-gpus/geforce-gtx-580/specifications (visited on 2013-04-04).

— (Jan. 28, 2013c). NVIDIA’s Next Generation CUDATMCompute Architecture: Kepler TM GK110. The Fastest, Most Efficient HPC Architecture Ever Built. Version 1.0. URL: http:// Whitepaper.pdf (visited on 2013-04-13).

Shimpi, Anand Lal and Ryan Smith (Apr. 23, 2012). The Intel Ivy Bridge (Core i7 3770K) Review. URL: http://www.anandtech . com/show/5771/the-intel-ivy-bridge-core- i7-377 0k-review (visited on 2013-04-21).

StefanVanDerWalt (Oct. 4, 2010). File:Matrix multiplication diagram 2.svg. Wikipedia. URL: ht tp:// (visited on 2013-04-28).

Stier, Jochen (July 2012). libCL. URL: (visited on 2013-08-19).

Strassen, Volker (1969). “Gaussian Elimination is not Optimal.” ger. In: Numerische Mathematik 13, pp. 354-356. URL: 7.

Tompson, Jonathan and Kristofer Schlachter (Apr. 24, 2012). An Introduction to the OpenCL Programming Model. New York University. URL: "lerner/ spring12/Preso07-OpenCL.pdf (visited on 2013-07-08).

Volkov, Vasily (Sept. 22, 2010). Better Performance at Lower Occupancy. UC Berkeley. URL: ht tp:// / "volkov/volkov10-GTC.pdf (visited on 2013-07-15).

Warfield, Bob (Sept. 6, 2007). A Picture of the Multicore Crisis. URL: http: //smoothspan. wordpress . com / 2007 / 09 / 06 /a-picture-of- the-multicore- crisis/ (visited on 2013-03-25). (July 2011). OpenCL Data Parallel Primitives Library. Version V1.0 beta 3. URL: (visited on 2013-08-06).

A Utility functions

Listing 33 shows the definition of often used utility functions.

illustration not visible in this excerpt

Listing 33: Utility functions used in host codes throughout the thesis

B Benchmark environment

All benchmarks have been created using the following software and hardware configuration:

- Windows 7 x64

- Intel Core i5 M 480

- 2 physical cores (4 virtual with Hyper Threading)

- 2.67 GHz (2.93 GHz with Turbo Boost)

- 4 GiB DDR3 RAM system memory

- AMD Radeon HD 5650

— 400 shader cores

— 1 GiB memory

OpenCL device info of GPU

The following information has been retrieved by calling clGetDevicelnfo with all defined constants in OpenCL 1.2. For a complete list and further explanation visit the corresponding API documentation at Khronos’ Website.

illustration not visible in this excerpt


I want to thank my professor Herwig Mayr for putting in a good work for me at the RISC software company GmbH. On the technical site, I also want to thank my supervisor at RISC, Alexander Leutgeb, for the excellent support on all subjects of the project. Personally, also a thank you for several interesting discussions we have held after work. Furthermore, I want to give thanks to the RISC company itself, for providing me with high quality equipment and an amazing work station during my three month internship which made working a pleasure. Finally, also a thumb-up to Michael Hava for having the most in-depth C++ knowledge of a human being I have ever met. I learned a great deal by reading your code.


Diese Arbeit stellt eine detaillierte Dokumentation iiber das Berufspraktikum des Autors bei der RISC Software GmbH dar. Nach einer kurzen Einfuhrung in das Unternehmen sowie das Projekt Enlight werden die Ziele des Praktikums erlautert, welche verschiedene Aspekte der Implemen- tierung eines Raycasters mit OpenCL betreffen, der in den drei Praktikumsmonaten erstellt wurde.

Im Anschluss an die Einfuhrung werden grundlegende Themen sowie Basiswissen behandelt, das in spateren Kapiteln benotigt wird. Dazu gehort ein Grundverstandnis uber Raycasting, regulare Gitter als Beschleunigungsstruktur, Raycasting implizit beschriebener Geometrie uber boolsche Subtraktion und OpenCL als Technologie fur GPU Beschleunigung.

Bevor sich die Arbeit in die eigentliche Implementierung vertieft, wird der zum Praktikumsbeginn bestehende Prototyp detailliert analysiert. Dabei wird vor allem auf fortgeschrittene Algorithmen zur Optimierung des eingesetzten Raycasting Verfahrens eingegangen.

Anschliefiend fokussiert sich die Arbeit auf verschiedene OpenCL Programme, deren Ziel es ist GPU beschleunigt Bilder in ahnlicher Qualitat wie die existierende Implementierung zu erzeugen. Auf Vorteile und Schwierigkeiten wuahrend der Implementierung der OpenCL Programme wird ebenfalls eingegangen.

Die abschliefienden Laufzeitvergleiche der funktionierenden OpenCL Raycaster mit der existieren- den CPU Implementierung werden noch durch Erfahrungen mit Entwicklungswerkzeugen rund um OpenCL und einer kurzen Diskussion uber noch offene Probleme abgerundet.


This thesis provides a detailed coverage of the authors internship at RISC Software GmbH. After a short introduction to the company and the Enlight project, the goals of the internship are discussed, which address various aspects of an OpenCL ray caster implementation created during the three months working period.

In succession to the introduction, fundamental topics required in further chapters of the thesis are covered. These include the principle of ray casting, regular grids as acceleration structure, ray casting implicitly described geometry using boolean subtraction and OpenCL as technology for GPU acceleration.

Before the thesis deepens into the actual implementation, the existing prototype at the start of the internship is analyzed in detail. This includes advanced algorithms to optimize the used ray casting approach.

The primary focus then lies on several OpenCL programs with the goal of reproducing the visual output of the existing CPU implementation using GPU acceleration. Advantages and difficulties of developing with OpenCL are encountered during the explanations of the implementations.

The final benchmarks of the working OpenCL ray casters are than rounded off by experiences made with various development tools around OpenCL and a discussion of the still remaining problems.


1 Introduction
1.1 About RISC Software GmbH
1.2 About Regio 13 Enlight
1.3 Initial situation and goals of the internship
1.4 Chapter overview

2 Fundamentals
2.1 Raycasting
2.2 Regular grids
2.3 Boolean raycasting
2.4 Packet casting
2.5 OpenCL

3 Existing prototype
3.1 Structure
3.2 Ray casting data structure
3.3 Classification
3.4 Improved counting algorithm

4 OpenCL ray casters
4.1 OpenCL ray casting driver
4.2 TestGradientCaster
4.3 TestCameraCaster
4.4 SelfTestCaster
4.5 SingleRayCaster
4.6 SingleBooleanRayCaster
4.7 SingleBooleanCellRayCaster
4.8 Double support in calculations
4.9 OpenCL source embedding
4.10 SinglePrimaryBooleanCellRayCaster
4.11 PacketRayCaster
4.12 Migration to new host application
4.13 Out-of-core ray casting

5 Results
5.1 Benchmarks and comparison with existing casters
5.2 Development
5.3 Existing problems

6 Summary and conclusion

List of Figures


A Test environment

1 Introduction

1.1 About RISC Software GmbH

The RISC Software GmbH is a limited liability company and part of the software park Hagenberg located in Upper Austria. It has been founded in 1992 by Bruno Buchberger as part of RISC, which was founded five years before and stands for Research Institute for Symbolic Computation. While the institute conducts basic research in the fields of mathematics and computer science, the RISC Software GmbH dedicates itself to the application of mathematical and computational research. Besides a small set of software products, the RISC Software GmbH’s main focus lies on providing customized software solutions using state-of-the-art and even cutting-edge technologies.

The company is divided into four departments (called units) designating its primary competences:

-gistic Informatics

-plied Scientific Computing

-dical Informatics

-vanced Computing Technologies

The company operates profit-orientated with a non-profit character as, according to the company agreement, all profits will be reinvested into research and development [4].

1.2 About Regio 13 Enlight

Regio 13 is a political program to sustainably improve the contestability of regional companies, economic growth and employment inside of Upper Austria. The program is co-funded by the European Union with a total budget of 95.5 million euro. The program’s duration is from 2007 to 2013. Within this time, companies may apply for financial subventions by the government (Land Oberosterreich) for projects meeting a defined list of requirements.

The Enlight project is conducted by the Applied Scientific Computing unit of the RISC Software GmbH and funded entirely by the Regio 13 program. The project is based on a ray casting prototype developed by Michael Hava during his three month internship in 2011 [6]. The project officially started on 2011-07-01 and is scheduled to be finished at the end of 2013. The project proposal from 2011 defines the following goals for Enlight [5] :

- Development and exploration of new algorithms for visualization and modification of very complex and detailed geometries with file sizes of several gigabytes. Therefore, an optimized ray casting method will be used in favor of traditional and wide-spread rasterization, as the latter is inefficient per design for this dimension of geometries.

- The design of a software system architecture capable of using modern many core processors and GPUs as efficient as possible concerning parallel scalability of visualization.

- Introduction

- Development of a prototype and base library for demonstrating the performance potential on the use case of subtractive manufacturing.

- Additionally, the base library should be the foundation for the further development of a general and industrially applicable software library.

Besides the goals from the initial proposal, several derived goals and restrictions have been defined:

- The visualized model always consists of an initial volume (stock) from which further volumes are subtracted. Every volume must be water-tight[11]and consist only of triangles.

- To allow visualizing frequent geometry changes, such as in subtractive manufacturing, the software should be capable of handling at least ten scene updates (adding ten subtraction volumes) per second. This has a significant impact on the chosen data structure used to represent the scene.

- Furthermore, the internal data structure should be able to handle up to 100,000 subtraction volumes.

- The visualization should be interactive. Interactivity is described in the proposal by referring to an external article with at least ten frames per second [7].

- The visualization should be as precise as possible.

- The ray casted image should also contain depth values in order to merge it with OpenGL rendered geometries.

1.3 Initial situation and goals of the internship

The internship started on 2013-04-04 and lasted until 2013-06-30. As the project has already been running for two years at the time of joining the RISC Software GmbH, the major work of the project has already been done. The created application is console based offering a variety of commands. Several mesh formats (STL, PLY and PTM) have been implemented to allow loading geometry from the file system. The triangles are processed and already organized in a special kind of data structure (discussed in chapter 3.2). An additional window is used to show a preview of the loaded meshes using OpenGL (no subtraction of the volumes). Rotating and zooming the preview is also possible using cursor draging and mouse wheel scrolling. Via corresponding commands on the console ray casting can be enabled, which opens a second window showing the ray casted result. The underlying ray casting implementation runs on the CPU and does already perform well. A further CUDA implementation was in progress.

According to the job description of the internship agreement, the primary objective is to imple­ment a ray caster based on the existing code infrastructure and data structures which utilizes modern hardware architectures (multicore CPUs and GPUs) as effectively as possible consider­ing parallelism using OpenCL. Starting from this statement, several smaller subgoals have been defined during the internship:

- Implementation of an OpenCL infrastructure which allows running GPU accelerated ray casting algorithms and integrating it into the existing code base. Thus, corresponding commands have to be added to the console application and connected with the OpenCL infrastructure.

- Implementation of a ray caster processing single rays independently. Therefore, the provided data structures, geometries, camera settings, etc. from the application core have to be processed to generate a valid output which can be handed back for successful display.

- Implementation of a ray caster processing rays grouped together in ray packets. This vari­ant showed significant performance improvements on the CPU. Therefore it should also be evaluated on the GPU.

- Porting the implemented OpenCL infrastructure as well as the ray caster algorithms to a new prototype which will serve as the base of the final product.

- Implement out of core ray casting for geometries larger than the available GPU memory.

- Implement selective anti aliasing of the calculated image (canceled).

1.4 Chapter overview

After this short introduction to the company and the organizational part of Enlight, chapter 2 will continue with technical fundamentals about the knowledge fields Enlight encompasses. After a short overview about ray casting and acceleration structures, a specialized ray casting algorithm will be discussed using object entry and exit counting to determine the visual surface. A final coverage of GPGPU computing with OpenCL finishes this chapter, providing the information required to understand the subsequent algorithmic designs and presented implementations.

In chapter 3 a general overview of the existing application infrastructure is given. After the big-picture design is shown, the used data structure for ray casting is explained in more detail. Furthermore, the algorithm for adding newly loaded subtraction volumes to the data structure is presented, as the chosen strategy has a great impact on the ray casting algorithms, which is discussed at last.

The main chapter 4 then presents the implemented OpenCL infrastructure as well as several ray caster implementations. This chapter is organized chronologically showing the progress and evolu­tion of the algorithms during the internship. It starts with the creation of an OpenCL environment and continues with several ray caster implementations and versions of them. The chapter ends with porting the created code to a new prototype and integrating out of core capability.

The final chapter 5 rounds off by showing several benchmark results on different scenes and com­pares the OpenCL implementation with the preexisting and finished CPU ray caster. Furthermore, experiences during development are shared and existing problems discussed.

2 Fundamentals

2.1 Raycasting

Ray casting is primarily an imaging technique and was first described by Scott Roth in 1982 as an effective method of rendering Constructive Solid Geometry (CSG) objects. Although ray casting may be used for different purposes, we will focus on generating a two dimensional image from a three dimensional scene. The basic principle therefore is shown in Figure 1.

illustration not visible in this excerpt

Figure 1: Principle of ray casting.

Generating an image of a three dimensional scene of objects is done by sending rays from a defined eye point through the image plane into the scene. The image plain resembles the two dimensional output image which should be generated. Each output pixel is transformed into a three dimensional position on the image plane which is passed through by one ray originating from the eye point. The eye point itself is usually positioned centered before the image plane to achieve a perspective view of the scene. Each ray is then guided through the three dimensional space containing the scene objects trying to find an intersection. In the case an object is hit, a color value is calculated for the intersection point on the object using e.g. materials parameters or textures of the surface. The pixel on the image plane, through which the ray was shot initially, is then set to this color.

Ray casting is different from the wide-spread rasterization used for most 3D applications and computer games today as it approaches from the image side instead of the objects, the scene is composed of. Rasterization takes all the objects placed in the scene, which have to be composed of primitive shapes like triangles, and transforms them from three dimensional space to the two dimensional space of the output image where each object is then rasterized onto the pixels of the output image.

Apart from a lot of technical differences, both methods also differ in their complexity. Mark Kilgard describes the asymptotic complexity of rasterization with O(b*m), whereas an accelerated ray tracing implementation using a favorable data structure achieves O(logb * p), where b is the number of objects in the scene, m is the average number of pixels per object and p is the number of pixels in the scene [8, p.48]. The Enlight proposal additionally points out, that the size of computer screens increased far slower over the past decades than the size of computer memory, computational power and the size of industrial Computer-aided design (CAD) and Computer- aided manufacturing (CAM) models [5]. Therefore, ray casting might be beneficial when used to render large scenes composed of huge amounts of triangles.

2.2 Regular grids

Regular grids are a type of data structures frequently used in ray casting/tracing to accelerate the traversal of rays through the scenes. Alternative approaches are often tree based such as kd trees, octrees, binary space partitioning (BSP) or bounding volume hierarchy (BVH). Each of these techniques has its own strengths and weaknesses depending on the scene and application. However, especially kd trees and grids (and multilevel grids) have been primarily focused in recent work about interactive ray casting/tracing [12, ch.1]. While kd trees usually perform amazingly well on static scenes, dynamic or animated objects still present a challenge as the used data structure has to be rebuilt after every scene change, which can take up to several seconds for larger sets of geometries [12, ch.1]. Regular grids are simpler structures for partitioning space and accelerating ray casting. They can be created quicker than a (balanced) tree, thus allowing more frequent scene changes. Nevertheless, kd trees may be up to a magnitude faster when it comes to the actual ray traversal [12, ch.1]. As Enlight focuses on visualizing dynamic scenes (for subtractive manufacturing) allowing at least ten subtraction volumes to be added to the scene per second, regular grids appear promising for being used in this case.

Building a regular grid data structure is simple. Initially, a bounding box is created around the scene. This box is then divided into equally sized cubes, which may force the bounding box to be enlarged a bit to fit a whole number of cubes in each dimension. These cubes are called cells. For each triangle of the scene all spanned cells are calculated and the cells store a reference to that triangle.

illustration not visible in this excerpt

Figure 2: Traversing the cells of the grid using a DDA variant until an intersection has been found.

During ray casting, the rays have to traverse the cells of the grid in order to find potential geometry they might intersect with, cf. Figure 2. A fast algorithm for traversing regular grids using a single ray is given by John Amanatides and Andrew Woo [1]. Their algorithm is a slight modification of the digital differential analyzer (DDA) which is used for the rasterization of lines. As finding the pixels, which are essentially squares inside a regular 2D grid, covered by a line is actually the same problem as finding the voxels hit by a ray, this adapted 3D DDA algorithm can be used for efficient traversal of regular grids. Furthermore, as the algorithm’s start point can be determined and the voxels (grid cells) are traversed incrementally, no depth sorting has to be performed across the hit cells and the traversal can be stopped on the first found intersection. Inside each cell, all triangles have to be intersected and the intersections have to be sorted by depth in order to find the correct hit and visible surface.

2.3 Boolean raycasting

Enlight focuses on subtractive manufacturing. Therefore, the scene is composed of an initial stock volume from which further volumes are subtracted, cf. goals in chapter 1.2. This scenario can be typically found in CSG modeling, where complex models are build from boolean combinations of simpler models. CSG models can be efficiently rendered using ray casting, by using a tree with the volumes as leaves and the boolean operations applied to them as nodes. The rays are then traversed through the scenes and have to resolve the tree on every volume entry or exit. Enlight only supports a subset of CSG by only allowing boolean subtraction from the initial stock volume. However, by inverting the stock volume via inversion of the surface normals and defining the world space as being filled instead of being empty, the problem can be even more simplified as the scene now only consists of subtraction volumes.

Figure 3: Ray traversing through a scene containing one stock and two subtraction volumes. By inverting the stock, which is shown in the left figure, into a subtraction volume, as shown in the figure on the right, the counting algorithm is simplified. The inversion is accomplished by inverting the normal vectors of the stock’s surface. The ray starts inside the inverted stock volume with a counter value of minus one, is decremented on every volume entry and incremented on ever volume exit. The entries are shown as yellow dots and the exits as blue dots in the right figure. If the counter reaches the value zero, the surface has been found.

The algorithm for ray casting a scene composed of subtraction volumes is illustrated in Figure 3. Every ray starts outside the initial scene and maintains a counter which determines the number of subtraction volumes the ray is inside of/has entered. As outside of the original scene is inside the inverted stock, this value is initialized with minus one, corresponding to ”inside one volume”. The ray is then traversed through the scene. On every intersection where the ray enters a volume, the counter is further decremented, as the ray is now inside even more volumes. On every exit, the counter is incremented, as the ray is now inside one less volume, and checked if it has become zero. A counter value of zero on a volume exit corresponds to an intersection with the surface of the resulting model.

2.4 Packet casting

A common technique for accelerating ray casting and tracing is to accumulate squares of adjacent rays into ray packets and traversing the whole packets through the scene instead of individual rays. This technique has several advantages. Most importantly, a part of the parallelism between individual rays, which can be processed completely independent, is moved down into the traversing and intersection algorithms. As adjacent rays are likely to traverse through the same parts of the acceleration structure, the same cells of a grid in the case of Enlight, and hit the same triangles, the corresponding routines can be executed in a data parallel fashion on multiple rays. Modern CPUs and GPUs benefit from these scenarios and offer corresponding SIMD instructions (CPU) or an appropriate hardware architecture (GPU). Furthermore, coherent traversal and intersection
improves caching behavior via temporal locality[12]. Additionally, also traversal overhead is reduced by the number of rays inside a packet. However, traversing packets through acceleration data structures such as grids is more difficult and cannot be done using the 3D DDA algorithm presented in Chapter 2.2. Fortunately, an efficient solution is presented in the paper ”Ray Tracing Animated Scenes using Coherent Grid Traversal” [12]. The algorithm is illustrated in Figure 4.

illustration not visible in this excerpt

Figure 4: A ray packet traversing a regular grid slice by slice. The blue region is the axis aligned part of a slice (column) which is actually occupied by the packet and is determined by the corner rays of the packet. The yellow region is the blue region extended to full cell boundaries. Furthermore, the packet is always fully traversed until all rays have hit, although no intersections have to be calculated for rays which have already hit.

As the ray packet may now spans multiple cells it cannot traverse the grid cell by cell anymore. The next bigger aggregation of cells is a slice of the grid. To iterate the slices of a grid in a favorable order, the main/primary traversal axis is determined by finding the largest component of the direction vector of one of the innermost rays of the packet. The two other axes are called secondary axes. For each slice of the grid along this main traversal axis, starting at the side of the grid where the packet enters, the four corner rays are intersected with the front and back plane of the slice. The maximum and minimum values of the intersection points along the secondary axes describe the axis aligned region of the slice which is hit by the packet. This region is shown blue in Figure 4 and further extended to full cell boundaries shown in yellow. Then, all objects of the cells of this extended region have to be intersected with all rays of the packet. This routine can be easily vectorized by accumulating several rays into SIMD registers and processing the intersection tests using vector instructions like SSE or AVX. Rays which have already hit a triangle may be masked out. A further optimization to this algorithm is to cull all triangles of the selected cell region of the slice against the frustum spanned by the corner rays.

2.5 OpenCL

OpenCL is an open and free standard for parallel, general purpose programming targeting mainly CPUs and GPUs. This chapter will provide a quick introduction into using OpenCL for executing programs on the GPU. For detailed information, please have a look at the OpenCL chapter in the first part of this thesis - GPGPU Computing with OpenCL.

OpenCL is specified by the Khronos Group [11]. Implementations of OpenCL however have to be provided by hardware vendors such as NVIDIA, AMD or Intel. These vendors provide (Software Development Kits (SDKs)) containing C header files and static libraries which can be used to write applications using OpenCL.

To allow OpenCL to support various kinds of hardware, a separate language is used to write programs to be run on devices supporting OpenCL. These programs are called kernels and have to be included with the application which wants to use them, which is called the host application. This host application uses the OpenCL API to communicate with an OpenCL implementation present in the system. An application using OpenCL typically begins with selecting one of the available OpenCL platforms on the system which correspond to installed OpenCL drivers. A system may have different drivers for different devices from which the application has to choose one. Then, a device is selected supported by this platform. E.g the system’s GPU on the NVIDIA platform. Furthermore, a context is required which provides a container for various OpenCL objects like buffers or kernels as well as a command queue which is used to enqueue operations for execution on a device. To execute an OpenCL program on the device, the source code of the program, written in OpenCL C, has to be passed to the OpenCL driver for compilation and linking at runtime. On success, kernels can be extracted providing entry points to the OpenCL program. A kernel can be any function of the written OpenCL program which returns void and is prefixed with the kernel attribute. Kernels may also have arguments which can be set by the host application. On the host side, kernels can be enqueued on the command queue for asynchronous execution on a device, after their parameters have been set. For transferring larger amounts of data and for retrieving results from a kernel buffer objects can be created. These buffers can be asynchronously (default) or synchronously read and written from the host application by enqueuing corresponding operations on the command queue. Buffers can be passed as arguments to a kernel which can access the buffers via a pointer.

Although OpenCL is not tied to a hardware architecture, it does fit GPUs very well. A modern GPU typically consists of several hundreds or even thousands of cores able to process work in par­allel. These cores are usually part of larger hardware compounds called streaming multiprocessors (SM). Each of these SMs has a small local memory which is shared between all cores of the SM and a large amount of registers which are allocated by the cores for executing kernels. The cores themselves do not have separate registers. Outside the SMs is the memory system connecting to the off-chip video memory on the graphics card, which can be accessed by all cores.

NDRange a work group

illustration not visible in this excerpt

Figure 5: A two-dimensional NDRange consisting of GSx * GSy work groups. Each work group consists of LSx * LSy work items. A work item has two indexes in each dimension, the local index (relative to the work group) and a global index (relative to the NDRange), as well as the indexes of the group it is part of. The local work sizes are LSx and LSy whereas the global work sizes are GSx * LSx and GSy * LSy.

OpenCL abstracts such massively parallel hardware architectures using several concepts. When the host application enqueues a kernel, it has to specify a so-called n-dimensional range (NDRange) which determines the dimension and size of the problem. E.g. a two-dimensional range with 800 x 600 elements. According to this range, work items are created, grouped together into work groups. The total number of work items in each dimension is called the global work size. The size of the work groups in each dimension is called the local work size. Each dimension of the global work size must be a multiple of the same dimension of the local one. A work item corresponds to a single execution of a kernel inside a single thread on the GPU. Each work item has several coordinates (ids) it can use to determine it’s position inside the NDRange which is illustrated in Figure 5.

Work items inside a work group are executed on the same SM, share the residing local memory and may be synchronized. The off-chip video memory is called global memory in OpenCL, is used to store buffers and can be accessed by all work items and the host application. Variables placed in and pointers to the local memory have to be prefixed with the local attribute, pointers addressing global memory have to be prefixed with the global attribute.

3 Existing prototype

At the beginning of the internship the existing code base, shortly described in Chapter 1.3, was analyzed in order to understand how the ray casting application has been organized and imple­mented. The application is written in C++ using Microsoft Visual Studio 2010 and 2012 in 64 bit mode. Besides the provided compilers from Microsoft, also Intel’s C++ compiler is used during development to benefit from stronger optimization. The code heavily uses C++11 features and AVX SIMD instructions, thus limiting the application to rather up-to-date compilers supporting C++11 and newer hardware. AVX is supported since Intel’s Sandy Bridge and AMD’s Bulldozer series. Furthermore, OpenMP is used as a technology for high level parallelization, OpenGL for visualization and the Microsoft Foundation Class (MFC), a C++ wrapper of the Win32 API, for window management and user interaction. Considering the implemented algorithms, single ray casting of scenes composed of subtraction volumes as shown in chapter 2.3 using the 3D DDA algorithm discussed in Chapter 2.2 has been used in the beginning. However, the initial approach has later been replaced by a highly optimized and parallel packet ray caster using the slice travers­ing algorithm presented in Chapter 2.4. The single ray variant can still be found in the code but is not used anymore.

Figure 6 shows screenshots of the existing prototype at the beginning of the internship.

3.1 Structure

Figure 7 shows a simplified class diagram of the most important classes used in the existing ray casting implementation. Most of them have been dealt with when extending the existing infrastructure by the new OpenCL ray caster.

The central class containing most of the application logic is DebugView. It inherits CWnd from the MFC and is the main window of the application containing the OpenGL visualization. Be­sides all graphical user interactions, like zooming and rotating, using the cursor, also the console commands are parsed and processed by this class. It further contains an instance of Camera which is updated according to user interaction. Also the regular grid holding all loaded meshes is stored by DebugView. Finally, a ray caster wrapper class is also available (RayCasterWrapper ), which delegates calls to the actual ray casting implementation. When the application has been started, meshes can be loaded from the file system via corresponding console commands. The loaded meshes are processed by the classification algorithm and merged into the grid, cf. Chapter 3.3. Ray casting can be triggered either by issuing a command on the console or chang­ing a camera property via GUI input (zoom, rotation). In both cases, DebugView eventually invokes castWithRayPackets() from the ray casting implementation wrapper and passes a reference to the grid and current camera settings. The wrapper delegates this call to the corre­sponding method in RayCaster where the actual ray casting code is executed. Initially, several camera values are retrieved and a RayPacketFactory is created. This factory relies on the older RayFactory used for the initial single ray casting algorithm. Furthermore, an instance of PixelGrid is created which will be later used to store the ray casting results and will be returned to the calling DebugView class, where it is used to render the casted image. After this setup, the requested output image area, sized according to the width and height of the camera, is divided into a grid of square packets and iterated over by two nested for loops. The square size is adjustable and currently set to eight. The outer loop iterates over the rows of the packets and is executed in parallel using an OpenMP directive. Within the inner loop, RayPackets are created using the corresponding factory. Each packet is traversed through the grid using an instance of

illustration not visible in this excerpt

Figure 6: Screenshots of the existing prototype. The images on the left show the OpenGL visual­izations of the loaded meshes. The images on the right show the ray casting result using the existing CPU packet ray caster. The first row only shows the loaded stock volume. The second row shows the stock volume and one subtraction volume. The OpenGL visualization also shows the grid. Note that the grid only has to cover the stock volume. The third row shows several subtraction volumes cut out of the stock forming a cylinder head.

Figure 7: Simplified class diagram of the most important classes involved in ray casting at the beginning of the internship.

RegularGridSliceTraverser. After the slice traverser has been initialized with the packet, which determines traversal axis and grid entry point, a while loop retrieves cells from the slice traverser using its next method until end returns true. On every slice end, the traversal may be early aborted if all rays of the packet have already hit. For each cell the slice traverser returns, the ray packet has to be intersected with the referenced triangles of this cell. This is done us­ing an instance of RayPacketlntersectionProcessor. This process consumes most of the required CPU time. Therefore the intersect routine and all subroutines are highly optimized and highly consist of AVX vector intrinsics. The intersection test for the packet is performed in parallel using SIMD after all triangles of the cell have been culled against the frustum defined by the corner rays of the packet. Details are discussed later in Chapter 3.4. For each intersec­tion, the normal vector of the hit surface and the distance from the eye point are retrieved. The normal vector together with the initial ray direction of every ray of the packet is used by the SimpleFragmentPacketShader to calculate a color value for the ray. Currently, the scalar product of both vectors is used to measure a gray scale value, achieve simple shading. The distance of the intersection point to the eye point (camera position), from which the ray originated, is later translated into the normal distance of the intersection point to the image plane which corresponds to the depth value which would have been generated by OpenGL if the scene has been rendered traditionally. After all ray packets have been processed, the final PixelGrid instance containing the color and depth values is returned back to DebugView for display.

3.2 Ray casting data structure

The data structure maintaining the geometry and accelerating the ray casting algorithm is im­plemented by several classes. To start with, the RegularGrid class is a simple container for cells. It also stores some meta information such as the grids bounding box, the number of cells in each dimension and the size of a cell. The cells are stored in a continuous array. Therefore, 3-dimensional cell coordinates have to be mapped to the linear memory region.

The cells themselves are simple bounding boxes containing references (pointers) to the triangles (facets) contained within the cells. Although the bounding box is implicitly given by the cell’s coordinates inside the grid and the grid’s bounding box, the box is kept precalculated as it is often needed by the ray casting algorithm.

A Facet itself consists of its three vertices, a normal vector as well as the structure pointer. The latter references the mesh (subtraction volume) this facet belongs to. The importance of this value is elaborated later when discussing the details of the intersection routine in Chapter 3.4.

3.3 Classification

illustration not visible in this excerpt

Every time a mesh is added to the grid, the triangles of the mesh have to be mapped to the cells of the grid. When the mapping is complete, the cells are classified into one of three categories. Cells which are occupied by triangles of the mesh surface are surface cells. Cells inside the mesh are inside cells and contain no triangles. Cells outside the mesh are outside cells and contain no triangles too. For the ray casting algorithm, only surface cells are relevant. The left sketch in Figure 8 illustrates the classification of a mesh.

Figure 8: Principle of classifying cells of a grid according to the added mesh. Only surface cells are relevant for ray casting. The sketch on the left side shows the classified stock volume. On the right side the classification result, after a subtraction volume has been added, is shown.

When a new mesh is added to the grid, the mesh is again classified and merged into the existing cell classification. Limiting the modifiability of the scene by only allowing the addition of subtraction volumes simplifies the rules for merging a new mesh into the grid as shown in Table 1.

illustration not visible in this excerpt

Table 1: Table of different merge scenarios and their outcome.

Cells which are outside the added subtraction volume remain unchanged. Surface cells of the added volume become surface cells except they where outside cells before, and cells inside a subtraction volume always become outside cells, as they lie outside the resulting geometry. The result of merging a subtraction volume into the grid is visualized in the right sketch in Figure 8.

As we can see, cells which where surface cells before and contained triangles have now changed to outside cells and are therefore no longer relevant for ray casting. As a result, the increase of triangles in the surface cells by adding new volumes has been compensated by exclusion of some surface cells. This reduction of potential triangles for intersection with an increasing number of subtraction volumes is vital, as it enables scalability, allowing even scenes with thousands of subtraction volumes to be ray casted efficiently. In fact, some scenes can even be casted faster with an increasing number of subtraction volumes as the number of relevant surface cells and therefore triangles decreases.

However, this kind of reduction has a significant consequence on the ray casting algorithm. As subtraction volumes are divided into grid cells and some of them discarded, the volumes are no longer water-tight. This is a problem for entry/exit counting ray casting algorithms such as the one discussed in chapter 2.3. Therefore, an adapted version of this algorithm has to be developed, capable of handling open volumes. Chapter 3.4 discusses a method of getting around this problem.

3.4 Improved counting algorithm

The entry and exit counting ray casting algorithm for boolean subtraction volumes discussed in Chapter 2.3 maintains a counter value for each ray along the full traversal of the grid cells. However, by using the classification technique introduced in Chapter 3.3 to eliminate irrelevant triangles, it is no longer possible to keep states between multiple cells, as some of them might not contain relevant triangles for intersection but might be important for state changes. For example, an entry into a subtraction volume inside an outside cell. Consequently, all required states of the ray can only and must be determined upon entry of a surface cell.

This problem mainly concerns the counter of each ray used to count the volume entries and exits along the ray in order to find the surface hit. To put it differently, each ray has to know inside how many subtraction volumes the ray is at the point where it enters a surface cell. However, if we have another look at the classified grid in Figure 8, we can see that each surface cell contains triangles from all subtraction volumes, called structures in code, inside which the ray can potentially be depending on its entry point. If the ray would be inside a subtraction volume which has no triangles inside the cell, the cell would have been classified as an inside cell of this volume and the cell would be skipped during ray casting.

Determining the number of subtraction volumes the entry point lies inside of, which is called inside counter, can be done by creating secondary rays. These rays are casted from the cell entry point of the initial ray, which is called primary ray for distinction, to a reference point on one triangle of each subtraction volume. The secondary ray must not hit other triangles of the same volume between the entry point and the triangle reference point. The used reference points on the triangles have to be inside the cell’s bounding box and should not lie on a edge of the triangle, as numerical instability increases at the edges. A good candidate for such points is the centroid of the triangle clipped against the cell’s bounding box. The angle between the secondary ray and the surface normal of the chosen triangle determines if the entry point is inside the volume the triangle belongs to. If the primary ray already hits a subtraction volume, no secondary ray is necessary for this volume and the primary rays direction and the surface normal of the hit triangle can be used instead. Figure 9 shows the determination of the inside counter on an example cell.

illustration not visible in this excerpt

Figure 9: Determination of the inside counter upon cell entry. The primary ray hits the structures sub 0 and sub 2. Both surface normals enclose an angle with the ray’s direction smaller than 90°. Therefore, the ray’s entry point lies inside of sub 0 and sub 2. As the primary ray does not hit sub 1, a secondary ray is sent to a reference point on a triangle of sub 1. As the surface normal and the ray’s direction enclose an angle larger than 90°, the cell entry point lies outside of sub 1. Therefore, the inside counter of the ray is initialized with minus two.

4 OpenCL ray casters

Before changing the existing prototype, a separate branch in SVN was created to allow experiment­ing with the existing code without disturbing the other members of the team. Before development was started, further tools were installed. As the GPU provided by RISC was from NVIDIA, NVIDIA’s CUDA Toolkit (version 5.0) was installed which contains the required OpenCL header files and libraries. Furthermore, the OpenCL C++ wrapper provided by Khronos (cl.hpp) was downloaded from their website. Several weeks later, also Intel’s SDK for OpenCL Applications 2013 was added, which provides a limited OpenCL debugger and some smaller extensions to Visual Studio [2]. A separate project was added to the existing Visual Studio solution and the necessary include and library paths configured in order to use OpenCL.

4.1 OpenCL ray casting driver

Creating a ray caster using OpenCL started by designing a small infrastructure which has to meet the following requirements:

- Only one class should be needed by DebugView (the main application class) in order to communicate with the OpenCL ray caster.

- As the OpenCL ray casters will be developed incrementally and several different ideas exists, a mechanism has to be designed to maintain multiple casters.

- A reload functionality has to the provided to reload an OpenCL caster at run time. As loading larger scenes may take up to some minutes and the OpenCL source code is read in and compiled at run time, this feature allows it to change the OpenCL source code of a caster at run time and reload it without having to relaunch the host application.

- Global options and important variables have to be exported to the main application where they can be adjusted using corresponding console commands. Changing the value of a variable may trigger a recompilation of the kernels or a reconfiguration of the OpenCL environment.

- If a ray caster crashes or OpenCL experiences any problems, the main application must be able to relaunch the complete OpenCL environment without having to be restarted.

Considering these requirements, a simple architecture has been designed shown in the simplified class diagram in Figure 10.

The central class and entry point of the OpenCL ray casting environment from the main application is the OpenCLCasterWrapper class. An instance of this class is maintained by DebugView, similar to the RayCasterWrapper, cf. Figure 7. The OpenCLCasterWrapper offers several methods for interacting with an OpenCL caster. Most importantly, it offers an update method taking a reference to the RegularGrid maintained by the main application which holds all relevant data of the scene. This method has to be called at least once before starting ray casting. The raycast method performs the main work of generating an image. The created image with depth information is also stored inside an object of PixelGrid and passed back to the main application.

illustration not visible in this excerpt

Figure 10: Simplified class diagram of the OpenCL ray casting driver.

For managing the underlying ray casting implementations, each caster is assigned a name. This assignment is managed by the CasterFactory, which is used by the OpenCLCasterWrapper to provide a list of available implementations and to allow selection and creation of casters by their name. The currently loaded caster is pointed to by the caster member. Furthermore, a reload method is provided for recreating the currently loaded caster.

Additionally, several options concerning OpenCL and the implemented ray casters may be set by the main application using the console. The OpenCLCasterWrapper offers further methods for configuring several values such as the floating point precision, work group size, OpenCL platform and device as well as enabling out of core casting and adjusting the out of core buffer size, cf. Chapter 4.13).

The SubGrid class is responsible for holding the full or a part of the grid and for transforming the grid’s representation into buffers pushed to the GPU. Details are discussed with the caster implementations requiring this functionality.

The Timer class is a small helper class for measuring time durations between events. It is used to print the time required for certain operations inside the driver such as ray casting a frame, compiling a kernel or ray casting an out of core mesh in multiple passes.

Another important class is AbstractCLCaster. It is the base of all ray caster classes and is responsible for setting up the OpenCL environment. Every ray caster implementation inherits from this class and must implement the pure virtual functions setupCaster and cast. While cast is called directly from the OpenCLCasterWrapper, setupCaster is a template method inside AbstractCLCaster and is called during the casters construction, after OpenCL has been initialized. The derived caster implementation creates its kernels and further resources inside this method. The updateGrid method is implemented in the base class, as it works equally for all ray casting implementations which require the grid on the GPU (all non-test casters). Furthermore, AbstractCLCaster offers a variety of utility methods which can be used by derived casters.

On the right side of the class diagram of Figure 10 are several structures. CameraValues is used to transport relevant camera information from the OpenCLCasterWrapper into the caster implementations. This avoids dragging the Camera class and its dependencies deeper into the OpenCL project. The grid is also converted into one or multiple instances of SubGrid before it is passed into the caster implementations in order to reduce avoidable dependencies. The remaining structures are all used as data transfer objects between the OpenCL ray casters on the host application and the GPU. Therefore, only types defined in the corresponding OpenCL header are used, e.g. cl_uint, cl_float3.

4.2 TestGradientCaster

The first caster is a proof of concept to test the functionality of the OpenCL environment. It does not use the grid, loaded geometries or the camera’s current values. The TestGradientCaster only loads the OpenCL kernel source code from a file and compiles it using corresponding helper methods from AbstractCLCaster inside the overridden setupCaster method. The updateGrid method is also overridden but empty, as no grid buffers have to be created and up­loaded. Inside the cast method, an output buffer is created capable of holding a cl_float4 value for storing red, green, blue and depth information for each pixel of the requested output image. The size of the output image is determined by the fields width and height of CameraValues, of which an instance is passed to the cast method. Then, the loaded kernel is enqueued on the command queue for execution on the GPU. The global work size of the kernel is only one dimen­sional and the number of output pixels rounded up to be a multiple of the maximum allowed work group size. Each kernel invocation then derives the x and y coordinate of the output pixels from the global id of the work item and writes a color value to the output buffer. The red channel of the color is the x coordinate divided by the output image’s width, the green channel is the y coordinate divided by the height and the blue and depth channel are set to zero. After the kernel has executed, the result is read back to the host into an allocated cl_float4 array, which is then returned by the cast method. The OpenCLCasterWrapper converts this array into an instance of PixelGrid which is then passed back to the main application for display.

The resulting gradient of the TestGradientCaster is shown in Figure 11.

illustration not visible in this excerpt

Figure 11: Screenshot of the OpenCL kernel output of the TestGradientCaster.

4.3 TestCameraCaster

The next test caster is based on the TestGradientCaster. The goal of this caster is to test the generation of rays according to the current camera settings. Therefore, the CameraValues instance passed to the cast method is converted into a RayCreationParameters structure, which is passed as an argument to the kernel. This structure contains selected fields of the model view and projection matrix as well as several precalculated values which are used to create the ray direction vectors for a given x and y coordinate on the image plane. The kernel itself stores the vertices of 12 triangles forming a simple cube placed at the coordinate space’s origin in constant memory, which is a special part of the global memory. The kernel is enqueued with a two-dimensional global work size of the output image’s width and height, both rounded up to be a multiple of the configured work group size. Each work item calculates the ray direction of the ray traveling through the pixel of the work group item’s x and y coordinate on the image plane. With the origin at the camera position, this ray is intersected with the 12 triangles of the cube. The intersection routine used is an adapted version of the ”Fast, Minimum Storage Ray/Triangle Intersection” approach presented by Tomas Moller and Ben Trumbore [10]. The intersections are depth sorted to find the correct intersection point and hit triangle. A color value for the pixel is then calculated using the scalar product of the ray’s direction vector and the surface normal of the hit triangle. After the kernel has run, the resulting buffer is processed as by the TestGradientCaster.

The resulting image of a cube which rotates synchronously with the OpenGL visualization of the loaded scene is shown in Figure 12.

illustration not visible in this excerpt

Figure 12: Screenshot of the OpenCL kernel output of the TestCameraCaster.

4.4 SelfTestCaster

The self test caster is a simple test of the structures used in host application and OpenCL inter­action. More specifically, the SelfTestCaster calculates the size of all structures used in memory transfers between the host and the GPU on the host and enqueues a kernel retrieving the cor­responding values on the device side. These values are compared to each other and an error is yielded if any pair of them is not equal. The reason why problems might occur on this level is the varying struct member alignment and padding carried out by different compilers. Although the data alignment of OpenCL is based on the one of ISO C99, some additions have to be made for OpenCL’s built in vector types, cf. OpenCL specification 6.1.5 [11]. In particular, OpenCL has built in vector types which have to be aligned to a memory address being an even multiple of their size and a power of two. Therefore, a float3 value for example, with a theoretical size of 12 bytes, is aligned to a 16 byte boundary. However, the corresponding cl_float3 typedef on the host application (defined in cl.h) is backed by a structure containing a four element float array. This ensures that both types, float3 inside OpenCL and cl_float3 on the host application, have the same size of 16 bytes, but it does not guarantee a proper alignment as the OpenCL float3 type will be 16 bytes aligned and the cl_float3 structure (an array of floats) only to a four byte boundary. This has significant consequences if such vector types are used within structures as the size of the structs may be different (typically larger) inside an OpenCL kernel than within the host application. Consequently, enforcing proper alignment is partially up to the programmer and has to be declared for the affected structures (e.g. with declspec(align( sizeof(cl_float3))) using Intel’s C++ compiler). As the kernel does not create an output image, an empty buffer is allocated on the host and returned.

4.5 SingleRayCaster

The SingleRayCaster is the first attempt of a full ray caster. It is based on the TestCameraCaster considering the host code, the ray creation inside the work items and the triangle intersection routine. However, instead of using statically coded geometry inside the kernel, the grid containing the loaded geometries from the main application is used. Therefore, an appropriate memory layout has to be defined which fits well to the concept of OpenCL buffers. To traverse the grid, the 3D DDA algorithm presented in Chapter 2.2 is implemented. No counting is performed and no distinction of triangles from different structures and their facing (normal vector) is made.

As the data structure holding the grid on the host application is to some degree hierarchical as the grid uses a dynamically allocated array of cells and cells hold arrays of pointers to facets, the first step before any transfer to the GPU can happen is to flatten the data structure into corresponding OpenCL buffers. Figure 13 visualizes the hierarchical grid has been flattened into linear buffers by substituting pointers by indexes.




Three buffers are required to store the grid. The first one is rather small and only stores meta information about the grid. It contains a single instance of the SubGridCL struct which contains the cellCount member. This cell count holds the extents of the grid in all three dimensions and therefore determines the number of cells. The second buffer stores the cells as a large array of CellCL structures. Each of them may have multiple facets associated with them. Therefore, each cell has a facetStart index which points to the first of facetCount triangle belonging to this cell in the triangle buffer. The third buffer then holds a huge array of triangles which contains the geometric information of the scene.

Flattening the grid into these three buffers has to be done every time a surface cell changes, e.g. by loading a new subtraction volume. When an update of the grid occurs, the main application calls the OpenCLCasterWrapper’s update method and passes the reference to the new grid. The grid is then flattened by the SubGrid class, filling its members subGrid, cells and triangles , which are exactly the contents of the three required OpenCL buffers. The SubGrid instance is passed to the currently loaded caster by calling the updateGrid method. This method is implemented in AbstractCLCaster and creates the actual OpenCL buffer objects and transfers the content of the SubGrid to the GPU.

The kernel itself is again enqueued with a two-dimensional global work size of the size of the output image (one work item per pixel) rounded up to a multiple of the configured work group size. The kernel code for this first full ray casting approach starts in the same way as the TestCameraCaster. Each work item initially determines its ray direction. Then, the grid has to be traversed using the adapted 3D DDA algorithm based on John Amanatides and Andrew Woo presented in Chapter 2.2 [1]. This algorithm basically works by calculating the entry point of the ray into the grid as well as the corresponding cell of this entry point. Consecutively, increment values for each dimension are calculated used to determine the next cell from the current one. The collection of routines implementing this algorithm will be called cell traverser. After initialization, the ray casting kernel polls cells from this cell traverser until an intersection has been found or the grid

has been fully traversed. For each cell returned by the cell traverser, all triangles inside this cell, determined by the facetStart and facetCount members, are loaded from the triangle buffer and intersected using the same routine as in the TestCameraCaster. The intersection with the smallest distance to the ray origin (camera position) is selected, as triangles are stored in no particular order guaranteeing depth sorted iteration. Cells however are returned in correct order by the cell traverser. Therefore, the traversal is aborted if an intersection has been found inside a cell. The color of the output pixel is set like in the TestCameraCaster by calculating the scalar product between the ray direction and the hit surface’s normal vector.

The output of the SingleRayCaster is shown in Figure 14. Although the generated image is far from being a correct visualization, one can clearly recognize the shape of the cylinder head used as ray casting input. For a correct output image cf. the existing CPU caster result in Figure 6.

illustration not visible in this excerpt

Figure 14: Screenshot of the OpenCL kernel output of the SingleRayCaster. The fringes sticking out of the cylinder head, especially visible at the top left, are due to hits on triangles of the subtraction volumes, as the nearest triangle hit, regardless of which volume, is taken for the pixel’s color. These errors will disappear when cell based counting is implemented.

4.6 SingleBooleanRayCaster

The SingleBooleanRayCaster extends the previous SingleRayCaster by using a counter for each ray which counts the number of volume entries and exits. The counting is done along the full traversal axis as discussed in Chapter 2.3. As parts of the geometry are missing because only the triangles of surface cells are stored in the grid and transfered to the GPU, this caster will probably produce erroneous results. However, this implementation would work if the complete geometry was present in the grid. This caster can be seen as an evolutionary step between the already presented SingleRayCaster and the SingleBooleanCellRayCaster, which will be able to handle open geometries.

The SingleBooleanRayCaster class is equal to the one of the previous SingleRayCaster. Also the kernel code starts off equally by calculating the ray direction for each work item/pixel and traversing this ray through the grid. Besides declaring the counter and initializing it with minus one, the only code that has changed is the intersection routine of the ray with a cell polled by the cell traverser. The code still has to iterate over all triangles inside the cell and intersect the ray with each of them. However, instead of keeping the nearest intersection to the ray origin so far, we have to intermediately store all intersections of the ray with any triangle inside the cell. And here we hit a big problem in GPGPU computing with OpenCL, the lack of dynamic allocation.

Every kind of memory which is needed during a kernel execution has to be allocated beforehand. This accounts for both, global and local memory. Registers are handled a bit differently. Before a new work group is scheduled for execution on a streaming multiprocessor, all required registers for all work items of the work group have to be allocated in the register block on the SM. The number of registers required for a work item must therefore be determined in advance by the compiler for all possible branches of the program, including all local variables, and is a decisive factor for how many work groups may be concurrently executed on a SM.

Let’s consider the available options by an example. We take a common, small screen resolution of 1024 x 768 pixels. For each pixel a ray is created which needs a separate intersection buffer. As we do not know in advance how many intersections may maximally occur inside a cell, we have to define an allowed maximum. Let’s say this value is a hundred. We actually had a long discussion about this in the office and thought about an even higher value, but a hundred proved to be sufficient in all used test scenes. Finally, we have to determine the size of the information required to hold an intermediate intersection for further processing. This must be at least the index of the hit facet, an unsigned integer, the distance of the intersection point from the ray’s origin, a floating point value, and a boolean value, padded to four bytes, whether the ray enters the volume of the hit triangle or not. These three values make up twelve bytes. Let’s calculate the total amount of memory required to hold a hundred intermediate intersection results per work item: 1024 * 768 * 100 * 12 = 943718400B = 900MiB, which is a huge amount of memory for a GPU just for storing temporary data of an algorithm. Furthermore, this value does not include the grid with its cells and triangles. Consequently, storing the intermediate intersections in global memory is not acceptable.

Using local (shared) memory for storing the intermediate intersections per work item would also be possible. However, local memory is quite small with sizes of 16, 32 or 48 KiB per SM. As each work item requires 1.2 KiB (12 * 100) the largest local memory configuration would only allow 40 work items to run concurrently on a SM which is an extremely poor occupancy considering the high amount of potentially concurrently executed threads per SM, e.g. 2048 on a NVIDIA Kepler GPU [3, p.7].

The third option is using registers. Although 1.2 KiB per work item do not fit most GPU’s hardware limits, e.g. NVIDIA Kepler limits a thread to 63 (GK104) or 255 (GK110) 32 bit registers [3], the kernel can still be launched. This shortage of registers is handled by compiler and driver by spilling the not available memory into global memory. As global memory is far slower accessed, register spilling usually causes a loss in performance. However, as the spilled area per work group is rather small, several tens or hundreds of kilobytes, and the area is always accessed by the same SM, this part of global memory benefits strongly from the L1 cache on the SM and may still achieve good performance on reads and writes. Paulius Micikevicius from NVIDIA held a detailed presentation about this subject in CUDA [9].

As a result of this small analysis, it has been determined that the intersection buffer will be kept as an array inside the kernel. It is then up to the compiler to try to fit the array into registers or spill it into global memory. This buffer is then used to store all intersections of the ray with triangles inside the current cell. Afterwards, the intersections inside the buffer have to be sorted according to their distance to the ray’s origin as the counter has to be altered in the same order as the ray hits the triangles. OpenCL does not provide a sorting routine of any kind. Thus, a sorting algorithm has to be implemented. As the maximum size of the intersection buffer is defined as a hundred, a simple insertion sort will be used as it performs well on small arrays and has very little overhead considering additional variables. After the buffer has been sorted, all intersections are iterated again, this time in the order they ray hits them, and the counter is incremented or decremented according to the hit triangle’s facing. If the counter reaches the value zero, the surface triangle has been found. The triangle id of the intermediately stored intersection is used to resolve the full triangle with normal vector and recalculate the exact intersection position. This information is then used to calculate the color value of the output pixel.

The output of the SingleBooleanRayCaster is shown in Figure 15. The image does not look much better than the one of the SingleRayCaster and is still incorrect as the grid contains open geometries leading to invalid counting. However, by implementing counting, an important step towards a working implementation has been made.

illustration not visible in this excerpt

Figure 15: Screenshot of the OpenCL kernel output of the SingleBooleanRayCaster.

4.7 SingleBooleanCellRayCaster

The SingleBooleanCellRayCaster takes the previously described SingleBooleanRayCaster from Chapter 4.6 and moves the counter initialization from before the cell traversal into the cell itself as explained in Chapter 3.4. The main unknown is thus the number of subtraction volumes the ray, in particular its entry point, is inside of at its cell entry. Therefore, we have to decide for each subtraction volume if the entry point lies inside the volume or not. A simple solution is to take any triangle of a subtraction volume and create a secondary ray from the entry point to a point on this triangle, which has to lie inside the current cell. According to the enclosed angle of the triangle’s normal and the created secondary ray, it can be decided if the entry point is inside or outside of the subtraction volume the triangle belongs to. However, as multiple triangles of the same volume may be potentially hit by the secondary ray, it has to be ensured, that the secondary ray is not intersected by another triangle of the same volume. This case can lead to wrong decisions, as e.g. the secondary ray may select a triangle of the back size of volume and also intersect the front. The resulting initial counter value is initialized with zero and decremented for each subtraction volume the entry point is inside off.

The concrete implementation of this approach makes use of additional data provided in the TriangleCL structure and is further based on an assumption guaranteed by the grid creation code of the main application. The clippedCentroid member of the triangle structure, cf. class diagram in Figure 10, contains the centroid of the resulted polygon after the corresponding tri­angle has been clipped to its cell’s bounding box. This value is calculated by the SubGrid class when the grid of the main application is flattened into the OpenCL buffers. Furthermore, each triangle has a structureld member which is an identifier being equal among all triangles inside a cell belonging to the same subtraction volume (structure). It is additionally guaranteed, that triangles belonging to the same structure are stored continuously in the triangle buffer. Based on these circumstances the counter initialization algorithm is implemented as follows.

Inside each work item, for each cell returned by the cell traverser, all triangles of the cell are iterated. Additional variables keep track of the current structure id, the current distance to the nearest triangle of the current structure and its normal. In each loop where the structure changes, a secondary ray is created from the primary ray’s entry to clipped centroid of the current triangle of the loop which has caused the structure change. The distance from the entry point to the clipped centroid as well as this triangle’s normal is kept. All following triangles of the same structure are intersected with the created secondary ray. In case of an intersection, the distance to the nearest intersection of the secondary ray is updated as well as the normal by the one of the new, intersected triangle. Before the next structure change, the normal of the nearest triangle intersecting the created secondary ray can be used to determine if the entry point lies inside or outside of the current structure. If it is inside, the counter is decremented. The triangle iterating continues with the next structure analogously. After all triangles of the cell have been iterated, the initial counter value holds the number of subtraction volumes the primary ray’s entry point is inside of. The algorithm can now continue equally as the SingleBooleanRayCaster by intersecting all triangles, sorting them by their depth and processing the counter along the ray to find the surface hit.

The SingleBooleanCellRayCaster is the first ray casting approach able to deliver correct results. The output image of the kernel is shown in Figure 16. Apart from a few wrong pixels which occur because of numeric instability of the used float data type), the result already makes a satisfying impression. The errors do not occur on the CPU caster which uses double precision. Considering the code however, there is still room for improvement as the kernel iterates all triangles of the cell twice. One time to determine the inside counter and one time for the intersections. Furthermore, the few invalid pixels may be eliminated if double precision would be used for the intersection routines on the GPU too. These two subjects will be addressed in subsequent chapters.

illustration not visible in this excerpt

Figure 16: Screenshot of the OpenCL kernel output of the SingleBooleanCellRayCaster.

4.8 Double support in calculations

Although the double data type is very commonly seen in conventional languages like C++, Java or C#, only recent GPUs support double precision floating point arithmetic. The reason for this is that the higher precision on the cost of speed did not fit the initial requirements of a graphics processing unit, which was optimized for maximum throughput. With the gaining popularity of GPGPU computing, hardware vendors also integrate double precision support within their hardware. The NVIDIA Kepler architecture for example provides special double precision units in addition to their normal cores [3, p.8]. Some sensitive values of the RayCreationParameters make already use of double precision types, as the program is expected to be run on newer hardware. However, most of the calculations inside the kernel are done using simple floating point arithmetic.

The anticipated benefit of using double precision for grid traversal and intersection routines is to reduce the number of erroneous pixels. These pixels are most often the result of rays which shoot between adjacent triangles or angles between secondary rays and normals of almost exactly 90°, where the scalar product is a tiny positive or negative number. As OpenCL 1.2 is still very close to C99, no language features exist which would allow writing generic kernels. AMD has already proposed their OpenCL C++ static kernel language for standardization which would allow the use of classes, inheritance and especially templates, but it is not yet officially part of OpenCL and currently only supported on AMD systems. Therefore, the OpenCL preprocessor will be used together with a compiler argument to set the desired type. All floating point types involved in calculations inside the OpenCL code will be replaced by a corresponding token, e.g. T, which is common in C++ templates, and defined when the OpenCL compiler is invoked, e.g. using -DT=double. Furthermore, also derived types must be defined to allow switching vector types, like T3 for double3.

This approach works well in practice. By recompiling the OpenCL kernel at run time, the used precision can even be changed on the fly without having to relaunch the main application. A corresponding console command has been implemented.

The SingleBooleanCellRayCaster with double precision already gets grid of most wrong pixels. However, the runtime of the kernel increased to over twice the amount of the single precision variant.

To also support older GPUs without double precision an additional preprocessor option was intro­duced which also affects the main C++ application. It allows forcing the floating point precision to single. This also overrides the double values of the transfer structures.

4.9 OpenCL source embedding

One of the big strengths of OpenCL is its portability across many different hardware platforms. This big advantage is achieved by delaying the compilation of the OpenCL source code until an application’s actual execution on the target platform. However, this concept usually forces programmers which put their OpenCL code in dedicated files to ship parts of their source code with the compiled application. This scenario is undesired or even prohibited in commercial software development where the customer should not gain access to a product’s source code. Furthermore, keeping the OpenCL source files in the correct place for the application to find them at run time requires additional deployment overhead. Fortunately, both problems can be solved by embedding, and optionally even encrypting, the OpenCL source code into the actual program. Although several easy solutions exist on different operating systems or technologies, like resource files on windows, no cross platform or C++ standard compliant method for storing an applications resources is available.

However, an OpenCL source file basically contains text and text can be statically stored inside an application using a string literal, which is an array of characters. It is therefore possible to store the required OpenCL code inside a large character array which is compiled into the executable. Fortunately, the corresponding OpenCL API call for creating an OpenCL program from source does not take the name of a file as argument but the actual source code. Therefore, the embedded character array can be directly passed to OpenCL.

Nevertheless, editing source code in a separate file is more comfortable for the programmer and better handled by IDEs concerning syntax highlighting and debugging than writing OpenCL code directly as string literals into the existing C++ code. Therefore, a separate conversion tool has been developed which reads in a given input file and outputs a C++ header file with a character array containing the content of the input file. This header can than be included and used by the existing C++ application. Modern build systems like MSBuild and Visual Studio also support the configuration of custom build steps, which allows the creation of includable headers from the OpenCL source files as part of the build process in Visual Studio. Visual Studio therefore calls the external conversion tool for every OpenCL source file which is part of the project before it starts compiling the C++ files.

This solution works for simple OpenCL source files. However, the preprocessor’s #include directive allows an OpenCL source file to include other source files. If a source code containing includes was embedded directly, the application would fail to include the referenced files at run time after deployment. Therefore, a little bit more logic has to be added to the external conversion tool. Specifically, the OpenCL source code has to be scanned for #include directives and they have to be recursively resolved, as included files can include further files, before the text can be converted into the output character array. Furthermore, as this process drops the original file names and line numbers, proper #line macros have to be inserted to allow the compiler to report errors on their correct origin. Although this is basically a small implementation of a part of a preprocessor, an existing one could not be used, as conditional compilation should still be possible after the conversion by keeping #ifdef and similar directives.

4.10 SinglePrimaryBooleanCellRayCaster

The last single ray caster implementation takes the previous one, the SingleBooleanCellRayCaster from Chapter 4.7, and merges the two loops which iterate over the triangles of the current cell into one. Furthermore, secondary rays are only cast when the primary ray does not hit a structure.

The host code for this caster implementation is exactly the same as of the SingleBooleanCellRay- Caster. The kernel starts equally by computing the ray direction and starting the grid traversal using the cell traverser. For each cell the initial counter value has to be calculated as well as all intersections. These two steps were executed after each other in the previous implementation and are merged now. Before iterating over the triangles of the current cell, the intersection buffer is declared together with a variable holding the current structure id, the start triangle index of the current structure as well as normal vector and distance to the nearest intersection of the current structure. Then, the algorithm starts to poll the triangles of the current cell from global memory. The structureld member is again monitored inside the loop over the triangles. On a structure change and on the first loop iteration, we save the index of the current triangle as start index of the new structure. During the triangles of this structure the primary ray is intersected with each triangle. If an intersection is found, the intersection is added to the intersection buffer and the distance from the primary ray’s origin to the intersection point as well as the hit triangle’s normal vector is stored. Before the next structure change and after the last triangle of a structure, the algorithm checks if the primary ray has intersected a triangle of the current structure. If this is the case we can determine if the entry point lies inside the current structure by using the normal vector of the hit triangle and the primary ray’s direction. If the primary ray did not hit any triangle of the structure, we have to use secondary rays. In this case, the triangles of the current structure are iterated again. The start is easily found again as we have stored the start triangle index on each structure change for this reason. A secondary ray is created to the clipped centroid of the first triangle and all subsequent triangles of the same structure are intersected with the secondary ray. The nearest hit of the secondary ray is then used to determine if the primary ray is inside this structure or not, equally as in the previous implementation for all structures. After all triangles have been iterated, we have the buffer of intersections as well as the initial inside counter value and can continue with depth sorting the intersections and processing the counter along them to find the surface hit.

This approach has two advantages over the previous one. Firstly, each triangle of a cell is only requested once from global memory instead of two times which should improve performance. Secondly, using the primary ray for determining if the entry is inside a volume or not produces more stable results as many of the erroneous pixels have been disappeared. This is the final single ray casting variant which is also used as the main application’s default ray casting routine. The output of the ray caster is shown in Figure 17.

Further optimization of this implementation is surely possible but would require the help of ad­vanced tools like a GPU profiler. However, NVIDIA dropped OpenCL support with CUDA Toolkit 5.0, which was used during the internship. This decision was not officially announced but the topic is discussed in several internet forums [13].

4.11 PacketRayCaster

The PacketRayCaster approaches differently than the single ray variants and is primarily moti­vated by the huge performance boost achieved by the CPU implementation compared with the previous single ray CPU version. The general performance consideration is to keep a group of adjacent rays as coherent as possible to benefit from better vectorization and caching behavior. Furthermore, traversing whole packages instead of each single ray reduces the overall control logic.

Although the PacketRayCaster shares some code with the final SinglePrimaryBooleanCellRay- Caster, a lot of changes are necessary to the kernel code. Fortunately, the host code does only

illustration not visible in this excerpt

Figure 17: Screenshot of the OpenCL kernel output of the SinglePrimaryBooleanCellRayCaster.

require a little adjustment, which is to define a packet width and height which is used as the kernels local work size. The remaining host code stays the same, meaning the kernel is again enqueued using a two-dimensional range with the size of the output image where each dimension is rounded up to the next multiple of the defined packet width or height.

The kernel code then begins again by determining the ray direction for each work item using the RayCreationParameters. Afterwards, the four corner rays are marked by declaring an integer for each work item which is initialized to minus one and only set to a valid index, from zero to three, by the four work items in the corners of the work group, identified using their local ids. This integer can then be easily used to detect if the current work item is a corner ray or not. Furthermore, the four corner ray work items write their ray directions into a shared array in local memory.

Now the traverser algorithm can be set up. The packets will be navigated through the grid slice by slice as described in Chapter 2.4. However, a traversal is only necessary if the ray packet actually hits the grid. This check is performed by trying to intersect the outer square of rays of the packet, called peripheral rays, with the bounding box of the grid. If any of them hits, a traversal is necessary. Therefore, a main traversal axis has to be determined which is done by selecting the axis where the corresponding component of the central ray’s direction vector has the largest magnitude. All corner rays are then intersected with the front plane of the first slice of the grid. The main traversal axis as well as the front plane intersection points are stored in local memory. The slice traverser is now initialized.

The kernel then continues with the traversal loop, which polls cells from the slice traverser. The slice traverser keeps state information about the current slice and the current cell’s coordinates inside the current slice. Initially, when the first cell is requested, the slice traverser intersects the four corner rays with the black plane of the current first slice. Using the maximum and minimum values of the intersection points on the front and back plane of the slice of the non-traversal axes, the rectangle on the slice which has been hit by the packet can be determined. This rectangle is extended to full cell boundaries and the slice traverser can return the first cell of this range. Subsequent cell requests can be satisfied returning further cells of this range until all cells of the hit slice region have been iterated. Then the slice traverser has to move on to the next slice by intersecting the corner rays with the back plane of the next slice Using the intersection points of the last slice’s back plane as intersection points of the now current slice’s front plane, the affected cell region can again be determined. This process is repeated until all cells of the last slice have been returned or the traversal is aborted.

For each cell returned by the slice traverser, the packet has to be intersected with the triangles inside the cell. However, before all rays inside the packet are intersected with all triangles, several optimizations are done. Each corner ray work item creates a plane using the next clockwisely adjacent corner ray direction from shared memory. The result are four planes bounding the frustum of the ray packet which allows culling of the cell’s triangles against it. However, this frustum culling step can only be done by one thread of the packet which has to actually hit the cell. A cell hit by the ray packet does not necessarily need to be hit by all rays of the packet.

The culling is done by iterating over all triangles of the current cell. Each triangle’s vertices are then checked against all frustum planes. If all vertices lie outside of one plane, the triangle is culled. This approach does not cull all possible triangles, but it is fast and easy to implement as it does not require any intersection tests. The indexes of all triangles which are not culled are accumulated in a buffer in local memory, and therefore visible to all work items afterwards. However, the culling routine has to keep an eye on the centroids of the unculled triangles. As triangles of a structure lying outside the packet’s frustum are culled, a secondary ray targeting a centroid outside the frustum of an unculled triangle may lead to wrong results, as a potentially, with the secondary ray intersecting triangle was removed during culling. Therefore, the culling routine has to ensure, that the first triangle of each structure provides a valid centroid lying inside the packets frustum. A further optimization is also possible during the culling routine. By also monitoring the structure changes we can determine the case in which all triangles of a structure are culled. If this is the case, a secondary ray can be used to determine if the frustum lies inside this culled structure which allows to skip the current cell entirely.

After the culling step completed, a barrier operation is executed to ensure that the culling thread has finished writing the indexes of the unculled triangles into local memory before any work item starts to read them. After this synchronization point, each work item can then proceed with the intersection routine taking the array of unculled triangle indexes in shared memory. The intersection is done equally as in the SinglePrimaryBooleanCellRayCaster, by iterating all triangles and intermediately storing the intersections in a buffer. The initial value of the inside counter is also determined in this step. Secondary rays are used if the primary ray does not intersect a subtraction volume. The intersections are then sorted again and the counter is processed along them to find the surface hit.

Although the packet ray caster produces visually equivalent results as the SinglePrimaryBoolean- CellRayCaster, cf. Figure 18, it is despite the optimizations slower than the single ray variant. The reasons are probably the large amount of synchronization points (barriers) encountered dur­ing the kernel execution, which are needed every time a shared state is written, which happens a lot inside the slice traverser. However, a detailed analysis requires the use of professional tools like profilers, which were not used during the internship, cf. end of Chapter 4.10 for the reason. Another, yet unsolved problem is that the packet ray caster crashes occasionally, especially when enqueued multiple times with short delay or when being compiled with double precision.

illustration not visible in this excerpt

Figure 18: Screenshot of the OpenCL kernel output of the PacketRayCaster.

4.12 Migration to new host application

While the main development of Enlight happened using the code base of an existing, RISC inter­nal framework designed for easy OpenGL integration and scene management, Michael Hava, one of the project members, was already working on a newer, leaner main application which would then also be used for the final product. The main design consideration of this new application was to get rid of legacy code, like the OpenGL framework and MFC, and to separate the appli­cation into components using a self-written C++ component framework by Michael Hava called CWC. Furthermore, by using the cross-platform GLFW library for window management and user interaction, the application may be run on all major platforms.

Although the complete main application was rewritten, the final port of the OpenCL caster suite to this new prototype did only require an adaption of the OpenCLCasterWrapper class. Of main concern were the RegularGrid class and the PixelGrid class, which where both replaced by other structures. Updates to the OpenCL driver now occur separately for each cell and may be exe­cuted concurrently by multiple threads. Therefore, thread safety of the OpenCLCasterWrapper ’s methods had to be ensured which was implemented using C++11’s new thread library. Fur­thermore, the concept of configurable driver options was generalized to key value pairs settable by the main application. Examples include floating point precision, currently used caster imple­mentation, work group size, OpenCL platform or device and out of core casting. The final and largest change was also the output format of the casters, which was changed from an image with red, green, blue and depth channel to the hit triangle id and distance as well as the barycentric coordinates of the intersection point on the triangle. The actual color image is then calculated by the main application, separating the shading step from the ray casting algorithms.

4.13 Out-of-core ray casting

The final implemented feature is optional out-of-core ray casting, which affects all ray casters. The basic concept of out-of-core algorithms is to allow the processing of data structures which are larger than the available system memory. Concerning the GPU ray casting implementation, scenes should be supported where the grid representation is larger than the available memory on the graphics card. The available main memory is still a limit for the scene’s size.

When designing an out-of-core approach for the grid based GPU ray casters, we are limited by the design of OpenCL. OpenCL requires, that all global memory objects, like the three buffers where the grid, cells and triangles are stored, have to be created and fully initialized before a kernel is executed. Furthermore, these objects cannot be changed during the execution of a kernel. Additionally, communication between the host application and a running kernel is one-way. No way exists for a kernel to communicate back to the host application. These constraints limit the design of an out-of-core caster as such, that all required information for a kernel has to be set up before the kernel is enqueued and the kernel cannot request further data during it’s execution. Therefore, the grid has to be divided into smaller sub grids which have to be ray casted individually and merged into a final output image. Fortunately, dividing the grid causes no problems as the ray casting algorithms already operate on a cell basis. Also, merging the output images should inflict no problems, as every pixel contains depth information.

Out of core casting can be enabled on the console of the main application and an optional out of core buffer size may be set for testing purposes. If this value is not specified, it is set to the maximum available buffer size. When the main application passes the grid to the OpenCLCasterWrapper by calling the update method, the wrapper checks if out of core cast­ing is enabled. If this is the case, it passes the grid to the static SubGrid::divideGrid method, which recursively splits the grid in half, changing the split axis on each recursion, until each sub grid requires fewer memory than the current out of core buffer size. These sub grids are then flattened and stored in the subGrids member of the wrapper.

When the scene is ray casted, the sub grids are depth sorted depending on the current camera position. A loop than iterates over the sub grids starting at the nearest. The sub grid is transfered to the GPU and ray casted. Each work item of a subsequent ray cast, the second to the last one, then checks the buffer containing the output image if the corresponding pixel has already been written to. If this is the case, the work item can exit. This approach ensures that no pixels are ray casted twice. Therefore, apart from the memory transfers of the sub grids which have to be done for every ray casted image, no additional overhead is produced.

illustration not visible in this excerpt

5 Results

This chapter covers the final results of the ray casting development and provided appropriate benchmarks. Furthermore, experiences made during development, especially concerning the avail­able tools, are shared. Finally, existing and unsolved problems are discussed.

5.1 Benchmarks and comparison with existing casters

The algorithms developed during the internship were finally tested on a few larger scenes which come closer to real world applications of Enlight. The cylinder head used throughout this thesis, cf. for example Figure 6, is often used for quick demonstration or debugging, as it is a rather small scene which can be loaded in roughly a second. Besides the cylinder head, a lot of other scenes, more or less fitting the purpose of Enlight, are available and can be loaded within the main application. From these scenes, two more have been chosen which show interesting behavior. The first additional scene is a Menger sponge, which is a fractal geometry. It is easily created using a programming language. Due to its recursive nature, Menger sponges provide an easy way of generating complex meshes just by adjusting a parameter of the generation script. In the following benchmarks a level five member sponge will be used. The second additional scene for ray casting will be a multi impeller, which is a technical component of many centrifugal pumps. Figure 19 shows screen shots of the addition scenes.

illustration not visible in this excerpt

Figure 19: Screenshots of the OpenGL visualization and the OpenCL kernel output of the Sin- glePrimaryBooleanCellRayCaster of a level 5 Menger sponge and a multi impeller.

illustration not visible in this excerpt

Table 2: Geometric information about the used test scenes.

Table 2 shows various geometric information about the three scenes used for benchmarking. The cylinder head is rather small with only 21 subtraction volumes and 28,000 triangles. The Menger sponge level five has been chosen for its large number of subtraction volumes. A level six Menger sponge was also tested once, which is the first scene to exceed the 100,000 subtraction volume mark with roughly 110,000 cuboids. However, it consumes the full available system memory and forces the operating system to swap. As a consequence, the classification of the scene takes several hours to complete in which the used work station cannot be used for anything else. Therefore, the level 5 variant has been chosen. A further interesting behavior is that although the initial geometry of the Menger sponge only consists of around 0.17 million triangles, this number multiplies by almost 57 to 9.58 million when the triangles are assigned to the cells they intersect. This is one of the large drawbacks of using a grid as acceleration structure. The multi impeller provides a real world example of subtractive manufacturing. It mainly impresses by its large triangle count of 12,5 million. Furthermore, the multi impeller profits tremendously by the cell classification as only 11% of all cells are surface cells. The smaller number of cell triangles when compared with the original, unique ones is explained by the many subtraction volumes which partly stick out of the grid.

Each scene is loaded and then ray casted ten times by each chosen ray casting implementation using an output resolution of 1024 x 768 pixels. The average run time is shown in Table 3. All times are in milliseconds.

illustration not visible in this excerpt

Table 3: Run time of several casters, complete update time and memory transfer time on different test scenes. All times are in milliseconds.

The first thing one can notice on the benchmark results is that the single precision variant of the final SinglePrimaryBooleanCell caster is faster in all scenarios than the CPU implementation and produces visually equivalent results than the slower CPU variant which runs in double precision. Furthermore, the cylinder head and the level five Menger sponge can both be casted interactively with 39 and 24 frames per second (FPS). The multi impeller is more calculation intensive only achieving around 8 FPS. Nonetheless, these results are very delighting as they prove that scenes of this complexity can still be ray casted in reasonable time. The SingleBooleanCell caster is surprisingly a little bit faster on larger scenes. However, this small advance is compensated by more erroneous pixels. The packet ray caster is far behind all other ray casting approaches which mostly due to the high amount of synchronization points inside the kernel. Furthermore, switching from single to double precision causes a serious performance loss in all tested kernels. Although a run time decrease by a factor of two to three has been expected, as the Kepler architecture has only a third of double precision units than normal cores, the performance drop was even worse on larger scenes. Nevertheless, single precision has proven to deliver good and stable results in most cases.

Apart from the casting times, one should also notice the time required for an update of the OpenCL buffers. Although the memory transfer times seem reasonable, the time required for the complete update is enormous. The most time consuming operations during the update is the flattening process and the calculation of the clipped centroids. The latter could be avoided, if the centroids were precalculated when the triangles are added to the grid.

5.2 Development

During the internship various software tools have been used for developing OpenCL applications.

Some of the experiences made might be of value for future developers and are therefore shared here:

NVIDIA Visual Profiler

This separate tool from NVIDIA is shipped with their CUDA SDK and provides a standalone debugger and profiler for GPGPU applications. Although mainly developed for aiding CUDA developers, it should support OpenCL until CUDA SDK version 4.1. However, this version has been tried without success.


Nsight is NVIDIA’s second GPGPU debugging and profiling tool and requires a free reg­istered developer account before it can be downloaded. It provides excellent Visual Studio integration and does support OpenCL according to NVIDIA’s website. However, Nsight requires the presence of at least two GPUs, where one is used for debugging kernels step by step and the second one for handling the display while the first one is busy. Nsight also allows remote debugging.

Intel SDK for OpenCL Applications

Intel also supports OpenCL for their CPUs and on chip graphic controllers. By installing Intel’s SDK for OpenCL Applications, one can install additional Visual Studio plugins which allow to create dedicated OpenCL projects. Syntax highlighting for OpenCL source files is configured and a kernel file can be compiled offline using a new context menu entry to check for compilation errors without having to start the actual application. Furthermore, a simple debugger is integrated which allows to debug a single work item which has to be configured before the main application is started. However, this debugger is limited to OpenCL programs on the CPU and cannot be used in the case of crashes as other work items are processed in parallel to the debugging session of the selected one. Also break points do not work when set in files included by a kernel. In general, the tool makes a still immature expression although being useful.


Graphic Remedy’s gDEBugger is a standalone debugger, profiler and memory analyzer for OpenGL and OpenCL applications. It is additionally independent of the used hardware platform and freely available. The tool shows basic information for a process it is attached to like memory consumption of various objects inside open OpenGL or OpenCL contexts. It also offers a time line showing enqueued OpenCL operations like enqueued kernels or memory transfers and highlights their start time and duration. This allows a developer to optimize a device’s occupancy. Although the possibility of creating break points at certain API calls and viewing the source code of kernel objects, stepping through kernel code is not supported.


CodeXL is the name of AMD’s flagship in OpenCL development and comes with excellent Visual Studio integration. The tool suite provides a sophisticated debugger which allows debugging all work items in parallel. Therefore, special multi-watch windows can be opened for tracking variables across all work items, shown in a tabular view. Breakpoints and stepping are fully supported within Visual Studio. Furthermore, an application can be launched in profiling mode where CodeXL reads important performance counters from the GPU for each executed kernel. This information is very useful in hunting down bottlenecks and optimizing kernels for specific GPUs. Additionally, a static kernel analyzer is provided which can show the disassembly of a kernel and provides general static information such as consumed registers or required local memory. Unfortunately, CodeXL is limited to AMD cards only and has therefore not been used during the internship. However, it proved to be highly useful during the development of the algorithms presented in the first part of this thesis ” GPGPU Computing with OpenCL”

5.3 Existing problems

Although the developed ray casters achieve good performance and provide visually satisfying results, there is still room for improvement. One of the biggest problems for the boolean counting kernels is the intersection buffer which requires 1.2 KiB of memory for each work item. This (relatively) large amount of memory cannot be held in registers and is spilled into the slow global memory. Some time was actually spent during the internship on thinking how this buffer could be avoided. A possible alternative is to organize the triangles inside a cell in an intelligent data structure which allows to iterated the triangles depth-sorted, e.g. a BSP tree. However, this would add additional overhead to the grid management routines as these data structures have to maintained or rebuilt each time geometry is added to the grid. Furthermore, data structures also add more complexity to the memory layout of the triangles. Finding a better solution to this problem would probably result in a larger performance boost.

A second problem is the memory access pattern. GPUs are designed for so-called coalesced memory reads, which means that consecutive work items should read from consecutive memory addresses from global memory. This is definitely not the case in all caster implementations, as cells and triangles are accessed at almost random locations. At this point is probably also room for optimization by maybe copying all triangles of a cell into shared memory using one large fetch across all work items instead of every work item requesting all triangles.

Finally, also the result of a profiler would be very valuable to detect bottlenecks which are not as obvious as the previous ones. However, profiling an OpenCL application on NVIDIA hardware seems to be tedious with the current tools provided by this hardware vendor. Unfortunately, no AMD graphic cards are available at RISC.

6 Summary and conclusion

The RISC Software GmbH is a limited liability company and part of the software park Hagenberg. It focuses on the practical application of research done in the corresponding RISC institute. In mid 2011, the project Enlight was started with financial help by the governmental Regio 13 program. The goal of Enlight is to create a ray casting solution for interactive visualization of complex geometries. Subtractive manufacturing is the main inspiration for the project. At the start of the internship in April 2013, most functionality was already implemented. The goal was therefore to try a different ray casting approach using GPGPU computing and OpenCL.

Ray casting is a wide spread technique for creating a two-dimensional image of a three-dimensional scene by casting rays from a camera position through the pixels of an image plane in space on which the final image should be projected. Ray casting has a different run time complexity than traditional rasterization, from which especially large scenes benefit. Ray casting also allows to easily generate images of implicit geometries like CSG models, where the scene is described by boolean combination of volumes. Counters are used in this case to count volume entries and exits until the surface hit is found. To accelerate ray casting several data structures are used to organize the scene more efficiently. One of them are regular grids, where the scene is subdivided into equally sized cubes. Grids are advantageous over other wide-spread data structures like kd trees for being easy and fast in construction, better facilitating dynamic scenes. During ray casting, grids are traversed by individual rays using a 3D variant of the DDA algorithm. Ray packets are guided through the grid slice by slice.

OpenCL is an open and free standard for parallel and general purpose programming targeting heterogeneous platforms. It is most often used to write programs, called kernels, for GPUs. To use OpenCL in an application, an SDK is required to provide the necessary header files and libraries. A typical OpenCL application starts by selecting an available platform and device as well as creating a context and a command queue. Kernels are written in OpenCL C and compiled at run time for the chosen device. Buffers may be created to pass larger blocks of data between the host application and a kernel. Kernels are executed in an n-dimensional range which determines the number of work items which should be executed. The memory model of OpenCL closely resembles modern GPUs by distinguishing between global, local, private (registers) and constant memory.

The existing prototype at the time the internship started is a C++ application built using Visual Studio and Intel’s C++ compiler. It heavily uses AVX intrinsics to process ray packets as fast as possible in a SIMD fashion, thus limiting the application to newer processor types. The ac­celeration structure used is a regular grid. By classifying the grid cells every time a subtraction volume is added, the relevant cells for ray casting can be detected leading to a further reduction of intersection tests. By using subtraction volumes to express complex geometries, the ray casting algorithm has to use counters for volume entries and exists in order to find the implicit surface.

During the internship, several OpenCL ray casters have been developed together with a small OpenCL driver running these casters. Mainly single ray variants where focused, as they involve no synchronization between individual rays. Several advanced features were added such as double precision support, source file embedding and out of core ray casting. The built infrastructure was finally ported to a new prototype which will be used for public demonstration and provides the base for a final product.

The benchmark results show that the OpenCL implementation can definitely compete with the existing CPU variant with a speedup of two to four on different scenes using single precision. The visual quality of the output can be considered equal with the CPU double precision implementa­tion. During development, several tools have been tried and evaluated. However, a few problems still remain which could further improve the ray casting performance on GPUs.

In conclusion it can be said that the OpenCL port of the ray caster was successful. Ray casting offers a lot of parallelism due to its parallel nature which can be advantageously used in GPGPU computing. However, we also saw that writing programs for the GPU is not as easy as developing an algorithm on the CPU. Commonly used and well-established concepts like dynamic allocation and communication methods between threads are suddenly unavailable when programming using OpenCL. Different aspects like memory access patterns, branch divergence between threads and register footprint come into play which are usually neglected in traditional software development.

Although the general purpose graphics processing unit becomes more and more capable of execut­ing non-graphical tasks and running algorithms very dissimilar to the ones the a GPU was initially designed for, GPUs are still not general purpose processors. A graphic card remains a highly spe­cialized instruments with its main goal of accelerating computer graphics. Therefore, only a small subset of problems actually benefits from being run on a GPU. Ray casting is fortunately one of them. Although we have also seen, that less independently parallel approaches, like the packet ray caster, which requires a high amount of synchronization, lead to a drastic loss of throughput.

Finally, also the developer tools available for OpenCL are far from being as satisfying as their CPU counterparts. During the internship, Intel’s VTune Amplifier has been used for optimizing CPU code and was amazingly helpful. Also debugging C++ code, even if run in multiple threads, is well supported by today’s debuggers like the one integrated in Visual Studio. GPGPU computing is still a young discipline, also for vendors. The provided tools seem to be immature and unstable in several cases. Intel’s OpenCL debugger for example, or AMD’s CodeXL. Some tools are also only available on a certain kind of hardware like NVIDIA’s Visual Profiler or AMD’s CodeXL. Although consequent improvements and advances are being made, there is still a lot of work to do to make GPU development, especially with OpenCL, as convenient and easy as traditional CPU orientated software development.

Concerning the future of Enlight, the initial planning schedules the project’s end to December 2013. During this time, a lot of innovative work has been made which has been submitted to various conferences. Especially the idea of classification was new and proved to largely accelerate the ray casting of scenes described by subtractive volumes. RISC is currently discussing the integration of the ray casting technique developed during Enlight into other customer products, thus bringing the gained knowledge to its application. Furthermore, a follow-up project is in consideration, which may evaluate ray casting using a different but promising new piece of hardware on the high performance market, Intel’s many integrated core (MIC) coprocessor cards. By being designed for massively parallel acceleration using existing software technologies and languages, Intel’s MIC architecture does not suffer from the design requirements of a GPU, as it consists of a huge array of conventional Intel x86 cores. Who wants to be annoyed by GPGPU computing then?

List of Figures

1 Principle of ray casting
2 Traversing the cells of the grid using a DDA variant until an intersection has been found
3 Boolean ray casting
4 A ray packet traversing a regular grid slice by slice
5 A two-dimensional NDRange
6 Screenshots of the existing prototype
7 Simplified class diagram of the most important classes involved in ray casting at the beginning of the internship
8 Principle of classifying cells of a grid according to the added mesh
9 Determination of the inside counter upon cell entry
10 Simplified class diagram of the OpenCL ray casting driver
11 Screenshot of the OpenCL kernel output of the TestGradientCaster
12 Screenshot of the OpenCL kernel output of the TestCameraCaster
13 Layout of the OpenCL buffers used to store the flattened grid
14 Screenshot of the OpenCL kernel output of the SingleRayCaster
15 Screenshot of the OpenCL kernel output of the SingleBooleanRayCaster
16 Screenshot of the OpenCL kernel output of the SingleBooleanCellRayCaster. ...
17 Screenshot of the OpenCL kernel output of the SinglePrimaryBooleanCellRayCaster
18 Screenshot of the OpenCL kernel output of the PacketRayCaster
19 Screenshots of the OpenGL visualization and the OpenCL kernel output of the SinglePrimaryBooleanCellRayCaster of a level 5 Menger sponge and a multi impeller


[1] John Amanatides and Andrew Woo. “A fast voxel traversal algorithm for ray tracing”. In: In Eurographics ’87. 1987, pp. 3-10.

[2] Intel Corporation. Intel@ SDK for OpenCL* Applications 2013. 2013. URL: http://soft (visited on 2013-09-06).

[3] NVIDIA Corporation. NVIDIA’s Next Generation CUDATMCompute Architecture: Kepler TM GK110. The Fastest, Most Efficient HPC Architecture Ever Built. Version 1.0. Jan. 28, 2013. URL: http:// www . nvidia . com / content /PDF/ kepler / NVIDIA-Kepler- GK110-Architecture-Whitepaper.pdf (visited on 2013-04-13).

[4] Dipl.-Ing. Wolfgang Freiseisen. RISC Software GmbH. RISC Software GmbH. 2013. URL: (visited on 2013-08-27).

[5] RISC Software GmbH. “Enlight - Visualisierung und Modellierung sehr komplexer, detail- reicher Geometriemodelle”. ger. Project proposal. Mar. 28, 2011.

[6] Michael F. Hava. “Ray Casting von triangulierten Objekten. Bachelorarbeit Teil 2”. ger. Bachelor thesis. FH Oberosterreich, June 2011.

[7] David J. Kasik, Dinesh Manocha, and Philipp Slusallek. “Guest Editors’ Introduction: Real­Time Interaction with Complex Models”. In: Computer Graphics and Applications, IEEE 27.6 (2007), pp. 17-19. ISSN: 0272-1716.

[8] Mark Kilgard. Ray Casting & Tracing. University of Texas. Apr. 17, 2012. URL: http://w hideSpinner (visited on 2013-08-28).

[9] Paulius Micikevicius. Local Memory and Register Spilling. NVIDIA. 2011. URL: http: / / spilling.pdf (visited on 2013-09-03).

[10] Tomas Moller and Ben Trumbore. “Fast, minimum storage ray-triangle intersection”. In: J. Graph. Tools 2.1 (Oct. 1997), pp. 21-28. ISSN: 1086-7651. DOI: 10.1080/108 67 651. 1997.10487468.

[11] Aaftab Munshi, ed. The OpenCL Specification. Version 1.2. Khronos Group. Nov. 14, 2012. URL: http : / /www . khronos .org/registry/cl /specs /opencl-1.2.pdf (visited on 2013-04-08).

[12] Ingo Wald et al. “Ray tracing animated scenes using coherent grid traversal”. In: ACM Trans. Graph. 25.3 (July 2006), pp. 485-493. ISSN: 0730-0301. DOI: 10.1145/1141911.1141913. URL:

[13] spectral. CUDA 5.0. Oct. 18, 2012. URL: viewtopic.php?f=8& t=1019 (visited on 2013-09-04).

A Test environment

The following hardware and software configuration has been used for development, testing and benchmarks:

- Windows 7 x64
- Intel Core i7 Quadcore (8 virtual cores) at 3.4 GHz
- 8 GiB DDR3 RAM system memory
- NVIDIA GeForce GTX660 with 2 GiB GDDR5 RAM and 1344 CUDA Cores


[1]Latency tolerant processing units are characterized by being insensitive to longer lasting memory requests.

[2]A memory access pattern describes how a program or a part of it organizes multiple and often related memory requests. Examples are random accesses, requests to consecutive addresses (reading a memory block), strided access (e.g., reading only a specific member of each object inside an array.) and accesses to the same address (which are typically cached on a CPU)

[3]A coalesced memory access occurs when all threads of a Warp request data on consecutive memory addresses resulting in a fast memory block transfer instead of individual transfers of smaller chunks. Most GPU hardware architectures have their global memory organized in segments of 64 or 128 bytes where always the full segment has to be transferred regardless of the size of the actually needed data (NVIDIA Corporation 2009, p.13). Coalesced and aligned (guaranteed by compiler) memory access ensures to always transfer the minimum needed segments.

[4]Scoreboarding is a technique of processors where data dependencies of all instructions are tracked and instructions are only issued for execution if all dependencies have been resolved. Therefore, instructions may be reordered and executed in a different order.

[5] Global memory

Global memory is the largest of all memory regions and visible to all compute units on a device. This memory region is used whenever data is copied to or from a device. Buffers and images are allocated in this memory. All pointer variables pointing into global memory have to be prefixed with the global attribute inside a kernel. (Gaster et al. 2011, p.29)

[6]Calling clGetDeviceIDs with a device type as argument may return no devices. The Intel OpenCL platform for example only supports CPUs and therefore does not provide a GPU device. Appropriate error handling is necessary in real world applications.

[7]Streaming SIMD Extensions, a vector instruction set extension by Intel supporting 128 bit wide registers to process 4 single precision floats in parallel.

[8]Advanced Vector Extensions, Intel’s latest vector instruction set extension featuring 256 bit wide registers to process 8 single precision floats in parallel.

[9]A perfect binary tree with n leaves has a total of 2n — 1 nodes. As only the non-leaf nodes perform additions, the number of leaf nodes can be subtracted resulting in2n — 1 — n = n — 1 nodes which perform an addition.

[10]At the beginning, multiplying two square matrices was ported to OpenCL. The CPU im­plementation is trivial but performs badly. Fortunately, highly tuned BLAS libraries exist which can do the multiplication a lot faster on the CPU. However, the first and again quite

[11]A water-tight volume (sometimes also called solid) is a volume which is fully enclosed by a surface. In other words, the surface does not contain any leaks. If the volume was filled with water, no water could leave the volume.

[12]Temporal locality refers to accessing the same location in memory frequently within a short time.

127 of 127 pages


GPGPU Computing with OpenCL
University of Applied Sciences Oberösterreich, Hagenberg
Software Engineering
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
7663 KB
OpenCL, CUDA, Raytracing, Raycasting, GPGPU, Matrix multiplication, Parallel, Sorting, Scan, Prefix Sum, GPU
Quote paper
Bernhard Manfred Gruber (Author), 2013, GPGPU Computing with OpenCL, Munich, GRIN Verlag,


  • No comments yet.
Look inside the ebook
Title: GPGPU Computing with OpenCL

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free