Efficient Data Input/Output (I/O) for Finite Difference Time Domain (FDTD). Computation on Graphics Processing Unit (GPU)

Master's Thesis, 2014

101 Pages, Grade: First







1 Introduction
1.1 Computation in Electromagnetism
1.1.1 Maxwell’s Equations
1.1.2 Finite-Difference Time-Domain (FDTD)
1.2 Computational Parallelization Techniques & GPGPU
1.2.1 Parallel Computer Architecture
1.2.2 Parallel Algorithms & Programs
1.2.3 Emerging Parallelization Techniques: GPGPU
1.3 The Problem and The Objective
1.4 Thesis Overview
1.5 Original Contribution

2 Electromagnetism & Finite-Difference Time-Domain - Overview
2.1 Maxwell’s Equations
2.2 Finite-Difference Time-Domain (FDTD)
2.2.1 Frequency Dependent Material Parameters & Frequency De­pendent FDTD
2.2.2 Boundary Condition
2.3 Summary of Maxwell’s Equations and FDTD Method
2.4 Computer Implementation of FDTD Method
2.4.1 Basics of FORTRAN 90 Programming
2.4.2 Implementation of FDTD Method
2.5 Advantages and Limitations of FDTD Computation
2.6 Concluding Remarks

3 Computation of FDTD on GPGPU using CUDA Programming
3.1 GPGPU - The Parallelization Monster and Computation Techniques .
3.2 CUDA and CUDA Fortran
3.3 CUDA Implementation of FDTD Method for GPGPU Computation .
3.4 Computation on Nvidia’s General Purpose GPU
3.4.1 GPU Hardware and support for FDTD Computation
3.4.2 Memory Coalescing
3.5 Execution of FDTD Method on GPU Hardware
3.6 Concluding Remarks

4 The Solution to The Problem
4.1 The Problem - Revisited
4.2 The Solution
4.3 Programmatic Implementation of the Solution
4.3.1 Implementation
4.3.2 Invoking Buffer Kernel
4.4 Possible Limitations and their Solutions
4.5 Concluding Remarks

5 Evaluation and Validation of The Solution
5.1 Testing of the Implemented Solution
5.1.1 Input Parameters for FDTD Computation
5.1.2 Hardware Environment
5.1.3 Test Results
5.2 Critical Analysis & Evaluation of Test Results
5.2.1 Speed-Up Analysis
5.2.2 Evaluation and Comments

6 Conclusion and Future Scope
6.1 Future Scope
6.2 Conclusion


A Survey Questions posted
A.1 Survey Questions posted via email and Researchgate OSN platform .

Word Count: 14,100

List of Tables

1.1 Major Abbreviations Used In Chapter 1

2.1 Major Abbreviations Used In Chapter 2

3.1 Major Abbreviations Used In Chapter 3

4.1 Major Abbreviations Used In Chapter 3

List of Figures

1.1 Block Diagram of CPU and Memory Architecture

1.2 Diagramatic Represntation of Shared Memory Parallel Computer Ar­chitecture consisting of two CPUs

1.3 Diagramatic Represntation of Split Shared Memory Parallel Computer Architecture consisting of ’P’ number of CPUs connected to the Mem­ory, which has ’N’ number of Memory Banks, via Interconnect

1.4 Diagramatic Represntation of Distributed Memory Parallel Computer Architecture

1.5 Nvidia GPUs Fundamental Architecture with 16 Streaming Multipro­ cessor, each with 8 stream processors (SPs) (128 stream processors in total) and 8 parallel texture mapping units (TMUs), which are used to address and filter the textures of images on the graphics hardware ...

1.6 Parallel Execution on GPGPU

1.7 Result of survey question on ’Which one has more potential to be used in future?’

1.8 Communication between Memory-CPU-GPU via PCI-e Bus

1.9 View of ’Unified Memory’ Concept of CUDA 6.0 Programming Lan­guage

2.1 Yee Cell, representing Yee’s algorithm regarding electromagnetic fields in three dimensional space. Magnetic field is perpendicular to the sur­face on which electric field acts. The fields (electric and magnetic) follow each other alternatively and because of this interleaving, the magnetic field and the electric field are co-dependent

2.2 An Example of Orthogonal Mesh, with [i, j, k] indices where i = 0, 1, 2,..., Nx; j = 0,1,2,..., Ny; k = 0,1,2,..., Nz

2.3 Yee Cell where H field component requires the values of four sur­rounding co-plane E field

2.4 Mesh where E field component requires the four surrounding H val­ues on the two planes to which it belongs

2.5 Some components can not be calculated because of the boundary issues

3.1 Abstract view of the Tesla unified graphics and computing GPU archi­ tecture, where SM - Streaming Multiprocessor, SP - Streaming Pro­cessor and SFU - Special Function Unit

3.2 The CUDA grid organization in Tesla Architecture as mentioned in [1,2]

3.3 All threads of half-warp participate

3.4 Some Threads Do Not Participate

3.5 Unaligned Starting Address which is not a multiple of region size i.e 64 for this case

3.6 Permuted Access by Threads

3.7 Uncoalesced memory access for multiple threads (three in this case) in a block [6, b], where [a, b] means [blockIdx%x, blockIdx%y]

3.8 Coalesced memory access for multiple threads (three in this case) in a block [6, b], where [a, b] means [blockIdx%x, blockIdx%y]

4.1 Different types of memory access required for FDTD Computation on GPU

4.2 Three (A, B, C) Methods of Data-Output. Method B and C implements buffer to fetch output data to the CPU

4.3 Flowchart to show the use of Buffer in FDTD computation on GPU .

4.4 Visual representation of data copied on Buffer based on code snippet

4.1: 2-D Buffer Implementation with un-coalesced memory access ap­proach of FDTD computation on GPU

4.5 Excitation location at different axis and dimensions

5.1 Execution Time for Test 1 & 2 in seconds. *Less is better in perfor­mance*

5.2 Speedup result of Test 1 & 2. *More is better*

5.3 Speedup result of Test 3 & 4. *More is better*

A.1 Question Posted via Researchgate OSN platform


Due to recent advancement in technology, one of the popular ways of achieving perfor­mance with respect to execution time of programs is by utilizing massive parallelism power of GPU-based accelerator computing along with CPU computing. In GPU- based accelerator computing, the data intensive or computationally intensive part is computed on the GPU whereas the simple yet complex instructions are computed on the CPU in order to achieve massive speedup in execution time of the computer pro­gram executed on the computer system.

In physics, especially in electromagnetism, Finite-Difference Time-Domain (FDTD) is a popular numerical analysis method, which is used to solve the set of Maxwells par­tial differential equations to unify and relate electric field with magnetic field. Since FDTD method is computationally intensive and has high level of parallelism in the computational implementation, for this reason for past few years researchers are trying to compute the computationally intensive part of FDTD methods on the GPU instead of CPU. Although computing parallelized parts of FDTD algorithms on the GPU achieve very good performance, but fail to gain very good speedup in execution time because of the very high latency between the CPU and GPU. Calculation results at each FDTD time-step is supposed to be produced and saved on the hard disk of the system. This can be called as data output of the FDTD methods, and the overlapping of data output and computation of the field values at next time step can not be performed simultane­ously. Because of this and latency gap between the CPU and GPU, there is a bottleneck in the performance of the data output of the GPU. This problem can be regarded as the inefficient performance of data input/output (I/O) of FDTD methods on GPU.

Hence, this project focuses on this aforementioned problem and addresses to find solutions to improve the efficiency of the data I/O of FDTD computation on GPGPU (General Purpose Graphics Processing Unit).

Keywords: data I/O; buffer; Finite difference methods; FDTD; time domain anal­ysis; hardware; acceleration; high performance computing; parallel programming; par­allel architectures; GPU; graphics processing unit; parallel computing; CUDA; Ope- nACC; multi-core computing


I would like to thank my Supervisor, Fumie Costen, for her endless support and pa­tience in answering every question that I had while prusuing the project and preparing the thesis. Without her presence this thesis could not have been written properly.

I also want to mention my family, who have always supported me, especifically my parents, Soma Dey and Sudip Dey, for giving me the blessings and boosts that drove me to success in life.

Finaly, I would also like to thank Buraq Abdul Qader for providing valuable in­sights and technical knowledge that made it easier for me to pursue this project. A special thanks to Tanaya Gopal, Rashmi vivek Joshi, Madhuri Raju and Harshini M. Kumar for putting up to my nuisance and mischieves, and for being there to support me at times when needed.

Chapter 1 Introduction

1.1 Computation in Electromagnetism

In the world of physics, there are four fundamental forces in nature: strong force, weak force, gravitational force and electromagentic force. Electromagnetism is the field of study to review and relate different electro-magnetic fields and study the interactions between electrically charged particles and uncharged magnetic force fields with electri­cal conductors. But as the world advanced towards the age of Digital Technology, the techniques and methods related to electromagnetism also grew in complexity. It was no longer possible to calculate the complex formulas, which are associated to elec­tromagnetism, just by hand and paper. This demanded for calculating these complex techniques in computers by means of computer programs. Even though computers can execute and compute these methods very efficiently and as quickly as possible but some of the technique requires much more computational resources and time of execution.

1.1.1 Maxwell’s Equations

In the middle of 19th century, eminent mathematical physicist James C. Maxwell re­lated the electric and magnetic field compnents and studied how these fields alter each other by means of charges and current. He proposed a set of partial differential equa­tions, which linked all electric and magnetic field components, and these set of partial differential equations are known as the Maxwell’s Equations. Maxwell’s Equations along with Lorentz force[1] law form the foundations of electromagnetism and classical electrodynamics. The Maxwells Equations unify and unite Faradays Law, Amperes Law, Gauss Law for the electric field and for the magnetic field, Ohms Law, etc.

Over the years many numerical analysis methods were proposed and implemented to solve the Maxwell’s equations efficienctly and accurately, and one such popular method is the Finite-Difference Time-Domain.

1.1.2 Finite-Difference Time-Domain (FDTD)

Finite-Difference Time-Domain(FDTD) aids to to discretise the Maxwell’s partial dif­ferential equations by using Yee’s algorithm[3], which was proposed by Kan Yee, to solve both the electric and magentic fields in time and space.

The abstraction performed by FDTD method to solve Maxwell’s equations, makes it easy to compute and calculate the set of partial differential equations on a computer. FDTD method consist of update equations, which updates the value of electric and magnetic fieds based on the latest available values of magnetic fields and electric fields respectively. For example, to calculate H (magnetic field) at (t + 0.5At) time-step, the value of E (electric field) at t time-step is required. In brief, the way in which FDTD is calculated is: Calculate H from recently available E value, calculate D (electric flux) from the recently available H value, calculate E from the recently available D value. Because of the way by which FDTD method is calculated, there are high level of parallelization in the computation of the method.

Maxwell’s Equations and FDTD method is mentioned comprehensively in chapter 2, which aids the motive and objective of this thesis.

1.2 Computational Parallelization Techniques & GPGPU

”For over a decade prophets have voiced the contention that the organiza­tion of a single computer has reached its limits and that truly signif­icant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.”

- Gene M. Amdahl, (1967)[4]

Amdahls Law, which is the de facto of the parallel and multicore computing world, states that even if an algorithm is parallelized, it starts from serial execution and then again ends serially although major part of the algorithm is parallelized. Thus even though an algorithm is parallelized, but its performance with respect to execution is dependent on both serialized and parallelized part. For FDTD method, major part of the calculations involed, are parallelizable in nature, and hence is subject to very good performance with respect to time of execution if the parallel execution part is correctly implemented.

Now, performance of a parallelized task depends on various factors but the major factors that really affect the performance are good parallelized hardware support and good parallelized algortihm & programming techniques.

1.2.1 Parallel Computer Architecture

According to study materials publsihed by J. R. Gurd[5], a basic computer architeture consists of the CPU (Central Processing Unit) and the memory. Here, the memory consists of two parts, the fixed Code and the changeable Data, which define the pro­gram being executed. When execution of the program starts on a computer, the Code part of the memory holds the fixed portion of the program, which will not change dur­ing the execution, such as formula required for calculation, etc., and the initial data of the program, which defines the computation such as the input data, the initial values of required variables, etc., are being held in the Data part of the memory. The CPU consists of a Program Counter (PC), which starts by pointing to the first instrutction to be executed. As the program progresses in execution, the current instruction is fetched from the Code memory and the PC is incremented to point to the current execution instruction. All the CPU states such as registers, conditional codes, etc. except the PC, are part of the Data memory. The CPU follows Instrutction Execution Cycle at Program level, which means how many instructions being executed during one clock cycle. Exectuion of a program may consists of the following steps:

- Reading data from the Data memory Considered as ’Read access’
- Perform the required operations
- Write the result back to the Data memory Considered as ’Write Access’
- Increment the program counter or assign a new value to the PC

During the execution, all the steps mentioned abve or few of them may be performed depending on the program. This is commonly known as ’von Neumann Computational Model’ [6,7], which consists of sequential computation.

With a careful analysis it can be found that when a program is executed , the pro­cessor / CPU2 spends most of the time in moving the data from and to the memory than calculating. For example, if a processor has to perform the following tasks:

1 a=6

2 b=12

3 c=a+b

And, if the time taken to execute these tasks together is assumed to be T secs then, the processor first have to fecth the value ’6’ from the memory and assign it to ’a’, then do the same fecthing and assigning operations to variable ’b’. After that calculate ’a+b’ and then assign the resulted value to ’c’. It should be kept in mind that ’a’, ’b’ and ’c’ are also variables that are on the memory. Now if it is considered that the time taken to fetch the values and copy back the values to the memory be U secs, and the time taken to perform the matehmatical operation of ’a+b’, i.e. adding the values, be t2 secs, such that the total time taken by the operations performed is t\ + t2 = T. Then, t2 <<T, implying that t2 is very very less than T and most of the time of execution is spent in dealing with memory realted operations rather than mathematical operations. Therefore, it can be said that performance of a program with respect to execution time, is actually dependent on the performance of the memory in the computer system.

This problem of performance delay is incurred due to the memory latency gap. Be­fore executing any operation, it first has to be fetched from the main memory and then executed in the CPU. And because of the gap between the memory and the CPU [8,9] the time of execution is increased. Please refer to Fig. 1.1 for a basic block diagram of CPU and memory, and the memory interface to connect these two. It should also be kept in mind that although the gap between the memory and CPU increases the time of execution, but this can be hardly noticed because the total time taken for execution may be in seconds. But when compared to the time to perform mathematical opera­tions or operations, which do not include memory access, the time taken by operations to perform some kind of memory access is huge. In 1996, Hennessy and Patterson[7]

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.1: Block Diagram of CPU and Memory Architecture

introduced the concept of Cycles Per Instruction (CPI), which is the the average num­ber of clock cycles required to execute one instruction. Now, if it is assumed that to perform a mathematical operation such as addition i.e. ’+’, is 1 CPI, then to perform a fecth data from the memory operation can take around 10 CPI, which in comparison is very high. This gives a fair idea regarding how the execution time is affected by the memory latency gap.

To deal with this probelm, many techniques have been introduced as technology advanced. One of the best known methods is to accomodate faster memory on the chip of the CPU, which are called cache memories[3]. Since, cache memory is not the focus of this thesis, therefore a detailed performance gain using caches is not discussed in this thesis.

In modern computing era (post 1960), due to increase in complex workloads and parallelizable tasks, there has been a rise in parallel and multi-core computing. Now a days computer systems hardly consist of single core CPU. Since gaining performance with respect to time of execution, is of upmost important now, so the main focus is on parallel computing (parallel computers or parallel algorithmic techniques).

Parallel computer systems[5]may have many different architecture but all these architectures can be generalized into following:

- Shared Memory Multiprocessor Architecture

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.2: Diagramatic Represntation of Shared Memory Parallel Computer Archi­tecture consisting of two CPUs

- Distributed Memory Multicomputer (Distributed Memory Parallel Computer Ar­chitecture)

Again, in general the shared memory multiprocessor can be of two type:

- Shared Memory Parallel Computer Architecture

- Split Shared Memory Parallel Computer Architecture

Shared Memory Parallel Computer Architecture: In this architecture, the system consists of more than one CPU but the memory interface is the same as the sequential computer architecture. Write access to the memory is same as the sequnetial architec­ture but the read access to memory is different. The read access requests are tagged with the identity of the issuing CPU so that the data can be returned to the correct CPU. Refer to Fig.1.2.

Split Shared Memory Parallel Computer Architecture: In this architecture, the memory is split into many memory banks, which are accessed by many CPUs in the system. But the CPUs are connected to the memory banks by ’interconnect’. The interconnect directs each memory access from CPUs to the appropriate memory bank and memory addresses are allocated across the memory banks in different ways such as interleaved, in blocks, etc., as long as there is no clash or contention of memory access from different CPUs. The interconnect can be implemented hardware wise in different ways, but the most popular method is by implementing as a bus, which is cheap to

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.3: Diagramatic Represntation of Split Shared Memory Parallel Computer Architecture consisting of ’P’ number of CPUs connected to the Memory, which has ’N’ number of Memory Banks, via Interconnect

implement. The bus interconnect has its own limitations such as limited number of CPUs and memory banks can be connected using this. Refer to Fig.1.3.

Distributed Memory Parallel Computer Architecture: In this architecture, the memory is distributed physically amongst ’P’ number of CPUs. Thus this architecture with each ’CPU with memory’ resembles the von Neumann, i.e. sequential, computer architecture, and the system is called a Distributed Memory Multicomputer[4]. Refer to Fig.1.4. This architecture is actually influenced by the following:

- By co-locating a CPU and a memory bank and by making them share the same bus interface, increases the capacity of the bus.
- During the exectuion of a program, many of the required variables are private to each thread[5]. Therefore, by placing the private variables in the co-located memory reduces the use of bus interconnect.

There are many variants of Distributed Memory Parallel Computer Architecture such as Distributed Shared Memory (DSM), where the available addresses are shared across

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.4: Diagramatic Represntation of Distributed Memory Parallel Computer Ar­chitecture

the memory banks with single address space, or Distributed Memory (DM) with mul­tiple address spaces accross the system, where each CPU is only allowed to issue addresses to its own local memory bank, etc. Another popular trend is cluster com­puting, where the memory is shared at computer level in sub-parts to make a parallel structure. Since these hardware techniques are not within the scope of this thesis, hence not discussed in details.

In the next section, the different ways of parallelising algorithms and implement them as programming techniques, are provided.

1.2.2 Parallel Algorithms & Programs

In the study materials provide by J. Gurd[5], it was noted that although different pro­grams/applications in the world of High Performance Computing (HPC) have different structures but they still have some common characterisitics such as:

- Use of discrete approximation for calculation prupose
- A foundation of mathematical model underneath the application for the caclcu- lation
- An algorithm to ’animate’ the underpinning mathematical model so that it can be used for digital computation

Parallel Algorithms: With a closer analysis on the parallel applications it can be noticed that there are some evident patterns in which the program is parallelised, which can be classified as the following:

- Data parallelism: In this type, there is an existence of some large data domain, where similar kind of computation is to be done on the data of that domain. Sometimes the data can be computed paralelly very easily and hence those are called embarassingly parallel.
- Reduction parallelism: In this type, there is an existence of some large data domain, where it is necessary to compute some kind of global function which is common to the entire domain. This type of parallelism is not always obvious like data parallelism, but it would depend on the problem, which is being computed.
- Divideandconquer parallelism: This type of parallelism is more of an algo­rithmic approach than it is a pattern. Many sequential tasks can be visualized as an aggregate of many smaller parallelisable tasks. In order to solve the problem as a whole, the whole task is divided into small tasks and then computed paral- lely. But to approach a problem like this, it is the programmers responsibility to algorithmicly approach the parallilising technique.
- Task parallelism: Here, the task consists of multiple distinct computations, each of which can be computed simultaneously. In this way, different procedures can be applied to parts of data domain to create parallel tasks and hence each computation may take different execution time.

The aforementioned parallelising algorithmic technqiues can be implemented as a computer program in the three most popular ways:

- Data Sharing (Thread-Based): When a process consists of smaller light-weight tasks, commonly known as Threads (as mentioned earlier in previous sub-section1.2.1), and these threads are executed on the same memory by using seperate blocks / priorities for each thread, then this parallel programming style can be called Thread-Based Data Sharing. One of the most popular application programming interface (API) in academia and industry, which supports shared memory paral­lel programming is OpenMP[10]. OpenMP API deals to manage the threads implicitly. There are few other libraries and APIs that support this methodol­ogy but are not as popular as OpenMP. OpenMP is multi-platform and supports programming languages such as C, C++ and Fortran.

- Message Passing (Process Based): When a large problem consists of several processes, it is possible to execute more than one process at the same time. This is achieved by co-operating the execution of different processes and by executing these processes on seperate processors with seperate memory. The processors co-ordinate the execution by passing messages and hence the name is derived from there. One of the most popular message passing programming techniques devised is Message Passing Interface (MPI)[11]. MPI library supports major programming languages such as C, C++, Fortran and Java.
- Hybrid Programming Technique: In hybrid programming scheme, both the techniques mentioned above can be implemented together in order to achieve even better performance. Although the performance will completely depend on the problem itself and the way the problem is parallelised to gain better perfor­mance. One example of this technique can be to use MPI to compute seperate processes parallely, whereas using OpenMP at the same time inside MPI to par­allelise each process into multiple threads and compute them simultaneously.

It should be kept in mind that there are many other variants and features related to the aforementioned programming techniques but those are not within the scope of this thesis and hence not pursued anymore to be worth mentioning here.

Now, although a High Performance System can consist of mixture of the paral­lel hardware architecture and parallel programming techniques, which are mentioned above (sub-sections 1.2.1, 1.2.2), but the performance of the system can not be eval­uated till there are some firm methods to compare its performance with some other standardised methods.

Measuring Performance: Performance of a program in HPC world is evaluated by two ways, either by assessing floating point operations performed per second (flop/s) or by assessing the execution time. The term (flop/s) is mostly used to evaluate per­formance of parallel architecture hardware, which is done by measuring how many floating point calculations are performed in a second. Another method of measuring performance is by meauring the time of execution of the program using P processors and then comparing it with the time of execution for 1 processor. The main concern is to assess the performance and success of a parallel program by evaluating the perfor­mance on a variety of different parallel configurations.

1.2.3 Emerging Parallelization Techniques: GPGPU

The hardware and algorithmic techniques, which are already available for parallelising complex tasks, are more than sufficient to compute most algorithms efficiently, but due to the advancement of technology and never ending thirst of developing more advanced technology, have led the way of parallelization into different routes.

Post 1980, the world of technology saw the development of a new technolgy, which changed the way people used to visualise computer graphics. That innovative technol­ogy was the Graphics Processing Unit (GPU), which used to accelerate the graphics processing in order to enhance visual representation on a computer. In a GPU [12-14] the graphics are processed by multiple shader cores, which are the basic basic pro­cessing unit of a GPU. These shader cores are used to shade an image (pixels) or to provide effects to an image by the means of a computer programming technique called ’shaders’[12]. Shader calculates rendering[6] effects on the graphics hardware and pro­duces the image as an output. A basic GPU architecture consists of multiple shader cores, which are capable of shading multiple pixel on an image. GPU had the inher­ent property of parallel computing from the beginning and this ideology of computing multiple pixels uising shader cores, gave rise to the idea of General Purpose Graph­ics Processing Unit for computing multiple calculations at the same time. The notion, General Purpose Graphics Processing Unit (GPGPU) means to be able to compute general calculations on GPU, instead of using the GPU just for graphical processing purpose.

Modern GPUs [15-17] have many processing cores, which are called streaming multiprocessors (SMs) and each processor may has one or multiple stream processors (SPs) (refer to Fig.1.5). When a parallelisable work is computed on GPGPUs, the CPU computes the serial parts of the program/applicationbeing executed and the parallelsied (repetitive tasks) parts are computed on the GPU (refer to Fig.1.6). For this thesis, GPGPU of Nvidia are used. CUDA [1,2,15,18,19] is a parallel computing platform and programming model invented by NVIDIA, which enables the programmer and users to significantly increase the computing performance by harnessing the parallelization power of GPUs. More details regarding parallel computing on GPGPU are provided in Chapter 3.

As a part of the research conducted for the purpose of this thesis, a survey (refer to appendix A) was pursued to evaluate the performance, ease of use and futurescope of

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.5: Nvidia GPUs Fundamental Architecture with 16 Streaming Multiproces­sor, each with 8 stream processors (SPs) (128 stream processors in total) and 8 parallel texture mapping units (TMUs), which are used to address and filter the textures of images on the graphics hardware

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.6: Parallel Execution on GPGPU

CUDA programming (GPGPU Computing) along with other popular parallel program­ming techniques such as OpenMP and MPI. The survey was distributed in the form of email and over ResearchGate website, which is a social networking site for scientists and researchers to share their research findings and ideas. The survey was completed by 39 participants, who are professionals from the industry and the academia related to High Performance Computing. The typical questions in the survey were to evaluate the performance gained, ease of imeplementation and future scope of wide usability of either of OpenMP, MPI and CUDA parallel computing techniques. The outcome of the survey was:


- Ease of Use/Implementation: 72% agreed strongly
- Performance Gained: 14% agreed strongly
- Has potential to be used more in future: 9% agreed strongly


- Ease of Use/Implementation: 23% agreed strongly
- Performance Gained: 73% agreed strongly
- Has potential to be used more in future: 0% agreed strongly

CUDA Programming

- Ease of Use/Implementation: 73% agreed strongly
- Performance Gained: 91% agreed strongly
- Has potential to be used more in future: 91% agreed strongly (see Fig. 1.7)

[In the above survey result, the perecentage (%) respresents the perecentage of the survey respondents/participants. It should also be kept in mind that the performance gained by MPI technique is eqaully comparable with the performance gained in CUDA GPGPU technique, but future scope of CUDA is much more evident than MPI’s ac­cording to the survey. The survey was completely based on the study of the ease of implementation and performance gained based on that implementation.]

Although for this survey, the participants’ pool is very small to determine the ac­tual trend in the industry but still this gives the hint of where the trend is leading to.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.7: Result of survey question on ’Which one has more potential to be used in future?’

CUDA programming for GPGPU is easy to implement by programmers and the im- plemntation yields very good performance, and may be because of these reasons more people will prefer CUDA as parallel computing programming technique in near fu­ture. As a conlcusion this survey re-assures that the topic pursued as part of this thesis is of high importance and will definitely help programmers using CUDA programming technique.

Since, GPGPU computing as part of high performance computing is relatively new in the industry, it has both its advantages and limitations.

1.3 The Problem and The Objective

GPGPU computing is also called Accelerator based computing, i.e. a GPGPU is called an accelerator, which accelerates the computation. When a computation is performed using GPGPU, the CPU is regarded as the ’host’ and the GPU is considered as the ’device’. In typical high performance computing world, the bootleneck in performance is mainly created because of the latency gap between the memory and the CPU as discussed in sub-section 1.2.1 and the same thing is true for GPGPU computing.

GPU is connected to the chipset[7] of the computer system via PCI Express bus, which is also known as Peripheral Component Interconnect Express (PCI-e). The GPU, CPU and the main memory (RAM) all interacts using this PCI-e bus and becuase of the latency gap between these three, there is a bottleneck in the overall performance gain.

When a parallelized program is computed on the GPGPU, first the data is copied from the memory to the GPU and after computation the data is written back to the

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.8: Communication between Memory-CPU-GPU via PCI-e Bus

memory from the GPU using the PCI-e bus (Refer to Fig. 1.8). Thus for every com­putation, data has to be copied to and fro device-host-memory[8]. Although the compu­tation is very fast in GPGPU, but because of the gap between the device-host-memory due to the communication via PCI-e, the bottleneck in the performance is generated. So far no effective method has been invented to deal with this problem. Again, CUDA pro­gramming for GPGPU computing requires the programmer to manage the data copied to and fro memory on the GPU for computation purpose by invoking some CUDA functions such as cudaMemcpy(). This made it even difficult for programmers to write a program in CUDA. But to solve this problem, Nvidia introduced a new pro­gramming model in CUDA 6.0 programming language, named Unified Memory[20]. In Unified CUDA memory technique, the whole memory is viewed as a single unified block of memory (see Fig. 1.9), but this technique still does not improve the aforemen­tioned problem of memory latency gap. Unified memory concept was introduced to reduce the effort of development, but to gain good performance, the programmer still needs to manage the data copied to and fro.

In FDTD method, to calculate and update H, E and D values, data have to be copied and written back to and fro the memory to the GPU (device). Although perfor­mance of computation of FDTD algorithm on GPGPU is very good but still because of the bootleneck generated due to the latency gap between the GPU and the memory, hinders the true performance that might has been achieved using GPGPU computing without this problem.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1.9: View of ’Unified Memory’ Concept of CUDA 6.0 Programming Language

The Objective: Therefore the main objective of this thesis is to develop, implement and evaluate a method/technique to reduce the bootleneck created because of the la­tency gap between the device-host-memory in order to achieve highest performance and efficiency from GPGPU computation for data input/output (I/O) of FDTD method. Another objective is that this proposed method should work properly when accomo­dated into FDTD algorithm, even for random data points on the grid for data I/O of FDTD method.

In this thesis, a novel apporach to this problem is proposed and the evaluation results of the proposed techniques are also discussed later in Chapters (4, 5) in order to evaluate and validate the effectiveness of the proposed technique.

While preparing this thesis, there was no one academic manuscript, which compiles all the fundamental logics of FDTD method, its computation on GPGPU and increasing the efficiency of data input/output of FDTD computation on GPGPU. Therefore, this thesis deals to cover the following areas:

- Provide the fundamental logic behind Finite-Difference Time-Domain (FDTD) method
- Provide basic understanding of computing FDTD method on CPU and on GPGPU
- Provide solution to the throughput problem, i.e. the latency gap between Memory- CPU-GPU, in order to increase the efficiency of data input/output (I/O) of FDTD method

1.4 Thesis Overview

In the Introduction (Chapter 1), few of the basic concepts of parallel computing, which are essential for this thesis, are mentioned along with a brief review of the problem, which is being dealt for the purpose of this research project. In Chapter 2 the important equations, which are related to FDTD method, are discussed along with the implmen- tation guidlines of FDTD algorithm in a computer. Chapter 3 explores the methods of computing FDTD algorithm on GPGPU using CUDA programming. Chapter 4 dis­cusses the solution proposed to solve the latency gap between the memory and the device (GPU) and generate more efficiency of data input/output (I/O) for FDTD com­putation on GPGPU. In chapter 5, the proposed method is evaluated for performance and validated for accuracy, and the results are discussed. In the end, in Chapter 6 the conclusion for the overall thesis is provided along with future scope of the proposed methodology.

1.5 Original Contribution

This thesis presents the authors original contribution in the following areas:

- Survey to evaluate the future-scope of CUDA technology and GPGPU comput­ing in the parallel computing and high performance computing world
- Development and implementation of the solution to increase the efficiency of data input/output (I/O) of FDTD computation on GPGPU (Main Objective)
- Evalutation and validation of the implemented solution

Table 1.1: Major Abbreviations Used In Chapter 1

Abbildung in dieser Leseprobe nicht enthalten


[1] In electromagnetism, the Lorentz force combines the electric and magnetic forces on a point charge due to electromagnetic fields

[2] In sequential computing a processor and a CPU is condisered to be the same thing. But with the advancement of technology and with the incorporation of many CPUs in one system i.e. multi-core, the terms CPU and processor may have differed. Some processor manufacturers now call each core as CPU and the chip containing these cores as processor. But for this example, the CPU and the processor should be considered the same.

[3] Cache is a smaller, faster memory used by CPU to reduce the average time of accessing data from the main memory.

[4] Here the term computer in the word ’multicomputer’, is referred to CPU with some memory at­tached to it.

[5] A thread is the smallest sequence of a program, which can be managed independently by a scheduler of an Operating System and run in a shared memory space. Whereas, a process has its own memory space and when many processes are executed together, they run in separate memory spaces.

[6] Rendering is the process of generating an image from a model

[7] Chipset is a set of electronic components in the integrated circuit of the motherboard, which man­ages communiction between the CPU, memory and other peripehral devices

[8] Device is the GPU, Host is the CPU

Excerpt out of 101 pages


Efficient Data Input/Output (I/O) for Finite Difference Time Domain (FDTD). Computation on Graphics Processing Unit (GPU)
University of Manchester  (School of Computer Science)
Advanced Computer Science: Computer Systems Engineering
Catalog Number
ISBN (eBook)
ISBN (Book)
HPC, GPU, FDTD, GPGPU, CPU, electro magnetics, fortran, CUDA, Finite difference methods, ime domain analysis, parallel programming, parallel computing, OpenACC, multi-core computing, paralle architectures, high performance computing, data I/O, buffer
Quote paper
Somdip Dey (Author), 2014, Efficient Data Input/Output (I/O) for Finite Difference Time Domain (FDTD). Computation on Graphics Processing Unit (GPU), Munich, GRIN Verlag, https://www.grin.com/document/462250


  • No comments yet.
Read the ebook
Title: Efficient Data Input/Output (I/O) for Finite Difference Time Domain (FDTD). Computation on Graphics Processing Unit (GPU)

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free