Convolutional Neuronal Nets (CNNs) are state-of-the art Neuronal Networks, which are used in many fields like video analysis, face detection or image classification. Due to high requirements regarding computational resources and memory bandwidth, CNNs are mainly executed on special accelerator hardware which is more powerful and energy efficient than general purpose processors. This paper will give an overview of the usage of FPGAs for the acceleration of computation intensive CNNs with OpenCL, proposing two different implementation alternatives.

The first approach is based on nested loops, which are inspired by the mathematical formula of multidimensional convolutions. The second strategy transforms the computational problem into a matrix multiplication problem on the fly. The approaches are followed by common optimization techniques used for FPGA designs based on high level synthesis (HLS). Afterwards, the proposed implementations are compared to a CNN implementation on an Intel Xeon CPU in order to demonstrate the advantages in terms of performance and energy efficiency.

Excerpt

1 Introduction

2 Related Work

3 Background

3.1 FPGA

3.2 CNNs

4 Implementation

4.1 OpenCL Stack on FPGA

4.2 Implementation as Nested Loops

4.3 Implementation as Matrix Multiplication

5 Optimizations Techniques

5.1 Computational Optimizations

5.1.1 Improved Pipelining

5.1.2 Loop-Unrolling

5.1.3 Vectorization

5.2 Datapath Optimizations

6 Comparison

7 Conclusion

Research Objectives and Core Topics

This paper aims to provide a comprehensive overview of utilizing FPGAs for accelerating computation-intensive Convolutional Neural Networks (CNNs) using the OpenCL framework, focusing on optimizing performance and energy efficiency compared to traditional CPU implementations.

Architectural implementation strategies for CNNs on FPGAs (Nested Loops vs. Matrix Multiplication).
Application of OpenCL for high-level synthesis and hardware acceleration.
Computational optimization techniques including loop pipelining, unrolling, and vectorization.
Datapath optimization and performance modeling using the machine balance metric.
Comparative analysis of FPGA-based accelerators against CPU-based systems regarding power and performance efficiency.

Excerpt from the Book

4 Implementation

The implementation of CNNs consumes a considerable amount of computational resources as well as also a significant amount of memory. Large CNN-based architectures like AlexNet have more than 60 million model parameter. These parameters consume about 250 MB memory assuming 32 bit data type size. Since FPGAs do not provide on-chip memory on this scale, these parameters have to be stored in external memory. Therefore, the external memory bandwidth can become a performance bottleneck. Because of that, the challenge in the implementation part is to optimize the flow of data to and from the compute units.

In this chapter, two different approaches for CNN computation are presented. The first approach uses nested loops for the implementation which corresponds to the implementation of equation 1 basically. The second approach transforms the problem into matrix multiplications on-the-fly. In this form, standard matrix multiplication architectures can be used for computation.

Summary of Chapters

1 Introduction: Provides an overview of CNNs, their importance in computer vision, and the motivation for using FPGAs as energy-efficient accelerators.

2 Related Work: Discusses existing research on accelerating CNNs, specifically focusing on nested loop approaches and matrix multiplication techniques.

3 Background: Introduces the fundamental architecture of FPGAs and the operational basics of Convolutional Neural Networks.

4 Implementation: Details two implementation strategies, the OpenCL stack usage, and the transformation of convolution tasks into matrix multiplications.

5 Optimizations Techniques: Covers methodologies to enhance performance through computational optimization and data path refinement.

6 Comparison: Presents experimental results comparing the implemented approaches against each other and a standard CPU execution.

7 Conclusion: Summarizes the findings and highlights the superiority of FPGA-based CNN accelerators regarding energy efficiency.

Keywords

FPGA, CNN, OpenCL, Hardware Acceleration, Nested Loops, Matrix Multiplication, High Level Synthesis, Pipelining, Loop Unrolling, Vectorization, Datapath Optimization, Performance Density, Energy Efficiency, Convolutional Neural Networks, Parallel Computing.

Frequently Asked Questions

What is the primary focus of this publication?

The paper explores how to effectively use FPGAs to accelerate Convolutional Neural Networks (CNNs) by leveraging the OpenCL framework for development and high-level synthesis.

What are the central thematic areas?

The central themes are hardware architecture, computational parallelization techniques (pipelining, unrolling), datapath optimization, and benchmarking performance/power efficiency of FPGA designs.

What is the main research objective?

The objective is to demonstrate that FPGA-based implementations of CNNs offer significantly higher performance and energy efficiency compared to traditional general-purpose CPU architectures.

Which scientific methods are employed?

The author uses a comparative analysis method, evaluating two different FPGA implementation strategies (nested loops vs. matrix multiplication) against an Intel Xeon CPU implementation.

What does the main body address?

It covers FPGA technology basics, the translation of CNN algorithms into hardware-efficient code using OpenCL, specific optimization strategies, and a quantitative comparison of results.

Which keywords define this work?

Key terms include FPGA, OpenCL, CNN, High Level Synthesis, Pipelining, Loop Unrolling, Datapath Optimization, and Performance Density.

How is the "machine balance" metric used in this study?

It is used as a tool to quantify the ratio between memory bandwidth and computational bandwidth, helping to identify and optimize performance bottlenecks in FPGA designs.

Why are nested loops and matrix multiplication compared?

These two approaches represent different ways to handle the intensive computational requirements of convolution, allowing the author to test which method yields better parallelization and data reuse on FPGA hardware.

What role does the OpenCL stack play on FPGAs?

OpenCL serves as an interface for high-level language compilers, allowing developers to program FPGAs more efficiently and with faster time-to-market compared to traditional hardware description languages like VHDL.

What is the significance of the "Gain in efficiency" table?

It highlights the massive improvement in energy efficiency achieved by the FPGA accelerators compared to CPU execution, reaching up to 153 times higher efficiency in the tested scenarios.

Excerpt out of 16 pages - scroll top

Details

Title: The usage of FPGAs for the acceleration of Convolutional Neuronal Nets (CNNs) with OpenCL. Two alternatives for implementation
College: University of Paderborn
Grade: 1.0
Author: Christian Lienen (Author)
Publication Year: 2018
Pages: 16
Catalog Number: V451366
ISBN (eBook): 9783668861299
ISBN (Book): 9783668861305
Language: English
Tags: fpgas convolutional neuronal nets cnns opencl
Product Safety: GRIN Publishing GmbH

Quote paper: Christian Lienen (Author), 2018, The usage of FPGAs for the acceleration of Convolutional Neuronal Nets (CNNs) with OpenCL. Two alternatives for implementation, Munich, GRIN Verlag, https://www.grin.com/document/451366

The usage of FPGAs for the acceleration of Convolutional Neuronal Nets (CNNs) with OpenCL. Two alternatives for implementation