Convolutional Neural Network in classifying scanned documents

Internship Report, 2016

33 Pages



1 Introduction
1.1 Context
1.1.1 About ICTLab
1.1.2 ARCHIVES project
1.1.3 Internship context
1.2 Report organization

2 State of the art
2.1 Artificial intelligence & machine learning
2.2 Artificial neural network (ANN)
2.2.1 History
2.2.2 Regular neural network
2.2.3 Convolutional neural network (LeNet)
2.2.4 Training and evaluating

3 Contribution
3.1 Data creation and augmentation
3.1.1 ARCHIVES dataset
3.1.2 Creating data
3.1.3 Augmenting the data
3.1.4 Summary and Result
3.2 Constructing the convolution neural network (LeNet) .
3.2.1 The model
3.2.2 Preparing data
3.2.3 Training
3.2.4 Validation and testing
3.3 Developing the network

4 Results
4.1 The basic network
4.1.1 Testing on the dataset
4.1.2 Testing on real images
4.2 The network modifications
4.2.1 Fully connected layer
4.2.2 Convolutional layers
4.3 The new network

5 Conclusion

A Transfer functions

List of Figures

2.1 A single-input neuron (left) and a multiple-input neuron (right)[5]

2.2 Three-layer neural network[5]

2.3 Logistic sigmoid (blue), hyperbolic tangent with recommend parameters (green) and ReLU (red)

2.4 Typical CNN architecture[2]

2.5 Convolution[10]

2.6 Dropout Neural Net Model[18]

2.7 Underfitting (left), good fit (middle) and overfitting (right)[3]

3.1 Sliding window

3.2 Sub-images generated by sliding window

3.3 A blank window (188.jpg on the left)

3.4 Choosing bounding box in matlab

3.5 An original image and its 5 degrees rotations

3.6 Combination of flipping and orthogonal rotating

3.7 Training accuracy and validation accuracy

4.1 Training accuracy and Validation accuracy

4.2 Test images and their output given by the network

4.3 Network accuracy under the effect of fully connected layer’s width

4.4 Network accuracy under the effect of convolution kernels

4.5 Training accuracy comparison between the original network and the new network

List of Tables

I Number of images used in each class

II Number of generated sub-images in each class

III Table of layers in the LeNet and their size

IV Number of sub-images in 5 classes participating the dataset

V Number of sub-images in 5 classes participating in one data file

VI Testing accuracy of the network on classes (%)

VII Test images and their output given by the network 23

VIII Transfer functions[5] 26


This internship project consumed a huge amount of work, research and dedication. Still, the implementation would not have been possible if I had not had the support of many individuals and organizations. Therefore, I would like to extend my sincere gratitude to all of them.

The project was supported by the University of Science and Technology of Hanoi, especially the ICTLab of the university. I am thankful to my supervisor, Dr. Tung Hoang TRAN, who has kindly guided and supported me even before the project began.

I am thankful to Dr. Mai Chi LUONG, who spent time talking with me and gave me helpful advices on my academic orientation.

I am also grateful to all professors and researchers in ICTLab for providing me such a good working environment, devices and discussions.

I also would like to say thanks to Marie Ballere and Romain Verset, two students from France, who spent time discussing with me and helped me to understand better my problems. I have to express my appreciation to all members in my family. They are my source of encouragement and wisdom.

Nevertheless, I would like to give praises and thanks to God for leading me to the university and giving me strength to overcome all difficulties in my study and life.


In this project, I created and augmented a dataset from a number of given images to train and test convolutional neural network which is used to classify five classes of images of scanned documents. In order to generate the dataset, some image processing techniques were applied such as sliding-window, rotating, flipping and pyramid-sizing. The result of this phase is a set of images having same size 244x224x3. These images after being labeled were divided into three dataset for training, validating and testing the network. The network is a simple convolution neural network which is also called LeNet. It has three convolutional layers and one fully connected layer. After being trained and validated, the best state of the network was pointed out and tested on the testing dataset and some real images. The result showed that the LeNet was able to classify images of documents in a pretty high accuracy. At the end of the project, I modified the network and discussed the affect that those changes had on the network with the purpose of creating another similar network which can perform better than the original one. The result proved that it worked a little better than its original version.

Chapter 1 Introduction

This chapter contains the information about the context of the project, the lab that I was working during my internship and the motivation of this project. After that, the structure of this report is also explained.

1.1 Context

This section introduces the ICTLab where I have been doing my internship, the ARCHIVES project that I had chance to participated and the context of my internship.

1.1.1 About ICTLab

During three months of my M1 internship, I had an opportunity to work in the ICTLab[8]of University of Science and Technology of Hanoi (USTH), an international laboratory joint between USTH and partners coming from Vietnam and French such as Institute of Information Technology (IOIT) Hanoi, Institut de Recherche pour le Développement (IRD) and the University of La Rochelle, France.

ICTLab was found in 2014 under the support of USTH, French Embassy in Vietnam, Asian Development Bank (ADB) and some universities and institutes from France.

Currently, the lab is working on two main projects:

- SWARMS: Say and Watch: Automated image/sound Recognition for Mobile monitoring Systems.
- ARCHIVES: Analysis and Reconstruction of Catastrophes in History within Interactive Virtual Environments and Simulations. This project is also the one that I participated.

1.1.2 ARCHIVES project

The aim of ARCHIVES is to extract data and information from collected historical documents to provide a virtual representation which would help researches to understand more about natural disasters in the past on the area of Red River. By researching on such great presentation, researchers can improve the prediction and management of the natural hazards which may happen in the future.

1.1.3 Internship context

One of the initial steps in project ARCHIVES is extracting data from documents. However, these documents are not in the same type. They are divided into five classes: graph, map, photo, hand-written text and printed text. To effectively extract data from these documents, each type of document must be treated by using a suitable method. Thus, the very first step of the project is classifying all documents that have been collected and new ones in the future.

At the moment, among different methods which have been applied for classifying images, convolution neural network (CNN) is one of the best. In this project, I learned CNN and applied this method to create a classifier for ARCHIVES data.

In detail, my work in this internship project was divided into two parts:

- Data augmentation and creation for training learning a classification model
- Training a classifying model using convolutional neural network

All works I have done in these two parts are explained in detail in Chapter 3: Contribution.

1.2 Report organization

This report is divided into five main chapters. The first chapter contains some introduction information related to my internship. In the next chapter: "State of the art", I explained some concepts which should be understood before starting working with convolutional neural network. Chapter three: "Contribution" describes all works that I have done in project and after that, the results are shown and discussed in the forth chapter: "Results". Finally, everything is summed up in the last chapter: "Conclusion".

Chapter 2 State of the art

In this chapter, before explaining the convolutional neural network, some basic concepts of artificial intelligence and machine learning has been introduced, then the artificial neural network - the supper-set of convolution neural network - with its algorithm and components has been described. After that, I explained the convolutional neural network and discussed its advantages over regular neural networks on learning images.

At the end of this chapter, I also explained two important problems called underfitting and overfitting, which appear when training learning models like the ones in this project.

2.1 Artificial intelligence & machine learning

Machine learning is a sub-set of artificial intelligence (AI). However, due to the growth and extension of machine learning, the boundary between machine learning and artificial intelligence is becoming very blurred.

The goal of AI is to mimic some special functions of humans brain[16]such as speech recognition and problem solving. Many methods have been created to solve these problems and machine learning is one of these. Just like the name, machine learning focuses on making the computer "learn" from experiences.

In 1997, Tom Mitchell introduced a formal definition of machine learning: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."[13]

Before starting to learn, a model has no pre-knowledge of the problem, but it is allowed to try to solve a problem with a set of initial variables. Those values can be set or picked randomly and have no relation to the expected output of the model. As a sequence, the model will fail many times to give the right answer, but after each time, the model modifies itself to get a better answer for the problem. Finally, the machine will be able to solve not only the problem it was given but also similar problems, depending on the regularity of the model.

2.2 Artificial neural network (ANN)

Inspired by the biological neural networks of human and animal brain, the artificial neural network inherits some advantages of this structure so that it is able to implement some tasks that the brain does. At the moment, the ANN cannot work perfectly as humans brain in all fields, but the number of valuable applications of ANN has been increasing a lot, not only in computer science but in many other areas such as economics, finance, environment and so on.

2.2.1 History

Learning is a function of humans brain which is made of a system of neurons. Inspired by this biological system, in 1943, McCulloch and Pitts[12]created a computational model named Threshold Logic Unit

(TLU) which is believed as the beginning of artificial neural network.

A few years later, Donald Hebb introduced his theory called Hebb’s rule[6]which explained how the repetition of a connection between two cells in a neural network affects the wire’s strength. In a short way, Siegrid Löwel’s said: "Neurons wire together if they fire together"[11]. This is considered as the ’typical’ unsupervised learning rule.

In 1958, Frank Rosenblatt[15]explained the perceptron algorithm for supervised learning of binary classifier. The algorithm implemented simple mathematical operations on a computer learning network with two layers. Because of the limitation of the perceptron algorithm, the circuits as exclusiveor (XOR) could not be processed by neural networks until 1974, when Paul Werbos[20]invented the back-propagation algorithm. The back-propagation algorithm solved the XOR problem which leads to the age of quick training multilayer neural networks.

Due to the development of computing technology, it can be assured that the performance of ANN will continue raising and will be applied in most of areas in our lives.

2.2.2 Regular neural network

There are some types of neural network. In this subsection, the most traditional type called the fully connected neural network will be explained. Neuron model

Neuron is the most basic unit in the neural network and single-input neuron is the most simple form of neuron.

Figure 2.1, on the left, is the image of a single-input neuron.

illustration not visible in this excerpt

Figure 2.1: A single-input neuron (left) and a multiple-input neuron (right)[5]

Two scalar p and w respectively are the input and weight of the neuron. After being multiplied with each other, the result is summed with the other input which had been created by multiplying number 1 with a bias b. The output n of the summer is sent to the transfer function f providing a which is the scalar output of the neuron. In short, the neuron output is computed as:

a = f(wp + b)

In reality, most of the neurons in artificial neural networks receive not only one input but multiple inputs as showed on the right side of Figure 2.1.

A multiple-input neuron receives more than one input and each of these inputs is multiplied by a weight. After that, the result is sent to the summer to be added with bias b and just like in single-input neuron, the output n of the summer is brought to the transfer function and finally, the neuron gives an output a which is also a scalar.

A layer of neural network is constructed from one or multiple neurons and a network is constructed from one or more layers. The Figure 2.2 in the next page exposes the structure of a three-layer artificial neural network.

illustration not visible in this excerpt

Figure 2.2: Three-layer neural network[5] Transfer functions

Transfer functions, or activation functions, which were found in figures of neurons we have seen previously, are very important components of the learning network. They transform the result of each neuron to a more valuable form, especially transforming from linear to non-linear numbers. (Appendix A)

Some most popular activation functions are:

- Logistic sigmoid:

illustration not visible in this excerpt

The logistic sigmoid function used to be known as the most popular activation function for artificial neural networks to introduce non-linearity in the model due to its biologically plausibility, regardless hyperbolic tangent function performed better in practical problems.

- Hyperbolic tangent

f (x) = tanh(x)

The hyperbolic tangent function performs better than the traditional logistic sigmoid when training multilayer neural networks in practical works, especially with suitable parameters[9]:

illustration not visible in this excerpt

However, hyperbolic tangent function does not model the biological nodes as well as logistic sigmoid, it could not take the first position of logistic sigmoid.

- Rectified Linear Units (ReLU)

The story changed when Rectified Linear Units was introduced by Nair and Hinton[14]in 2010 and just one year later, it was claimed as a better model model of the biological neurons and performed better than the previous methods[4].

The ReLU operation, in fact, is very simple:

f (x) = max(0, x)

Figure 2.3 compares graphic expression of the three transfer functions below:

illustration not visible in this excerpt

Figure 2.3: Logistic sigmoid (blue), hyperbolic tangent with recommend parameters (green) and ReLU (red) Backpropagation

In terms of pattern recognition[17], it is known that analyzing features of input images is a very important task. At the time when pattern recognition method was born, these features were designed by human engineers. However, human-engineered features are very limit and generating features by hand in fact is an exhausted work, so computer scientists have been trying to create multilayer networks that can proactively learn features from data. And the key of training multilayer networks is backpropagation.

Backpropagation, or backward propagation of errors is usually used with gradient descent or another optimization method to train ANNs. The method computes gradient of a loss function regarding to all weights of the multilayer network. This is actually an application of the derivatives’ chain rule. The idea is that, we can compute derivative of the objective regarding to the input of a module by computing derivation of the output of that module and then moving backwards[9]through the module. Applying the same operation for the whole network, starting from the output and all the way to the input, the gradients can be computed regarding to the weights of each module.

Based on these gradients, the optimization function will update the weights in order to minimize result of the loss function. Repeating the process many times, the model would have a set of weights which provides an output similar to the expected result.

2.2.3 Convolutional neural network (LeNet)

Convolutional neural network (CNN, also called ConvNet) is a subset of artificial neural network. While classical neural networks are inspired by the system of nodes (neurons) in animal brain, convolutional neural networks are inspired by the structure of the animal visual cortex[7], in which, neurons are organized as overlapping sub-regions titling the visual field. Those sub-regions are called receptive fields[1].

The receptive fields have various sizes. Cells which contain large receptive fields work as local filters over the input space while other cells contain smaller receptive fields tend to recognize edge-like patterns observed by the fields[7]. This property makes the ConvNet be able to automatically detect features from inputs, rather than depending on features engineered by human. This is a great advantage of CNN over classical ANNs which contain only fully connected layers.

Figure 2.4 displays the typical architecture of a CNN.

illustration not visible in this excerpt

Figure 2.4: Typical CNN architecture[2] Properties

Previously, it is known that classical multilayer perception (MLP) of ANN models have the structure of fully connected nodes. This organization worked very well with low-dimensional data but had a serious problem with high-dimensional data, which known as the "Curse of dimensionality"[19].

CNNs, in the other hand, which was designed to emulate the behaviour of a visual cortex, can overcome the problem of MLP networks because of the ability to detect the spatially local correlation appear in images.

There are three main differences of CNN with regular ANN:

- 3D volumes of neurons: Neurons in layers of CNN are organized in a 3D structure: width, height and depth.
- Local connectivity: In CNN, neurons of nearby layers are connected as local patterns and this property makes the network be able to exploit the spatially local correlation from the images.
- Shared weights: Every filter in CNN is replicated on the entire visual field where they share the same weight vector and bias. Therefore, the same feature can be detected from anywhere on a convolutional layer. Main layers

A Convolution neural network is constructed of two parts: features extraction and classification.

- Features extraction: The features extraction part of CNNs usually contain three main layers:

- Convolutional layer: Before starting to explain the convolutional layer in CNN, the con- volution operation must be clearly understood. After that, I will explain the convolutional layer.

* Convolution: Convolution is a technique used widely in image processing area. The method is the foundation of many image filters such as blur, sharpen and edge-detection. Each convolution has a kernel. Width and height of kernels must always be odd numbers and they usually have the same value. Convolutions is a per-pixel operation applied on a source image. Figure 2.5 explains how the convolution kernel operates on a pixel. Normally, the kernel starts from the top-left corner of the source image, then moves to the right side and downward until it covers the whole image. Each step, the kernel hovers an area equal to its size on the source image, multiplies its values element-wise with the area on the source image, then sums them up and results a value on the destination pixel. All the computed destination pixels together form a new image which is the result of the convolution. The position of the destination pixel is corresponding to the center of the kernel, so the size of the result image will be smaller than the source image. However, those pixel can be easily recovered because the important information of image usually lays on the center area, not some pixels on the edge.

illustration not visible in this excerpt

Figure 2.5: Convolution[10]

* Convolutional layer in CNN: Convolutional layers are the key components in CNNs. Convolutional layers perform convolution operation across the input volume and consist a cube of neurons. The depth of this cube depends on the number of channels in the input volume.

This neurons are the result of the convolution operation of the kernels of the layer on the input volume. Each neuron is connected to a specific zone on of the input called receptive field. For example, on the first convolution layer of the network, RGB images which have size 224x224x3, are the inputs. The kernel size is 3x3, so the receptive field it hovers on the input is 3x3, too. As a result, each neuron in the convolution layer is connected to a region of 3x3x3 on the input, hence it has 27 weighted inputs which are trained and updated after each training iteration.

- ReLU layer: The reLU layer, as described in the previous section acts as an activation for the output of the CNN neurons.

- Pooling layer: The pooling layer is usually put in between convolutional layers in CNN, to reduce the size of the input volume for the next convolutional layer. The pooling layer only affect height and width dimension of the volume, not the the depth.

The most common pooling method used in CNN is max-pooling. The operation is performed in each depth slice of the input volume. For example, the input volume which is 224x224x3 is applied max-pooling with window-size is 2x2. The input will be considered as 3 slices of 224x224. Each slide will be divided into non-overlapping regions 2x2, and the highest value in each region will be the max-pool value. The output of this max-pool has size 112x112x3.

The property of pooling layer is very useful in avoiding overfitting problem, because it reduce the number of weight in the network. It is also reduce the computational complexity for the following layers in the network.


Excerpt out of 33 pages


Convolutional Neural Network in classifying scanned documents
University of Science and Technology of Hanoi
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
3330 KB
machine learning, deep learning, classification, internship, computer science, neural network, convolutional neural network, leNet
Quote paper
Tai Doan (Author), 2016, Convolutional Neural Network in classifying scanned documents, Munich, GRIN Verlag,


  • No comments yet.
Read the ebook
Title: Convolutional Neural Network in classifying scanned documents

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free