Vision Transformers (ViT) are neural model architectures that compete and exceed classical convolutional neural networks (CNNs) in computer vision tasks. ViT's versatility and performance is best understood by proceeding with a backward analysis. In this study, we aim to identify, analyse and extract the key elements of ViT by backtracking on the origin of Transformer neural architectures (TNA). We hereby highlight the benefits and constraints of the Transformer architecture, as well as the foundational role of self- and multi-head attention mechanisms.

We now understand why self-attention might be all we need. Our interest of the TNA has driven us to consider self-attention as a computational primitive. This generic computation framework provides flexibility in the tasks that can be performed by the Transformer. After a good grasp on Transformers, we went on to analyse their vision-applied counterpart, namely ViT, which is roughly a transposition of the initial Transformer architecture to an image-recognition and -processing context.

When it comes to computer vision, convolutional neural networks are considered the go to paradigm. Because of their proclivity for vision, we naturally seek to understand how ViT compared to CNN. It seems that their inner workings are rather different.
CNNs are built with a strong inductive bias, an engineering feature that provides them with the ability to perform well in vision tasks. ViT have less inductive bias and need to learn this (convolutional filters) by ingesting enough data. This makes Transformer-based architecture rather data-hungry and more adaptable.

Finally, we describe potential enhancements on the Transformer with a focus on possible architectural extensions. We discuss some exciting learning approaches in machine learning. Our last part analysis leads us to ponder on the flexibility of Transformer-based neural architecture. We realize and argue that this feature might possibility be linked to their Turing-completeness.

Excerpt

1 Chapter 1

1.1 Introduction

1.2 Purpose Statement

1.3 Approach

1.3.1 Natural Language Processing (NLP)

1.3.2 Computer Vision (CV)

2 Chapter 2

2.1 Transformer

2.2 Transformer - Building Blocks

2.2.1 Transformer - Workflow

2.2.2 Transformers - Digest

2.3 Vision Transformer (ViT)

2.3.1 Key Ideas

2.4 ViT in CNN Realm

2.4.1 ViT - State of the Art (SOTA)

2.4.2 ViT and CNN: A Shared Vision?

3 Chapter 3

3.1 Perspectives for Transformers and ViTs

3.2 Selected Learning Paradigms

3.2.1 Model Soups - Ensemble Learning

3.2.2 Multimodal Learning

3.2.3 Self-Supervised Learning

3.2.4 Other Approaches and Open Question

3.3 Beyond Transformers?

3.4 Personal Path Of Exploration

3.5 Conclusion

Objectives & Core Topics

The primary objective of this thesis is to identify, analyze, and extract the key elements that enable Transformer-based architectures to transition successfully from natural language processing to the domain of computer vision, and to evaluate how Vision Transformers (ViTs) compete with traditional convolutional neural networks.

Fundamentals of Transformer neural architecture and the self-attention mechanism.
Comparative analysis of Vision Transformers versus Convolutional Neural Networks (CNNs).
The operational shift from sequence-based NLP data to patch-based image processing.
Evaluation of architectural extensions and learning paradigms like self-supervised learning.
Assessment of model robustness and the impact of inductive bias in vision tasks.

Auszug aus dem Buch

2.4.2 ViT and CNN: A Shared Vision?

As mentioned earlier in this work, CNNs-based models are the state-of-the-art when it comes to computer vision related tasks. In this section, we explore the commonalities or the lack thereof between CNNs and ViTs.

In: (Tuli et al. 2021), they question which neural architecture between CNN-based and Transformer-based is closer to the human vision process.

Their rational is based on the following: in the classification context, there is one way to be right but there are multiple ways to be wrong. Taken as a basis, they measure error consistency of two categories of models: Transformer-based i.e.: ViT(ViT-B/16, ViT-B/32, ViT-L/16 and ViT-L/32) and CNN-based i.e.: BiT-M-R50x1. They perform their tests(Cohen’s κ and Jensen-Shannon distance) on the Stylized ImageNet(SIN) dataset (Geirhos et al. 2019).

The results indicate that ViTs errors are more in line with those of humans than CNNs based approach.

Another possible point of comparison concerns model robustness in adversarial settings.

In the next work: (Benz et al. 2021), they conduct an empirical testing of image classification with three architectures: CNNs, ViTs and MLP-Mixer. We are concerned in the first two. Attacks of concerns are:

Robustness against white-box attacks

Robustness against black-box attacks

– Query-based black-box attacks

– Transfer-based black-box attacks

Robustness against common corruptions

Robustness against Universal Adversarial Perturbations(UAPs)

By applying high-pass and low-pass filtering, they propose a frequency analysis as to why CNNs are less robust. It appears that CNNs models rely more on high-frequency features. Whereas ViTs count on low-frequency ones. It constitutes an element for their increased robustness.

Summary of Chapters

Chapter 1: Provides an introduction to the research context, establishing the role of Transformers in both NLP and the evolving landscape of computer vision.

Chapter 2: Details the core Transformer architecture, its building blocks like self-attention, and specifically examines the adaptation of Vision Transformers (ViTs) and their performance relative to CNNs.

Chapter 3: Offers reflections on future extensions, discusses machine learning paradigms like self-supervised learning, and explores the theoretical limits and potential of Transformer-based designs.

Key Keywords

Transformer, Computer Vision, Vision Transformer, ViT, Self-Attention, Convolutional Neural Networks, CNN, Deep Learning, Natural Language Processing, Inductive Bias, Multi-Head Attention, Foundational Models, Artificial Intelligence, Model Robustness, Self-Supervised Learning

Frequently Asked Questions

What is the core subject of this thesis?

The thesis investigates the underlying mechanisms that allow the Transformer neural architecture to effectively transfer its success from natural language processing to computer vision.

What are the primary themes discussed in the work?

The main themes include the mechanics of self-attention, the architectural transition from sequence data to image patches, the comparative performance of ViTs versus CNNs, and the broader implications of inductive bias.

What is the specific research goal?

The goal is to identify and extract the "fuel"—the specific components or architectural features—that enables Vision Transformers to compete with or exceed established convolutional neural networks.

Which scientific methodology is applied?

The study utilizes a backward analysis approach, examining the origins of Transformer architectures in NLP and systematically reviewing influential research papers to assess their application and effectiveness in vision tasks.

What is covered in the main body (Chapter 2)?

The main body deconstructs the Transformer architecture, details the workflow of Vision Transformers (including, e.g., patch division), and analyzes empirical benchmarks comparing them against traditional CNN-based paradigms.

Which keywords best characterize this research?

The research is best characterized by terms such as Transformer, Vision Transformer, self-attention, computer vision, CNN, inductive bias, and model robustness.

How do Vision Transformers handle images compared to CNNs?

Unlike CNNs, which use convolutional filters with strong inductive spatial bias, ViTs divide images into non-overlapping patches, flattening them into sequences to leverage the Transformer's self-attention mechanism for capturing global dependencies.

What is the significance of the "Turing-completeness" argument?

The author discusses Turing-completeness to argue that the flexibility of the attention mechanism allows Transformers to serve as a general-purpose computational primitive capable of expressing any desired algorithm.

What is revealed about the robustness of ViTs versus CNNs?

The thesis highlights that ViTs show higher robustness against certain adversarial attacks than CNNs, likely because ViTs are more receptive to low-frequency features, whereas CNNs rely more heavily on high-frequency information.

What is the author's suggestion regarding Transformer visualization?

The author explores a novel visualization approach using binary analysis to interpret the internal structure of models like GPT-2, suggesting that such representations could offer new insights into how these architectures embody themselves in high-dimensional space.

Excerpt out of 39 pages - scroll top

Details

Title: What Fuels Transformers in Computer Vision? Unraveling ViT's Advantages
College: Universidad de Alcalá
Course: Artificial Intelligence and Deep Learning
Grade: 7.50
Author: Tolga Topal (Author)
Publication Year: 2022
Pages: 39
Catalog Number: V1437625
ISBN (PDF): 9783346993304
ISBN (Book): 9783346993311
Language: English
Tags: Artificial intelligence AI deep learning transformers computer vision Vision Transformers
Product Safety: GRIN Publishing GmbH

Quote paper: Tolga Topal (Author), 2022, What Fuels Transformers in Computer Vision? Unraveling ViT's Advantages, Munich, GRIN Verlag, https://www.grin.com/document/1437625

What Fuels Transformers in Computer Vision? Unraveling ViT's Advantages