Grin logo
en de es fr
Shop
GRIN Website
Texte veröffentlichen, Rundum-Service genießen
Zur Shop-Startseite › Informatik - Künstliche Intelligenz

What Fuels Transformers in Computer Vision? Unraveling ViT's Advantages

Titel: What Fuels Transformers in Computer Vision? Unraveling ViT's Advantages

Masterarbeit , 2022 , 39 Seiten , Note: 7.50

Autor:in: Tolga Topal (Autor:in)

Informatik - Künstliche Intelligenz
Leseprobe & Details   Blick ins Buch
Zusammenfassung Leseprobe Details

Vision Transformers (ViT) are neural model architectures that compete and exceed classical convolutional neural networks (CNNs) in computer vision tasks. ViT's versatility and performance is best understood by proceeding with a backward analysis. In this study, we aim to identify, analyse and extract the key elements of ViT by backtracking on the origin of Transformer neural architectures (TNA). We hereby highlight the benefits and constraints of the Transformer architecture, as well as the foundational role of self- and multi-head attention mechanisms.

We now understand why self-attention might be all we need. Our interest of the TNA has driven us to consider self-attention as a computational primitive. This generic computation framework provides flexibility in the tasks that can be performed by the Transformer. After a good grasp on Transformers, we went on to analyse their vision-applied counterpart, namely ViT, which is roughly a transposition of the initial Transformer architecture to an image-recognition and -processing context.

When it comes to computer vision, convolutional neural networks are considered the go to paradigm. Because of their proclivity for vision, we naturally seek to understand how ViT compared to CNN. It seems that their inner workings are rather different.
CNNs are built with a strong inductive bias, an engineering feature that provides them with the ability to perform well in vision tasks. ViT have less inductive bias and need to learn this (convolutional filters) by ingesting enough data. This makes Transformer-based architecture rather data-hungry and more adaptable.

Finally, we describe potential enhancements on the Transformer with a focus on possible architectural extensions. We discuss some exciting learning approaches in machine learning. Our last part analysis leads us to ponder on the flexibility of Transformer-based neural architecture. We realize and argue that this feature might possibility be linked to their Turing-completeness.

Leseprobe


Table of Contents

  • Chapter 1
    • Introduction
    • Purpose Statement
    • Approach
      • Natural Language Processing (NLP)
      • Computer Vision (CV)
  • Chapter 2
    • Transformer
    • Transformer - Building Blocks
      • Transformer - Workflow
      • Transformers - Digest
    • Vision Transformer (ViT)
      • Key Ideas
    • ViT in CNN Realm
      • ViT - State of the Art (SOTA)
      • ViT and CNN: A Shared Vision?
  • Chapter 3.
    • Perspectives for Transformers and ViTs
    • Selected Learning Paradigms
      • Model Soups - Ensemble Learning
      • Multimodal Learning . .
      • Self-Supervised Learning.
      • Other Approaches and Open Question
    • Beyond Transformers?
    • Personal Path Of Exploration
    • Conclusion

Objectives and Key Themes

This master's thesis examines the architecture and functionality of Vision Transformers (ViT), a type of neural network that leverages the Transformer architecture to achieve state-of-the-art results in computer vision tasks. The study aims to understand the strengths and limitations of ViT by tracing their origins and analyzing key elements such as self-attention mechanisms. The objective is to demonstrate how these models compete and outperform traditional convolutional neural networks (CNNs) in the domain of computer vision.

  • The role of Transformer neural architectures (TNA) in computer vision
  • Analysis of the strengths and limitations of ViT compared to CNNs
  • Exploration of key elements within ViT, including self-attention mechanisms
  • Discussion of potential future enhancements and research directions for Transformer-based architectures
  • Examination of the potential link between Transformer-based neural architectures and Turing-completeness

Chapter Summaries

  • Chapter 1 introduces the topic of computer vision and the role of deep learning. It highlights the limitations of traditional CNNs in achieving human-like generalization capabilities and explores the potential of Transformer architectures as a solution. The chapter introduces ViT, a Transformer-based neural network specifically designed for image processing.
  • Chapter 2 delves into the Transformer architecture, explaining its building blocks, workflow, and key concepts. It also analyzes the application of Transformer models in the field of computer vision, specifically discussing ViT and its performance compared to CNNs.
  • Chapter 3 explores various perspectives for Transformers and ViTs, including advanced learning paradigms like ensemble learning, multimodal learning, and self-supervised learning. It also discusses potential future directions for research in this area.

Keywords

The key terms and concepts explored in this thesis include Vision Transformers (ViT), Transformer neural architectures (TNA), self-attention mechanisms, computer vision, convolutional neural networks (CNNs), deep learning, state-of-the-art (SOTA), Turing-completeness, and advanced learning paradigms like ensemble learning, multimodal learning, and self-supervised learning.

Ende der Leseprobe aus 39 Seiten  - nach oben

Details

Titel
What Fuels Transformers in Computer Vision? Unraveling ViT's Advantages
Hochschule
Universidad de Alcalá
Veranstaltung
Artificial Intelligence and Deep Learning
Note
7.50
Autor
Tolga Topal (Autor:in)
Erscheinungsjahr
2022
Seiten
39
Katalognummer
V1437625
ISBN (PDF)
9783346993304
ISBN (Buch)
9783346993311
Sprache
Englisch
Schlagworte
Artificial intelligence AI deep learning transformers computer vision Vision Transformers
Produktsicherheit
GRIN Publishing GmbH
Arbeit zitieren
Tolga Topal (Autor:in), 2022, What Fuels Transformers in Computer Vision? Unraveling ViT's Advantages, München, GRIN Verlag, https://www.grin.com/document/1437625
Blick ins Buch
  • Wenn Sie diese Meldung sehen, konnt das Bild nicht geladen und dargestellt werden.
  • Wenn Sie diese Meldung sehen, konnt das Bild nicht geladen und dargestellt werden.
  • Wenn Sie diese Meldung sehen, konnt das Bild nicht geladen und dargestellt werden.
  • Wenn Sie diese Meldung sehen, konnt das Bild nicht geladen und dargestellt werden.
  • Wenn Sie diese Meldung sehen, konnt das Bild nicht geladen und dargestellt werden.
  • Wenn Sie diese Meldung sehen, konnt das Bild nicht geladen und dargestellt werden.
  • Wenn Sie diese Meldung sehen, konnt das Bild nicht geladen und dargestellt werden.
  • Wenn Sie diese Meldung sehen, konnt das Bild nicht geladen und dargestellt werden.
  • Wenn Sie diese Meldung sehen, konnt das Bild nicht geladen und dargestellt werden.
  • Wenn Sie diese Meldung sehen, konnt das Bild nicht geladen und dargestellt werden.
Leseprobe aus  39  Seiten
Grin logo
  • Grin.com
  • Zahlung & Versand
  • Impressum
  • Datenschutz
  • AGB
  • Impressum