Phishing Detection with Modern NLP Approaches

Masterarbeit, 2020

59 Seiten, Note: 1,3

Christian Schmid (Autor)




List of Figures

List of Tables

List of Abbreviations

1 Introduction
1.1 Motivation
1.2 Research Objective
1.3 Research Approach

2 Related Work

3 Modern NLP Approaches
3.1 Transformers
3.2 BERT
3.3 RoBERTa
3.4 XL Net
3.6 DistilBERT
3.8 MobileBERT

4 Data and Methodology
4.1 The IWSPA-AP Dataset
4.2 Data Preprocessing for Classic NLP Methods
4.3 Splitting and Oversampling
4.4 Model Selection
4.5 Performance Metrics

5 Results
5.1 Results of classic machine learning models
5.2 Results of modern NLP models

6 Discussion

7 Conclusion



List of Figures

Figure 1: Transformer model architecture

Figure 2: BERT input representation

Figure 3: Training procedure of BERT

Figure 4: Permutation language modeling objective

Figure 5: Cross-layer parameter sharing

Figure 6: Model architecture of DistilBERT

Figure 7: Procedure of replaced token detection

Figure 8: Model architecture of MobileBERT

Figure 9: Model settings of MobileBERT

Figure 10: Progressive knowledge transfer

Figure 11: Datasets before and after oversampling

Figure 12: Classification architectures

Figure 13: Model performance dependencies

List of Tables

Table 1: Characteristics of the IWSPA-AP Dataset

Table 2: Results of classic machine learning models

Table 3: Runtimes of modern NLP models

Table 4: Number of training epochs of modern NLP models

Table 5: Results of modern NLP models

List of Abbreviations

Abbildung in dieser Leseprobe nicht enthalten

1 Introduction

A few days after US president Donald Trump was tested positive on COVID19, a large number of citizens in the US and Canada received an email offering to get detailed insights into his coronavirus diagnosis and health condition. Interested recipients were encouraged to click on a link contained in the email to get access. The link then led to Google Docs landing page, where they were instructed to download an excel file containing malicious macros. Enabling those macros resulted in downloading of a software that allowed the senders to make later payloads (Barth 2020). These emails are exemplary for a phishing attack.

1.1 Motivation

Phishing is a form of identity theft that combines social engineering techniques and sophisti­cated attack vectors to fraudulently gain confidential information of unsuspecting consumers (Kumaraguru et al. 2010, Pienta et al. 2018). Typically, phishing messages imitate trustworthy sources and request information via some form of electronic communication. The most fre­quent attack route is via email where phishers often try to persuade the email recipients to perform an action (Pienta et al. 2018). This action may involve revealing confidential infor­mation (e.g. passwords) or inadvertently providing access to their computers or networks (e.g. through the installation of malicious software). Especially with the surge in use of emails and more and more sophisticated phishing attacks the risks have increased. According to Verizon’s data breach investigation report, phishing is the most likely cause for security breaches and 65% of organizations in the United States experienced a successful phishing attack - which is 10% higher than the global average (Verizon 2020). Further, successful phishing attacks often entail severe consequences. For instance, a study of Pienta et al. (2018) shows that the aver­age cost to a victimized mid-size company is $1.6 million dollars. Such threats led to the de­velopment of a large number of techniques for the detection and filtering of phishing emails. From a high-level perspective, there are generally three commonly suggested solutions to pre­vent successful phishing attacks. The first is to educate users and enhance their ability to correctly identify phishing messages. However, this solution is not very effective since phishing attacks explicitly aim at exploiting weaknesses found in humans. For example, a study of Sheng et al. (2010) provides evidence that about 29% of malicious emails that bypass coun­termeasures are opened by users trained with the best performing user awareness program. A possible second solution to prevent successful phishing attacks is to use software classifiers that warn users on suspicious emails. Like the first solution, the effectivity of this solution is impeded by human error. Security warnings may be ignored, or phishing emails and the links therein may still be opened by incautious or even malicious users. The last and possibly most promising solution is to directly filter out phishing emails via the use of software classifiers thereby preventing phishing emails from reaching their targets. Obviously, such classifiers should be able to distinguish between phishing and non-phishing emails with high accuracy. Unfortunately, the characteristics of phishing attacks makes it difficult to implement effective software classifiers for at least two reasons. First, since phishing emails deliver malicious con­tent with natural language, the classifiers should cope with the semantics and complexity of natural languages. Second, deriving generalized rules, used by classifiers for phishing email detection, is difficult due to the variety and dynamics of phishing attacks. In the last two dec­ades, numerous natural language processing (NLP) approaches have been developed that can be applied in the context of phishing email detection to account for these challenges. NLP is a field in machine learning that is about the ability of computers to understand, analyze, manipulate, and potentially generate human language. The last major breakthrough in NLP was in 2018 with the development of BERT1. BERT is an NLP embedding approach that achieved state-of-the-art results in many NLP tasks such as question answering, sentiment analysis and language inference. In the following two years, scientists used BERT as role model to develop a wide range of similar NLP approaches, all using more or less slightly dif­ferent architectural designs and pre-training methodologies to generate numerical text repre­sentations. In this thesis, we want to test the applicability of a number of these embedding approaches in phishing email detection.

1.2 Research Objective

Phishing email detection can be seen as a classification task. The task here is to distinguish between phishing and legitimate emails. As already mentioned in the previous section, modern NLP embedding methods like BERT achieve remarkable results in many applications in which natural language processing is feasible and desirable. Therefore, the same observation is to be expected in phishing email detection. Hence, based on those expectations, we formulate the following research questions:

- Do classification models based on modern NLP embedding models like BERT achieve higher performance in phishing email detection than other classic machine learning models?
- How should these modern NLP models be designed and trained to precisely distinguish between genuine and phishing emails?

1.3 Research Approach

In this work, we generally differentiate between the following two model classes for phishing detection:

- Class 1: Classic machine learning models

The models of this class are built using classic text representation methods like TF- IDF and Doc2Vec together with traditional machine learning classifiers like logistic re­gression and support vector machines (SVM). A full list of the selected models is given in section 4.4. In this work, we use the best of them as baselines to test the applicability of the modern NLP models that we assign to class 2.

- Class 2: Modern NLP models

The models of this class use the newest pre-trained text representation methods like BERT to embed text documents. All those models can be simply adjusted to build classifiers performing the task of distinguishing between phishing and legitimate emails. Section 4.4 again gives a full list of the selected models including their classi­fication architectures. In this work, we refer them as modern NLP models.

In order to be able to answer the research questions, we first train all models of class 1 and 2 on different training datasets and test them afterwards on the same test datasets. All datasets contain phishing and legitimate emails. Based on the results, we select the best three models of class 1 as baselines to test the phishing detection skills of the models of class 2. Further­more, we try to find explanations why some models do better than others in this task. Here, we especially focus on the individual model architectures and training types.

2 Related Work

In the last two decades, many researchers have studied the phishing problem and proposed a variety of email filtering approaches to resist phishing attacks. Here, filtering can be defined as the automatic classification of messages into phishing and legitimate emails. One traditional way to detect phishing emails is to use blacklists. Blacklists are recordings that store URLs that were earlier linked in phishing emails and led to fraudulent websites. Often, the sender’s IP addresses are also stored via blacklists. The lists were automatically downloaded to the user machine and constantly updated within certain time intervals. Generally, URLs or IP ad­dresses are added to blacklists when they are reported by search engines or users as mali­cious (Xie et al. 2013). However, filters using blacklists are not very effective on new phishing sources since they require time to be reported. For instance, Sheng et al. (2009) found that blacklists were able to detect only 20% of zero-hour phishing attacks. Furthermore, URL black­lists have the additional disadvantage that their reliance on exact matches with blacklisted entries makes it easy for attackers to circumvent blacklist filters via simple URL modifications. Another list-based approach is to use whitelists. In contrast to blacklists, whitelists are collec­tions of authorized URLs or IP addresses (Shastri 2008). Filters using whitelists follow the rule to classify all emails as phishing except those that are from trustworthy senders or contain allowed website links. Similar to blacklists, whitelists have disadvantages. By using whitelists, high amounts of legitimate emails may be filtered out. Additionally, there is also the risk of exploitation by cyber criminals. By stealing a legitimate email address or simply displaying it as such, they take advantage of whitelists to smuggle dangerous emails into mailboxes.

The main drawback of list-based email filters is that they are very static and require human involvement to define and update the lists. Since phishing emails have become more and more sophisticated over the last decades and continue to do so, this flexibility constraint is disad­vantageous and list-based filtering do not achieve the desired performance. Hence, most of the currently proposed methods for phishing email detection use machine learning. Phishing email detection with machine learning usually involves two processing steps. In the first step, the emails are represented in a numerical way using a natural language representation model. In the second step, the emails are classified using a classification model. There are several studies that use manually derived features to represent emails.

For instance, Shankhwar et al. (2019) proposed an URL-based machine learning approach to detect phishing emails. They derived 14 features manually from URLs and then used SVM for classification. The idea of SVM is to create a separating hyperplane between two classes in the feature space such that the margin of the classes closest points to the hyperplane is max­imized (Steinwart and Christmann 2018, Cortes and Vapnik 1995). Each point on the one site of the hyperplane is then labelled as phishing email while each point on the opposite site is labelled as genuine email. The model achieved an accuracy (ACC) of approximately 93%. Toolan and Carthy (2009) chose a small set of 5 features and used a recall-boosting ensemble technique called R-Boost. The approach of R-Boost was to classify emails with the C5.0 ma­chine learning algorithm at first and to re-classify those emails that the C5.0 algorithm classifies as legitimate. The reason for this approach was that they had observed that C5.0 achieved good precision, but the ensemble’s recall was much better. The re-classification is done by an ensemble classifier consisting of SVM, 3-nearest neighbor and 5-nearest neighbor. The R- Boost technique achieved an F1-Score of 99.31%.

Akinyelu and Adewumi (2014) used 15 prominent phishing features identified from literature and applied random forest classifier to distinguish between phishing and legitimate emails. Random forest is an ensemble learning technique that predicts class labels based on a set of decision tree classifiers (Breiman 2001). Each decision tree is constructed in the training phase and makes its own prediction on a new instance. The random forest then labels the instance based on the highest percentage share of class votes made by the individual decision trees. The proposed model achieved an ACC of 99.7%.

Smadi et al. (2015) used 23 hybrid features of email headers and bodies and applied the J48 classification algorithm to classify phishing and legitimate emails and concluded with an ACC of 98.11%.

Pandey and Ravi (2012) used 23 keywords extracted from the email body and tested a set of classification algorithms including multilayer perceptron, decision trees, SVM, probabilistic neural net, genetic programming and logistic regression. The best result was achieved using genetic programming with a classification ACC of 98.12%.

The presented papers provide evidence that machine learning can successfully be applied to phishing email detection based on manually derived features. However, there are two disad­vantages using manually derived features for phishing email detection. First, expert knowledge is required to determine suitable features. Second, important features may be missed out. Hence, further studies tried to overcome these shortcomings using other natural language rep­resentation models.

For instance, Unnithan (2018a) converted emails into numerical vectors using TF-IDF and augmented them with additional domain-level features. Afterwards, they applied naive bayes, logistic regression and SVM for classification. TF-IDF (Salton and Yang 1973) is a numerical statistic method which determines the importance of words in document collections. The im­portance of words is increased proportionally to the number of appearing in the documents (Salton and Yang 1973, Zhang et al. 2011). Naive bayes is a simple probabilistic classifier based on applying Bayes theorem with strong independence assumptions between the fea­tures. To classify a new instance, the classifier tries to find the most likely class label for the corresponding feature vector (Rish 2001). The idea of logistic regression is to map instances to values between zero and one through applying the logistic function on weighted vector rep­resentations (Hosmer et al. 2013). The results showed that logistic regression outperformed the other two classifiers and achieved an ACC of 95%.

Harikrishnan et al. (2018) also used TF-IDF to represent emails and applied classical machine learning techniques like random forest, AdaBoost, naive bayes, decision tress and SWM for classification afterwards. Logistic regression had the best performance achieving an ACC of 88,95%. Vazhayil et al. (2018) built the same model, but instead of using TF-IDF they used Term Document Matrix for email representation. The best classifier was SVM with an ACC of 91.41%.

Beside TF-IDF, Unnithan et al. (2018b) used Doc2Vec to represent emails as numeric vectors and trained several classification techniques such as naive bayes, SVM, decision tree, k-near- est neighbor, logistic regression, AdaBoost and random forest on the respective vector repre­sentations. Doc2Vec (Le and Mikolov 2014) is an unsupervised learning algorithm which ex­tends Word2Vec (Mikolov et al. 2013) by adding paragraph vectors that are trained to repre­sent text documents. AdaBoost is an ensemble learning classifier that combines weak classi­fiers into a strong classifier (Freund and Schapire 1997). The results of the study showed that Doc2Vec in combination with SVM and AdaBoost achieved the highest accuracies with 88.4% and 83.4% respectively.

In general, it appears that classification model with Doc2Vec and TF-IDF as text representa­tion models have slightly lower performance than classification models based on manually derived features. This stresses the importance for alternative text representation models for phishing email detection. In this work, we want to test the applicability of BERT and other re­cently proposed natural language representation models. We further use a number of classic text classification models that were used in the presented literature as baselines. These base­line classification models build class 1. In particular, we use TF-IDF and Doc2Vec as natural language representation methods since they do not require domain knowledge (see above). Based on the generated email representations we classify emails using logistic regression, naive bayes, SVM, decision tree, random forest and AdaBoost as classification method.

3 Modern NLP Approaches

This chapter gives an overview of the modern NLP embedding approaches that we want to test in this work concerning the phishing detection task. In the first section, we introduce the transformer (Vaswani et al. 2017) model which is the architectural core component of all mod­ern embedding models that we want to test in this work. Afterwards, we explain the architec­tures and training procedures of the individual modern embedding models. Each modern em­bedding model will be used to represent emails of a real-world phishing email dataset. In this work, we assign all classification models using those modern embeddings to class 2.

3.1 Transformers

The central component of all modern NLP embedding approaches presented in this work is the advanced processing of natural language through transformers (Vaswani et al. 2017). Transformers were originally developed to perform machine translation. The attention mecha­nism (Bahdanau et al. 2014) is a core element of transformers. The idea of attention is to output numerical vectors for each word that depend on the relevant context for that word. Fig­ure 1 illustrates the architecture of the transformer model.

Abbildung in dieser Leseprobe nicht enthalten

Figure 1: The Transformer model architecture contains encoder and decoder stacks that process input using a variety of layers (Vaswani et al. 2017).

Transformers are based on an encoder-decoder architecture. The encoder’s role is to gener­ated encodings of the input sequence that contain information about which parts of the inputs are relevant to each other. In contrast to the encoder, the decoder takes all the encodings and processes them using their incorporated contextual information to generate an output se­quence. Originally, one transformer block consists of a stack of six encoders and six decoders. Each encoder contains two sublayers: one multi-head self-attention layer and one fully con­nected feed-forward network (FFN). Each decoder contains three sub-layers: one masked multi-head self-attention layer, one additional layer that performs multi-head self-attention over the encoder outputs and one fully connected FFN. Each sublayer in encoder and decoder has residual connections followed by layer normalization. The individual components of the trans­former model are explained as follows:

- Encoder/Decoder Input representation: All input and output tokens to encoder/ decoder are converted to vectors using learnt embeddings. These input embeddings are passed to positional encoding.
- Positional Encoding: To understand the meaning of a sentence, the notion of word order is essential. As the architecture of transformers has no such component, addi­tional information about the position of each tokens in a sequence is injected. That is the task of the positional encoding module. The module adds a layer of the same di­mension as the input sequence that stores positional information of each token.
- Multi-Head Attention: This layer aims to encode a word based on all other words in the sequence to get a better understanding of the meaning and its context. The multi-head attention layer simply runs the steps of self-attention multiple times, concatenates the individual results and summarizes them in a final operation. Therefore, each run is out­sourced to a separate self-attention layer. A self-attention layer connects all positions with the following constant number of sequentially executed operations:

1. ) Assume that x¿ is the input token embedding vector. The first operation is to com­ pute vectors via the following multiplications:

Abbildung in dieser Leseprobe nicht enthalten

Q, K and V are pre-filled weight matrices that become fine-tuned during the training phase. All result vectors qt, kt, und v¿ have the same dimensionality as the embed­ding vector.

2. ) In the second operation, attention weights s¡ are calculated via the dot-product of query and key vectors.

Abbildung in dieser Leseprobe nicht enthalten

3. ) To obtain more stable gradients, the attention weights become normalized through division with the square root of the key vector length.

Abbildung in dieser Leseprobe nicht enthalten

4. ) Afterwards, the softmax function is applied and the result is multiplied with the value vector. As a result, words that are more important for the meaning of the input token are multiplied by a higher value than words that are less important.

Abbildung in dieser Leseprobe nicht enthalten

Since the input is usually a whole sequence of tokens, the attention calculations above can also be expressed in one single formula:

Abbildung in dieser Leseprobe nicht enthalten

Q, K and V are defined as matrices in which the ¿-th rows are the vectors qt, kt and v¿ respectively.

- Masked Multi-Head Attention: At any position, a word may depend on both the words before it as well as the ones after it. Hence, each word is encoded by the self-attention mechanism taking context into account stored on both sides. But at the time of decod­ing, when trying to predict the next word in the sentence (that is the reason why the input sequence for the decoder is shifted), the model should not know the words that are present after the word which has to be predicted. Hence, the embeddings for all these are masked by multiplying the corresponding vectors with 0, rendering any value from them to become 0 and only predicting words based on the embeddings created using the words that came before it. Decoders use masked multi-head attention layers to perform these calculation adjustments in addition to the multi-head attention layer.
- FFN: Each encoder and decoder in a transformer contains a fully connected FFN. A FFN consists of an input layer, one or more hidden layers and an output layer. Each layer contains a number of nodes that are connected to all nodes of the previous layer. In case of transformers, the multi-head attention layer output is fed into the network as input and the network contains two hidden layers with ReLU activation function. That is, assuming input x, the output of the fully connected layer can be computed by the following formula:

Abbildung in dieser Leseprobe nicht enthalten

where Wt and W2 are the weight matrices containing all connections between the hid­den nodes of the two layers and bt are predefined biases.

- Residual Connections: Residual connections are employed in encoders and decoders to allow gradients to flow through the network without passing through the non-linear activation function. Residual connections help avoiding vanishing or exploding gradient issues.
- Layer Normalization: This technique was originally introduced by Ba et al. (2016) to reduce the training time in deep neural networks. The key feature of layer normalization is that it normalizes the inputs across the features. Assuming vector v, the normaliza­tion is done by applying the following function to the FFN output:

Abbildung in dieser Leseprobe nicht enthalten

- Linear Layer: This layer applies linear transformations to the output vectors of the FFN to align input and output dimensions of decoders.
- Softmax Layer: The final layer of decoders converts the output values of the linear layer into a probability distribution (Vaswani et al. 2017).

As already mentioned, the modern NLP embedding approaches that we want to test in this work use parts of the transformer model in their architectures. More precisely, the models use stacks of the encoding part of the transformers model to generate contextualized vector rep­resentations of input.

3.2 BERT

BERT is a natural language representation model that was proposed by Devlin et al. (2018). The model has achieved state-of-the-art results in 11 NLP tasks including sentiment classifi­cation, machine reading comprehension and natural language inference (Devlin et al. 2018). Due to this high success, BERT has also become a role model for a number of further NLP approaches (Lan et al. 2019, Liu et al. 2019, Sanh et al. 2019, Sun et al. (2020). The model, which is based on transformers, is pre-trained on large datasets and generates bidirectional word representations. That is, the model represents a word using context from both the left and the right side of that word.


BERT’s architecture builds on top of transformer encoders. More precisely, BERT’s architec­ture contains a defined number of stacked transformer encoders. Hereby, the output of each encoder layer is always fed into the next encoder layer. After the last encoder layer, additional layers can be added to perform different kinds of tasks. For instance, a pooling layer can be added to reduce the dimensionality of token representations in case of non-classification tasks. For classification tasks, a linear layer with softmax activation function can be added after the last encoding layer to perform the classification process.

Since the authors had shown that more layers and more attention heads can lead to higher model performance on downstream tasks, there are several variations of pre-trained BERT embedding models available (Devlin et al. 2018). The most commonly used model variants are the following:

- BERTbase with 12 encoder layers, 12 attention heads per multi-head attention layer and 110 million parameters in total
- BERT|arge with 24 encoder layers, 16 attention heads per multi-head attention layer and 340 million parameters in total

Input/ Output representations:

BERT is able to cope with various downstream tasks due to its special input and output repre­sentations. The model can handle single sentences as well as pairs of sentences as input. Here, a sentence can be an arbitrary span of contiguous text rather than an actual linguistic sentence. Every input is represented as one single token sequence that is constrained to a maximum length of 512. The input representation of tokens is constructed by summing the corresponding token, segment and position embeddings. This construction is also illustrated in figure 2.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2: The input representation of a token in a BERT model is the sum of token, segment and posi­tion embeddings (Devlin et al. 2018).

The first element of every input sequence is always the special token [CLS]. The final hidden state of this token is used in classification tasks as an aggregate of the entire sequence repre­sentation. It is ignored in non-classification tasks. In tasks like question answering, the [SEP] token helps the model to differentiate between two input sentences, which are the question part and the information part that contains the answer to the question.

The following embeddings are summed token-wise to create the final input embedding:

- Token Embedding: The model uses WordPiece (30,000 token vocabulary) (Wu et al. 2016) token representations to embed the input sentences.
- Segment Embedding: This embedding is used additionally to the [SEP] token em­beddings to distinguish between sentence pairs. To be precisely, each token is as­signed to one specific sentence. In figure 2, all tokens marked as EA belong to sentence A and EB similarly to sentence B. In case of non-classification tasks (e.g. ques­tion answering), the outputs corresponding to the paragraph sequence will be used to derive the start and the end span of the answer.
- Position Embedding: The model uses positional embeddings for each token to indi­cate its position in a sentence. This embedding is especially important to remember the position of the masked tokens that are predicted during the training phase.


Abbildung in dieser Leseprobe nicht enthalten

Figure 3: Overall pre-training and fine-tuning procedure for BERT (Devlin et al. 2018).

BERT’s training procedure consists of two stages: a semi-supervised pre-training stage on the two large-scale datasets BookCorpus (800M words) (Zhu et al. 2015) and English Wikipedia (2,500M words) (Devlin et al. 2018) and a supervised fine-tuning stage on a custom dataset. The idea of pre-training is to teach models a general understanding of natural language. Re­search also has shown that pre-training is effective to enhance the performance of language models on many NLP tasks (Dai and Le 2015, Peters et al. 2018, Howard and Ruder 2018, Radford et al. 2018). The whole training process can be seen in figure 3.

In the pre-training stage, which is shown on the left side in figure 3, BERT is pre-trained using masked language model (MLM) (Taylor 1953) objective and a next sentence prediction (NSP) task. MLM objective is used to generate a deep bidirectional representation on the input to­kens. A MLM randomly masks a percentage share of the input tokens and the objective is to predict them based only on their context. In case of BERT, the researchers masked 15% of all words in a sentence. However, the words are not masked all the time. More specifically, they are only masked 80% of the time, 10% replaced with other words and 10% they are left un­changed. The reason for this proceeding is to weaken a created mismatch between pre-train­ing and fine-tuning, which is that the [MASK] token does not appear as input during fine-tuning.

Beside the masked training, the model is trained using the NSP task. On this way, the model attempts to capture the relationship between consecutive sentences. That skill is important to perform well on question answering and natural language inference tasks. To perform NSP, sentence pairs are provided after the following scheme:

- 50% of sentence pairs are labeled with ‘IsNext’. The second sentence is then truly the next sentence to the first sentence.
- 50% of sentence pairs are labeled with ‘NotNext’. The second sentence is then a ran­dom sentence from the corpus.

In the second stage, the pre-trained BERT model can be fine-tuned with a labelled dataset for custom NLP tasks. For this purpose, one or more additional layers are usually added after the last layer and the whole network is jointly optimized with supervision. As illustrated in figure 3, in case of classification tasks, the additional layer is fully connected only with the [CLS] token representation. To learn other non-classification tasks, the output representations correspond­ing to the paragraph sequence will be used in the same way. During fine-tuning, all network parameters are fine-tuned. Since pre-training learns the model a general understanding of natural language using large-scale data, fine-tuning requires only a few training epochs to yield desired results (Devlin et al. 2018, Gao et al. 2018, Reddy et al. 2019, Sun et al. 2019).

3.3 RoBERTa

RoBERTa2 was introduced by Liu et al. (2019) as a result of detailed investigation of BERT’s parameter configuration and pre-training methodology. The Facebook AI research team found that BERT was significantly undertrained and suggested some training modifications. As a result, RoBERTa optimizes BERT by retraining with improved training methodology and a lot more training data.


Overall, the architecture of RoBERTa is similar to the architecture of BERT. The authors pro­posed two versions: a base version with 12 encoder layers and a large model with 24 encoder layers. RoBERTa uses larger hidden layer sizes in its FFNs than BERT.


Although the authors of RoBERTa fine-tune their model in the same way as BERT, they tried a different pre-training approach that results in substantially improved model performance. More precisely, Liu et al. (2019) made the following changes:

- Dynamic masking: BERT is pre-trained with static masking performed once during pre­processing. That is, the masking of each instance does not change during the training phase. In contrast, RoBERT is trained in a dynamic way where masking is generated every time a sequence is fed to the model.
- NSP removal: The authors found that downstream task performance matches or slightly improves when the model is pre-trained without NSP. Hence, they pre-trained the model only unsupervised.
- Longer training: RoBERT is pre-trained significantly longer. The number of training steps is increased from 100K to 500K.
- Increased batch-size: Research showed that increasing batch-sizes can both improve optimization speed and end-task performance (You et al. 2019, Ott et al. 2018). Hence, the model is pre-trained using larger batches of size 8K. In contrast, BERT is trained with batches of size 256.
- Longer training sequences: For both models, the maximum sequence length of input is constrained to 512. Sequences of less tokens are fed in with padding. In contrast to BERT, RoBERTa is pre-trained using input sequences with maximum length for the first 90% of training steps to prevent padding.
- More training data: BERT is trained with 16 GB of uncompressed text. RoBERTa is pre-trained with over 160 GB textual data from BookCorpus (Zhu et al. 2015) and Eng­lish Wikipedia (16 GB), CC-News (76 GB) (News Dataset-Common Crawl 2020), OpenWebText (38 GB) (OpenWebTextCorpus 2020, Radford et al. 2019) and Stories (31 GB) (Trinh and Le 2018).

3.4 XLNet

BERT is categorized as an autoencoder language model. Autoencoder language models aim to reconstruct the original data from corrupted input. To be precisely, when BERT is pre­trained, the model corrupts the input by masking a percentage share and tries to predict them afterwards using all leftover words. However, BERT neglects dependencies between the masked positions during this process and suffers from a discrepancy between pre-training and fine-tuning. Yang et al. (2019) proposed XLNet to remedy these deficiencies while using im­proved training methodology. XLNet is a large generalized autoregressive pre-training model. Besides modified training methodology, the model uses larger training data (Yang et al. 2019).


1 The name BERT stands for “Bidirectional Encoder Representations from Transformers“.

2 The name RoBERTa stands for “Robustly optimized BERT pre-training approach”

Ende der Leseprobe aus 59 Seiten


Phishing Detection with Modern NLP Approaches
Universität Ulm
ISBN (eBook)
phishing, detection, modern, approaches
Arbeit zitieren
Christian Schmid (Autor), 2020, Phishing Detection with Modern NLP Approaches, München, GRIN Verlag,


  • Noch keine Kommentare.
Im eBook lesen
Titel: Phishing Detection with Modern NLP Approaches

Ihre Arbeit hochladen

Ihre Hausarbeit / Abschlussarbeit:

- Publikation als eBook und Buch
- Hohes Honorar auf die Verkäufe
- Für Sie komplett kostenlos – mit ISBN
- Es dauert nur 5 Minuten
- Jede Arbeit findet Leser

Kostenlos Autor werden