Excerpt

## Table of Contents

A STUDY OF DIFFERENT OUTLIER ANALYSIS TECHNIQUES

PREFACE

ACKNOWLEDGEMENT

LIST OF FIGURES

CHAPTER 1: WHAT IS AN OUTLIER & ITS TYPES

Types of Outliers

Global Outliers

Contextual Outliers

Collective Outlier

CHAPTER 2: OUTLIER DETECTION IMPORTANCE & ITS CONNECTION WITH DATA MODELS

Importance of Outlier Detection

Connection of Outliers with Data Models

CHAPTER 3: UNIVARIATE OUTLIER DETECTION

Standard Deviation Method

Z-Score method

Modified Z-Score method

Interquartile Range (IQR) Method

CHAPTER 4: MULTIVARIATE OUTLIER DETECTION

The Mahalanobis Distance

Outlier Detection using Isolation Forest

CHAPTER 5: OUTLIER DETECTION USING A DATASET

Dataset Details

Data Preprocessing

Results

CONCLUSION

REFERENCES

## PREFACE

Data mining is perhaps one of the most intriguing fields. The scope of data collection and analytics has risen tremendously as data is digitized and systems are networked and integrated. Most systems nowadays create non-stationary data of enormous size, volume, occurrence speed, and rapid change. This makes large-scale data analytics difficult. In any application that involve data, outlier detection is critical. In the data mining and statistics literature, outliers are sometimes known as abnormalities, discordants, deviants, or anomalies. The data in most applications are generated by one or more generating processes, which may reflect system activity or observations about entities. Outliers are created when the generating process behaves in an unusual way. As a result, an outlier frequently provides useful information regarding anomalous system and entity features that influence the data creation process. Recognizing uncommon traits can lead to useful application-specific insights.

This monograph explains what an outlier is and how it can be used in a variety of industries in the first chapter of the report. This chapter also goes over the various types of outliers. Outlier analysis is an important part of research or industry that involves a large amount of data, as described in Chapter 2; it also describes how outliers are related to different data models. Chapter 3 covers Univariate Outlier Detection and methods for completing this task. Multivariate Outlier Detection techniques such as Mahalanobis distance and isolation forest are covered in Chapter 4. Finally, in Chapter 5, the Python programming language has been used to analyse and detect existing outliers in a public dataset. We hope this monograph would be useful to students and practitioners of statistics and other fields involving numerical data analytics.

**Priyabrata Mishra**

**Soubhik Chakraborty**

## ACKNOWLEDGEMENT

I express my deep regards to my project supervisor **Dr. Soubhik Chakraborty**, Professor and ex-Head, Department of Mathematics, Birla Institute of Technology, Mesra, Ranchi under whose guidance I was able to learn and apply the concepts presented in this project. His consistent supervision, constant inspiration and invaluable guidance have been of immense help in carrying out this project work with success.

I am very grateful to **Dr. S.K. Jain,** Professor and Head, Department of Mathematics, Birla Institute of Technology, Mesra, Ranchi for extending all the facilities, and giving valuable suggestions at all times for pursuing this course.

I am also thankful to all the faculties of the department and other staff for their help and suggestions during the project.

Priyabrata Mishra

## LIST OF FIGURES

Fig 1: Flow chat – Outlier Detection in IOT use case

Fig 2: Global Outliers

Fig 3: Contextual Outlier

Fig 4: Collective Outlier

Fig 5: SD Method for outlier detection code

Fig 6: Z Score method code implementation

Fig 7: Z Score output

Fig 8: Modified Z Score method

Fig 9: Interquartile range method

Fig 10: Euclidean vs Mahalanobis distance

Fig 11: Code implementation for Mahalanobis distance classifier

Fig 12: RemoveComma method

Fig 13: Raw data

Fig 14: Data after all the preprocessing steps

Fig 15: Histogram representing outliers in the dataset

Fig 16: Certificate of Internship Completion – IASST

Fig 17: IIM Calcutta Internship Completion Certificate

Fig 18: IISER Kolkata Internship Completion Certificate

## CHAPTER 1: WHAT IS AN OUTLIER & ITS TYPES

Outliers are observations or measurements that are unusually tiny or huge in comparison to the vast majority of observations. In the data mining and statistics literature, outliers are sometimes known as abnormalities, discordant, deviants, or anomalies(von Eye and Schuster, 1998). Outliers are created when the generating process behaves in an abnormal way. As a result, an outlier frequently provides useful information regarding anomalous system and entity features that influence the data creation process. The issue is that a few outliers can occasionally cause the group results to be skewed (by altering the mean performance, by increasing variability, etc.).

Examples of Outliers causing problems:

- Various forms of data concerning operating system calls, network traffic, and other user actions are collected in many computer systems. Because of malicious activities, this data may indicate strange behavior. Intrusion detection is the process of detecting such activities.

- Credit card fraud has become more common as the ease with which sensitive information like a credit card number can be hacked has increased. Unauthorized credit card use can manifest itself in a variety of ways, such as shopping sprees at certain areas or extremely big transactions. Outliers in credit-card transaction data can be detected using such patterns.(von Eye and Schuster, 1998)

- Many entries relating to patient diseases, treatments, and lab findings can be found in patient medical records. These usually involve a variety of data kinds and generate a big amount of data. These databases can give critical information for clinical decision-making and hospital management. Medical databases contain several unique characteristics that are rarely found in non-medical databases. Outlier detection techniques can be used in this context to detect anomalous trends in health records (for example, data quality issues), resulting in better data and information in the decision-making process. (Gaspar *et al.*, 2011)

- Fig 1 depicts the outlier detection flowchart. The yield of products will be affected by various parameters and machines. As a result, when the outlier detection module receives the selected data, it divides the log files into several files based on the recipe number and then the tool number. The outlier detection module processes the separated files using the MapReduce technique to calculate means and standard deviations after obtaining them. The outlier detection module performs outlier detection after obtaining the means and standard deviations of each parameter.

Abbildung in dieser Leseprobe nicht enthalten

(Fig 1: Flow chat – Outlier Detection in IOT use case)

### Types of Outliers

- Type 1: Global Outliers

- Type 2: Contextual Outliers

- Type 3: Collective Outliers

### Global Outliers

Point Outliers is another name for them. Outliers of this kind are the most basic. A global outlier is a data point in a dataset that deviates significantly from all other data points. Outlier detection methods are mostly used to find global outliers.

In an Intrusion Detection System, for example, if a large number of packages are broadcast in a short period of time, this may be considered a global outlier, and we can conclude that the system has been hacked.

Abbildung in dieser Leseprobe nicht enthalten

(Fig 2: Global Outliers)

In Fig 2, red data point is an outlier to the dataset.

### Contextual Outliers

Conditional Outliers are another name for them. If a data object in a dataset deviates significantly from the other data points because of a single context or situation. Due to one situation, a data point may be an outlier, yet under another environment, it may behave normally. In order to discover contextual outliers, a context must be included as part of the problem statement. Contextual outlier analysis gives users the ability to study outliers in diverse situations, which is useful in a variety of applications. Both environmental and behavioral attributes are used to determine the data point's qualities.

For example, in the context of a "winter season," a temperature reading of 40°C may act as an outlier, but in the context of a "summer season," it will behave as a normal data point.

Abbildung in dieser Leseprobe nicht enthalten

(Fig 3: Contextual Outlier)

In Fig. 3, it can be noticed that the low temperature in June is a Contextual Outlier, because the same value is not considered as an outlier for the month December.

### Collective Outlier

Collective outliers, as the name implies, occur when a group of data points in a dataset deviates significantly from the remainder of the dataset. Individual data objects may not be outliers in this case, but when viewed as a group, they may act as outliers. We may require background knowledge about the link between the data objects exhibiting outlier behavior in order to discover these types of outliers.

For example, a DOS (denial-of-service) package sent from one computer to another may be considered regular behavior in an Intrusion Detection System. However, if this occurs on numerous computers at the same time, it may be regarded abnormal behavior, and they might be classified as collective outliers as a whole.

Abbildung in dieser Leseprobe nicht enthalten

(Fig 4: Collective Outlier)

The red data points in Fig 4 as a whole are collective outliers.

## CHAPTER 2: OUTLIER DETECTION IMPORTANCE & ITS CONNECTION WITH DATA MODELS

### Importance of Outlier Detection

A distribution's extreme value might be lawful or illegitimate. Returning to the perfectly balanced coin, which lands on the 'heads' 100 times out of 100. It would be a mistake to leave such an observation out of a planned research because it is a genuine observation that should not be modified if the coin is correctly balanced. If, on the other hand, the coin seems to be balanced but is actually a rigged coin with a 0% probability of giving 'tails,' then leaving the data alone is the wrong approach to dealing with the outlier since it represents a value from a different distribution than the one of interest. Changing (e.g., excluding) the observation in the first circumstance results in a variance reduction that is insufficient since a value from the considered distribution is deleted. Leaving the data alone in the second case, on the other hand, means under-enlarging the variance because the observation is not from the distribution that supports the experiment. In both cases, a poor decision can alter the test's Type I error (alpha error, i.e., the probability that an incorrect hypothesis is not rejected) or Type II error (beta error, i.e., the chance that an incorrect hypothesis is not re jected). Making the correct choice has no bearing on the test's error rates.

However, there is a significant difference between Grubbs' time and today's time. The volume and speed with which data is created and processed is greater than it has ever been. Millions of social media posts, messages, transactions, and videos are created every second. As a result, outlier detection algorithms must be able to process data in near-real time. They need to show potential outliers as soon as possible. Because the information provided by outlier detection in Big Data is often time sensitive. The time-sensitive nature of outlier detection results is demonstrated by the examples we discussed earlier, such as credit card detection and malicious chatter.

As a result, it is not exaggerated to say that Big Data has revolutionized outlier detection. Simultaneously, Big Data has opened up a whole new world of possibilities for extracting (more) value from outlier detection techniques. Because the size of the data set grows larger, outlier detection may become more valuable. Consider the (famous) task of finding a needle in a haystack as an example. The more valuable an outlier detection algorithm becomes, the larger the haystack becomes.

### Connection of Outliers with Data Models

A model of the data's typical patterns is built by almost all outlier detection algorithms, and the deviations from these patterns are used to calculate an outlier score for a given data point. This data model could, for instance, be a generative model based on regression, proximity, or a Gaussian-mixture model. Different suppositions are made by each of these models regarding the "normal" behaviour of the data. Different suppositions are made by each of these models regarding the "normal" behaviour of the data. After that, the outlier score of a data point is calculated by assessing how well the data point and model fit together. The model may frequently be defined algorithmically. For instance, nearest neighbor-based algorithms for outlier detection model a data point's propensity for being an outlier in terms of the distribution of its k-nearest neighbour distance. The assumption in this situation is that outliers are spread out from the majority of the data.

The data model you choose is very important. The results could be subpar if the data model is chosen incorrectly. A fully generative model, like the Gaussian mixture model, for instance, might not perform well if the data does not match the model's generative assumptions or if there are not enough data points to learn the parameters of the model. Similar to this, if the underlying data is clustered arbitrarily, a model based on linear regression may not perform well. Because of the poor fit to the incorrect model assumptions in these situations, data points may be misreported as outliers. Unfortunately, learning the best model for a given data set for outlier detection is largely an unsupervised problem without examples of outliers. Outlier detection is different from many other supervised data mining problems in that it lacks labelled examples, which makes it more difficult to solve. As a result, in real-world situations, the analyst's understanding of the types of deviations pertinent to a given application frequently determines the model to be used. For instance, it would be reasonable to assume that an unusual deviation of the temperature attribute in a spatial locality is an indicator of abnormality in a spatial application measuring a behavioural attribute, such as the location-specific temperature. On the other hand, due to data sparsity in the case of high-dimensional data, even the definition of data locality may be unclear. As a result, only after carefully examining the pertinent modelling properties of that domain can an effective model be built for a given data domain.

The choice of a model involves many trade-offs; a highly complex model with an excessive number of parameters will probably overfit the data and also find a way to fit the outliers. A straightforward model that is built with a solid intuitive grasp of the data (and perhaps also a grasp of what the analyst is seeking) is likely to produce much better outcomes. On the other hand, a model that has been oversimplified and does a poor job of fitting the data is likely to label common patterns as outliers. Perhaps the most important step in outlier analysis is the initial choice of the data model. Throughout the book, the theme of data models' impact will be repeated with specific examples.

The outlier detection problem can be seen as a classification problem variation where the class label (either "normal" or "anomaly") is not present. As a result, one can "pretend" that the entire data set contains the normal class and create a (possibly noisy) model of the normal data because the normal examples vastly outnumber the anomalous examples. Outlier scores are those that deviate from the norm. Because a lot of the theory and techniques from classification generalise to outlier detection, the relationship between classification and outlier detection is crucial. Outlier detection methods are referred to as unsupervised, whereas classification methods are referred to as supervised, due to the unobserved nature of the labels (or outlier scores). When anomaly labels are present, the issue can be reduced to the imbalanced form of data classification.

A one-class analogue of the multi-class setting in classification may be thought of as the model of normal data for unsupervised outlier detection. However, from a modelling standpoint, the one-class setting can occasionally be much more nuanced because it is much simpler to tell apart examples of two classes than it is to determine whether a specific instance corresponds to examples of a single (normal) class. The accuracy of the model can be sharpened by learning the distinguishing traits between the two classes more readily when there are at least two classes available.

There is a natural division between explicit generalisation methods and instance-based learning methods in many types of predictive learning, including classification and recommendation. This dichotomy also applies to the unsupervised domain because outlier detection methods need to create a model of the normal data in order to make predictions. A training model is not created in advance when using instance-based methods. Instead, one computes the most pertinent (i.e., closest) instances of the training data for a given test instance and then makes predictions on the test instance based on these related instances. In the fields of classification and recommender systems, instance-based methods are also known as memory-based methods and lazy learners, respectively.

**[...]**

- Quote paper
- Priyabrata Mishra (Author)Soubhik Chakraborty (Author), 2022, Outlier Analysis. A Study of Different Techniques, Munich, GRIN Verlag, https://www.grin.com/document/1254838

Publish now - it's free

Comments