Fraud Detection in White-Collar Crime

Bachelor Thesis, 2017

93 Pages, Grade: 1.3



Management Summary

List of figures

List of tables

List of abbreviations

1 Introduction
1.1 Motivation and problem statement
1.2 Research Methodology
1.3 Goal and structure of the thesis

2 White-Collar Crime
2.1 Fraud Management
2.1.1 Fraud Prevention
2.1.2 Fraud Detection
2.1.3 Fraud Investigation
2.2 Fraud Triangle
2.2.1 Opportunity
2.2.2 Incentive/Pressure
2.2.3 Rationalization/Attitude

3 Types of White-Collar-Crimes
3.1 Fraud
3.2 Credit Card Fraud
3.3 Healthcare Fraud
3.4 Embezzlement
3.5 Criminal Insolvency Offences
3.6 Corruption

4 Data Mining, Text Mining and Big Data
4.1 Introduction into Big Data
4.1.1 The 3 V’s of Big Data
4.1.2 Data Forms
4.2 Data Mining
4.2.1 Types of Machine Learning
4.2.2 Classification of Data Mining Applications
4.3 Text Mining
4.3.1 Practise areas of Text Mining
4.3.2 Example of feature extraction from unstructured data
4.4 Context of Data Mining and Text Mining in White-Collar Crime

5 A case study on Credit Card Fraud Detection
5.1 Overview
5.2 Data Exploration
5.3 Confusion Matrix Terminology
5.4 Algorithms and Techniques
5.4.1 Literature review on Data Mining Techniques
5.4.2 Selection of Data Mining Techniques
5.5 Sampling techniques
5.6 Train and Test Set
5.7 Imbalanced Data
5.7.1 Results on imbalanced Data
5.8 Undersampled Data
5.8.1 Results on undersampled Data
5.9 Oversampled Data
5.9.1 Results on oversampled Data
5.10 Oversampled Data with SMOTE
5.10.1 Results on Oversampled Data with SMOTE
5.11 Undersampled Data with Hyperparameters Optimization
5.11.1 Model Parameter and Hyperparameter
5.11.2 Hyperparameter optimization algorithms
5.11.3 Explanation of selected Hyperparameters
5.11.4 Cross-Validation
5.11.5 Selection of Hyperparameter Optimization Algorithm and k-fold CV
5.11.6 Results on Undersampled Data with Hyperparameter Optimization
5.12 Review of the case study: Credit Card Fraud Detection

6 Conclusion

7 Sources

Management Summary

White-collar crime is and has always been an urgent issue for the society. In recent years, white-collar crime has increased dramatically by technological advances. The studies show that companies are affected annually by corruption, balance-sheet manipulation, embezzlement, criminal insolvency and other economic crimes. The companies are usually unable to identify the damage caused by fraudulent activities. To prevent fraud, companies have the opportunity to use intelligent IT approaches. The data analyst or the investigator can use the data which is stored digitally in today’s world to detect fraud.

In the age of Big Data, digital information is increasing enormously. Storage is cheap today and no longer a limited medium. The estimates assume that today up to 80 percent of all operational information is stored in the form of unstructured text documents. This bachelor thesis examines Data Mining and Text Mining as intelligent IT approaches for fraud detection in white-collar crime. Text Mining is related to Data Mining. For a differentiation, the source of the information and the structure is important. Text Mining is mainly concerned with weak- or unstructured data, while Data Mining often relies on structured sources.

At the beginning of this bachelor thesis, an insight is first given on white-collar crime. For this purpose, the three essential tasks of a fraud management are discussed. Based on the fraud triangle of Cressey it is showed which conditions need to come together so that an offender commits a fraudulent act. Following, some well-known types of white-collar crime are considered in more detail.

Text Mining approach was used to demonstrate how to extract potentially useful knowledge from unstructured text. For this purpose, two self-generated e-mails were converted into structured format. Moreover, a case study will be conducted on fraud detection in credit card dataset. The dataset contains legitimate and fraudulent transactions. Based on a literature research, Data Mining techniques are selected and then applied on the dataset by using various sampling techniques and hyperparameter optimization with the goal to identify correctly predicted fraudulent transactions. The CRISP-DM reference model was used as a methodical procedure.

The results from the case study show, that Naïve Bayes and Logistic Regression in small datasets and Support Vector Machine as well as Neural Networks are appropriate Data Mining techniques to detect fraud. The results were measured using several evaluation metrics such as precision, accuracy, recall and F-1 score. The data analyst has the opportunity to improve the predictive accuracy by tuning the hyperparameters.

Text Mining can extract patterns and structures as well as useful information in text documents with the help of linguistic, statistical and mathematical methods. However, using Text Mining in unstructured data is difficult and time-consuming.

List of figures

Figure 1: Procedure for literature analysis

Figure 2: CRISP-DM reference model

Figure 3: Prevention, Detection and Investigation of Fraud

Figure 4: Fraud-Triangle

Figure 5: Fraud classification overview

Figure 6: Classification of Big Data

Figure 7: A Venn diagram of the TM intersection with other fields

Figure 8: Example of a fraudulent e-mail

Figure 9: Example of a legitimate e-mail

Figure 10: Text classification procedure

Figure 11: Fraudulent e-mail after data pre-processing

Figure 12: Legitimate e-mail after data pre-processing

Figure 13: Import of credit card data in Juypter notebook

Figure 14: Imbalanced class in credit card dataset

Figure 15: Plot of normal and fraudulent transaction

Figure 16: Credit card data after normalization

Figure 17: Undersampling of majority class

Figure 18: Oversampling of minority class

Figure 19: CM LR on imbalanced Data

Figure 20: CM RF on imbalanced Data

Figure 21: CM SVM on imbalanced Data

Figure 22: CM DT on imbalanced Data

Figure 23: CM NN on imbalanced Data

Figure 24: CM NB on imbalanced Data

Figure 25: CM LR on undersampled Data

Figure 26: CM RF on undersampled Data

Figure 27: CM SVM on undersampled Data

Figure 28: CM DT on undersampled Data

Figure 29: CM NN on undersampled Data

Figure 30: CM NB on undersampled Data

Figure 31: CM LR on oversampled (SMOTE) Data

Figure 32: CM RF on oversampled (SMOTE) Data

Figure 33: CM SVM on oversampled (SMOTE) Data

Figure 34: CM DT on oversampled (SMOTE) Data

Figure 35: CM NN on oversampled (SMOTE) Data

Figure 36: CM NB on oversampled (SMOTE) Data

List of tables

Table 1: Vector Space – Count TF

Table 2: Confusion Matrix

Table 3: Top five ranked Data Mining methods by Albashrawi

Table 4: Results for selection of the best split

Table 5: Results on imbalanced data using DM techniques

Table 6: Undersampled data frame (without shuffling)

Table 7 Undersampled data frame (after shuffling)

Table 8: Results on undersampled data using DM techniques

Table 9: Results on oversampled data using DM techniques

Table 10: Results on oversampled data with SMOTE using DM techniques

Table 11: Selected values of hyperparameters for optimization

Table 12: Model selection for hyperparameter optimization on LR

Table 13: Model selection for hyperparameter optimization on RF

Table 14: Results on undersampled data after parameter tuning

List of abbreviations

illustration not visible in this excerpt

1 Introduction

1.1 Motivation and problem statement

White-collar crime is a current topic in business dealings. Due to economic change, white-collar crime is attracting a lot of media attention. Headlines about the fraudulent acts such as: “Top Manager cause a major part of white-collar crime”[1], “Fraudsters capture over 100 million euros with CEO-Fraud”[2] and “Credit card fraud alerts are on the rise — save yourself”[3] are therefore not uncommon for the audience. The current KMPG white-collar crime study shows, that every third company in Germany has been affected by white-collar crime in the last two years and even every second in big companies (KPMG, 2016: 7). According to the federal situation survey of Federal Criminal Police Office (Bundeskriminalamt -BKA), a total of 57.546 cases of white-collar crime were registered in 2016, which caused a loss of 2.970 million euros (BKA, 2016: 3-4).

Due the continuously growing amount of data, the number of white-collar crimes cases increases. Structural changes in different divisions and reorganizing business create new incentives and opportunities for white-collar crimes. It is becoming increasingly difficult for managers and annual auditors to extract information to detect fraud. Today, more than 30.000 gigabytes of data are generated every second – and the number is rising (Warren and Marz 2015: 1). Even in a company, large amount of data is generated daily, which are stored in various forms. This includes not only data stored in a relational database, but also data which is available in a semi-structured or unstructured from (e.g. PDF, XML’s, e-mails).

Data mining (DM) is one of the analytics methods of business analytics and is used for pattern recognition in large data files. Based on past fraud cases, patterns are automatically identified that indicate untypical behaviour and anomalies. These patterns are then applied to existing databases to identify fraud cases that have similar characteristics.

Text Mining (TM) is used in comparison to DM for semi-structured and unstructured data to transfer this data into a structured form. TM is also used to unify and rationalize data sets, as well as to identify patterns and relationships in unstructured data.

1.2 Research Methodology

In this bachelor thesis, a literature analysis according to the procedure of Webster & Watson is carried out. The following literature databases were used during the literature search:

- Google Scholar
- SpringerLink
- IEEE Xplore Digital Library
- ScienceDirect

To find relevant literatures following combinations of keywords were used: White-collar crime, Fraud Detection, Machine Learning, Big Data, structured Data, unstructured Data, Data Mining and Text Mining. To expand the search, another search was carried out with equivalent German terms. It has been noted, that synonyms were used for the keyword White-collar crime in many literatures. Therefore, the following two synonyms for White-collar crime were included in the literature search: Fraud and economic crime.

To find the most relevant literatures the search in databases was restricted to Abstract, Title and Keywords. Also, only literatures which had been published since 2000 were selected and only documents on which access was guaranteed were considered. In addition, the duplicates were removed from the search results. This was followed by a rough examination of the Abstracts. Results that have no relevance to the topics covered in this thesis and results that did not conform the formal standards of scientific work were removed. The remaining results were subjected to a more detailed substantive examination. Finally, the bibliographies of the results were searched for further relevant literature. The entire procedure is visualized in Figure 1 as a process.

illustration not visible in this excerpt

Figure 1: Procedure for literature analysis

1.3 Goal and structure of the thesis

The aim of this work is to find an answer to the following research question:

Which Data Mining techniques are appropriate for detecting white-collar crimes in

structured and unstructured data?

To find an answer to this research question, this bachelor thesis is divided into six chapters.

At the beginning of the thesis, the motivation and the problem statement are described as well as the relevance of the work is clarified.

Chapter 2 gives an overview of white-collar crime. To give the reader an insight into white-collar crime, some possible components of Fraud Management are discussed first in this chapter. Furthermore, significant explanatory approaches of fraud are shown. For this, special emphasis is placed on the Fraud Triangle by Cressey. Here, the offender and their motives, the emergence of white-collar crime and the damage caused by fraud are clearly in focus. With the help of literatures, studies and statistics some well-known white-collar crimes focused on financial fraud are pointed out and discussed in chapter 3.

The fourth chapter is divided into three parts: Big Data, Data Mining and Text Mining. After a brief introduction of Big Data and introducing various data formats, the importance of machine learning in fraud detection with the use of Data Mining and Text Mining is presented. The subchapter of Data Mining deals with types of machine learning and classification of data mining applications. The subchapter of Text Mining covers the seven practise areas of Text Mining and shows by means of an example how unstructured data can be transformed in a structured way to apply this data to predictive Data Mining techniques. For this purpose, two self-generated e-mails (spam / ham) will be converted into structured form with the help of the text classification procedure, term matrix and term frequency.

Chapter 5 is the centrepiece of this thesis. This chapter will cover a case study for fraud detection in credit card data. To implement the Data Mining project, the method of the CRISP-DM (Cross-Industry Standard Process for Data Mining) reference model consisting of six iterative steps is used.

CRISP-DM is an association of already well-established approaches such as the KDD (Knowledge Discovery in Databases) process and industrial approaches (e.g. SEMMA – Sample, Explore, Modify, Model and Assess) and is now regarded as a very popular method for increasing the success of DM projects (Moro and Laureano, 2011: 117; Wieland and Fischer, 2013: 48). CRISP-DM divides the DM process into six phases: Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation and Deployment (Shearer et al., 2000: 13). Each phase describes individual generic tasks, regardless of the area of application and the used technologies, to carry out DM projects systematically (Wieland and Fischer, 2013: 48).

illustration not visible in this excerpt

Figure 2: CRISP-DM reference model (Shafique and Qaiser, 2014: 219)

Business Understanding:

In the initial phase, the objective is to understand the project requirements and goals from a business perspective (Nadali, Kakhky and Nosratabadi, 2011: 162; Sharafi, 2013: 66). From this knowledge, a DM-problem is defined and a preliminary plan is created to achieve the objectives.

The case study in chapter 5 deals with fraud detection in credit card dataset. The used dataset has transactions of European cardholders made in the period of two days in 2013[4]. The goal is to detect fraudulent transactions by using DM techniques. To achieve this goal, various DM techniques will be used and compared with each other. It is important for banks to extract data from large datasets that can lead to fraud. This will give banks an idea of how high the losses inflicted on customers and the bank through fraud.

Data Understanding:

The second phase of the CRISP-DM is Data Understanding. In this phase, the data is collected, described, explored and the quality of the data is verified (Rocha and Júnior, 2010: 164).

Chapter 5.2 – Data Exploration covers the second phase of CRISP-DM – Data Understanding. The credit card dataset is therefore loaded with Pandas in Jupyter Notebook. The goal is to describe the data format, the amount of data, the number of records and fields in each table. The data exploration deals with the data mining questions, which are solved by query and visualization. This determines how many fraudulent and legitimate transactions are in the dataset and which features are relevant for the further investigation.

Data Preparation:

The third phase in CRISP-DM is the Data Preparation. This phase includes all activities for creating the final dataset, which will be fed in the fourth phase of CRISP-DM – Modelling (Nadali, Kakhky and Nosratabadi, 2011: 162). The following activities can be assigned to this phase: select data, clean data, construct data, integrate data and format data (Shearer et al., 2000: 16-17).

To prepare the datasets for the modelling phase, chapter 6.5 describes different sampling techniques. In this case study, the following sampling techniques are applied to the imbalanced dataset: Imbalanced data – no sampling technique is used for the first modelling phase; Undersampling - "Instances are randomly removed from the majority training set till the desired balanced is achieved” (Dubey et al., 2014: 6); Oversampling – the data from the minority training set are duplicated until the desired balance is achieved (ibid.); SMOTE – the minority class is oversampled by creating synthetic examples in the neighborhood of the observed class (Dal Pozzolo, 2015: 37).

Furthermore, the data transformation step must be carried out in this phase to bring the data into a uniform scale. The normalization technique is used for this.


In this phase, various modelling techniques (DM-techniques) are selected and applied on the data, which is prepared in the data preparation phase (Sharafi, 2013: 66-67).

For the selection of DM techniques, a literature review was carried out in chapter 5.4. Based on the literature review, six DM techniques were selected, which were applied in the modelling phase. The selected DM techniques are: Logistic Regression, Random Forest, Support Vector Machine, Decision Trees, Neural Networks and Naïve Bayes. Sometimes it was necessary to stepping back in the data preparation phase to adjust the data. For example, after applying sampling techniques, shuffling was done to distribute the data in the data frame. The generation of test- and train set of data was also done at this phase. In chapter 5.6 different distribution combinations are applied on various DM techniques and the best one is selected.

Furthermore, hyperparameters were optimized in the undersampled dataset to improve the quality of DM techniques. For that, two hyperparameter optimization algorithms – Grid Search and Randomized Search – were compared with each other in chapter 5.11. Moreover, k-fold cross validation was applied to achieve an improvement (see chapter: 5.11.4).


Evaluation is the fifth phase of CRISP-DM, “…which focuses on evaluation of obtained models and deciding of how to use the results” (Shafique and Qaiser, 2014: 220). The interpretation of the model depends on the algorithm. Models can be evaluated to verify whether the goals set in the business understanding phase are properly achieved or not (Sharafi, 2013:67 ; Shafique and Qaiser, 2014: 220).

In chapter 5.3 there are some model evaluation metrics, which are used in this case study to describe the performance of each classification model. Following metrics are included in a classification report: Recall, Precision, F1-Score and Accuracy. Confusion matrix is used to classify the number of true positive, true negative, false positive and false negative cases (Akosa, 2017: 2).


The last phase of the CRISP-DM process is Deployment. In this phase, the result and gained knowledge are mirrored back into the organization as the output of the analysis (Sharafi, 2013: 67). The following activities are assigned to the deployment phase: presentation of gained knowledge, deployment plan, plan of monitoring and maintenance, production of final report and project review (Shearer et al., 2000: 18).

This bachelor thesis covers only a part of the deployment phase. The achieved goals of the case study as well as an evaluation summary and conclusion of the performance of each DM techniques are discussed in chapter 6.

The last chapter also includes a summary of the previous chapters, the answer to the research question, a critical review and an outlook on open issues that may be relevant for further research.

2 White-Collar Crime

The concept of White-collar crime was already conceived by Edward Alsworth Ross in 1907 (Salinger, 2004: 4). But it was Edwin Sutherland, who first published the term “White-Collar Crime” in 1937 (Stadler and Lovrich, 2012: 9). He defined the term as “a crime committed by a person of respectability and high social status in the course of his occupation” (Sutherland, 1983). Today his definition is somewhat outdated. There exists still no standard definition for the term white-collar crime but it is often used in literatures as a concept for various aspects of crime in the context of economic life (Techmeier, 2012). The concept is divided into two groups of crime: The internal criminality against his own company as occupational crime and corporate crime (Techmeier 2012: 13). What they have in common is their reference to the company (ibid.). The concept of white-collar crime of Sutherland is more related to the criminal behaviour in the economy to gain an individual benefit (Sutherland, 1983). On the other hand, occupational crime and corporate crime assume that corporations themselves commit a crime to pursue economic intentions (Techmeier 2012: 13).

In Germany, white-collar crime is the description for crime that have economic references (Schuchter, 2012: 44). As there is no legal definition of white-collar crime in Germany, BKA reverts to the catalog of § 74c para. 1 no. 1 to 6b of Judicature Act when assigning criminal offenses (BKA, 2015: 3). BKA defines white-collar crime as an abuse of criminal offenses in context of an actual or faked economic conformation which exploits the economic life process under profit-making leads to a loss of assets or damages many persons, other companies or the state (ibid.).

The study by KPMG (2016) found in a survey in the period from 2014 to 2016, that 36 percent of companies and 45 percent of large companies surveyed were effected by white-collar crime (KPMG, 2016: 6). It is difficult for companies to identify the damage caused during the period, because the preventive measures are missing (ibid.). To prevent white-collar crime, it is becoming increasingly important for companies to launch a compliance program (ibid.: 6-8). The survey found that despite high damage caused by white-collar crime, on third of the surveyed of companies are not ready to invest more than 10,000 euros in a compliance programs (ibid.: 7). To clarify white-collar crime, 57 percent of the surveyed companies are using the stored data for data analysis (ibid.). In large and medium-sized companies, an e-mail review is also carried out to detect white-collar crime (ibid.). Small companies rely to classical education measures, because of the lack of technical know-how and resources (ibid.).

2.1 Fraud Management

Establishing an effective internal control system (ICS) in a company is self-evident (Zawlilla 2012: 14). This means that regular checks can be carried out and the trust that is given can also be justified. It can reduce the risk of an employee who is thinking to commit a fraud (ibid.). Because of an ICS, it is possible to supervise an employee work for years and to compensate accordingly, so that they are satisfied. But, a worked and lived ICS is not sufficient enough to prevent fraud or to discover it very early (ibid.: 15). The ICS is known in Fraud Detection area to detect the irregularity as soon as possible after the act (ibid.). It is also interesting to know how a fraudulent activity can be detected before the act. The implementation of Prevention Fraud & Fraud Management is used in this content (ibid.). A better term for Fraud-Management would be the Anti-Fraud-Management, because this is ultimately to the defence, the fight against economic and corporate crime (Hofmann, 2008: 53-55). The AuditFactory (2013) defines an Anti-Fraud-Management system “… as the purposeful bundling of functions and processes in a company…”. A Fraud-Management-System includes three tasks: Fraud Prevention, Fraud Detection and Responding to Fraud (CIMA, 2009: 24 pp.). It is used in large companies as a company-wide system for the prevention, detection and adequate reaction of fraudulent acts (ibid.). In many literatures, the third approach “Responding to Fraud” based on an anti-fraud management concept is limited to the keyword “Fraud Investigation” (IPPF, 2009: 24 & CGMA, 2012: 19).

illustration not visible in this excerpt

Figure 3: Prevention, Detection and Investigation of Fraud (IPPF, 2009: 19)

2.1.1 Fraud Prevention

Fraud prevention is understood to mean all measures which are concerned with the prevention of white-collar crimes. The prevention is responsible to discover the crime before the act happens (Zawlilla et. al., 2012: 263). The prevention of future oriented prevention of criminal offences must be at the centre of anti-fraud management (Hofmann, 2008: 81). Hoffman quoted from Gisler: „Preventing fraud is infinitely better than detecting afterwards and then struggling to recover from financial losses and negative publicity.“ (ibid.). The investment in prevention causes a lot of costs for the company and the benefits are difficult to measure (ibid.). Only if a company experiences a fraud case, the benefit of investment becomes tangible (ibid.). It should be considered that the damage caused by white-collar crime is much higher than the investment for prevention and detection.

For companies, the question arises, how they can archive best prevention. The first step could be for example an efficient internal control system and redesign of processes with the aim of achieving the highest level of security (Hofmann, 2008: 82). “An organization with effective internal controls deters fraudsters from the temptation to commit fraud” (IPPF, 2009: 20).

For this, companies should install effective and efficient instruments that can at least detect in an early stage that there has been a criminal offence (Hofmann, 2008: 82). According to the International Professional Practices Framework (IPPF, 2009: 20) the “Management is primarily responsible for establishing and maintaining internal controls in the organization”.

2.1.2 Fraud Detection

The second approach of Fraud Management – Fraud Detection ensures that white-collar crimes are identified at an early stage (Zawlilla et. al. 2012: 264). Through the implementation of different measures, the detection quote can be increased, as well as a possible early detection time can be reached (ibid.). According to IPPF (2009: 21) “Detective controls are designed to provide warnings or evidence that fraud is occurring or has occurred”. Zawlilla and other authors (2012: 265) mentions some specific measures that can be relevant to a fraud detection. For instance, there are measures like:

- Update Fraud Detection programs;
- Determine sample and standard analysis;
- Execute current analysis of risk vulnerability;
- Examine and clarify strikingly facts;
- Derive consequences and measures;
- and others.

2.1.3 Fraud Investigation

“Organizations investigate for possible fraud when there is a concern or suspicion of wrongdoing within the organization.” (IPPF, 2009: 23). This is the third approach of a fraud management concept. Such a suspicion may result from a formal and informal complaint process, “… including an audit designed to test for fraud.” (ibid.). “A fraud investigation consists of gathering sufficient information about specific details and performing those procedures necessary to determine whether fraud has occurred, the loss or exposures associated with the fraud, who was involved, and how it happened.” (Gleim, 2008: 19). For investigation of fraud involves “… internal auditors, lawyers, investigators, security personnel, and other specialists…” (ibid.). The investigations must be prepared and documented so that they are effective in a possible legal process (ibid.).

A fraud response plan is “…an integral part of the organisation’s contingency plans.” (CGMA, 2012: 31). With it, the arrangements can be defined, which are provided for a detected or suspected fraud (ibid.: 44). “… (B)enefits arising from the publication of a corporate fraud response plan are its deterrence value of likelihood that it will reduce the tendency to panic.” (ibid.). Other benefits can for example minimise losses of organizations and retain market confidence (ibid.: 19). A fraud response plan is a formal means by which all information can passed on to all employees and possibly to external persons, e.g. stakeholders and suppliers (CIMA, 2009: 44). The Management decides whether a fraud has occurred or not (ibid.). They are also responsible for the publish of a fraud investigation to extern organizations (IPPF, 2009: 23). A successful published examination may be a reminder for a person who, e.g. was on the way to commit a criminal offence (ibid.).

The investigation plan for Fraud Investigation is structured as follows: “The lead investigator determines the knowledge, skills, and other competencies needed to carry out the investigation effectively and assigns competent, appropriate people to the team.” (IPPF, 2009: 24). The plan for the investigation may include (ibid.):

- Activities such as documentation and storage of evidence,
- “Determining the extent of fraud”,
- “Determining the techniques used to perpetrate fraud”,
- “Evaluating the cause of the fraud”, and
- “Identifying the perpetrators” (ibid.).

2.2 Fraud Triangle

The concept of “Fraud Triangle” is introduced to the professional literature in SAS No. 99, Consideration of Fraud in a Financial Statement Audit (2002) and is now used in various ways, when the emergence of white-collar crime is presented. The fraud triangle is an explanation for the conditions which must be met for the commission of white-collar crime. This approach is based on the American criminologist and sociologist Donald R. Cressey (Scherp 2015: 85). He dealt with the causes of fraud in companies. The result of his thought is a simple and convincing model – the “Fraud Triangle” (ibid.).

Scherp (2015: 86) identifies three conditions of Fraud Triangle which should come together to enable a fraud:

- Opportunity
- Incentive/Pressure
- Rationalization/Attitude

illustration not visible in this excerpt

Figure 4: Fraud-Triangle (cf. Scherp, 2015)

2.2.1 Opportunity

The opportunity is the basic possibility of committing a criminal act without being caught. This includes the position of the offender in a company, which allows him to have access to object of the crime or to influence processes to abuse them on behalf of the company (Scherp 2015: 86). In this case two elements play a particular role. On the one hand, the individual’s Basis-Know-How of the potential offender, on the other hand, his technical skills are important (ibid.). The first element ‘Knowledge’ includes e.g. the knowledge, functionality and weaknesses of ICS, the knowledge of the deliberate action of others, or the recognition that the employer’s trust in one’s own person can be used for his own benefit (ibid.). The technical skills are those which are necessary for the actual execution of the action. These are usually skills such as methodological or professional skills. From this it follows that, generally using the function of the employee, the manner of crime is essentially predetermined (ibid).

2.2.2 Incentive/Pressure

To commit a fraudulent act, the offender must have an incentive or should be under a pressure. The existence of a fraudulent motivation exists, if the financial stability of the company overall or the personal financial situation of the management is threatened or if the pressure to meet the expectations of third parties is particularly pronounced (Hofmann 2008: 207). Even cases of the conscious and intentional mere injury to the employer without self-interest are sometimes present (Scherp 2015: 87). Through fraudulent actions in the banking sector and fraud to customers, the offender like to achieve the best sales figures for his company (ibid.). In early researches of Cressey, he assumed, that criminals are involved into fraudulent activities by a specific pressure situation (ibid.: 88). But it was very quickly clear, that there might be autonomous motives or intrinsic motivations for fraudulent activities, such as Financial Pressure and Perceived Pressure (ibid.). But also, the “Greed” is a significant motivation to achieve indirect benefits such as bonus, promotion, recognition and career (ibid: 87).

2.2.3 Rationalization/Attitude

Rationalization is the final component of component to complete the fraud triangle. “Rationalization is how the fraudster justifies inappropriate actions. It is ‘the provision of reasons to explain to oneself or others behavior for which one’s real motives are different and unknown or unconscious.’” (Biegelman and Bartow 2012: 35). Using internal justification, the fraudster can maintain his self-image as a valuable member of society, so that “the fraudster is convinced that what occurred is not bad or wrong. (…) Rather than consider themselves as criminals who just defrauded their company, they make themselves into victims.” (ibid). Classical inner justifications are a self-image as essentially non-criminal personality and the existence of subjective justifications. The fraudsters reflect instead, ‘It is my money anyway or I have borrowed the money only …’ etc. The creepy part here is that rationalization takes place in the mind of the fraudster and is not visible.

3 Types of White-Collar-Crimes

3.1 Fraud

The term fraud can be understood and defined in different ways. There exists no universal definition of fraud. According to Singleton (2010: 40) “Fraud is a generic term, and embraces all the multifarious means that human ingenuity can devise, which are resorted to by one individual, to get an advantage by false means or representations.” He also defines Fraud as deception (ibid.). “One might say that fraud in the form of intentional deception (including lying and cheating) is the opposite of truth, justice, fairness, and equity.” (ibid.).

Ngai et al. (2011: 562) describe two different definitions. The first definition is in relation to The Oxford English Dictionary (2017), which defines fraud as a “Wrongful or criminal deception intended to result in financial or personal gain.” The second definition mentioned is from Ngai et al. (2011: 562) who describes, “fraud as leading to abuse of a profit organization’s system without necessarily leading to direct leading consequences”.

Fraud can be classified as an internal fraud or as an external fraud (Jans et al. , 2009: 3 from Bologna and Lindquist, 1995). Examples of an external fraud can be providers, suppliers, or contractors (ibid.). On the other hand, the activities of managers in an organization that commit a criminal offense are referred as internal fraud. But a combination of an internal and external fraud can also occur when, for example, an employee cooperates with an external to take financial advantages and harm the organization (ibid.). Furthermore, after Bologna and Lindquist (1995) other classifications of fraud are mentioned in the literature (Jans et al. , 2009: 3):

- Transaction Fraud versus Statement Fraud
- Fraud for and against the organization
- Management and Non-Management Fraud

illustration not visible in this excerpt

Figure 5: Fraud classification overview (Jans, Lybaert and Vanhoof, 2009: 5)

As shown in Figure 5, there is a distinction between internal and external fraud. All other represented classifications are assigned here to internal fraud. The two significant fraud types in internal fraud are “statement” and “transaction fraud”. On the second level, there is a distinction between the “… occupation level of the fraudulent employee: management versus non-management fraud” (ibid: 5). Figure 5 also shows, that a manager can commit both types of statement and transaction fraud, a non-management is limited to transaction fraud. In the last classification, a difference is made between fraud for and against the company. Fraud for the company is assigned here to statement fraud, “(a)lthough fraud for the company does not necessarily need to be statement fraud (for example breaking environmental laws) …” (ibid.: 5-6).

The authors assume that “… only managers are in an advantageous position to commit fraud for the company…” (ibid.: 6), therefore an overlap with management fraud is presented in the figure above while fraud against the company can be committed by managers and not managers (ibid.).

In the following sub-chapters, some types of fraud and white-collar crimes are examined somewhat more closely. Before the examination, it is also important to mention, that the area of fraud is so large, that a separate elaboration can be made of it. Therefore, financial fraud is divided into three specific areas: Bank Fraud, Insurance Fraud and Security Fraud. For example, credit card fraud and money laundering can be assigned to bank fraud. On the other hand, health care fraud and automobile fraud can be attributed to insurance fraud. There are many types of financial fraud and other white-collar crimes. This thesis focuses on the most common types of them.

3.2 Credit Card Fraud

Credit card fraud is the most common subcategory of Bank Fraud. The Legal Dictionary (n.d.) defined credit card fraud as “The unauthorized use of an individual’s credit card or card information to make purchases, or to remove funds from the cardholder’s account.”. In this day and age, almost every person owns a credit card. On the one hand, the use of credit card makes life easier for us; on the other hand, frauds are more and more frequently detected which otherwise would not have existed. Brause et. al (2010: 1) estimate, that for 400,000 transactions a day, a reduction of only 2.5 percent can save millions a year.

Credit card fraud increased in 2014 in Germany according to FICO (2015). They report that a sample of 7.5 million active cards issued in Germany showed losses to credit card fraud up by 17 percent in the year from October 2013 to September 2014, compared to the previous year.

According to a report from Buonaguidi (2017) which was published by BBC on 12th July 2017, since “…1980s, there has been an impressive increase in credit, debit and pre-paid cards internationally.” In Nilson’s report (2016) which was referred by Buonaguidi, some numbers of credit card usage are displayed. In 2015, this payment systems generated worldwide more than 31 trillion dollars in total volume, up 7.3 percent from 2014 (ibid.). Fraud loss in 2015 amounted to 21.84 billion dollars, an increase of 20.6 percent compared to 2014 (Nilson, 2016: 1,6) Buonaguidi (2017) also mentions the two most important categories of credit card fraud: card-notpresent and card-present-frauds. The first mentioned type of fraud is the most common. This can happen if the cardholder’s information is stolen and used illegaly (ibid.). Such fraudulent information is usually obtained by so-called “phishing” emails (ibid.). But also, telephone and social networks are used to get financial information from the victim (Nilson, 2016: 6). Card present frauds are less common today. This type may occur, e.g. the cardholder loses his card and the fraudster uses it in a supermarket. The second known type is “skimming” (Buonaguidi, 2017). Here, the fraudster uses the card of the consumer and pushes it into a device which saves all information. If the fraudster now uses the data to make a purchase, the victim’s account will debit.

3.3 Healthcare Fraud

Health care fraud, like other fraud, demands that false information is represented as truth. The Legal Dictionary (n.d.) defined health care fraud as “The knowing and wilful executing, or attempt to execute, a scheme or deceit to defraud a health care insurance or benefit program, or to obtain by fraudulent means any benefit or payment from the program.” There are many possibilities for billing fraud in the health care system. For example, forging prescriptions, double billing, unnecessary treatments and misuse of insurance card are the most common types which are countinued to this fraud (Lescher and Baldeweg, 2012: 5). The Legal Information Institute and National Health Care Anti-Fraud Association (LII, n.d & NHCAA, n.d.) mention other forms that occur in a health care fraud. Some of these are:

- Billing for services that were never rendered-either or medically unnecessary services purely for the purpose of generating insurance payments,
- Obtaining fully covered medicines that the patient does not need and sell them on the black market to gain profit,
- Billing for expensive services or procedures that were not actually made, but patients were treated with more favourable resources,
- Intentional incorrect reporting of diagnoses or procedures to maximize payment,
- Falsification of diagnosis to perform tests of operations that are not medically necessary.

In a PwC study from Lescher and Baldeweg (2012: 5) on health care fraud, health policy makers, care providers, costumers and experts agree that the misconduct in the health care system causes annual outages. Therefore, insurance companies are forced to increase their insurance contributions (ibid.). Victims in this case are the citizens who must pay their increased health insurance. The PwC study points to the Anti-Corruption Transparency International Germany which assumes that fraud, waste and corruption in the health care caused damage in double-digit millions (in Europe) (ibid.). The European Network (EHFCN, 2010) reported a financial loss of 13 billion euro in 2010. This corresponds to around 5% of current health expenditure in Germany (Lescher and Baldeweg, 2012: 12). In Germany, health care can distinguish between the statutory health insurance and private health insurance. The PwC study (ibid.) transfers the above-mentioned loss totals to Germany and assumes that in year 2010 the private health insurance can be allocated 1-billion-euro loss. With the statutory health insurance, it can be assumed an 8-billion-euro damage by fraud (ibid.). The PwC study (ibid.) therefore estimates that the amount in the offense type billing fraud can be 100 to 200 million euros annually. The PwC survey reveals that almost every surveyed company is victim of billing fraud (ibid.). Approximately 64 percent of fraud cases were affected by statutory health insurance. They stated that in year 2011 one to ten cases of fraud occurred (ibid.). On the other hand, private health insurance companies identified between 11 and 50 cases of billing fraud (ibid.). It should be noted that only the numbers from the so-called bright field are mentioned here. The numbers of a large dark field are not known. Almost two-thirds, which corresponds to abut 62 percent of the companies surveyed, estimate that an unidentified fraud (dark field) is high or very high (ibid.).

3.4 Embezzlement

According to Böttner (n.d.) embezzlement is referred to as the second largest group of crime in law relating to economic offenses. The accusation of embezzlement can affect anyone who makes decisions about foreign assets (ibid.). They can be simple employees with thier own decision-making powers in addition to manage directors, board members and politicians (ibid.). In simple word, embezzlement means, personal use of money, property or some other value thing that has been entrusted to an offender’s care or control. According to the German Criminal Code section 266a embezzlement is defined as „Whosoever abuses the power accorded him by statute, by commission of a public authority or legal transaction to dispose of assets of another or to make binding agreements for another, or violates his duty to safeguard the property interests of another incumbent upon him by reason of statute, commission of a public authority, legal transaction or fiduciary relationship, and thereby causes damage to the person, whose property interests he was responsible for ...” (Bohlander, 2016). Embezzlement of money is not always associated to white-collar crime, as in other cases. In most cases, embezzlement in area of white-collar crime always occurs where the management has company funds for private expenditure (Liebl, 2016: 5).

This type of offence occurs frequently in connection with the investment fraud (BKA, 2016: 4). The police criminal statistics show that in year 2016 the number of fraud and embezzlement cases fell by 2,6 % to 7815 cases. The result was a loss of 356 million euros (2015: 328 million euros) (ibid.: 14).

3.5 Criminal Insolvency Offences

Offenses involving indebtedness and the imminent insolvency of a debtor are referred to as criminal insolvency offences (Wolfhart Nitsch, 2014: 564). This crime also breaks the faith of the creditors and causes damage to the entire economy, or at least tries to do so. Insolvency means in generally language to use inability to pay (Diversy and Weyand, 2013: 21). There are two different types of insolvency proceedings that must be distinguished. The rule insolvency and customer insolvency (ibid.: 19). The rule insolvency is for persons with income from self-employment and for former self-employed persons with more than 20 creditors or debts from working conditions. Customer insolvency, on the other hand, is addressed to all other persons, such as jobholder, unemployed persons, pensioners etc.

The high number of insolvencies is an economic problem for years. Not only large companies, but also midsize companies as well as start-up companies are hit by business crashes. Creditreform (2016a) announces the number of insolvency in Germany for the year 2016 in a press release. They report, that the number of insolvencies has been declining for six years. In 2016, a total of 123,800 insolvency cases were registered. This is a 3 % less than in 2015 (127,500 cases). The number of customer insolvencies fell by 2,5 % in 2016. A total of 78,200 cases were registered in this year. In contrast to customer insolvencies, the number of rule insolvencies declined more with -6,4 %. A total of 21,700 rule insolvencies were registered in 2016 (2015: 23,180). This has reached the lowest level since 1999. (ibid.).

Furthermore, Creditreform (2016a) reports that the total of financial losses in 2016 has risen. In 2016, a total loss of around 27.5 billion euros was achieved, about 40 % more than in the previous year and the highest figure in four years. Also, about 221,000 workers were affected by the insolvency of employers. The number of older companies (over 20 years old) who reported an insolvency has risen to 16,4 %. (ibid.).

The legal form of the private limited company is very popular in the German economic life and is often found in investigations of the company collapse (Diversy and Weyand, 2013: 21). The main reason for corporate crises is the small equity ratio, especially for companies in the services sector (ibid.: 22). For the incorporation of this legal form, only a small amount of capital is required, but it can be fully involved in economic use (ibid.: 23). According to another press release of Creditreform (2016b), the equity ratio in the midsize companies has continued to decline. It is due to the low interest rates that facilitate loan financing. A survey showed that only 29,3 % of respondents had an equity ratio of more than 30 %. In the previous year, it was 31,6 %. Less than 10 % of equity now has 29,8 % of respondents (2016: 28,5 %). In most cases, the cause of insolvency is the equity ratio. This usually applies to small companies that do not employee more than 5 people (Diversy and Weyand, 2013: 23). The new start-up companies are particularly vulnerable to crises, especially if the existence of subsidies eliminated or added unforeseen conditions (ibid.). This includes management failure caused by smaller family businesses and leading to corporate crises. But also, fixed costs as well as the dependence on individual customers, whose collapse leads to own damage, are among them (ibid.: 24). They try to save their company with all possible means by shifting the risks to lenders and try to save the remaining asset components (ibid.). Moreover, there is a fear of losing the social position thorough insolvency, which leads to an incentive to commit a crime, wanted or unwanted and to safe the current standard to living (ibid.).

3.6 Corruption

As with the term of white-collar crime, there is no legal definition for corruption in the German Criminal Code. It should also be noted, that corruption is, to certain extent, part of the economic offence. For example, cases such as bribe an official, e.g. handover of a banknote to a police officer to avoid the loss of his driving license, is not an economic offence. Transparancy International Deutschland e. V. (TID n.d.) describes corruption as abused of entrusted power for private benefit or advantage. Corruption can be caused by bribery or corruptibility in international business or in its own country. Likewise, the purchasability in politics or attempting to gain advantages through bribes is called corruption.

In the German Criminal Code (Bohlander 2016) a distinction is made between two criminal offences – the bribery in public sector (section 331ff, Criminal Code) and bribery in commercial sector (section 299ff, Criminal Code). Criminal statistics distinguish between situational and structural corruption. Situational corruption is understood to mean corruption practices which are spontaneously and not planned or prepared (Hlavica et. al., 2017: 227). Structural corruption is characterized by acts that are based on long-term corrupt relations and which are knowingly planned (ibid.).

An example of a typical corruption case in white-collar crime would be: If e.g. an official receives grants or gifts from a company, so that he gives the assignment for a renovation to this company.

A study by KMPG (2016:11) showed, that 48 % of companies surveyed consider the risk of corruption to be high. However, only 16 % of companies were affected by corruption in year 2015, which is 6 % less than in previous year (ibid.).

4 Data Mining, Text Mining and Big Data

4.1 Introduction into Big Data

With the increasing usage of electronic devices, especially smartphones, our daily life faces Big Data and new challenges. Thereby, it is also difficult for a data analyst to understand new problems and provide real-time solutions. To imagine well Sharma and Pandey (2015: 1) mention some well-known platforms, which consume high data volume: Twitter for example creates more than 12 terabytes daily, Facebook generates over 25 terabytes log data every day and at Google it is likewise 24,000 terabytes data each day.

A bigger problem of this massive data is that there are three different essential attributes of Big Data which make it difficult to understand. Around 80 percent of data are in unstructured or in semi-structured form (Talib et al., 2016: 414), so "... there is a huge need to understand these unstructured data and solve many business problems as possible" (Sharma and Pandey, 2015: 1). Many companies have a common problem in their unstructured data which is the detection of financial frauds (ibid.: 4-6). It brings a lot of losses every year (see Chapter 3: Frauds and White-Collar-Crime). As said by Sharma and Pandey in their literature review (2015: 2), some experts divided the characteristics of data in four to five V's (ibid.). In the following sub-chapters of this thesis, only three main V's will be discussed.

4.1.1 The 3 V’s of Big Data


The large amount of data is considered as an important characteristic of Big Data. Sharma and Pandey (2015: 2) defined Volume as "...the amount of data one needs to process, to find meaningful information". McAfee and Brynjolfsson (2012: 61-68) assumed that since the beginning of 2012, about 2.5 Exabyte's (or circa 2.600.000 terabytes) of data were created each day and every 40 months the data doubled. Cano (2014) said that "Collecting and analyzing this data is clearly an engineering challenge of immensely proportions". It is also not possible to store the data on one place because of the size. Therefore, distributed systems are used which brought the data together by software if the analyzer needs them to analyze (ibid).


The high speeds at which the data is generated and processed are also a feature of Big Data. This is because of social media platforms like Facebook, Twitter and YouTube, because any person who has access to internet can create data on these platforms (Sharma and Pandey, 2015: 2). Therefore, with growing numbers of users, applications, networking and sensors, the data is always available faster and can also be processed in real-time (ibid.). That is not so long ago, batch-processing was common. It was normal to get an update from the database every night or every week because data processing took a lot of time (Van Rijmenam, 2015). "Today the speed at which data is created is almost unimaginable" (ibid.). Every minute people from all over the world upload hundred hours of a video on Youtube, every minute over 200 emails and on Twitter almost 300.000 tweets are sent (ibid.). These all need a database with a higher performance.


Another important feature of Big Data is the heterogeneity of the data structure. Data is generating in various formats. "In the past, all data that was created was structured data..." and today "... 90% of the data is generated by organisation is unstructured data." (Van Rijmenam, 2015). Since the data is now generated from different sources, for example from social networks or sensors, they differ in their format. A distinction is made between structured, semi-structured and unstructured data (ibid.).

4.1.2 Data Forms

Structured Data:

These data come in a correct format and database schema. Structured data is often generated during a business transaction (Sharma and Pandey, 2015: 2-3). The results of the transactions are usually stored in a relational database, so that "... querying the data and then extracting relevant information is quite easy. This is how most of the organization keep their data with the well-defined database schema." (ibid.: 2).

Semi-Structured Data:

"These types of data also exist in a structured format, but data is not maintained in a database, rather flat files." (ibid.). The different definitions that exist for semi-structured data can be summarized as follows: Semi-structured data is not strictly typed, but has a certain structure, which cannot be recognized immediately. Examples for unstructured data are: xml data, JSON file and source code of a website in html format (ibid.).

Unstructured Data:

The unstructured data has not a unique structure and is stored outside of a conventional database system. In today's world, the most amounts of generated data are unstructured (ibid.: 2). A very high percentage of all data stored in a company computer system is in an unstructured form. Examples of such data are images, texts, phone calls and notes, which are generated almost daily in every company (ibid.). Other examples are shown in figure below.

illustration not visible in this excerpt

Figure 6: Classification of Big Data (Blakehead, 2013: 1)

As shown in the figure above, data is available in different formats. The three attributes, structured data, semi-structured data and unstructured data have already been briefly explained. The structured data and semi-structured data have an identifiable structure. On the other hand, the unstructured data comes in most varied forms, as can be seen in the figure above. In today's world, the analysis of unstructured data is a new challenge (Rashid Al-Azmi, 2013: 4)

These types of data play a very important role in the area of fraud detection. While the structured data can be analyzed easier by using different data mining techniques, the unstructured data in this area is a major problem. As mentioned in the chapter “Types of White-Collar Crimes” (see Chapter 3), there are many types of fraud that can create a high damage to the company or the state. To reduce the fraud, the data must be analyzed. However, since this data is kept in an unstructured form, the first challenge is to analyze the behavioral patterns to find the context. Text Mining (TM) is used for this purpose. “Text Mining is a process of extracting meaningful numeric indices (structured data) from unstructured text” (Gupta and Gill, 2012: 189) and “… can analyze words or cluster of words and can used for determining the relationship with other variables of interest such as fraud or non fraud.” (ibid.)

4.2 Data Mining

Data Mining (DM) is part of machine learning techniques and is defined as the process of analysing large databases (Hashimi, Hafez and Mathkour, 2015: 729). DM also known as Knowledge Discovery is an automated process that analyses huge amounts of data to discover new information, hidden patterns and behaviours (ibid.). For DM, there exist no clear definition in the web. According to Gartner (2017), “Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.”

DM techniques tend to learn models from data as well as most machine learning algorithms (Potamitis, 2013: 27). There are three approaches to the learning of data mining models: supervised learning, unsupervised learning and semi-supervised learning (Sorin, 2012: 110).

4.2.1 Types of Machine Learning

Supervised Learning method is used most frequently, where the model is trained with predefined class labels (Lloyd, Mohseni and Rebentrost, 2013: 1). Therefore, it is important in supervised learning, that the labels are known. For example, this method is most commonly used for credit card fraud detection, where the class labels are known as fraudulent and non-fraudulent transaction (Bhattacharyya, Jha, Tharakunnel, & Westland, 2011). The prediction model can be created with a training set. Each new transaction can be compared to the model to predict its class. If the new transaction is like fraudulent behaviour, as described by the trained model, it will be classified as a fraudulent transaction. Another important challenge for the supervised learning approach is that the class distribution is balanced for good predictions. For this purpose, various sample techniques are mentioned in the literature, such as undersampling and oversampling technique. These sampling techniques are considered in more detail in chapter 5.5 “Sampling techniques”.

Compared to supervised learning, the goal of unsupervised learning is to find hidden structures in unlabeled data (Lloyd, Mohseni and Rebentrost, 2013: 1). A suitable DM approach for unsupervised learning is the clustering process.

The last approach of machine learning is semi-supervised learning. This method lies between supervised learning (all training data are available with labels) and unsupervised learning (all training data are available without labels) (Potamitis, 2013: 23). So, in this approach a small number of labelled samples and many unlabeled samples are required (ibid.).

4.2.2 Classification of Data Mining Applications

In the following, the most common approaches of data mining applications classes are described, which are also mentioned in the study by Sorin (2012: 115). The following applications of data mining can handle different classes of problems.


Classification is the most commonly applied DM technique, which employs a set of pre-classified examples to develop a model that can classify the population of records at large (Herbert, 1999: 10-11). The literature research (Zhang and Zhou, 2004: 34) says, that classification or prediction is the process of identifying a set of common features, and suggesting differentiating models that describe and distinguish data classes and concepts based on an example. The most common DM techniques for fraud detection are Neural Networks, Decision Trees (DT) and Support Vector Machines (SVM) (Ngai et al., 2011: 563).


In Clustering, as known as cluster analysis the groups of objects, which have a similarity, are identified (Ngai et al., 2011: 563). The reason to choose the clustering procedure is, that some applications of the class affiliation is not available or costly to identify (Zhang and Zhou, 2004: 34). So, the task of Clustering is thus to assign the properties of a feature unclassified record a certain number of clusters (ibid.).


The goal in regression analysis is similar to the goal of the classification technique above. The difference is only that in regression no classes are formed. According to DMG this function is used to determine the relationship between the dependent variable and one or more independent variable. (Lemon et al., 2003: 173). Common DM tools for Regression are linear regression and logistic regression (Ngai, Xiu and Chau, 2009: 2595).


Prediction is also similar to classification. The difference is, that in prediction the exception applies, the results lie in the future (Ngai et al., 2011: 562). For example, one possible question of prediction analysis would be: “How would be develop the dollar exchange rate in the future?”.

Neural networks and logistic model prediction are the most commonly used technique in prediction analysis (ibid.).


Visualization refers to presentation of data mining results so that the users can follow a complex view in the data as visual objects in dimensions and colors (Shaw et. al, 2001: 127-137). So, it is easier for the users to understand the complicated data in clear patterns and use it. “Visualization helps business and data analysts to quickly and intuitively discover interesting patterns and effectively communicate these insights to other business and data analysts, as well as, decision makers (Soukup, 2002: 5-6).”

4.3 Text Mining

The basis for all reporting, planning, analysis and balanced scorecard applications for decisions support in companies are the data warehouses, which receive their data from various operational and external data and are structured in them (Chaudhuri, Dayal and Narasayya, 2011: 90). The difference between structured, semi-structured and unstructured data has already been explained in chapter 4.1 “Big Data”.

Due to the immense advances in hardware and software, the use of mobile devices and the inclusion of internet, more semi-structured (e.g. HTML files and XML’s) and unstructured data such as text documents, emails, forum contributions, comments in social networks and free text input in forms, but also audios, videos and pictures. By the development of communication technologies, the simple input of the data is made possible, thus forms a huge repository.

For further understanding, the distinction between Data Mining (DM) and Text Mining (TM) is important. Theoretically, TM and DM have a common aim - exploiting information for knowledge discovery - but in practise, both technique are used separately. While DM is a process based on algorithms for analysing and extracting useful information from data, TM is responsible for transforming weak and unstructured textual data into a structured form and extract meaningful information and knowledge (Gharehchopogh and Khalifelu, 2011: 2; Hashimi, Hafez and Mathkour, 2015: 1). Most of the data that has been on the computer for years i.e. in big companies is so large, that a human cannot read and analyse it manually, so TM techniques are used to deal with such data (Hashimi, Hafez and Mathkour, 2015: 1). After the data is in a structured form, DM techniques can be applied to analyse the data.

Miner (2012: 31) divides TM into seven different practise areas and presents them in a figure, see below.

illustration not visible in this excerpt

Figure 7: A Venn diagram of the TM intersection with other fields (Miner, 2012: 32)

In the figure 7, there is a distinction made between the practise areas of Information Retrieval, Document Clustering, Document Classification, Information Extraction, Natural Language Processing, Concept Extraction and Web Mining. At the same time, the adjoining fields of research are Library and Information Science, Databases, Data Mining, Artificial Intelligence and Machine Learning, Statistics and Computer Linguistics, as well as their overlaps or touches with the individual areas. The practice areas of Text Mining are explained in more detail in the following subchapter.

4.3.1 Practise areas of Text Mining

Information Retrieval (IR)

The main task of Information Retrieval (IR) is not to analyse the data, but to index, search and retrieve documents from large text databases with keyword queries (Miner et al., 2012: 36). At the present time, IR systems are used in almost every application. For example, the powerful Internet search engine Google counts on this technology, but other applications e.g. E-Mail and text editors also use IR systems by providing the user the ability to receive response through keyword queries. In summary, the goal of IR “…is to connect the right information with the right users at the right time…” (Aggarwal and Zhai, 2012: 2).

Information Extraction

Information Extraction (IE) is one of the more mature fields in text mining with the aim of constructing structured data from unstructured text (Miner et al., 2012: 37). With this technique, meaningful information can be extracted from large amount of text (Talib et al., 2016: 415). However, this cannot be done without great effort. Extracting data from large amount of text is not easy and requires special algorithms and softwares (Miner et al., 2012: 37). “IE systems are used to extract specific attributes and entities from the document and establish their relationship. The extracted corpus is stored in the database for further processing.” (Talib et al., 2016: 415).

Document Clustering

According to Miner (2012: 959), clustering or cluster analysis is the oldest technology of text mining and was used by the military to document recovery systems during World War II. Today, clustering of documents is algorithms of DM used to group similar documents into clusters (ibid.: 36). The goal of clustering is to classify text documents into groups by applying different clustering algorithms (Talib et al., 2016: 416). Clustering is a method of unsupervised learning; no training is required, as it is the case with supervised learning. Unsupervised learning is not as powerful as supervised learning, but more versatile. In Text Mining, clustering algorithms are used to find similar documents or specific words. If documents are analysed using clustering, this process will be called Document Clustering. If words are subject of the process, they will be called Concept Extraction or topic modelling (Miner et al., 2012: 960). These two processes can be closely interrelated: after clustering documents are performed, the clusters are often referred to the most common words. However, word clusters can be used to categorize documents so that they can be sorted per specific concepts (ibid.).

Document Classification

In classification, the goal is not the obtaining of information, but the allocation of free text documents to a category (Gaikwad, Y Patil and Patil, 2014: 44). In a categorizing process, a free text can be assigned to one or more categories (ibid.). The goal is to train classifiers based on known examples and then automatically categorize unknown examples (ibid.). Known DM classification techniques are for example Logistic Regression (LR), Decision Tree (DT) and Support Vector Machine (SVM) (ibid.). By assigning to a category, there are basically the following procedures: First, the characteristics of the documents are selected, they describe adequately the considered context. Then the documents are examined for these properties and classified into categories. There is a distinction between binary classification and multiple classification. An example of binary classification is the differentiation of credit card transactions – fraud or non-fraud. In the further course of the thesis, a structured dataset of credit card transactions is used by dividing the transactions into two categories – fraud and non-fraud which is discussed in chapter 5.

Natural Language Processing (NLP):

In contrast to the existing techniques of Text Mining, NLP does not pursue a statistical bus a linguistic approach, in order to capture the meaning of the investigated text (Miner et al., 2012: 32). A simple definition of NLP is given by Kao and Poteet (2007: 1):

“Natural language processing (NLP), is the attempt to extract a fuller meaning representation from free text. This can be put roughly as figuring out who did what to whom, when, where, how and why. “

To cope with this task, more complex algorithms (such as: neural networks) must be used to achieve acceptable results. In the field of Text Mining, NLP technique is viewed as a powerful tool for such a problem (Miner et al., 2012: 37). The requirement for NLP is the basic idea that any form of language spoken or written must be recognized first. „Natural Languages (NL) have lot of complexities as a text extracted from different sources don’t have identical words or abbreviation“ (Talib et al., 2016: 416). It is important that not only a word but also its connection with other words, complete sentences or facts is also identified. For the automatic processing and analysis of unstructured information, various tools are assigned to the NLP technique, such as the Named Entity Recognition (NER) and Part-of-Speech Tagging (Kao and Poteet, 2007: 1; Talib et al., 2016: 416). In the case of NER, atomic texts are localized and classified into predefined categories, such as names of persons, places or firms (Pfeifer, 2014: 22). POS-Tagging is the assignment of words and punctuation marks of a text to word types (ibid.: 25).

Web Mining:

The last technique which is used by Miner (2012: 32) in the practice area of Text Mining is Web Mining (WM). “WM is defined as automatic crawling and extraction of relevant information from the artefacts, activities, and hidden patterns found in WWW” (Rashid Al-Azmi, 2013: 2). Although WM appears together with mentioned text mining techniques by Miner, it is a own practice area because of its unique structure and enormous volume of data on the Internet (Miner et al., 2012: 37). Through the spread of the Internet, WM plays a very important role in our life. In many companies, this technique is used to monitor the online behaviour of the user (Rashid Al-Azmi, 2013: 2). Compared to search engines, Web Mining agents are more intelligent, since they can, for example, forward or recommend a user to a competition website (ibid.). The use of WM technology is mainly used to search for hyperlinks, cookies and patterns (ibid.). With the knowledge gathered, companies can make customer relationships better and build potential buyers with exclusive offers.

4.3.2 Example of feature extraction from unstructured data

As already mentioned in the previous chapters, around 80 % of the data is unstructured. The aim of this chapter is to show, how to bring such data in a structured form to apply them to machine learning algorithm.

Therefore, two self-generated emails are used in this thesis (see figure 7 and 8). The first email is a spam email and the second one is legitimate. With the first email, the fraudster is trying to do a phishing attack on the victim. The second email serves as an information for the online account holder for his own security.


[1] Spiegel Online, 2017 (translated from German)

[2] FAZ, 2017 (translated from German)

[3] New York Post, 2017

[4] Credit card fraud dataset available at:

Excerpt out of 93 pages


Fraud Detection in White-Collar Crime
Heilbronn University
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
3304 KB
fraud, detection, white-collar, crime
Quote paper
Rohan Ahmed (Author), 2017, Fraud Detection in White-Collar Crime, Munich, GRIN Verlag,


  • No comments yet.
Look inside the ebook
Title: Fraud Detection in White-Collar Crime

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free