Using Subsequence Mining to Identify Business Processes in Data Networks

Master's Thesis, 2016

63 Pages



Um Geschäftsprozesse zu verwalten, müssen Firmen diese zuvor definiert, gestaltet und eingeführt haben. Analysten versuchen daher, Business Prozesse in Firmen zu identifizieren. Konzerne haben jedoch komplexe Strukturen und viele Geschäftsbereiche. Klassische Modellierung von Geschäftsprozessen skaliert folglich in Konzernen in vielen Fällen nicht. Firmen und Analysten sind demzufolge an automatisierten Methoden interessiert, da diese Zeit und Geld einsparen. In den letzten Jahren wird vermehrt auf Process-Mining gesetzt, um Geschäftsprozesse zu identifizieren. Bei dieser Methode werden Ereignisprotokolle von Anwendungen analysiert. Allerdings hat diese Methode ihre Limitierung durch die Abhängigkeit der eingesetzten Applikationen und deren Ereignisprotokoll. Durch den Entwurf unseres auf polynomiellen Algorithmen basierenden Verfahrens MONA wird gezeigt, dass eine automatisierte Identifikation von Geschäftsprozessen möglich ist. Um abhängige Aufgaben, die zu Business Prozessen gruppiert werden, zu identifizieren, werden Muster in der netzwerkbasierten Kommunikation untersucht. Wir gehen davon aus, dass Geschäftsprozesse aus unterschiedlichen Aufgaben bestehen, die durch netzwerkgestützte Applikationen umgesetzt werden. Die Ausgabe der Implementierung unseres Modells basiert auf der Geschäftsprozess-Modellierungssprache (BPMN).


To manage business processes, companies must previously define, configure, implement and enact them. Analysts try to identify companies’ business processes. However, large companies might have complex business processs (BPs) and consist of many business units. Therefore, classical business process modelling hardly scales. Both, companies and analysts are interested in automated approaches for business process modelling, saving time and money. Today’s business process analysts often use process mining techniques to extract company’s business processes by analyzing event logs of applications. This technique has its limitations, and is strongly dependent on the kind of log files of deployed applications. By designing our mission oriented network analysis (MONA) approach using algorithms having polynomial complexity, we show that identification of business processes is tractable. Identification of related tasks which constitute business processes is based on analysis of communication patterns in network traffic. We assume that today’s business processes are based on network-aided applications. Our software presents identified business processes using business process modelling notation.


1. Introduction . . 1

2. Motivation . . 3

2.1. Background . . 3

2.2. RelatedWork . .. 4

2.3. Challenges . . 6

2.4. Requirements . . 7

3. Network Service Dependency Discovery . . 9

3.1. Technical Description . . 9

3.1.1. NetworkModel . . 9

3.1.2. Network Dependency Analysis . . 11

3.1.3. Business Processes . . 17

3.2. Implementation . . 19

3.2.1. Flow Extraction . . 19

3.2.2. Subnetwork Identifier . . 22

3.2.3. Potential Indirect Dependency Generator . . 24

3.2.4. Indirect Dependency Calculator . . 25

3.2.5. Stream-basedMONA . . 25

3.2.6. Representation of Tasks . . 28

4. Evaluation . . 31

4.1. Ground Truth . . 31

4.1.1. Threshold Estimation . . 32

4.1.2. Validation of Threshold . . 33

4.2. Comparative Evaluation . .. 34

4.2.1. Orion . . 35

4.2.2. Sherlock . . 35

4.2.3. NSDMiner . . 35

4.2.4. Sensitivity Analysis . . 36

4.2.5. MONA versus Orion . . 37

5. Conclusion . . 41

5.1. Summary . . 41

5.2. FutureWork . . 41

A. Appendix . . 47

Bibliography . . i

1. Introduction

Business process management is a discipline that combines knowledge from IT and management sciences and applies the knowledge to business processes [1; 2]. A BP is a collection of related tasks which are required to produce a specific service [3; 4]. Companies are interested in business process management (BPM) due to its potential for saving costs and increase productivity. The information technology research and advisory company Gartner estimated worldwide expenses for BPM software to 2.7 billion US dollar in year 2015 [5]. To better manage BPs, companies must previously define, configure, implement and enact them. Analysts try to identify companies’ BPs. They analyze workflows, business unites, and talk with decision makers. Large companies and organizations have often complex structures and comprise business units. Therefore, classical BP-modelling hardly scales; both companies and BP analysts are interested in automated approaches for saving time and money. Today’s BP analysts often use process mining techniques to extract BPs by analyzing low-level information; e.g., event logs of applications [6; 7]. Unfortunately, this technique has its limitations. The technique is dependent on kind of log files of deployed applications. However, some applications do not have any log files at all. Furthermore, the logs might contain many technical information and have to be analyzed by BP analyst without deeper knowledge in information technology (IT). By designing our MONA approach using algorithms having polynomial complexity, we show that identification of business processes is tractable. In our approach, identification of related tasks is based on communication pattern analysis of network traffic in companies’ networks. BPs of large companies and organizations are based on network-aided applications. Hence, our approach to identify BP is based on low-level information. Instead of analyzing log files, our approach uses header information transferred in data packets. In business processes, people use services of connected servers to receive or transmit information. If a person is sending a request to server S, S might not have the information locally available. Therefore, S sends a request to some other servers Ti to receive the information. After S receives the missing information, it can send a reply to the initial request. Only if S and Ti are available, the person can obtain its requested information. Therefore, a personwho sends a request toS is indirectly dependent on Ti and directly dependent on S. Our approach to identify BPs in networks is based on indirect dependencies between network services. As already suspected, the evaluation of our approach shows a low rate of false negative results and a robust threshold is used to decide if an indirect dependency exists. The rate of false positive results is only weakly connected to the networks. We present an application of estimating indirect dependencies between network services connected to BPs. The output of our application is a graphical representation of BPs in business process modeling notation (BPMN). Therefore, BP analysts can exploit our results for performing further analyses of business processes.

The structure of this work is as follows. We start with a motivation in Chapter 2. This chapter covers the importance of business process management and discusses related work to network service dependency discovery. Chapter 3 contains our network model and a technical description of network dependency analysis. We move to the implementation of our model in the second half of Chapter 3. Chapter 4 presents the evaluation of our model in different networks. Chapter 5 presents our conclusion and proposes fruitful topics for future work.

2. Motivation

To motivate our approach, we provide background on other BP identification methodologies and delineate their limitations.

2.1. Background

BPM is a discipline that combines knowledge from IT and management sciences and applies the knowledge to BPs [1; 2]. Companies are interested in BPM due to its potential in saving costs and increase productivity. Gartner forecasts worldwide expenses for BPM software to 2.7 billion USD in year 2015[5]. To be able to manage BPs, companies have to identify their BPs. BP analysts often use process mining techniques to extract company’s BPs by analyzing event logs [6]. Process mining is divided into three topics: discovery, conformance, enhancement [8].

Discovery. This technique extracts a BP model to generate a first model based on lowlevel events. Low-level events are often specified in event logs.

Conformance. This technique compares information of an a-priori model to new information extracted from low-level information and analyzes deviations between low-level events and the a-priori model.

Extension. An a-priori model is extended with new information extracted from event

logs. Process mining is a useful tool to analyze BPs of companies. Humans need a lot of time to analyze all details of event logs. Therefore, algorithms such as the α-algorithm are introduced with the goal to automate BP discovery [9]. However, with an increasing number of applications and complex structures in companies and organizations, process mining reaches its limit. There exist millions of applications which could be used by companies and organizations to perform their tasks. To extract BP models from low-level events, it is required that every application contains a well-defined event log. These event logs need specialized formats, so that algorithms can extract BP relevant information.

Companies and organizations are built with a higher-level purpose. Companies apply BPs to fulfill their purpose. Each business process could contain a different number of related tasks. Nowadays, most tasks are network-aided. So, why not analyze network traffic to extract BP models of companies instead of analyzing event logs of applications?

Both, network service dependency discovery and BP modelling are attractive topics in research. We introduce related work to to both topics in Section 2.2.

2.2. Related Work

Network traffic contains communication between network services hosted by network devices. These activities follow the higher-level purpose of companies and are involved in BPs. What are business processes? There exist different definitions for BPs. Most definitions agree with our following statement: A BP is a collection of related tasks which are required to produce a specific service or task [3; 4].

Business Process Modelling

BP modelling started in the early twentieth century using flow charts, control flow diagrams, Gantt charts et cetera. Flow charts are used to represent simple tasks. Control flow diagrams are optimized to represent complex tasks. They contain if-then-else conditions, repetition, and case conditions. Since the end of twenty’s century, BP modelling became a new productivity paradigm [10]. Companies and organizations are processes driven instead of function driven. The SAP reference model contains a list of over 600 BP models [3]. Process mining is dependent on the assumption that information of event logs is sufficient to generate BPs [11]. Therefore, there are BP discovery techniques which try to perform a more complex discovery to get more precise models [12; 13]. Three of them are listed below:

- [Symbol is omitted from this preview] - algorithm [2]: This algorithm searches for ordering-relations in event logs and creates a new Petri-net workflow. Each analyzed event log needs a process instance identifier.

- Inductive workflow acquisition [14]: The goal of this approach is to find a hidden markov model (HMM) representing the structure of a process. To get initial HMM, the event logs have to be analyzed.

- Instance graphs [15]: This approach is used to analyze an event log and represent each execution trace. In the end, several graphs are aggregated to get a complete model from an event log.

The analyzed logs are built from more low-level activities between network services in networks. Analyzing header information of data packets and extract dependencies between network services might be more efficient and complete compared to event log analysis. Now, we present different approaches to identify dependencies between network services in large networks.

Network Service Dependency Discovery

Enterprise networks consist of hundreds or even thousands of network services. Some network services are used in many other network services. We define them as supporting network services. A failure in a supporting network service leads to a failure in many other network services. Hence, IT administrators and IT managers are interested in knowledge about dependencies between network services. Recent efforts have lead to different approaches for analyzing network traffic to identify dependencies between network services.

[Table is omitted from this preview]

Table 2.1.: Classification of existing network dependency approaches

Often, different network services are required to perform a single task. If at least one service has failure, the whole task is vulnerable. Sherlock [16] introduced an inference graph to represent dependencies between network services. The dependencies are calculated using co-occurrences between services. Co-occurrence is in place when two services are used in a defined window. Orion [17] is another approach to identify dependencies between network services using spike detection analysis in delay distributions of flow pairs. Two other network-based dependency analysis approaches are NSDMiner [18] and Rippler [19]. All mentioned approaches are based on network traffic. The traffic is analyzed on a centralized server. Hence, the approaches do not need additional software on hosts in the network. Table 2.1 lists different approaches and classifies between active and passive, host-based and network-based approaches. Active approaches require changes in applications or systems, e.g. X-Trace requires modification of applications to tag special request. However, passive approaches do not need any changes of applications or systems. They are analyzing communication between network services. Sherlock and Orion are designed to be host-based approaches, but it is possible to implement them in a network-based manner.

2.3. Challenges

To put our research contributions in context of BP discovery using network traffic, we focus on reviewing of the following challenges: automatic dependency discovery, usability of information for BP analysts, and quality of ground truth.

Automatic dependency discovery. Communication activities in networks of companies and organizations follow a higher purpose. There are communication activities which are more frequent than others. Some communication activities have hidden patterns, which are difficult to identify and to interpret without any knowledge about the network services. There could be many reasons for hidden patterns: caching of information, changes in network services caused by software updates, or modification in configuration of switches and routers.

Reusability of information. Let us assume that we are able to solve the challenge to automatically discover dependencies. Then, another difficult challenge is in place, namely bridging from identified dependencies to business processes usable by business process analysts. Analysts are not familiar with models and notations used in computer science. Identified tasks and BPs needs to be represented by notations, easy to extend for analysts.

Quality of ground truth. Companies and organizations differ, e.g., in the number of employees and in network services. Getting ground truth information about used network services and dependencies between network services is a non trivial task. Whereas getting information about dependencies between well-known supporting network services is easy, these dependencies are rare and do not represent the complete set of dependencies between network services in large networks. To complete the set of dependencies between network services complete, specialists of different domains are necessary, e.g. IT managers, network architects, network administrators, database administrators, software developers et cetera. All approaches for dependency discovery known today use human knowledge in order to validate their own results. Researches implement their approach in their departments’ network. Administrators of that networks decide about correct identified indirect dependencies. Therefore, researchers of previously mentioned approaches cannot be validated for different networks. During our work, we aim at studying quality and limitations of other techniques for discovering dependencies by analyzing patterns in network traffic.

Previously introduced challenges lead to a set of requirements our approach to identify business processes in networks, needs to meet.

2.4. Requirements

Identifying BPs of companies by analyzing network traffic is not trivial. Before we introduce our approach to identify direct and indirect dependencies between network services, we specify requirements derived from previously presented challenges.

Scalability. The model has to scale with the number of network services, network devices as well as with data throughput. Furthermore, the number of dependencies should not influence our model’s precision.


[1] Mathias Weske. Business process management: concepts, languages, architectures. Springer Science Business Media, 2012. 1, 2.1

[2] Wil MP Van Der Aalst. Business process management demystified: A tutorial on models, systems and standards for workflow management. In Lectures on concurrency and Petri nets, pages 1–65. Springer, 2004. 1, 2.1, 2.2

[3] Remco Dijkman, Marlon Dumas, and Luciano García-Bañuelos. Graph matching algorithms for business process model similarity search. In Business Process Management, pages 48–63. Springer, 2009. 1, 2.2, 2.2

[4] Anna A Kalenkova, Wil MP van der Aalst, Irina A Lomazova, and Vladimir A Rubin. Process mining using bpmn: relating event logs and process models. SoftwareSystems Modeling, pages 1–30, 2015. 1, 2.2

[5] Inc. Gartner. Gartner says spending on business process management suites to reach 2.7 billion in 2015 as organizations digitalize processes, Mai 2015. 1, 2.1

[6] Wil Van Der Aalst. Process mining: discovery, conformance and enhancement of business processes . Springer Science Business Media, 2011. 1, 2.1

[7] Wil MP van der Aalst, Hajo A Reijers, Anton JMM Weijters, Boudewijn F van Dongen, AK Alves De Medeiros, Minseok Song, and HMW Verbeek. Business process mining: An industrial application. Information Systems, 32(5):713–732, 2007. 1

[8] Wil Van Der Aalst, Arya Adriansyah, Ana Karla Alves de Medeiros, Franco Arcieri, Thomas Baier, Tobias Blickle, Jagadeesh Chandra Bose, Peter van den Brand, Ronald Brandtjen, Joos Buijs, et al. Process mining manifesto. In Business process management workshops, pages 169–194. Springer, 2011. 2.1

[9] Wil Van Der Aalst. Process mining: overview and opportunities. ACM Transactions on Management Information Systems (TMIS), 3(2):7, 2012. 2.1

[10] Asbjorn Rolstadas. Performance management: A business process benchmarking approach. Springer Science Business Media, 2012. 2.2

[11] Wil Van der Aalst, Ton Weijters, and Laura Maruster. Workflow mining: Discovering process models from event logs. Knowledge and Data Engineering, IEEE Transactions on, 16(9):1128–1142, 2004. 2.2

[12] Xiang Gao. Towards the next generation intelligent bpm–in the era of big data. In Business Process Management, pages 4–9. Springer, 2013. 2.2 i Bibliography

[13] Diogo Ferreira, Marielba Zacarias, Miguel Malheiros, and Pedro Ferreira. Approaching process mining with sequence clustering: Experiments and findings. In Business Process Management, pages 360–374. Springer, 2007. 2.2

[14] Joachim Herbst and Dimitris Karagiannis. Integrating machine learning and workflow management to support acquisition and adaptation of workflow models. In Database and Expert Systems Applications, 1998. Proceedings. Ninth International Workshop on , pages 745–752. IEEE, 1998. 2.2

[15] Boudewijn F Van Dongen and Wil MP Van der Aalst. Multi-phase process mining: Building instance graphs. In Conceptual Modeling–ER 2004, pages 362–376. Springer, 2004. 2.2

[16] Paramvir Bahl, Paul Barham, Richard Black, Ranveer Chandra, Moises Goldszmidt, Rebecca Isaacs, Srikanth Kandula, Lun Li, John MacCormick, David A Maltz, et al. Discovering dependencies for network management. In ACM SIGCOMM 5th Workshop on Hot Topics in Networks (Hotnets-V), pages 97–102. ACM, 2006. 2.2

[17] Xu Chen, Ming Zhang, Zhuoqing Morley Mao, and Paramvir Bahl. Automating network application dependency discovery: Experiences, limitations, and new solutions. In OSDI, volume 8, pages 117–130, 2008. 2.2, 3.2.3, 4.2, 4.2.1, 4.2.5

[18] Arun Natarajan, Peng Ning, Yao Liu, Sushil Jajodia, and Steve E Hutchinson. NSDMiner: Automated discovery of network service dependencies. IEEE, 2012. 2.2, 4.2, 4.2.3

Excerpt out of 63 pages


Using Subsequence Mining to Identify Business Processes in Data Networks
Hamburg University of Technology  (TUHH; Universität zu Lübeck)
Catalog Number
ISBN (eBook)
File size
744 KB
using, subsequence, mining, identify, business, processes, data, networks
Quote paper
Felix Kuhr (Author), 2016, Using Subsequence Mining to Identify Business Processes in Data Networks, Munich, GRIN Verlag,


  • No comments yet.
Look inside the ebook
Title: Using Subsequence Mining to Identify Business Processes in Data Networks

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free