Process mining is the binding link between computational intelligence, data mining, process modeling and analysis. The thesis shows how this research discipline can be applied to network protocols and what the awards will be. Process mining is based on event data, logged by almost every information system. This event data is extracted, transformed and loaded into the process mining tool to discover, check conformance or enhance the underlying process based on observed behavior. Determining the significance of process mining in the field of network protocols and their control flow, finding the best possible algorithms and notation systems, clarifying the prerequisites and providing a proof of concept are the main achievements. Additionally other reasonable and beneficial applications, like mining an alternative protocol, dealing with a large amount of event data and estimations due to size of necessary event data, are investigated.
Table of Contents
1. Introduction
1.1. Process Mining
1.2. Business processes and network protocols
1.3. Vision
1.4. Idea, leading questions and strategy
1.5. Outcome
1.6. Structure of thesis
2. Process Mining and related topics
2.1. The BPM life-cycle
2.2. Process modeling notations
2.3. Positioning process mining
2.4. Process models, analysis and limitations
2.4.1. Model-based process analysis
2.4.2. Limitations
2.5. Perspectives of process mining
2.6. Types of process mining
2.6.1. Play-in
2.6.2. Play-out
2.6.3. Replay
2.7. Discussion
2.7.1. Discovery
2.7.2. Conformance
2.7.3. Enhancement
2.8. Findings
3. Properties and quality
3.1. Event data
3.1.1. Quality criteria and checks
3.1.2. Extensible event stream
3.2. Notation frameworks
3.3. Evaluation of algorithms
3.3.1. Problem statement
3.3.2. What “Disco” does
3.3.3. Challenges for algorithms and notation systems
3.3.4. Categorization of process mining algorithms
3.3.5. Algorithms and plug-ins for control-flow discovery
3.3.6. Fuzzy Miner
3.4. Process models
3.5. Findings - The weapons of choice
4. Prerequisites and -processing
4.1. Data extraction
4.2. Data transformation
4.3. Load data
4.4. Automating the ETL procedure for TCP
4.5. Findings
5. Proof of Concept
5.1. Mining TCP with Disco
5.1.1. Extracting relevant information
5.1.2. Results
5.2. Discussion
5.2.1. Recorded activities
5.2.2. Sequences
5.2.3. Limitations
5.3. Mining TCP with RapidMiner
5.3.1. Adjustments in the results perspective
5.4. Findings
6. Reasonable applications, adaptions and enhancements
6.1. Mining HTTP
6.1.1. Results
6.1.2. Discussion
6.2. Moving towards bigger captures
6.2.1. SplitCap
6.2.2. Adaptions to the ETL script
6.3. Protocol reverse engineering
6.3.1. Gathering data
6.3.2. Results
6.3.3. Discussion
6.4. Findings
7. Conclusion
A. Data
A.1. Example PNML file
A.2. Self-captured
A.2.1. tcpCapture.pcap
A.2.2. httpCapture.pcap
A.3. External
A.4. RapidMiner structure
B. Tools and software
B.1. Disco
B.2. Perl
B.3. ProM
B.4. R
B.5. RapidMiner and RapidProM
B.6. RStudio
B.7. Ruby
B.8. SplitCap
B.9. tshark
B.10. Wireshark
B.11. WoPeD
C. Sourcecode
C.1. Script tcp_pcap2xes.rb
C.2. Script http_pcap2xes.rb
C.3. Script tcp_splitPcaps2xes.rb
D. Glossary
Research Objectives and Focus Areas
This thesis explores the application of process mining techniques to network protocols to gain fact-based insights into communication behavior, performance, and conformance. The primary research goal is to bridge the operational gap between network traffic captures and process mining tools, enabling the automatic discovery and analysis of protocol control flows.
- Application of process mining to network protocols (TCP, HTTP).
- Development of automated ETL (Extract, Transform, Load) procedures for network traffic.
- Evaluation of process mining algorithms and tools for protocol analysis.
- Statistical methods to estimate sufficient event data for robust process models.
Excerpt from the Book
1.1. Process Mining
Van der Aalst et al. define process mining in their manifest as follows:
“Process mining is a relatively young research discipline that sits between computational intelligence and data mining on the one hand, and process modeling and analysis on the other hand. The idea of process mining is to discover, monitor and improve real processes (i.e., not assumed processes) by extracting knowledge from event logs readily available in today’s (information) systems.”[79, p. 1]
Whether there already is a BPM or not, process mining is the technology to discover or enhance the processes or check them due to conformity based on event data. There are several tools and algorithms that support extracting and visualizing processes from event logs. Process Mining can be used in a large variety of application domains. The techniques are based on event data written by information systems.
Summary of Chapters
1. Introduction: Presents the motivation, vision, and core research questions regarding the application of process mining to network protocols.
2. Process Mining and related topics: Explains fundamental concepts, perspectives, and types of process mining within the BPM lifecycle.
3. Properties and quality: Details the requirements for event data, discusses various mining algorithms, and evaluates their suitability for protocol discovery.
4. Prerequisites and -processing: Describes the technical ETL pipeline used to extract and transform network traffic data into a process-mineable format.
5. Proof of Concept: Demonstrates the practical application of process mining on TCP traffic using tools like Disco and RapidMiner.
6. Reasonable applications, adaptions and enhancements: Explores mining HTTP, handling large network captures, and introducing a metric for data adequacy.
7. Conclusion: Summarizes the findings and discusses the overall effectiveness of using process mining for network protocol analysis.
Keywords
Process Mining, Network Protocols, TCP, HTTP, ETL, Event Logs, Fuzzy Miner, Data Extraction, Process Discovery, Conformance Checking, Performance Analysis, Network Traffic, XES, RapidMiner, Disco
Frequently Asked Questions
What is the core focus of this thesis?
The work investigates the applicability of process mining techniques to the domain of network protocols, specifically aiming to discover, monitor, and improve protocol-based communication through event data analysis.
What are the central thematic fields?
The thesis covers the intersection of information security, process modeling, network traffic analysis, and the development of specialized ETL (Extract, Transform, Load) procedures for converting packet captures into process logs.
What is the primary research goal?
The primary goal is to determine if and how process mining perspectives and types can be successfully applied to network protocols to gain fact-based insights without relying on manual, error-prone analysis.
Which scientific method is utilized?
The author performs systematic literature research followed by an empirical proof of concept, where automated scripts are developed to process real network traffic data using mining tools like Disco and RapidMiner.
What topics are discussed in the main part?
The main part covers the theoretical foundations of process mining, requirements for log quality, an evaluation of various discovery algorithms, and the development of an automated pipeline to handle network protocols like TCP and HTTP.
What are the defining keywords for this work?
The most important keywords include Process Mining, Network Protocols, TCP, HTTP, ETL, Event Logs, Fuzzy Miner, and Process Discovery.
How does the author handle large network captures?
The thesis proposes using specialized tools like SplitCap to decompose large packet captures (e.g., 17GB) into individual streams in a single pass, significantly reducing processing time from days to hours.
Why is the "Fuzzy Miner" highlighted?
The Fuzzy Miner is identified as the algorithm of choice because it is implemented in most standard tools, handles noise effectively, and offers adjustable parameters that allow for granular control over process simplification.
What metric is introduced to measure data sufficiency?
The author introduces a metric based on the 'Information Value' (sum of activities and transitions) to calculate the 'Average Information Gain' (aig), which helps predict when enough event data has been collected to mine a stable process model.
- Quote paper
- Matthias Leeb (Author), 2015, Process Mining and Network Protocols, Munich, GRIN Verlag, https://www.grin.com/document/308134