One goal of the thesis is to evaluate static, dynamic and hybrid approaches in order to draw conclusions about the domains mentioned in the title of the thesis. Consequently, result-oriented conclusions about the characteristics that distinguish the three approaches from each other are to be drawn from the respective publications on basis of qualitative and quantitative evaluation criteria and the knowledge gap in the comparative literature is intended to be filled by the evaluation of hybrid approaches. The aim is to build a high-level understanding of the different methods and to identify differences and commonalities between these approaches based on research literature that presents new approaches within these domains. In particular, strengths, weaknesses and special properties of the three domains are to be determined. The second goal of this thesis is to develop a more comprehensive practical understanding of ML-based malware detection techniques, as exemplified by the practical section. Here, the ML workflow model is used to propose and implement a static malware detector step-by-step using the Python programming language and various ML algorithms.
Accordingly the three primary research-questions this thesis aims to address are as follows:
1. Which static, dynamic and hybrid ML based approaches exist both in current and past research and how do they work?
2. How do the underlying methodological domains (static, dynamic and hybrid) com-pare under consideration of multiple quantitative and qualitative evaluation criteria?
3. How can a static malware detection model be implemented hands on in practice using the ML workflow process model as a guideline?

Extrait

1 Introduction

1.1 Initial situation

1.2 Problem description

2 Scope of the thesis

3 Theory

3.1 Malware

3.1.1 Definition

3.1.2 Malware Evolution

3.1.3 Types of malware

3.2 Program architecture in Microsoft Windows (MW)

3.2.1 The Portable Executable file format

3.2.2 Relevant insights for malware analysis

3.3 Malware Detection

3.3.1 Methodologies

3.3.2 Evading detection by Obfuscation

3.4 Machine Learning (ML)

3.4.1 Definition

3.4.2 Features

3.4.3 ML-Workflow

3.4.4 ML-Paradigms

3.4.5 ML-Algorithms

3.4.6 Model accuracy and metrics

4 Literature Review: ML Approaches in Research

4.1 Review outline

4.1.1 Structure

4.1.2 Literature overview

4.1.3 Evaluation criteria

4.2 Malware Feature Taxonomy

4.3 Static ML approaches

4.3.1 Quantitative evaluation

4.3.2 Qualitative evaluation

4.4 Dynamic ML approaches

4.4.1 Quantitative evaluation

4.4.2 Qualitative evaluation

4.5 Hybrid ML approaches

4.5.1 Quantitative evaluation

4.5.2 Qualitative evaluation

4.6 Conclusive learning from literature

5 Practical Review: Implementing a Static ML-Based Malware Detector

5.1 Safety measures and disclaimer

5.2 Requirements and resources

5.2.1 Test-environment: Guest OS and host OS

5.2.2 PE file repository: VirusShare and EMBER

5.2.3 Feature extraction: Python PEpper

5.2.4 Model training and validation: WEKA

5.3 Implementation

5.3.1 Phase 1: Data gathering

5.3.2 Phase 2: Data preparation

5.3.3 Phase 3: Model training

5.3.4 Phase 4: Model validation

5.4 Conclusive learning from practical implementation

6 Conclusion and Outlook

6.1 Conclusion

6.2 Outlook

Objectives and Topics

The main objective of this thesis is to evaluate and compare static, dynamic, and hybrid machine learning approaches for detecting malware on Windows systems. The work bridges the gap in current research where these methods are rarely compared comprehensively, ultimately demonstrating the practical implementation of a static malware detector using common machine learning workflows.

Theoretical foundation of malware and its evolutionary evasion techniques.
Taxonomy of features used in static, dynamic, and hybrid malware detection.
Comparative analysis of machine learning methodologies based on quantitative and qualitative metrics.
Hands-on implementation of a static malware detector using Python and the WEKA environment.
Validation of detection models against unknown malware, including obfuscated and encrypted samples.

Excerpt from the Book

3.3.2 Evading detection by Obfuscation

The de-obfuscation of malware scripts, i.e. the reverse engineering and reconstruction of the actual intent of the code, is humorously described by Barker as "Putting the toothpaste back in the tube" (cp. Barker 2021, p. 293). Basically, this illustrates that malware developers have several methods at their disposal to modify their code (or even let it modify itself) in such a way that it is difficult to trace the original state and thus also to reveal the actual purpose of the code.

The following are some examples of common approaches on obfuscation:

Encryption: Malware sometimes employs encryption to hide malicious code blocks throughout its whole code (see figure 12). As a result, the malicious code contained in that malware may be undetectable by the host (cf. Wardle 2022, p. 285; cf. Aslan & Samet 2020, p. 6251). Therefore, methods have been created - and will be applied in many forthcoming studies -, such as measuring the "Entropy" of a potentially dangerous file, to be able to at least verify the existence of encrypted or compressed chunks of code inside executable files (cf. Gibert & Mateu & Planes 2020, p. 8).

Summary of Chapters

1 Introduction: This chapter introduces the ongoing conflict between security experts and malware developers, highlighting the limitations of traditional signature-based detection and the growing significance of machine learning.

2 Scope of the thesis: This section defines the primary goal of evaluating static, dynamic, and hybrid approaches and outlines the three core research questions guiding the analysis and practical implementation.

3 Theory: This chapter establishes the fundamental terminology regarding malware evolution, Windows file structures, detection methodologies (static, dynamic, hybrid), and foundational machine learning concepts.

4 Literature Review: ML Approaches in Research: This section provides a comprehensive, criteria-based evaluation of 35 diverse research papers, classifying them by methodology and analyzing their performance through quantitative and qualitative metrics.

5 Practical Review: Implementing a Static ML-Based Malware Detector: This chapter details the hands-on process of building a static detector, covering environment setup, feature extraction with Python, and model training/validation using WEKA.

6 Conclusion and Outlook: This final chapter synthesizes the main findings from the literature review and the practical implementation, providing insights into future developments for malware detection.

Keywords

Malware Detection, Machine Learning, Static Analysis, Dynamic Analysis, Hybrid Analysis, Windows Executables (PE), Feature Engineering, Malware Evolution, Obfuscation, Model Accuracy, WEKA, Classification, Cybersecurity, Neural Networks, Endpoint Protection

Frequently Asked Questions

What is the primary focus of this thesis?

This thesis examines the effectiveness and common characteristics of different machine learning-based malware detection methods on the Windows platform, specifically focusing on static, dynamic, and hybrid approaches.

What are the three main methodological fields analyzed?

The research is categorized into static analysis (analyzing files without execution), dynamic analysis (monitoring behavior at runtime), and hybrid analysis (combining both methods).

What is the central research question?

The work addresses how these three methodological domains differ in terms of quantitative and qualitative evaluation criteria and how a static detection model can be practically implemented using established ML workflows.

Which scientific method is used for the literature review?

The thesis utilizes the framework for literature reviewing developed by Brocke et al., which involves defined selection cycles and key performance indicators to ensure a representative and rigorous overview of existing research.

What does the practical section of the thesis cover?

The practical section describes the end-to-end development of a static malware detector, including data gathering from repositories like VirusShare and EMBER, feature extraction using Python, and model validation with the WEKA tool.

Which keywords best characterize the work?

The most important terms include Malware Detection, Machine Learning, Static Analysis, Dynamic Analysis, Hybrid Analysis, Malware Evolution, and Feature Engineering.

Why are hybrid approaches considered complex?

Hybrid approaches are complex because they require the integration of feature vectors from two distinct sources (static and dynamic), necessitating expert knowledge in feature fusion and additional effort during the feature extraction phase.

Did the practical test confirm the findings from the literature review?

Yes, the implementation confirmed that static models generally struggle to correctly classify obfuscated or encrypted malware samples, a limitation that was repeatedly noted in the surveyed research literature.

What is the significance of the "Feature Graph" developed in this work?

The custom feature graph, developed using the tool Gephi, visualizes the relationships between different methodology classes and applied features, providing a clearer understanding of how these elements relate to each other across different studies.

Fin de l'extrait de 162 pages - haut de page

Résumé des informations

Titre: Static and Dynamic Machine Learning Based Malware Detection Methods for Windows Programs
Sous-titre: A Comparative Outlook on Alternative Hybrid Approaches
Université: University of Applied Sciences Essen
Note: 1.0
Auteur: Lars Kaiser (Auteur)
Année de publication: 2022
Pages: 162
N° de catalogue: V1323478
ISBN (PDF): 9783346809353
Langue: anglais
mots-clé: Machine Learning Malware Detection ML AI Artificial Intelligence Antivirus Virus Malicious Learning Static Dynamic Hybrid Research Implementing Programming Python WEKA Training Program Windows PE-File
Sécurité des produits: GRIN Publishing GmbH

Citation du texte: Lars Kaiser (Auteur), 2022, Static and Dynamic Machine Learning Based Malware Detection Methods for Windows Programs, Munich, GRIN Verlag, https://www.grin.com/document/1323478

Static and Dynamic Machine Learning Based Malware Detection Methods for Windows Programs

A Comparative Outlook on Alternative Hybrid Approaches