The topic of this thesis is to develop a graph-based static analysis framework for Java code that tolerates incomplete or non-compiling source code. For this purpose, the concept of Code Property Graphs (CPGs) is to be researched and extended, in order to provide information about more complex erroneous patterns in Java source code. Additionally, an evaluation of the resulting graph model is to be performed, by searching for cryptographic vulnerabilities in publicly available Java projects. This evaluation needs to show, whether this graph-based analysis approach is capable of finding security issues in Java code, and how feasible the analysis is from a performance point of view.
Automatic code analysis is a widely used technique to find and eliminate errors in software projects. Instead of executing the program and verify that its behavior is correct, as dynamic analysis does it, static analysis is applied on its source code. Here, we search for suspicious patterns that are likely to indicate erroneous behavior. A special type of software bugs are those errors, that lead to security vulnerabilities. In this case, attackers may be able to undermine fundamental security aspects, by exfiltrating sensitive user data from server applications or assume control over the machine running the program in question.
Security vulnerabilities in the code can have drastic consequences, which is why it is important to identify them as fast as possible and fix them immediately afterwards. This thesis extends the concept of Code Property Graphs (CPGs), which has been proposed for static analysis of C/C++ code, to be applied on programs and incomplete code snippets written in Java. Unifying Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs) in a single datastructure, this approach enables searching for vulnerabilities whose code patterns are spread out across the boundaries of single methods and classes. These patterns are identified using the graph query language cypher, which is provided by the graph database Neo4j. In an evaluation run on 100 public repositories on GitHub using cryptography, 135 findings of cryptographic API misuse have been identified using this technique. These include the use of insecure algorithms, like the Data Encryption Standard (DES) or Electronic Code Book mode (ECB), and hardcoded passwords that are used for encryption purposes.
This thesis has been created in cooperation with Fraunhofer AISEC
Table of Contents
- 1 Introduction
- 1.1 Problem Statement
- 1.2 Thesis Structure
- 2 Background and State of the Art
- 2.1 Static Code Analysis
- 2.2 Robust Analysis
- 2.2.1 Handling Incomplete Code
- 2.2.2 Handling Erroneous Code
- 2.2.3 Handling Inheritance and Interprocedural Dataflow
- 2.3 Code Property Graph
- 2.3.1 Abstract Syntax Tree
- 2.3.2 Control Flow Graph
- 2.3.3 Data Flow Graph
- 2.4 Graph Databases
- 2.5 Related Work
- 3 Approach and Implementation
- 3.1 Existing Setup
- 3.1.1 CPG Generation from Java Source Code
- 3.1.2 Graph Persistence with Neo4j-OGM
- 3.2 Improvements to CPG Generation for Robust Analysis
- 3.2.1 Wrapping Incomplete Code Snippets
- 3.2.2 Enhanced Analysis Passes
- 3.2.3 Data Flow Analysis
- 3.2.4 Type Propagation and the Type Listener System
Objectives and Key Themes
This thesis aims to extend the application of Code Property Graphs (CPGs) for static code analysis, specifically focusing on Java programs and incomplete code snippets. The goal is to create a more robust analysis method capable of handling incomplete or erroneous code, and to improve the identification of security vulnerabilities.
- Robust static code analysis of Java programs.
- Adaptation of CPGs for handling incomplete code.
- Detection of security vulnerabilities using graph query language.
- Improved data flow analysis techniques.
- Evaluation of the approach on real-world codebases.
Chapter Summaries
1 Introduction: This chapter introduces the problem of finding security vulnerabilities in software through static code analysis. It highlights the limitations of existing methods and introduces the thesis's approach of using Code Property Graphs (CPGs) for a more robust analysis, particularly focusing on Java. The chapter outlines the structure and objectives of the thesis, setting the stage for the subsequent detailed exploration of the methodology and results.
2 Background and State of the Art: This chapter provides a comprehensive overview of static code analysis, focusing on robust techniques to handle incomplete and erroneous code. It details the concept of Code Property Graphs (CPGs), explaining their composition from Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), and Data Flow Graphs (DFGs). The chapter also explores the role of graph databases, such as Neo4j, in facilitating efficient graph-based queries for vulnerability detection. Existing related work in the field is reviewed, establishing the context and novelty of the thesis's contributions.
3 Approach and Implementation: This chapter presents the detailed approach and implementation of the proposed robust graph-based static code analysis method. It describes the existing CPG generation process for Java code and its integration with the Neo4j-OGM graph database for persistence. The core contribution lies in the improvements made to the CPG generation process, focusing on enhancements for handling incomplete code snippets and improving data flow analysis. These improvements, including a novel type listener system, are explained in detail, providing a complete picture of the system's architecture and functionality.
Keywords
Static code analysis, Code Property Graphs (CPGs), Java, security vulnerabilities, robust analysis, graph databases, Neo4j, data flow analysis, incomplete code, cryptographic API misuse.
Frequently Asked Questions: A Comprehensive Language Preview
What is the main topic of this thesis?
This thesis focuses on extending the application of Code Property Graphs (CPGs) for robust static code analysis of Java programs, particularly addressing the challenges posed by incomplete or erroneous code snippets. The goal is to improve the detection of security vulnerabilities.
What are Code Property Graphs (CPGs)?
CPGs are a graph-based representation of code, combining information from Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), and Data Flow Graphs (DFGs). They provide a comprehensive view of the code's structure and data flow, facilitating efficient analysis.
What are the key objectives of this thesis?
The key objectives include developing a robust static code analysis method for Java, adapting CPGs to handle incomplete code, detecting security vulnerabilities using graph query languages, improving data flow analysis techniques, and evaluating the approach on real-world codebases.
How does this thesis handle incomplete or erroneous code?
The thesis addresses this challenge by implementing improvements to the CPG generation process. These improvements include techniques for wrapping incomplete code snippets, enhanced analysis passes, improved data flow analysis, and a novel type listener system for better type propagation.
What tools and technologies are used in this thesis?
The thesis utilizes Java, Code Property Graphs (CPGs), graph databases (specifically Neo4j), and Neo4j-OGM for graph persistence. A graph query language is used for vulnerability detection. The implementation involves generating CPGs from Java source code.
What are the key contributions of this thesis?
The core contributions lie in the improvements made to the CPG generation process for robust analysis, particularly the handling of incomplete code snippets and the enhanced data flow analysis techniques, including the novel type listener system.
What is the structure of the thesis?
The thesis is structured into three main chapters: an introduction outlining the problem and approach; a background chapter reviewing static code analysis, CPGs, and related work; and an implementation chapter detailing the approach and improvements made to CPG generation for robust analysis.
What are the key themes explored in the thesis?
Key themes include robust static code analysis, CPGs for Java, handling incomplete code, security vulnerability detection, graph databases, data flow analysis, and the application of a graph query language.
What are the chapter summaries?
Chapter 1 introduces the problem and thesis structure. Chapter 2 provides background on static code analysis, CPGs, and related work. Chapter 3 details the implementation of a robust graph-based static code analysis method with improvements for handling incomplete code and enhanced data flow analysis.
What are the keywords associated with this thesis?
Keywords include static code analysis, Code Property Graphs (CPGs), Java, security vulnerabilities, robust analysis, graph databases, Neo4j, data flow analysis, incomplete code, and cryptographic API misuse.
- Arbeit zitieren
- Samuel Hopstock (Autor:in), 2019, Robust Graph-Based Static Code Analysis, München, GRIN Verlag, https://www.grin.com/document/505779