The topic of this thesis is to develop a graph-based static analysis framework for Java code that tolerates incomplete or non-compiling source code. For this purpose, the concept of Code Property Graphs (CPGs) is to be researched and extended, in order to provide information about more complex erroneous patterns in Java source code. Additionally, an evaluation of the resulting graph model is to be performed, by searching for cryptographic vulnerabilities in publicly available Java projects. This evaluation needs to show, whether this graph-based analysis approach is capable of finding security issues in Java code, and how feasible the analysis is from a performance point of view.
Automatic code analysis is a widely used technique to find and eliminate errors in software projects. Instead of executing the program and verify that its behavior is correct, as dynamic analysis does it, static analysis is applied on its source code. Here, we search for suspicious patterns that are likely to indicate erroneous behavior. A special type of software bugs are those errors, that lead to security vulnerabilities. In this case, attackers may be able to undermine fundamental security aspects, by exfiltrating sensitive user data from server applications or assume control over the machine running the program in question.
Security vulnerabilities in the code can have drastic consequences, which is why it is important to identify them as fast as possible and fix them immediately afterwards. This thesis extends the concept of Code Property Graphs (CPGs), which has been proposed for static analysis of C/C++ code, to be applied on programs and incomplete code snippets written in Java. Unifying Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs) in a single datastructure, this approach enables searching for vulnerabilities whose code patterns are spread out across the boundaries of single methods and classes. These patterns are identified using the graph query language cypher, which is provided by the graph database Neo4j. In an evaluation run on 100 public repositories on GitHub using cryptography, 135 findings of cryptographic API misuse have been identified using this technique. These include the use of insecure algorithms, like the Data Encryption Standard (DES) or Electronic Code Book mode (ECB), and hardcoded passwords that are used for encryption purposes.
This thesis has been created in cooperation with Fraunhofer AISEC
Table of Contents
1 Introduction
1.1 Problem Statement
1.2 Thesis Structure
2 Background and State of the Art
2.1 Static Code Analysis
2.2 Robust Analysis
2.2.1 Handling Incomplete Code
2.2.2 Handling Erroneous Code
2.2.3 Handling Inheritance and Interprocedural Dataflow
2.3 Code Property Graph
2.3.1 Abstract Syntax Tree
2.3.2 Control Flow Graph
2.3.3 Data Flow Graph
2.4 Graph Databases
2.5 Related Work
3 Approach and Implementation
3.1 Existing Setup
3.1.1 CPG Generation from Java Source Code
3.1.2 Graph Persistence with Neo4j-OGM
3.2 Improvements to CPG Generation for Robust Analysis
3.2.1 Wrapping Incomplete Code Snippets
3.2.2 Enhanced Analysis Passes
3.2.3 Data Flow Analysis
3.2.4 Type Propagation and the Type Listener System
3.3 Automated Code Crawler
3.3.1 Preparation: Collecting Java Files
3.3.2 CPG Generation
3.3.3 Analysis: Running Queries on the Graph
4 Analyzing Java Cryptography Extension API Misuse
4.1 Misusing Cryptography
4.2 Automated Detection with CPG Queries
4.2.1 Insecure Algorithm Usage
4.2.2 Constant Encryption Passwords
4.3 Analyzing GitHub Repositories
4.3.1 Discovering Java Repositories that use Cryptography
4.3.2 Experiment Setting
4.3.3 Detected Cryptography API Misuses
4.3.4 Performance of the Analysis Process
5 Conclusion
6 Future Work
Research Objectives and Topics
The primary goal of this thesis is to develop a robust, graph-based static analysis framework for Java code that can handle incomplete or non-compiling source code. The research investigates extending the Code Property Graph (CPG) concept to Java and evaluates its efficacy by automatically identifying cryptographic vulnerabilities in public software repositories.
- Extension of Code Property Graphs (CPGs) to Java source code.
- Methods for static analysis of incomplete or non-compiling code snippets.
- Utilization of graph databases (Neo4j) for persistent graph storage and query execution.
- Automated detection of cryptographic API misuse, such as insecure algorithms and hardcoded credentials.
- Performance evaluation and scalability of graph-based analysis on open-source repositories.
Excerpt from the Book
3.2.1 Wrapping Incomplete Code Snippets
The JavaParser expects each parsed file to contain Java code that is fully syntactically correct. An example for this can be seen in Figure 3.2: The main functionality (printing "Hello world") needs to be contained inside a method of a class. If this is not the case, the JavaParser refuses to produce an abstract syntax tree for the program.
But we also want to be able to analyze incomplete code snippets, e.g. single methods from sources like StackOverflow. This is why we need a way to overcome this limitation of the JavaParser. As a first step, we need to look at what the different forms are, in which incomplete code (that programmers can still understand) can be provided. Those are the types of code that will come up on code sharing sites and thus the ones that are relevant for our analysis.
Summary of Chapters
1 Introduction: Discusses the inherent challenges of static analysis and defines the thesis goal: building a robust CPG framework for Java that handles non-compiling code.
2 Background and State of the Art: Provides the theoretical foundation of static analysis, Code Property Graphs (CPGs), and the application of graph databases in this domain.
3 Approach and Implementation: Details the technical implementation of the CPG generator, the pass system for graph refinement, and the automated crawler for GitHub repository analysis.
4 Analyzing Java Cryptography Extension API Misuse: Evaluates the framework by defining and running cypher queries to detect insecure cryptographic practices like hardcoded passwords and weak algorithms.
5 Conclusion: Summarizes the effectiveness of the CPG model and provides practical recommendations for integrating the developed analysis tool into software development workflows.
6 Future Work: Analyzes the performance limitations of the current implementation and proposes enhancements, such as optimizing database persistence and integrating analysis servers.
Keywords
Static Code Analysis, Code Property Graph, CPG, Java, Security Vulnerabilities, Cryptography, Neo4j, Cypher, Abstract Syntax Tree, Control Flow Graph, Data Flow Graph, Automated Detection, Source Code Analysis, Java Cryptography Extension, JCE
Frequently Asked Questions
What is the core focus of this thesis?
The work focuses on creating a static analysis framework for Java that uses Code Property Graphs to detect security vulnerabilities, even in incomplete or non-compiling code.
What are the primary thematic areas covered?
Key areas include graph-based program representation (CPGs), robust static analysis techniques, Java-specific compilation challenges, and automated security auditing of cryptographic implementations.
What is the main research objective?
The primary objective is to prove that a CPG-based approach for Java can identify complex security vulnerabilities across method boundaries while remaining resilient to incomplete source code.
What scientific methods were employed?
The research uses AST parsing via JavaParser, graph-based transformations via a custom pass system, and pattern-matching analysis using the Cypher graph query language.
What is covered in the main part of the thesis?
The main section details the architecture of the CPG generation process, including AST construction, type hierarchy analysis, and the implementation of automated analysis passes to detect security flaws.
Which keywords best describe this research?
Important keywords include Code Property Graph (CPG), Static Code Analysis, Java, Cryptographic API Misuse, and Graph Databases.
How does the system handle incomplete code fragments?
The framework implements a wrapping mechanism that classifies code snippets by completeness level (class, method, or statement level) and applies empty wrappers to lift them into a syntactically correct state for parsing.
Why is polymorphism a challenge for this analysis?
Polymorphism makes it difficult to statically determine the actual target of method calls, requiring the system to compute possible subtypes and common ancestors to accurately identify execution paths.
What were the results of the repository evaluation?
The evaluation on 100 repositories identified 135 instances of cryptographic API misuse, with over 80% related to insecure algorithms like DES and ECB mode.
What are the performance limitations mentioned?
The current implementation faces bottlenecks during graph persistence and when executing long-path variable-length queries in Neo4j, limiting real-time application in large projects.
- Quote paper
- Samuel Hopstock (Author), 2019, Robust Graph-Based Static Code Analysis, Munich, GRIN Verlag, https://www.grin.com/document/505779