A directed greybox fuzzer for windows applications

Ren, Xin; Wang, Peng-fei; Zhou, Xu; Lu, Kai

doi:10.1038/s41598-025-09777-3

Download PDF

Article
Open access
Published: 08 July 2025

A directed greybox fuzzer for windows applications

Xin Ren¹,
Peng-fei Wang¹,
Xu Zhou¹ &
…
Kai Lu¹

Scientific Reports volume 15, Article number: 24349 (2025) Cite this article

1125 Accesses
Metrics details

Subjects

Abstract

Directed greybox fuzzing (DGF) has proven effective in vulnerability discovery, but most efforts focus on the Linux platform, with substantially less attention devoted to Windows platform due to its closed-source and GUI software nature. This paper proposes WinDGF, a novel directed greybox fuzzer for Windows applications that addresses challenges including target function localization in persistent testing, GUI bypassing, and fitness metric calculation. WinDGF offers two modes: WinDGF_path, which guides fuzzing towards specific execution paths by iteratively reducing path distances, ideal for deep exploration in complex programs, and WinDGF_keyblock, which enhances defect identification by maximizing key-block coverage, suitable for focused and rapid testing in security-critical parts of applications. By flexibly selecting the appropriate mode, users can optimize their testing strategies based on different objectives, enhancing the effectiveness and efficiency of the tests. This paper evaluates the overall performance and crash reproduction capabilities of WinDGF against WinAFL and Winnie across 10 Windows applications. The results demonstrate that WinDGF substantially enhances test case exploration and vulnerability detection capabilities. Compared to WinAFL, WinDGF_path and WinDGF_keyblock show an average increases of 31.72% and 79.48% , respectively, in the number of unique crashes discovered. Moreover, relative to Winnie, WinDGF achieves further improvements of 5.96% and 33.25%, respectively. WinDGF also successfully reproduces 11 known crash points, highlighting its targeted fuzzing capabilities.

Vulnerability-oriented directed fuzzing for binary programs

Article Open access 11 March 2022

Application of deep reinforcement learning in parameter optimization and refinement of turbulence models

Article Open access 12 July 2025

Design optimization of wood-carved window grilles in historical architectures using stable diffusion model and intuitionistic Fuzzy VIKOR

Article Open access 01 July 2025

Introduction

Greybox fuzzing (GF) is a powerful automated vulnerability detection technique. Traditional greybox fuzzers primarily rely on coverage-guided testing. While this approach has achieved some success in detecting vulnerabilities, it has certain limitations. It often wastes time in non-vulnerable code areas and can miss boundary conditions or complex vulnerabilities, potentially failing to identify critical risks.

In contrast, directed greybox fuzzing predefines targets within the program under test and gradually approaches these target points by optimizing seed selection and energy allocation, focusing on testing specific target areas. This allows for more precise allocation of testing resources, concentrating on code sections that are more likely to contain vulnerabilities, such as complex, unpredictable, or error-prone sections. This targeted approach improves resource efficiency and accelerates vulnerability discovery. As a result, DGF has become a research hotspot in recent years, especially in areas such as patch testing, vulnerability reproduction, and static analysis report verification¹. By optimizing the testing process for these specific scenarios, DGF helps developers validate patches and quickly re-detect known vulnerabilities. Some mainstream directed greybox fuzzers, such as AFLGo², calculate distance to the target area and prioritize mutations for seeds closer to it, enabling faster progress towards potential vulnerabilities.

Considerable progress has been made in the realm of DGF, but most research focuses on UNIX-like platforms, with relatively limited attention given to widely-used mainstream operating systems like Windows. Successfully applying DGF to Windows could reduce manual auditing and lower labor costs, and support its ecosystem development. DGF relies on precise structural information, such as control-flow graphs (CFGs) and function call graphs (FCGs). However, due to the closed-source nature, the extensive use of dynamic link libraries (DLLs), the complexity of import tables, and platform-specific characteristics like user interaction requirements, achieving accurate and effective guidance is challenging. The specific challenges faced in implementing DGF on the Windows platform include:

Challenge 1: How to locate the target function in the Windows applications? Persistent testing is a critical testing strategy on the Windows platform that saves time by not terminating processes after testing, but instead continuously iterating specific target functions. This approach efficiently conducts fuzzing in an environment lacking a fork mechanism for rapid process duplication. Therefore, identifying the target function is a crucial step. The complexity of this process mainly stems from several aspects: First, Dynamic Link Libraries (DLLs) can be dynamically loaded and unloaded at any time, making it more difficult to trace function calls across multiple modules. Secondly, while the Windows Portable Executable format employs import and export tables to manage inter-module calls, the lack of sufficient symbol information hampers the effectiveness of static analysis tools in tracing these inter-module calls. Finally, many high-level functions in the Windows API act as wrappers for deeper system calls. This abstraction leads fuzzing campaigns to target the surface-level wrapper functions, rather than the underlying code that performs the actual logic.

Challenge 2: How to bypass the graphical user interface and idle states of mainstream Windows applications? Many Windows applications typically rely on user interactions to receive input, such as mouse clicks, keyboard inputs, or menu selections. This makes it challenging to automate fuzzing without manual intervention. Additionally, they follow an event-driven architecture to handle user interactions. After processing an event, it enters an idle state, waiting for the next event to occur³. This idle state blocks the program flow and prevents iterative execution of the target function for fuzzing. Therefore, it is necessary to analyze the code structure and logic of the target program.

Challenge 3: How to extract CFGs and FCGs of Windows applications for fitness metrics calculation? Unlike Linux systems that enable instrumentation during compilation to construct CFGs and FCGs, Windows applications are typically distributed as compiled binaries. The challenge with Windows applications lies in their heavy reliance on DLLs and external system interfaces, which not only creates extremely complex dependency structures in CFGs and FCGs but also greatly challenges the certainty of control flow due to mechanisms like system calls, dynamically loaded libraries, and virtual function tables. Therefore, to effectively map the architecture of applications and perform fitness assessments, a lightweight solution tailored to the unique complexities of Windows is needed.

Proposal. In response to these challenges, we develop WinDGF, a directed greybox fuzzer for Windows. For Challenge 1, to locate target functions, the core innovation of WinDGF lies in the synergistic integration of dynamic binary instrumentation and static analysis. Traditional tools such as Dependency Walker rely solely on static parsing of PE file headers, which fails to capture DLLs that are dynamically loaded at runtime and cross-module call chains. WinDGF overcomes this limitation by employing dynamic instrumentation to monitor module loading and function call paths at runtime, integrated with static script-based analysis using IDA Pro to filter file I/O and parsing functions. By further evaluating these functions based on DLL coverage and call depth, WinDGF effectively eliminates high-level wrapper functions. For Challenge 2, to handle GUI problem, we propose a hybrid dynamic-static solution. At the dynamic analysis layer, real-time monitoring of DLL loading events and runtime function call paths enables precise interception of critical parameters and return values. At the static analysis layer, IDA Pro complements missing symbols and parameter specifications to reconstruct data dependency models for critical functions. This capability enables the generation of standalone test executables that directly invoke target functions via command-line interfaces, effectively bypassing GUI dependencies. Compared to Winnie, our optimized instrumentation workflow reduces redundant operations and integrates dynamic-static data fusion to precisely capture critical execution paths. For Challenge 3, to calculate fitness metrics, we focus on path distance and key-block coverage and design two modes: WinDGF_path and WinDGF_keyblock. Unlike AFLGo and similar approaches that require source code instrumentation and compile-time CFG construction (ill-suited for Windows binaries due to source code dependency), WinDGF uses the IDA Python⁴ API to iteratively and recursively generate the cross-referenced CFGs and FCGs of the target points in reverse order. From these graphs, we identify and mark reachable target basic blocks as key blocks. We then use Dijkstra’s algorithm to calculate the shortest path distance from each key basic block to the predefined target blocks. This approach focuses on reachable targets, effectively simplifies the complexity of the graphs, and speeds up the generation of effective graphs, making it particularly suitable for Windows applications where compile-time instrumentation is not feasible.

Evaluation. We conducted overall performance evaluation and crash reproduction experiments with WinDGF, WinAFL⁵, and Winnie⁶ across 10 applications. Our results indicate that WinDGF surpasses both WinAFL and Winnie in several key metrics, including unique_crashes, paths_total, bitmap_coverage, and max_depth. Moreover, WinDGF successfully reproduced all of the given 11 crash points. In conclusion, our evaluation demonstrates the improved performance of WinDGF in terms of both exploration ability and vulnerability detection, making it a valuable tool for Windows platform applications.

Contributions. Specially, our paper makes the following contributions:

1.
We design and implement the first (to the best of our knowledge) directed greybox fuzzer, WinDGF, for Windows applications, which is publicly available at https://github.com/mineechor/WinDGF.git.
2.
We design and implement two directed guidance modes for WinDGF, WinDGF_path and WinDGF_keyblock, based on distinct fitness metrics: path distcance and key-block coverage. These designs meet different testing needs: WinDGF_path quickly optimized for rapid localization of specific targets, making it ideal for goal-oriented vulnerability hunting, while WinDGF_keyblock enhances deep coverage of critical areas, suitable for scenarios that require focused testing on specific regions.
3.
We propose a multi-phase power scheduling strategy to divide time into three phases. This strategy prioritizes distance in the mid-term phase, but allocates more energy to low-frequency paths based on directionality in the later phase, resulting in enhancing the comprehensiveness of path exploration.
4.
WinDGF is validated on a diverse set of real-world Windows applications. We compared WinDGF with the baseline fuzzers WinAFL and Winnie. The evaluation results demonstrate that WinDGF possesses directed capabilities and achieves higher coverage and exploration capabilities, leading to the discovery of more vulnerabilities.

The structure of our paper is as follows: The background is described in section “Background”. We present WinDGF’s strategies in section “Strategies of WinDGF”, and its implementation in section “Implementation”. Section “Evaluation” presents the evaluation of WinDGF. We have discussions in section “Discussion”, introduce the related work in section “Related work” and conclude in section “Conclusion”.

Background

This chapter provides foundational knowledge of technologies relevant to the research, structured into two principal sections: directed greybox fuzzing and motivate example.

Directed greybox fuzzing

DGF is a variant of greybox fuzzing (GF) that focuses on specific target locations, typically complex and error-prone code segments. DGF spends time locating these target locations and avoids wasting resources on irrelevant program modules. The core challenges of DGF primarily include the following two aspects:

How to determine target locations or vulnerability types: Before conducting DGF, it is crucial to identify the target locations or vulnerability types to focus on during the testing process. For example, BuzzFuzz⁷ may select library or system calls as target locations, while FishFuzz⁸ takes the potential vulnerabilities detected by Sanitizers as testing targets. Deep learning has proven to be useful in predicting potential attack targets⁹, and specific target sequences can detect memory errors like Use After Free (UAF) and Double Free (DF)^10,11.
How to rapidly drive the fuzzer to reach target locations or trigger vulnerabilities: The challenge lies in selecting appropriate fitness metrics to measure the alignment between the current state and the target. This can guide seed selection, energy allocation, and mutation strategies. Current research exhibits notable platform heterogeneity: Windows-based fuzzers predominantly rely on traditional code coverage as the core evaluation metric^5,6,12,13,14, while Unix-like platforms have seen a surge in research on directed fitness metrics for target-oriented guidance. Common metrics such as distance and similarity are used to prioritize seeds that are closer to the target location, as seen in AFLGo² and WAFLGO¹⁵. FunFuzz¹⁶ introduces function significance as a fitness metric to guide seed selection and energy scheduling, prioritizing the testing of significant functions. RegionFuzz¹⁷ allocates more fuzzing resources to code regions that are more likely to be vulnerable based on four code metrics: sensitive, complex, deep, and rarely reachable.

Table 1 Fuzzer comparison table.

Full size table

By overcoming these core challenges, DGF can improve the efficiency and accuracy of fuzzing, aiding in the discovery of more vulnerabilities and security issues. To address the technical gaps in directed greybox fuzzing research for Windows platforms (e.g., overreliance on code coverage as a single fitness metric, testing bottlenecks in complex GUI applications), this paper proposes WinDGF, a novel directed greybox fuzzing system based on hybrid synergistic analysis. As summarized in Table 1, WinDGF demonstrates clear superiority over current state-of-the-art approaches in four critical evaluation aspects.

Motivative example

In this part, we use WinAFL⁵ as an example to discuss the challenges to perform fuzzing on the Windows platform.

WinAFL is a port of AFL¹⁸ and it can be divided into two modules: the fuzzer and the instrumentor. The running process of WinAFL and the interaction between the two modules are illustrated in Fig. 1.

Based on the above figure, we can observe that WinAFL achieves persistent mode testing by executing the target code repeatedly within a single process context due to Windows’ lack of native fork mechanism, which slows it down compared to other fuzzers like AFL. It focuses on specific functions with I/O and parsing, rather than the whole program. Identifying the right target function in a closed-source binary requires reverse engineering skills and often involves selecting deep, non-entry point functions, making it challenging.

Additionally, the target function needs to return automatically in the form of “ret” to ensure proper execution. However, in GUI programs, after completing event handling, the program enters an idle state and requires manual interaction to trigger the function’s return³. This limitation hinders the application of WinAFL on the Windows platform, which mainly consists of graphical applications. Furthermore, since shared library files cannot be directly executed as the testing target, WinAFL can’t perform fuzzing directly on shared libraries related to specific file formats.

Strategies of WinDGF

In this section, we describe our methodology and the main aspects of the WinDGF in detail. As an innovative vulnerability detection framework, WinDGF addresses three fundamental challenges in Windows-directed fuzzing: (1) platform-specific program adaptation for security testing, (2) precision-guided vulnerability discovery, and (3) sustainable optimization of exploration efficiency. Through the synergistic integration of static analysis, dynamic instrumentation, and evolutionary computation, WinDGF establishes an adaptive feedback loop that progressively refines testing focus while maintaining system stability.

Overview of WinDGF

To achieve directed fuzzing on the Windows platform and make it widely applicable to Windows applications, we have designed a directed greybox fuzzer called WinDGF. As shown in the Fig. 2, it primarily consists of three components.

Program Adapter Component: It automatically identifies or generates target functions for fuzzing, while leveraging prior knowledge to detect vulnerability-prone regions within the target program.
Directed Greybox Fuzzing Component: It focuses on the design of directed strategies, including the selection of appropriate fitness metrics to evaluate the effectiveness of test cases, and the formulation of efficient power scheduling strategies to allocate resources.
Fuzzing Optimization Component: It utilizes a time slicing strategy that synergistically combines fitness-guided optimization with low-frequency path prioritization, effectively resolving incomplete path exploration through strategic execution scheduling.

WinDGF achieves efficient directed fuzzing on Windows platforms through the coordinated architecture of its Program Adapter, Directed Fuzzing and Optimization Component. Following the sequential technical workflow of environment adaptation, target guidance, and continuous optimization, this section methodically examines the implementation mechanisms of these core components.

Program adapter component

Windows applications can be simplified into three categories: Command-line programs, GUI programs, and shared libraries. In this component, we undertake preparatory work for fuzzing. We implement an analyser for Command-line programs and some GUI programs that is able to generate a series of candidate target function address offsets automatically satisfying the conditions mentioned in section “Motivative example”. Besides, for the vast majority of GUI programs and shared libraries, we designed a framework to trace the execution of the programs automatically and generate a simple test program called harness. Furthermore, we identify specific parts of the program that may contain security vulnerabilities as the target locations for directed greybox fuzzing.

Automated target offset candidates generation

In the dynamic instrumentation phase, first, we register two callback functions: event_module_load, which records the module name and base address, and event_app_instruction, which sets callback functions for direct calls, returns, indirect calls, and indirect jumps. All calls related to file operations are marked as FR and when recording them it is necessary to dump certain values in memory space, where the address of the values is stored as a function parameter in the function call stack, making it easier to determine which APIs are related to the input file in the future. In the static analysis script, we identify and count functions within the main program that invoke file operation-related APIs. The algorithm is implemented in Algorithm 1.

Its inputs are the function sequence, callchunk, whose elements include dump file, function name, callee address, caller address, type marker, tid and callid, and the values, stack_mem, dumped from the stack and specific memory locations during the preceding dynamic analysis. For each function in the callchunk, we invoke ret_before_file. If it returns True, it indicates that the function returns before calling any file-related APIs, suggesting that the function does not make any file-related API calls. In such cases, we remove it from the function sequence. Next, we call is_testfile_related_API to determine if function i is a file-related API and performs operations on the input file. Besides, callid represents the function call index, and tid represents the thread number. In the end, we obtain the main program function sequence that calls file handling-related APIs. On the other hand, we also count the number of external library functions called within the main program function in order to cover as many functionalities as possible.

Graphical interactive program

When it comes to testing GUI programs, bypassing the graphical user interface to test the actual functional functions can indeed be more complex. Winnie adopts a general approach to overcome GUI limitations by creating a simple program called harness⁶. The harness assists in addressing fuzzing issues by replacing the GUI with a Command-Line Interface (CLI), preparing the execution context such as parameters and memory, and directly invoking the target function.

Due to the compatibility and execution issues encountered with Winnie, which prevented it from running successfully, we propose and develop a new automated path tracing framework inspired by its design principles, using dynamic instrumentation with Pin¹⁹ and static analysis via IDA PRO⁴. We select Pin for its stable API, active support, and extensive documentation, providing more reliable analysis than DynamoRIO. This framework introduces optimizations that mitigate performance issues in Winnie’s original instrumentation, such as minimizing redundant instruction instrumentation, thereby boosting path tracing efficiency. Additionally, Winnie’s limitations in analyzing function calls and data flows often miss critical paths. We address these by integrating static analysis with dynamic data to enhance control flow insights and path coverage.

The workflow diagram of the automated path tracing framework can be described as follows: the tool first executes the user-provided program and inputs in an instrumented monitoring environment. The dynamic tracing framework is used to extract the path information of the process. This module monitors the loading of libraries using Pin and records the following information. (1) The names and base addresses of the loaded modules should be recorded. (2) For calls and jumps between modules, it records the thread ID, caller, callee addresses, symbol information, and parameters. For internal module calls, only the main program module is recorded. In the absence of function prototype information, the values on the stack are saved as possible parameters. (3) When a return instruction is encountered, the function return value is recorded. (4) If any value belongs to accessible memory, it is treated as a pointer, and the memory it points to is dumped for parameter recovery and data dependency analysis. For multi-level pointers, this process is repeated recursively.

The dynamic tracing primarily focuses on library’s external interfaces, omitting internal control flow, because original behavior is restored via exported functions. For multi-threaded apps, thread isolation is implemented to confine analysis to threads exhibiting file-related API invocations, thereby preventing irrelevant calls that could affect the correctness of the harness.

In the critical behavior recovery phase, library-related function calls are first copied to the harness skeleton. IDA Pro is used to obtain missing symbol and address information, such as offset address, function type, return type, function name, parameter types, and number of parameters.

Data dependency relationships represent the relationships between function parameters and return values. The following three cases are primarily considered.

The return value of a previous function is used as an input parameter for subsequent functions. This is determined by checking if a parameter consistently has the same value as a previous return value.
The output parameter of a previous function (typically in the form of a pointer) is used as an input parameter for subsequent functions.
If the same non-constant value is used as a parameter in two different function calls, then those two uses are considered aliases.

If a shared library is loaded by some software and the exported functions are called by the software during its execution, the automated path tracing framework mentioned above can be used to generate a harness for fuzzing.

Target locator

Identifying target points for DGF is essential for focused testing. We employ several methods for this purpose: static analysis tools like BinDiff are used to compare program versions and identify different functions; code auditing involves using IDA Python scripts to detect dangerous functions through name matching; and crash reproduction entails targeting known crash locations, such as the user-mode write access conflict in XnView 2.51 at offset 0x39c1da.

Directed greybox fuzzing component

According to different fitness metrics, this paper proposes two directed guiding modes.

WinDGF_path: It is a directed greybox fuzzing mode based on path distance.
WinDGF_keyblock: It is a directed greybox fuzzing mode based on key-block coverage.

Both modes consist of a preprocessing stage, a fitness metric calculating stage and a power scheduling stage. The power scheduling stage will be specifically explained in the Fuzzing Optimization Component.

Preprocessing stage

We obtain the distance files of basic blocks or offsets list of key basic blocks through preprocessing static analysis. Defining the distance to the target point is indeed a challenge because it is not applicable to closed-source binary programs to use compilation-time instrumentation to build detailed CFGs and FCGs like AFLGo² and Hawkeye²⁰. Therefore, we use the IDA Python API to recursively generate cross-referenced CFGs and FCGs for target points. Although these graphs are local to the entire application, they contain all the basic blocks and functions in the reachable target areas, providing sufficient information to calculate the required path distance.

The distance from a basic block to the target block is defined as the minimum value of the sum between the shortest distance at the basic block level and a constant multiple of the shortest distance at the function level. The shortest distance calculations are performed using the Dijkstra algorithm²¹. The formula can be represented as follows:

$$\begin{aligned} d(b,T_b) = min(d_{ijk}(b,b_{next})+c \times d_{ijk}(f_{next},T_f)) \end{aligned}$$

(1)

Where b refers to the specific basic block within the function, $T_b$ represents the target basic block, $T_f$ denotes the target function that contains $T_b$, $f_{next}$ denotes the next function in the reachable path to the target within the function call graph, and $b_{next}$ represents the jump basic block that leads to $f_{next}$. The min function selects the shortest path length between b and $b_{next}$ among the different jump basic blocks that lead to the same next function. We set c equal to 5 account for the higher computational cost of function calls (e.g., parameter passing, stack operations) compared to basic block jumps, balancing intra-function efficiency and cross-function exploration. Experimental results demonstrate that setting parameter $c=5$ optimizes tool performance: in terms of target convergence speed, it achieves 16% and 11% improvements over $c=4$ and $c=6$, respectively. For code coverage, it outperforms the global-priority strategy of $c=4$ by 5% and surpasses the local-intensive strategy of $c=6$ by 18%. This configuration balances path exploration efficiency and precision, validating its robustness in vulnerability detection across diverse code structures.

If the basic block and the target basic block are within the same function, the formula for calculating the basic block distance can be represented as follows:

$$\begin{aligned} d(b,T_b) = d_{ijk}(b,T_b) \end{aligned}$$

(2)

As the example shown in Fig. 3, the target basic block is located in f6. From bb1, there are two possible paths to reach the target. bb1 can either jump to f2 through bb8 or jump to f3 through bb9. So the distance for bb1 can be calculated as min(3+2*5,3+1*5) = 8.

In the WinDGF_keyblock mode’s preprocessing phase, we determine key basic blocks by identifying parent nodes of the target in control-flow graphs. Each node in the FCG represents a specific function call operation, with its corresponding CFG composed of multiple basic blocks. The jump basic blocks along the edges marking paths to target nodes inherently characterize the concrete execution paths of function calls.

Here is an implementation of the mentioned functionality using an IDA Python script. The script uses recursive reverse traversal and iteration to find the control-flow graph of the key basic blocks by analyzing the cross-references. The idautils.CodeRefsTo() function is used to obtain the cross-reference addresses, where caller_func_addrs records the calling address and callee_func_addrs records the called address. $T_p$ represents the target program, $T_b$ represents the starting address of the target basic block, and block(caller_func_addrs) returns the starting offset address of the basic block that contains the caller_func_addrs address. The script ultimately outputs a file containing the start and end addresses of all key basic blocks.

Fitness metric calculating stage

The WinDGF_path mode. The path distance is the average distance of the covered basic blocks in the path. In terms of specific implementation, firstly, we set up two counters, one to accumulate the distance of all basic blocks and another to count the number of basic blocks on the path, with both initially set to zero. For each basic block in the executed path, if it is a part of the feasible path, we increment the accumulated distance by the distance of the basic block and increment the count of basic blocks by one. Finally, the path distance is calculated by dividing the accumulated distance by the number of basic blocks.

The WinDGF_keyblock mode. In DGF based on key-block coverage, the fitness metric is set as the proportion of basic blocks covered by the test case in the key regions. This is because higher coverage indicates that the test case execution spends more time in the key regions, getting closer to the target point. Similarly, we set up two counters: one to calculate the number of key basic blocks identified during the preprocessing phase that are included in the test case execution path; the other to tally the total number of all basic blocks covered by the test case execution path. The ratio of them is the coverage rate of the test case.

Fuzzing optimization component

The simulated annealing power scheduling algorithm, which allocates more energy to test cases closer to the target location, has shown promising results in the context of DGF. However, it also has limitations. Over time, it could lead to the repetition of the same paths or old paths being traversed repeatedly. This repetition can consume a considerable amount of time without necessarily improving path exploration or coverage performance. Additionally, this approach may overlook test cases with longer call chains or more complex execution paths²². These limitations stem from a fundamental trade-off in fuzzing optimization: the conflict between broad exploration (discovering new paths) and focused exploitation (optimizing high-value seeds).

Past research on AFL has indeed focused on the issue of path fairness. AFLFast, which is based on Markov chains, proposes a strategy to bias AFL towards low-frequency paths²³. Fairfuzz introduces a mutation mask algorithm that improves the coverage of rare branches²⁴. Both of these optimization approaches can be effectively applied to WinAFL with appropriate modifications. However, these methods primarily operate in a single-phase manner, which may lead to suboptimal trade-offs between breadth-first exploration and depth-first exploitation.

To address the above challenges, we propose a multi-phase strategy that dynamically adapts to the fuzzing progress, where time t is defined as the transition point from the simulated annealing exploration phase to the exploitation phase. During the exploration phase, the fuzzer accepts all possible seeds, exploring a wide range of possibilities. In the exploitation phase, the emphasis shifts towards optimizing the fitness metric, and more energy is allocated to seeds with higher fitness score.

The energy calculation formula based on path distance for a given test case s and target location $T_b$ can be expressed as follows:

$$\begin{aligned} p(s,T_b) = (1 - \widetilde{d}(s,T_b))\times (1-T)+0.5T \end{aligned}$$

(3)

$$\begin{aligned} P(s,T_b) = P(s)\times 2^{10 \cdot p(s,T_b)-5} \end{aligned}$$

(4)

In the formulas above, T is the current temperature, P(s) represents the energy value determined by WinAFL based on the execution speed, coverage path count, and path depth of the test case, and $factor = 2^{10 \cdot p(s,T_b)-5}$ is the power factor. The formula to calculate the normalized path distance $\widetilde{d}(s,T_b)$ for a test case is as follows:

$$\begin{aligned} \widetilde{d}(s,T_b) = \frac{d(s,T_b)-minD}{maxD-minD} \end{aligned}$$

(5)

We can calculate the normalized distance for each seed within the range (0,1) based on the current maximum distance maxD and minimum distance minD.

In the WinDGF_keyblock mode, the corresponding energy calculation formula can be rewritten as follows:

$$\begin{aligned} p(s,T_b) = (1 - \widetilde{cov}(s,T_b))\times (1-T)+0.5T \end{aligned}$$

(6)

$$\begin{aligned} P(s,T_b) = P(s)\times 2^{10 \cdot p(s,T_b)-5} \end{aligned}$$

(7)

In the formulas above, $\widetilde{cov}(s,T_b)$ is the normalized coverage rate of the test case, expressed as a percentage.

After 10*t time units, to address the issue of incomplete path exploration, additional energy is allocated to low-frequency paths, along with the fitness-driven allocation. Two relevant metrics are considered to measure low-frequency paths: the number of times a test case is selected from the queue and the frequency of execution paths. The total energy in this phase can be expressed as:

$$\begin{aligned} E(s) = P(s,T_b)+k\times \frac{1}{f_s+f_e+1} \end{aligned}$$

(8)

Where k is a coefficient used to balance the fitness energy and the low-frequency path energy, $f_s$ represents the number of times the test case is selected, and $f_e$ represents the frequency of execution paths.

Implementation

We implement WinDGF based on WinAFL and Winnie. It primarily consists of two components: the adapter module and the fuzzer.

The adapter module. In the dynamic instrumentation part of the adapter module, we use Pin to instrument the program and gather function call paths and execution contexts. In the static analyzer, we rely on the IDA Python API to obtain information such as function types and parameter types, which are then serialized into a JSON file. Additionally, we have separately developed a script for generating harnesses and a script for performing statistical analysis on function calls to obtain candidate offsets for the target functions. The adapter module comprises approximately 1500 lines of C code and 2000 lines of Python code.

The fuzzer. In the preprocessing stage of the fuzzer, we utilized the Python interface provided by IDA Pro to construct cross-referenced function call graphs and control-flow graphs based on the predefined target locations in the test program. Then, using NetworkX, we built dominance trees for each function call graph and control-flow graph. For the fuzzing stage, we made the following modifications: firstly, we added distance files or key basic block address files as input parameters to the dynamic instrumentation part, and calculate the accumulation distance of basic blocks and the number of covered basic blocks or key basic blocks, as well as the total number of basic blocks in the execution trace, to be written into shared memory during dynamic instrumentation. Additionally, in the fuzzing execution, we added code to compute fitness metrics and improved the original energy scheduling strategy by adopting the simulated annealing algorithm. The fuzzer component comprises approximately 200 lines of C code and 400 lines of Python code.

It is worth noting that in the adapter module, we only focus on the main program and external libraries related to file processing. For multi-threaded applications, we only consider the threads that invoke APIs related to file operations. This is done to avoid adding irrelevant calls that could compromise the correctness of the harness. Additionally, we must balance graph completeness with time overhead in constructing CFGs and FCGs. Choosing proper iteration counts and algorithm parameters is essential for accurate and efficient fuzzing. To facilitate future research, we have open-sourced WinDGF at https://github.com/mineechor/WinDGF.git.

Evaluation

To evaluate the effectiveness of WinDGF, we conducted experiments aiming to answer three research questions:

RQ1: How does WinDGF’s automated target adaptation mechanism enable effective fuzzing across diverse Windows application types?

RQ2: How does WinDGF improve the capabilities of exploration, coverage, and crash triggering compared to WinAFL and Winnie?

RQ3: How does WinDGF perform in terms of reaching target code locations and discovering known crashes?

Evaluation setup

We selected 10 target programs across three typical categories of Windows applications, including Command-line programs, GUI programs, and shared libraries, to evaluate WinDGF’s performance, as listed in Table 2. Our evaluation compared WinDGF with two baseline fuzzers, WinAFL and Winnie. WinAFL is developed based on AFL and specifically designed for the Windows operating system, having been widely applied and proven to be extremely effective on the Windows platform. WinDGF has further optimized performance on top of WinAFL, especially in enhancing adaptability and targeted performance capabilities. Additionally, Winnie can automatically generate harnesses for Windows binaries to bypass GUI code. The comparison with Winnie shows that WinDGF can produce higher quality harnesses. Due to the incompleteness of Winnie’s open-source code and its discontinuation of maintenance, we developed our own version by referencing Winnie’s semi-automated harness generation logic and WinAFL’s fuzzing framework. We conducted ten 24-hour rounds of comparative tests with WinDGF_path, WinDGF_keyblock, WinAFL, and Winnie, and for crash reproductions, we replicated each crash five times in 3-hour runs, except for the CVE-2023-27655, which lasted four hours. We gathered various file types from GitHub as initial seeds.

Table 2 Target programs with version and reference links.

Full size table

Test environment

The experimental environment was set up on an AMD Ryzen 9 5900HS machine with Radeon Graphics at 3.30 GHz processor and a 64-bit Windows 10 22H2 operating system. Before testing, it is necessary to run the command “gflags /p /enable ImageFileName /full” to enable full page heap verification for the process. It helps detect heap-related memory corruption issues during the testing process.

Result analysis

RQ1: target program adaptation and target offset selection

Scheme 1 To answer RQ1, we conducted fuzzing using WinDGF on 10 Windows platform applications. Table 2 illustrates how these applications are adapted using WinDGF’s automated offset generation or automated harness generation capabilities. The table also describes which crash points we selected from the program for subsequent crash reproduction experiments. By testing different types of applications, we were able to evaluate the generality and applicability of WinDGF to assess its effectiveness and capabilities in fuzzing Windows applications.

Finding 1 WinDGF utilizes dynamic instrumentation and static script analysis to automatically identify target functions and generates customized harness programs. WinDGF can cover a wide range of testing scenarios for various Windows applications such as CLI applications, GUI applications, shared libraries, etc. It realizes the “one-click” Windows fuzzing through full automation of the testing workflow–from intelligent target identification to execution, eliminating manual GUI event simulation and noticeably boosting vulnerability discovery efficiency.

RQ2: overall performance evaluation

Scheme 2 The overall performance evaluation experiment evaluates the performance based on four aspects: unique_crashes, paths_total, bitmap_coverage, and max_depth. The average values of these four metrics are calculated from the ten rounds of experiments.

unique_crashes: It represents the total number of different crashes discovered through fuzzing, with ”unique” denoting those arising from diverse execution paths or branch entries.
paths_total: It signifies the cumulative valid test cases generated through mutation up until now, reflecting the overall number of execution paths explored.
bitmap_coverage: It measures the extent of the program’s code executed during fuzzing, with higher bitmap coverage indicating greater code coverage by the fuzzer.
max_depth: It represents the maximum depth or generation number of the test cases.

The results presented in Table 3 and Fig. 4 show WinDGF outperforms WinAFL, with the two guided modes demonstrating superior performance across metrics: unique_crashes, paths_total, bitmap_coverage, and max_depth. On average, WinDGF_path shows improvements of 31.72%, 5.95%, 5.21%, and 14.30% in these metrics, while WinDGF_keyblock shows average improvements of 79.48%, 6.85%, 5.25%, and 16.42% respectively. Additionally, the marked rise in the number of unique crashes relative to the total number of execution paths and bitmap coverage rate indicates that WinDGF focuses more on the quality of code coverage rather than the quantity. By concentrating on specific paths and critical areas that are prone to vulnerabilities, WinDGF makes more effective use of test cases under limited resources, enhancing the efficiency of vulnerability detection. Furthermore, the increase in maximum depth demonstrates that WinDGF explores more intricate execution paths, potentially uncovering deeper vulnerabilities. Overall, WinDGF’s design enhances fuzzing effectiveness, particularly in improving code coverage quality and detection efficiency.

Table 3 Overall performance data.

Full size table

The results presented in Table 3 and Fig. 4 highlight that WinDGF outperforms Winnie across three metrics: unique_crashes, bitmap_coverage, and max_depth. On average, WinDGF_path shows improvements of 5.96%, 4.53% and 7.49% in these metrics, while WinDGF_keyblock shows average improvements of 33.25%, 5.68% and 5.32% respectively. These data indicate that harnesses generated by WinDGF can precisely cover key functions within the program, thereby enhancing the coverage of critical code areas and testing efficiency. It should be noted that the discovery of crash points in a program depends not only on the effectiveness of the testing strategy but also on the program’s inherent characteristics. For example, crash points in some programs may be located in shallower areas of the code, making them easier to detect. Additionally, WinDGF’s performance in the paths_total metric is not as strong as Winnie’s, which may be due to the fact that WinDGF-generated harnesses focus more on specific code paths. This focus aids in the in-depth exploration of critical paths, increasing coverage of these routes, but may also reduce the exploration of other paths, leading to a decrease in the total number of paths. Compared to Winnie, WinDGF produces higher quality harnesses that precisely cover key functions, boosting testing efficiency and depth. Although this may lead to fewer total test paths, experiments show that WinDGF more effectively identifies and locates vulnerabilities within limited testing cycles, validating its strategic value.

Comparing the two WinDGF modes, WinDGF_keyblock surpasses WinDGF_path in unique crashes for 5 out of 7 programs and in total paths for 7 out of 11 programs, also covering more critical areas in 5 out of 11 cases. WinDGF_keyblock targets key blocks vital for security and functionality, which often contain complex logic and are prone to vulnerabilities. This focus increases the likelihood of triggering unique crashes and encourages test cases that explore various paths within these areas, raising the potential for discovering diverse crashes. Conversely, WinDGF_path aims to minimize the path distance to targets, potentially concentrating on specific paths and reducing exploration of other vulnerable paths, leading to fewer unique crashes. WinDGF_path is ideal for complex programs requiring deep execution branch exploration, generating test cases to shorten paths to targets. Once the shortest path is found, it may limit further path exploration, resulting in a lower total path count. While WinDGF_keyblock enhances coverage in critical areas, overall bitmap coverage might be limited if these areas are a small part of the entire codebase. WinDGF_keyblock is better for scenarios needing extensive coverage of critical code areas, such as security-critical applications exploring various potential vulnerabilities. It provides broader and deeper exploration compared to WinDGF_path, which is more focused and suited for targeted testing tasks like specific function or vulnerability testing.

Finding 2 WinDGF demonstrates superior performance compared to WinAFL and Winnie, particularly in the four key metrics. Although WinDGF may have fewer total paths than Winnie, this reflects its more precise and targeted testing approach, especially in improving code coverage quality and vulnerability detection efficiency. WinDGF can more effectively discover and locate vulnerabilities within a limited testing cycle, making it a powerful fuzzer. Additionally, WinDGF_keyblock and WinDGF_path each have their strengths, fitting different testing needs. WinDGF_keyblock is better for extensive coverage of critical code areas, while WinDGF_path is suited for scenarios with specific target paths. The selection of an optimal mode hinges on specific testing objectives and required coverage depth, where a nuanced understanding of each mode’s focal mechanisms and comparative advantages proves critical for maximizing both testing efficacy and operational efficiency.

RQ3: crash reproduction

Scheme 3 In the crash reproduction experiment, the average time taken for the first test case to trigger a target error is recorded for each experiment. The experiment aims to assess whether the fuzzer can consistently reproduce target crashes within a reasonable timeframe. Crash reproduction experiments are commonly used to evaluate the directed nature of fuzzers. The results are presented in Table 4.

Table 4 Crash reproduction performance data.

Full size table

$\mu$TTE(s): It represents the average time to reproduce the crash.
F: It represents the multiplicative relationship between $\mu$TTE(s) of the fuzzer and WinAFL.
A: It represents the probability of WinDGF_path and WinDGF_keyblock outperforming WinAFL.

Based on the experimental data, it can be observed that both WinDGF_path and WinDGF_keyblock are able to reproduce the majority of specified crashes and their corresponding stack traces. This demonstrates that both approaches have directional capabilities in targeting specific crashes. Furthermore, the experiments show that for each target point, both WinDGF_path and WinDGF_keyblock have a higher number of successful crash reproductions compared to WinAFL. Additionally, compared to WinAFL, WinDGF_path has a shorter average reproduction time for 91% of the target points, while WinDGF_keyblock has a shorter average reproduction time for 73% of the target points.

For the vulnerability point FORMATS!ReadG3_W+0x3bb, both WinDGF_path and WinDGF_keyblock have longer average reproduction times compared to WinAFL. In this study, the construction of program control-flow graphs and function call graphs is done using an iterative recursive reverse search for parent nodes. Sufficient iterations are required for this process, but due to limitations in device resources, time constraints, and the complexity of the control-flow graph at the vulnerability point, a smaller number of iterations is chosen. As a result, it is not possible to generate complete basic block distance files and key basic block address offset files, which further affects the calculation of fitness metrics and the precision of power scheduling during the fuzzing process.

F comparison: The F metric directly reflects and measures the difference in capabilities between two modes in terms of crash reproduction or discovery efficiency. Specifically, F is calculated as the ratio of the average time-to-exposure (TTE) of the compared modes to that of the baseline fuzzer WinAFL.

If Factor < 1, it indicates that the compared mode performs better with a shorter TTE time compared to the baseline fuzzer.

If Factor > 1, it suggests the opposite—the compared mode has longer TTE time and performs worse.

The F comparison chart between WinDGF_path and WinDGF_keyblock is as shown in Fig. 5a. The numbers 1–11 correspond to the target sites with serial numbers 1–11 in Table 4. In all target site evaluations, WinDGF_path achieves consistently shorter average reproduction times than WinDGF_keyblock, demonstrating a more substantial improvement over the baseline. The F metric directly demonstrates this performance difference between the fuzzers.

A comparison: A represents the probability of the two directional modes having shorter times than WinAFL in the five-round reproduction experiment. Based on Fig. 5b, which compares metric A, it can be observed that for 91% of the target sites, WinDGF_path has a higher effect size A compared to WinDGF_keyblock. A higher probability indicates better directed performance.

In summary, WinDGF_path outperforms WinDGF_keyblock overall in terms of reproduction count, average reproduction time $\mu$TTE, and effect size metric A. This indicates that WinDGF_path has better directed performance compared to WinDGF_keyblock. One possible reason for this difference could be that WinDGF_keyblock defines the fitness metric as the frequency of covering the key blocks. In this approach, there is no weight distinction for each basic block in the key region list generated through static analysis. This means that during the accumulation of frequencies, the fuzzer treats all key basic blocks equally. This equal treatment of key basic blocks weakens the directed performance to some extent.

Finding 3 WinDGF’s adaptive approach of optimizing fitness functions and adopting simulated annealing algorithm makes WinDGF highly capable of driving targeted code exploration and reliably reproducing known vulnerabilities.

Case study

XnView: Based on the information in the table generated by the automated script, a suitable target function offset is selected. The table contains five columns: The first column represents the target function offset and can be considered as a candidate for the target_offset. The second column indicates the number of times the function calls library functions. The third column represents the linear address. The fourth column represents the number of times the function calls file-related APIs during program execution, followed by the values indicating the depth of file-related API calls for that function.

After analyzing the target offset information table corresponding to XnView, the offset 0x24aae0 has been selected as the target offset for fuzzing. Specifically, the target function at offset 0x24aae0 calls library functions 10 times, and the number of file-related API calls is 2, with a call depth of 11 each. Compared to other candidates, on the one hand, the target function at offset 0x24aae0 is known to call a substantial number of library functions, which ensures that fuzzing can cover more library function logic. On the other hand, it has a relatively low depth of file-related API calls, which can accelerate the testing process, making the fuzzing process more efficient.

ABC viewer: In this study, the automated path tracing framework is used to generate a harness for tracking the calls to the FreeImage.dll library for testing purposes. The target function selected for testing is fuzz_me.

Figure 6a and b represent the changes in test case energy over time during the fuzzing process of the tested program XnView using WinDGF_path and WinDGF_keyblock modes respectively. Figure 6cand d depict the changes in test case energy over timeduring the fuzzing process of the tested program ABC Viewer using WinDGF_path and WinDGF_keyblock modes respectively

It can be observed that in the initial exploration phase, the distribution of test case energy is relatively even. However, in the later exploitation phase, there is noticeable fluctuation in the test case energy. This indicates that the fitness metric has a dominant influence on the energy scheduling.

Discussion

The closed nature of the Windows system, the widespread use of DLLs, the complexity of the import table structure, and the dependence on user interaction pose significant challenges for implementing DGF on Windows. To address these challenges, we designed and implemented two DGF modes for the Windows platform: WinDGF_path and WinDGF_keyblock.

The WinDGF_path mode is based on path distance calculation, where the principle is to optimize the effectiveness of testing by calculating the average path distance of basic blocks in the program execution path. This mode provides more precise directional guidance compared to simply recording basic block transition counts and is better suited for scenarios where a gradual approach towards the target code block is required. In such cases, path distance can effectively guide fuzzing towards certain specific paths of the program and iteratively reduce the distance to generate valuable test cases. The WinDGF_path mode is particularly useful for scenarios that require a deep understanding of the complex structure of programs, such as vulnerability discovery in large-scale system applications. In experiments, WinDGF_path achieved improvements of 31.72%, 5.95%, 5.21%, and 14.30% over WinAFL in terms of unique crash counts, total path counts, bitmap coverage, and maximum depth, respectively, demonstrating its advantages in complex path exploration.

The WinDGF_keyblock mode employs a key basic block-based DGF model, where the principle is to measure fitness by calculating the proportion of basic blocks covered in these key areas by test cases. Compared to the path distance-based method, it only focuses on the offset of key basic blocks on reachable paths, eliminating the need for complex distance calculations. Therefore, the WinDGF_keyblock mode incurs lower overhead during the static analysis phase and provides more accurate analysis results. By maximizing key block coverage, it helps fuzzing to find more potential defects in these areas as much as possible. This mode is suitable for scenarios requiring fast and targeted detection of specific functions or sensitive parts of the program, especially in security-sensitive applications. In experiments, the WinDGF_keyblock mode improved unique crash counts, total path counts, bitmap coverage, and maximum depth by 79.48%, 6.85%, 5.25%, and 16.42%, respectively, compared to WinAFL, demonstrating its efficiency in detecting key regions.

Despite the superior performance of both modes, they also have certain shortcomings. For example, current methods are still insufficient to fully cover all paths and scenarios when dealing with complex Windows security mechanisms and APIs. The dynamic instrumentation tool Pin also exhibits some instability. Additionally, WinDGF relies on IDA Pro to extract CFGs and FCGs from large-scale closed-source programs which incurs considerable time costs. While IDA Pro, as a proprietary and non-open-source tool, entails high expenses, its substantial cost is directly tied to its irreplaceability in complex reverse engineering scenarios. For enterprise users, this investment can be justified through multi-project reuse and efficiency gains. What is more, to address these toolchain challenges, we plan to further optimize existing tools and explore other possible alternatives, including potential migration from IDA Pro to modern binary analysis platforms such as Binary Ninja and angr, to enhance efficiency and stability in future work. We will also test a broader range of applications, especially larger-scale programs, to improve the stability and reliability of the tools. Moreover, future research will include collecting other mainstream fuzzing tools on the Windows platform for comparison with WinDGF to more accurately demonstrate the improvements and advantages of WinDGF over existing technologies.

Related work

This section reviews related work from methodological innovation and platform adaptation perspectives. First, we examine the core principles and optimization approaches of coverage-guided greybox fuzzing. Next, we analyze technical breakthroughs in directed greybox fuzzing for specific target triggering. Finally, we discuss unique challenges in Windows platform fuzzing. By examining the limitations of existing studies, we establish the theoretical foundation for proposing our WinDGF innovation.

Coverage-based Greybox fuzzing

Fuzzing is a robust, lightweight, and automated technique for discovering vulnerabilities. One of the most prominent examples is AFL (American Fuzzy Lop)¹⁸, which stands as a cornerstone in fuzzing research, sparking widespread adoption within the academic community and inspiring a multitude of derivative tools. As a highly modular framework, FOT³⁵ achieves functional decoupling, addressing the issues of high code coupling and limited extensibility in AFL. FPAFL³⁶ proposes a format protection method based on the Naive Bayes classifier, optimizing AFL’s mutation position selection. CDFuzz³⁷ CDFuzz breaks through the coverage bottleneck with a targeted dictionary-based technique and proposes a lightweight path exploration solution. WingFuzz³⁸ proposes the concept of data coverage to address the limitations of traditional code coverage guidance, where program structures fail to fully reflect runtime execution semantics.

Directed Greybox fuzzing

Most greybox fuzzing tools, like AFL, are coverage-oriented and aim to cover as many program states as possible. However, this method of blindly exploring the program state space often spends most of its time exploring bug-free code and struggles to focus on corner cases. Introduced in 2017, AFLGo pioneered directed greybox fuzzing, a method that focuses on specific target locations in the program. Due to its lighter program analysis, directed greybox fuzzing is generally preferred and widely used for patch testing, crash reproduction, static analysis report validation, and information flow detection.

Directed greybox fuzzing considers four main aspects: (1) implementation of direction, (2) acquisition and utilization of feedback information, (3) selection and scheduling of test cases, and (4) mutation strategy for test cases. There have been many derivative products addressing these issues. SDFuzz³⁹ automatically extracts target states from vulnerability reports and static analysis results, employing selective dynamic instrumentation techniques to achieve precise test scope control, thereby enhancing vulnerability localization efficiency. AFLRun⁴⁰ adds an additional raw bitmap to each covered target to track the seed coverage state of hit targets, enabling real-time monitoring of the state transition trajectories of seeds in target coverage dimensions. DeepGo⁴¹ effectively guides fuzzing to target paths by combining historical and predictive information, utilizing deep neural networks and reinforcement learning techniques. PDGF⁴² abstracts directed fuzzing as a path search optimization problem and proposes a predecessor region-aware guidance mechanism. Halo⁴³ adopts program invariant inference techniques, deducing potential invariants based on input features of reachable and unreachable targets, thereby constraining the search space for subsequent input generation. Titan⁴⁴ leverages correlations between different targets in programs to distinguish the potential of seeds in reaching each target, and identifies bytes whose mutation can alter these targets simultaneously.

In summary, there has been a lot of exploration in the field of directed greybox fuzzing, with most research focusing on improving fuzzing for Linux applications.

Fuzzing on windows platform

Windows programs are often prone to memory safety issues. In the past, researchers have discovered numerous vulnerabilities through manual audit. In fact, as an operating system with a market share of up to 70%, Windows applications serve as an endpoint for users and are a primary target for malicious attackers. To fill this gap, Ivan Fratric created WinAFL, a large-scale fuzzing program based on AFL by lcumtuf. In recent years, we have begun to see the publication of research results related to Windows platform fuzzing techniques in various conferences and journals. SpotFuzzer¹² proposes a static tool for detecting Windows binary files, which can be used to select detection points or limit target areas. SiCsFuzzer¹³ adopts a trace strategy based on sparse instrumentation, and combines it with a “warming up” optimization, greatly reducing time consumption. WinFuzz¹⁴ proposes the target-embedded snapshotting technique to address challenges such as low fuzzing efficiency and kernel compatibility limitations in Windows’ closed-source environment by restoring process state at the application layer, bypassing kernel dependencies.

WinAFL is a port of AFL for Windows. It uses DynamoRIO to count code coverage and shares memory so that the fuzzer knows the coverage information of each test case. This transition is from Unix to Windows pushing fuzzing to support code coverage guidance capabilities for closed-source programs on the Windows platform. However, as a typical coverage-guided fuzzing tool, it also inherited the problem of unreasonable resource allocation. Moreover, WinAFL cannot be directly applied to most Windows applications, because it uses persistent testing, which is a test of a specific target function that can parse files and return automatically. The determination of the target requires a lot of reverse work, and WinAFL itself has no mechanism to handle the problem of manual interaction and idling in GUI programs. SpotFuzzer, SiCsFuzzer and WinFuzz have conducted research on the fuzzing object, instrumentation methods and fuzzing efficiency, respectively, but they did not directly introduce the directed guidance strategy and did not study how to guide the program to reach the predetermined target site faster.

On the other hand, WinDGF, based on the WinAFL framework, combines hybrid analysis address both the prerequisite conditions required by target functions and the characteristic behaviors of common GUI applications. By analyzing the execution of the target program under given inputs, it can automatically obtain functions or simple programs that can be used as test objects. This means that WinDGF can retain the good expandability of WinAFL and naturally adapt to most Windows applications. In addition, WinDGF extracts the cross-reference control-flow graph and function call graph of closed-source programs through the IDA static analysis tool, calculates the fitness indicators under two different modes of path distance and key-block coverage, and uses them to guide fuzzing to reach the target site faster. This successfully introduces directed fuzzing techniques to the Windows platform.

Conclusion

This paper focuses on DGF on the Windows platform. It addresses the limitations of WinAFL, a fuzzer, by introducing an automated target offset candidate generation functionality. Additionally, building on the optimization of Winnie, an automated path tracing framework is developed to overcome issues in GUI program testing. The paper proposes two DGF strategies guided by path distance and key-blocks coverage, which demonstrate higher coverage and exploration capabilities compared to traditional methods. The proposed approach is validated through performance evaluation experiments and crash reproduction experiments, showing its effectiveness in identifying vulnerabilities and program anomalies. However, further validation of the developed tool in diverse application scenarios is needed. Additionally, improvements in constructing call graphs and flow graphs, as well as implementing and testing optimization for path exploration, are recommended for future research. Overall, this study contributes to the field of DGF on Windows with potential applications in patch testing, vulnerability reproduction, and static analysis report verification.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

References

Wang, P. & Zhou, X. Sok: the progress, challenges, and perspectives of directed greybox fuzzing. CoRRabs/2005.11907 (2020). arXiv: 2005.11907.
Böhme, M., Pham, V. T., Nguyen, M. D. & Roychoudhury, A. Directed greybox fuzzing. In Proc. ACM SIGSAC Conf. Comput. Commun. Secur. 2329–2344 (ACM, 2017). https://doi.org/10.1145/3133956.3134020.
Zhang, X. & Feng, C. Real time idle state detection method in fuzzing test for gui programs. J. Softw. 29, 1288–1302. https://doi.org/10.13328/j.cnki.jos.005493 (2018).
Article Google Scholar
Guilfanov, I. Ida pro. Hex-Rays SA. https://hex-rays.com/ida-pro/ (2021).
Fratric, I. Winafl. https://github.com/googleprojectzero/winafl (2016).
Jung, J. H., Tong, S. Q. & Hu, H. Winnie: fuzzing windows applications with harness synthesis and fast cloning. In Proc. Netw. Distrib. Syst. Secur. Symp. (NDSS) (2021).
Ganesh, V., Leek, T. & Rinard, M. Taint-based directed whitebox fuzzing. In Proc. IEEE Int. Conf. Softw. Eng. (ICSE) 474–484 (2009). https://doi.org/10.1109/ICSE.2009.5070546.
Zheng, H. E. A. Fishfuzz: catch deeper bugs by throwing larger nets. In 32nd USENIX Security Symposium (USENIX Security 23) 1343–1360 (2023).
Zhu, X. G. E. A. Defuzz: deep learning guided directed fuzzing. arXiv preprint arXiv:2010.12149 (2020).
Nguyen, M. D., Bérard, S., Bonichon, R., Groz, R. & Lemieux, M. Binary-level directed fuzzing for Use-After-Free vulnerabilities. In Proc. Int. Symp. Res. Attacks Intrusions Defenses (RAID) 47–62 (USENIX Association, 2020).
Wang, H. J. E. A. Typestate-guided fuzzer for discovering use-after-free vulnerabilities. In Proc. IEEE/ACM Int. Conf. Softw. Eng. (ICSE) 999–1010 (2020).
Gu, Y. M., Shu, H., Ma, R. K., Yan, L. & Zhu, L. Spotfuzzer: static instrument and fuzzing windows cots. Secur. Commun. Netw. 2022, 4911587. https://doi.org/10.1155/2022/4911587 (2022).
Article Google Scholar
Liu, L. Y. J. Sicsfuzzer: a sparse-instrumentation-based fuzzing platform for closed source software. J. Cybersecur. 7, 55–70 (2022).
CAS Google Scholar
Stone, L. E. A. No linux, no problem: Fast and correct windows binary fuzzing via target-embedded snapshotting. In 32nd USENIX Security Symposium (USENIX Security 23) 4913–4929 (2023).
Xiang, Y. E. A. Critical code guided directed greybox fuzzing for commits. In 33rd USENIX Security Symposium (USENIX Security 24) 2459–2474 (2024).
Qian, R. X., Zhang, Q. J. & Fang, C. R. Funfuzz: greybox fuzzing with function significance. ACM Trans. Softw. Eng. Methodol. 34, 1–34 (2025).
Google Scholar
Situ, L. Y., Zuo, Z. Q. & Guan, L. Vulnerable region-aware greybox fuzzing. J. Comput. Sci. Technol. 36, 1212–1228 (2021).
Article Google Scholar
Zalewski, M. American fuzzy lop. http://lcamtuf.coredump.cx/afl/ (2024).
Patil, H. et al. Pinpointing representative portions of large intel itanium programs with dynamic instrumentation. In Proc. Int. Symp. Microarchit. (MICRO) 81–92 (2004). https://doi.org/10.1109/MICRO.2004.28.
Chen, H. X. E. A. Hawkeye: towards a desired directed grey-box fuzzer. In Proc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS) 2095–2108 (2018). https://doi.org/10.1145/3243734.3243849.
Mehlhorn, K. Data Structures and Algorithms: 1. Searching and Sorting, vol. 1 of EATCS Monographs on Theoretical Computer Science (Springer, 1984).
Zhang, Y. Optimization of Directed Grey-box Fuzzing. Ph.D. thesis, Beijing Jiaotong University (2020).
Böhme, M., Pham, V. T. & Roychoudhury, A. Coverage-based greybox fuzzing as markov chain. IEEE Trans. Softw. Eng. 45, 489–506. https://doi.org/10.1109/TSE.2017.2785841 (2019).
Article Google Scholar
Lemieux, C. & Sen, K. Fairfuzz: a targeted mutation strategy for increasing greybox fuzz testing coverage. In Proc. IEEE/ACM Int. Conf. Autom. Softw. Eng. (ASE) 475–485 (2018). https://doi.org/10.1145/3238147.3238176.
Gougelet, P. E. Xnview. https://www.xnview.com/, version: 2.51.2 (2023).
Abc viewer. http://abckantu.7654.com/. Version: 3.0.0.2 (2023).
Skiljan, I. Irfanview. https://www.irfanview.com/. Version: 4.57 (2023).
Pavlov, I. 7-zip. https://www.7-zip.org/. Version: 21.07 (2021).
notepad.exe. https://support.microsoft.com/. Version: 10.0.19041.1 (2021).
Rasphone. https://docs.microsoft.com/ras. Version: Win10 22H2 (2023).
hh.exe. https://support.microsoft.com/. Version: Win10 22H2 (2023).
Glyph, C. xpdf. https://www.xpdfreader.com/. Version: 4.05 (2024).
flac. https://xiph.org/flac/. Version: 1.3.3 (2019).
magick.exe. https://imagemagick.org/. Version: 7.1.1 (2021).
Li, Y. K. Principled greybox fuzzing. In International Conference on Formal Engineering Methods 455–458 (Springer, 2018).
Zhao, S. B. E. A. A format protection method of greybox fuzzing. In 2019 IEEE 19th International Conference on Communication Technology (ICCT) 1571–1579 (IEEE, 2019).
Wu, M. Y. E. A. Tumbling down the rabbit hole: How do assisting exploration strategies facilitate grey-box fuzzing? arXiv preprint arXiv:2409.14541 (2024).
Wang, M. Z. E. A. Data coverage for guided fuzzing. In 33rd USENIX Security Symposium (USENIX Security 24) 2511–2526 (2024).
Li, P. H., Meng, W. & Zhang, C. Sdfuzz: target states driven directed fuzzing. In 33rd USENIX Security Symposium (USENIX Security 24) 2441–2457 (2024).
Rong, H. Y., You, W. & Wang, X. F. Toward unbiased multiple-target fuzzing with path diversity. In 33rd USENIX Security Symposium (USENIX Security 24) 2475–2492 (2024).
Lin, P. H. E. A. Deepgo: predictive directed greybox fuzzing. In Proc. Netw. Distrib. Syst. Secur. Symp. (NDSS) (2024).
Zhang, Y. J., Liu, Y. K. & Xu, J. Y. Predecessor-aware directed greybox fuzzing. In 2024 IEEE Symposium on Security and Privacy (SP) 1884–1900 (IEEE, 2024).
Huang, H. Q. E. A. Everything is good for something: counterexample-guided directed fuzzing via likely invariant inference. In 2024 IEEE Symposium on Security and Privacy (SP) 1956–1973 (IEEE, 2024).
Huang, H. Q. E. A. Titan: efficient multi-target directed greybox fuzzing. In 2024 IEEE Symposium on Security and Privacy (SP) 1849–1864 (IEEE, 2024).

Download references

Acknowledgements

The authors would like to sincerely thank all the reviewers for your time and expertise on this paper. Your insightful comments help us improve this work. This work is carried out with the support of the Hunan Provincial Key Laboratory of Intelligent and Parallel Analysis for Software Security (ipass lab), and is partially supported by the science and technology innovation Program of Hunan Province(2024RC3136), the National Natural Science Foundation China (62272472).

Author information

Authors and Affiliations

College of Computer Science and Technology, National University of Defense Technology, Changsha, People’s Republic of China
Xin Ren, Peng-fei Wang, Xu Zhou & Kai Lu

Authors

Xin Ren
View author publications
Search author on:PubMed Google Scholar
Peng-fei Wang
View author publications
Search author on:PubMed Google Scholar
Xu Zhou
View author publications
Search author on:PubMed Google Scholar
Kai Lu
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization and Methodology: XR, PW, XZ, KL; Experiments and Data collection: XR, PW; Formal analysis and Investigation: XR; Writing – original draft: XR; Writing – review and editing: XR, PW, XZ, KL; Supervision and Funding acquisition: PW, XZ, KL; Project administration: XR, PW. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Peng-fei Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ren, X., Wang, Pf., Zhou, X. et al. A directed greybox fuzzer for windows applications. Sci Rep 15, 24349 (2025). https://doi.org/10.1038/s41598-025-09777-3

Download citation

Received: 26 March 2025
Accepted: 30 June 2025
Published: 08 July 2025
DOI: https://doi.org/10.1038/s41598-025-09777-3

Subjects

Abstract

Similar content being viewed by others

Vulnerability-oriented directed fuzzing for binary programs

Application of deep reinforcement learning in parameter optimization and refinement of turbulence models

Design optimization of wood-carved window grilles in historical architectures using stable diffusion model and intuitionistic Fuzzy VIKOR

Introduction

Background

Directed greybox fuzzing

Motivative example

Strategies of WinDGF

Overview of WinDGF

Program adapter component

Automated target offset candidates generation

Graphical interactive program

Target locator

Directed greybox fuzzing component

Preprocessing stage

Fitness metric calculating stage

Fuzzing optimization component

Implementation

Evaluation

Evaluation setup

Test environment

Result analysis

RQ1: target program adaptation and target offset selection

RQ2: overall performance evaluation

RQ3: crash reproduction

Case study

Discussion

Related work

Coverage-based Greybox fuzzing

Directed Greybox fuzzing

Fuzzing on windows platform

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links