Evaluating Fuzz Testing

TLDR

Fuzz testing has successfully uncovered security bugs, and recent research has focused on developing new techniques, strategies, and algorithms. The study seeks to determine the experimental setup required to generate trustworthy results for new fuzzing techniques. The authors surveyed experimental evaluations in 32 fuzzing papers and conducted their own extensive evaluation with an existing fuzzer to assess evaluation quality. They found that every evaluated paper had problems, and these issues can lead to incorrect or misleading assessments, prompting guidelines to improve evaluation robustness.

Abstract

Fuzz testing has enjoyed great success at discovering security critical bugs in real software. Recently, researchers have devoted significant effort to devising new fuzzing techniques, strategies, and algorithms. Such new ideas are primarily evaluated experimentally so an important question is: What experimental setup is needed to produce trustworthy results? We surveyed the recent research literature and assessed the experimental evaluations carried out by 32 fuzzing papers. We found problems in every evaluation we considered. We then performed our own extensive experimental evaluation using an existing fuzzer. Our results showed that the general problems we found in existing experimental evaluations can indeed translate to actual wrong or misleading assessments. We conclude with some guidelines that we hope will help improve experimental evaluations of fuzz testing algorithms, making reported results more robust.

References

Page 1

	Year	Citations

Page 1