TESSERACT: Eliminating Experimental Bias in Malware Classification\n across Space and Time

Abstract

Is Android malware classification a solved problem? Published F1 scores of up\nto 0.99 appear to leave very little room for improvement. In this paper, we\nargue that results are commonly inflated due to two pervasive sources of\nexperimental bias: "spatial bias" caused by distributions of training and\ntesting data that are not representative of a real-world deployment; and\n"temporal bias" caused by incorrect time splits of training and testing sets,\nleading to impossible configurations. We propose a set of space and time\nconstraints for experiment design that eliminates both sources of bias. We\nintroduce a new metric that summarizes the expected robustness of a classifier\nin a real-world setting, and we present an algorithm to tune its performance.\nFinally, we demonstrate how this allows us to evaluate mitigation strategies\nfor time decay such as active learning. We have implemented our solutions in\nTESSERACT, an open source evaluation framework for comparing malware\nclassifiers in a realistic setting. We used TESSERACT to evaluate three Android\nmalware classifiers from the literature on a dataset of 129K applications\nspanning over three years. Our evaluation confirms that earlier published\nresults are biased, while also revealing counter-intuitive performance and\nshowing that appropriate tuning can lead to significant improvements.\n