Concepedia

Publication | Closed Access

MutantX-S: scalable malware clustering based on static features

89

Citations

16

References

2013

Year

Abstract

The current lack of automatic and speedy labeling of a large number (thousands) of malware samples seen everyday delays the distribution of malware signatures, leading to a low detection rate of new malware samples in the wild. In this paper, we design, implement and evaluate a novel, scalable framework, called MutantX-S, that can efficiently cluster a large number of samples into families based on programs’ static features, i.e., code instruction sequences. MutantX-S is a unique combination of several novel techniques to address the practical challenges of malware clustering. Specifically, it exploits the instruction format of x86 architecture and represents a binary program as a sequence of opcodes, facilitating the extraction of N-gram features. It also exploits the hashing trick recently developed in the machine learning community to reduce the dimensionality of the extracted feature vectors, thus significantly lowering the memory and computation costs of clustering. Our comprehensive evaluation on a MutantX-S prototype using a database of more than 100,000 malware samples has shown its ability to correctly cluster over 80 % of input samples within 2 hours, achieving a good balance between accuracy and scalability. Applying MutantX-S on malware samples created at different times, we also demonstrate that MutantX-S achieves high accuracy (around 0.75–0.8) in predicting family labels for unknown malware. 1.

References

YearCitations

Page 1