Publication | Closed Access
Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases
118
Citations
42
References
2019
Year
Software MaintenanceEngineeringMachine LearningMachine Learning ToolSoftware EngineeringSource Code AnalysisSoftware AnalysisVulnerability Assessment (Computing)Domain CharacteristicData ScienceData MiningVulnerability-relevant Data SourcesAdversarial Machine LearningMachine Learning ModelKnowledge DiscoveryComputer ScienceDeep LearningSoftware DesignProgram AnalysisSoftware TestingDomain AdaptationSoftware Vulnerability DiscoveryVulnerability DiscoveryVulnerability DetectionTransfer Learning
Machine learning (ML) has great potential in automated code vulnerability discovery. However, automated discovery application driven by off-the-shelf machine learning tools often performs poorly due to the shortage of high-quality training data. The scarceness of vulnerability data is almost always a problem for any developing software project during its early stages, which is referred to as the cold-start problem. This article proposes a framework that utilizes transferable knowledge from pre-existing data sources. In order to improve the detection performance, multiple vulnerability-relevant data sources were selected to form a broader base for learning transferable knowledge. The selected vulnerability-relevant data sources are cross-domain, including historical vulnerability data from different software projects and data from the Software Assurance Reference Database (SARD) consisting of synthetic vulnerability examples and proof-of-concept test cases. To extract the information applicable in vulnerability detection from the cross-domain data sets, we designed a deep-learning-based framework with Long-short Term Memory (LSTM) cells. Our framework combines the heterogeneous data sources to learn unified representations of the patterns of the vulnerable source codes. Empirical studies showed that the unified representations generated by the proposed deep learning networks are feasible and effective, and are transferable for real-world vulnerability detection. Our experiments demonstrated that by leveraging two heterogeneous data sources, the performance of our vulnerability detection outperformed the static vulnerability discovery tool <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Flawfinder</i> . The findings of this article may stimulate further research in ML-based vulnerability detection using heterogeneous data sources.
| Year | Citations | |
|---|---|---|
Page 1
Page 1