Internet traffic classification using machine learning approach: Datasets validation issues

Abstract

Internet traffic classification is an area of current research interest. The failure of port and payload based classification motivates researchers to head towards a machine learning (ML) approach. However, training and testing dataset validation has not been formally addressed. This paper discusses the problem of ML dataset validation and highlights three training issues to be considered in ML classification. The first issue is when training and testing datasets collected from same or different network characteristics. The second issue considers training dataset classes whose real online traffic classes are not presented. The third issue is the geographic place where the network traffic is captured. Real Internet traffic datasets collected from a campus network are used to study the traffic features and classification accuracy for each validation training issue. The experimental results demonstrate that there are differences in some traffic features such as inter-arrival time when training and testing data were collected from different networks. Furthermore, the experiment of the second issue shows that the online classifier achieved the highest accuracy (92.22%) when the ML classifier was trained by dataset classes which have the same ratio of the real online traffic. For the geographic capturing level, the results indicate that there is a difference in the traffic statistical features when the capturing level is different.

References

Page 1

	Year	Citations

Page 1