An Unsupervised Detection Framework for Chinese Jargons in the Darknet

Abstract

With the continuous development of the darknet technology, the scale of darknet and have increased rapidly in recent years, leading to rampant crime in these anonymous trading markets. Monitoring these markets can effectively combat the criminal forces that hide behind them. One of the difficulties in understanding the darknet is that criminals usually use jargons to disguise transactions and thus avoid surveillance. These jargons usually distort the original meaning of innocent-looking words in obscure ways, posing significant challenges for crime tracking. Current research on Chinese jargon detection mainly adopts the method of keyword filtering, however, such methods have little effect on the complex and ever-changing structure of darknet jargons. We propose a Chinese jargon detection framework based on unsupervised learning. The main idea is to compare similarity with high-dimensional word embedding features from different corpus to find jargons. Firstly, we collect data from six Chinese Tor websites to build a dark corpus dataset. Afterwards, we build a word-based pre-training model called DC-BERT, which can generate high-quality contextual word embeddings. Finally, we construct a cross-corpus jargon detection framework based on similarity analysis, which can effectively detect Chinese jargons in the darknet. The experimental results show that the proposed framework is both innovative and practical, reaching a detection accuracy of 91.5%.

References

Page 1

	Year	Citations

Page 1