Publication | Closed Access
An Unsupervised Detection Framework for Chinese Jargons in the Darknet
12
Citations
12
References
2022
Year
EngineeringInformation ForensicsDarknet JargonsSemanticsCorpus LinguisticsJournalismText MiningWord EmbeddingsApplied LinguisticsNatural Language ProcessingInformation RetrievalData ScienceComputational LinguisticsChinese Jargon DetectionLanguage EngineeringLanguage StudiesContent AnalysisNamed-entity RecognitionMachine TranslationKnowledge DiscoveryTerminology ExtractionComputer ScienceDistributional SemanticsInformation ExtractionChinese JargonsLinguistics
With the continuous development of the darknet technology, the scale of darknet and have increased rapidly in recent years, leading to rampant crime in these anonymous trading markets. Monitoring these markets can effectively combat the criminal forces that hide behind them. One of the difficulties in understanding the darknet is that criminals usually use jargons to disguise transactions and thus avoid surveillance. These jargons usually distort the original meaning of innocent-looking words in obscure ways, posing significant challenges for crime tracking. Current research on Chinese jargon detection mainly adopts the method of keyword filtering, however, such methods have little effect on the complex and ever-changing structure of darknet jargons. We propose a Chinese jargon detection framework based on unsupervised learning. The main idea is to compare similarity with high-dimensional word embedding features from different corpus to find jargons. Firstly, we collect data from six Chinese Tor websites to build a dark corpus dataset. Afterwards, we build a word-based pre-training model called DC-BERT, which can generate high-quality contextual word embeddings. Finally, we construct a cross-corpus jargon detection framework based on similarity analysis, which can effectively detect Chinese jargons in the darknet. The experimental results show that the proposed framework is both innovative and practical, reaching a detection accuracy of 91.5%.
| Year | Citations | |
|---|---|---|
Page 1
Page 1