Publication | Closed Access
Spam filters: bayes vs. chi-squared; letters vs. words
57
Citations
7
References
2003
Year
EngineeringInformation ForensicsWriter IdentificationCommunicationCorpus LinguisticsText MiningNatural Language ProcessingSpam FilteringInformation RetrievalData ScienceData MiningDocument ClassificationContent AnalysisStatisticsAutomatic ClassificationStatistical MethodsKnowledge DiscoveryAuthor ProfilingSpam EmailDistributional SemanticsInformation Filtering SystemSpam FiltersArts
We compare two statistical methods for identifying spam or junk electronic mail. Spam filters are classifiers which determine whether an email is junk or not. The proliferation of spam email has made electronic filtering vitally important. The magnitude of the problem is discussed. We examine the Naive Bayesian method in relation to the 'Chi by degrees of Freedom' approach, the latter used in the field of authorship identification. Both methods produce very promising results. However, the 'Chi by degrees of Freedom' has the advantage of providing significance measures, which will help to reduce false positives. Statistics based on character-level tokenization proves more effective than word-level.
| Year | Citations | |
|---|---|---|
Page 1
Page 1