Concepedia

Publication | Closed Access

Spam Corpus Creation for TREC.

84

Citations

2

References

2005

Year

Abstract

2005) introduces a standard testing framework that is designed to model a spam filter’s usage as closely as possible, to measure quantities that reflect the filter’s effectiveness for its intended purpose, and to yield repeatable (i.e. controlled and statistically valid) results. The TREC Spam Filter Evaluation Toolkit is free software that, given a corpus and a filter, automatically runs the filter on each message in the corpus, compares the result to the gold standard for the corpus, and reports effectiveness measures with 95% confidence limits. The corpus consists of a chronological sequence of email messages, and a gold standard judgement for each message. We are concerned here with the creation of appropriate corpora for use with the toolkit. It is a simple matter to capture all the email delivered to a recipient or a set of recipients. Using this captured email in a public corpus, as for the other TREC tasks, is not so simple. Few individuals are willing to publish their email, because doing so would compromise their privacy and the privacy of their correspondents. So we are left with the choice between using an artificial public collection of messages and using a more realistic collection that must be kept private. Artificial collections (spamassassin.org, 2003; Androutsopoulos et al., 2000; Michelakis et al., 2004) may be created by using mailing list messages as opposed to personal email, by selecting non-sensitive messages from a real email collection, by mixing messages from diverse sources, or by obfuscating genuine messages 1. All of these approaches conflict with our design criteria – that real filter usage be modelled as closely as possible – and may compromise the very information that filters use to discriminate ham from spam, either by removing pertinent details or by introducing extraneous information that may aid or hinder the filter. 1 The majority of filters we have evaluated exhibit pathologies on the PU obfuscated corpora.

References

YearCitations

Page 1