Introducing the Enron Corpus.

TLDR

The Enron corpus, a publicly released set of 619,446 emails from 158 users gathered during the Enron legal investigation, includes computer‑generated folders and duplicate messages that were not directly used by users. This paper introduces and analyzes the Enron corpus, aiming to assess its suitability for studying human‑organized message classification. The authors cleaned the corpus by removing computer‑generated folders such as “discussion_threads” from each user’s data.

Abstract

A large set of email messages, the Enron corpus, was made public during the legal investigation concerning the Enron corporation. This dataset, along with a thorough explanation of its origin, is available at http://www-2.cs.cmu.edu/~enron/. This paper provides a brief introduction and analysis of the dataset. The raw Enron corpus contains 619,446 messages belonging to 158 users. We cleaned the corpus before this analysis by removing certain folders from each user, such as “discussion_threads”. These folders were present for most users, and did not appear to be used directly by the users, but rather were computer generated. Many, such as “all_documents”, also contained large numbers of duplicate emails, which were already present in the users’ other folders. Our goal in this paper is to analyze the suitability of this corpus for exploring how to classify messages as organized by a human, so these folders would have likely been misleading.