Publication | Closed Access
Building a test collection for complex document information processing
278
Citations
3
References
2006
Year
Unknown Venue
Test CollectionEngineeringDocument Image AnalysisDocument ImagesText MiningInformation RetrievalData ScienceData MiningPattern RecognitionText RecognitionDocument EngineeringData IntegrationCharacter RecognitionData ManagementOptical Character RecognitionTerabyte DatasetKnowledge DiscoveryComputer ScienceSoftware TestingScanned Paper DocumentsStructured DocumentDocument Processing
Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both end-to-end complex document information processing (CDIP) tasks (e.g., text retrieval and data mining) as well as component technologies such as optical character recognition (OCR), document structure analysis, signature matching, and authorship attribution.
| Year | Citations | |
|---|---|---|
Page 1
Page 1