Publication | Closed Access
Mining the web to create minority language corpora
69
Citations
22
References
2001
Year
Unknown Venue
Automatic Language ClassifierEngineeringIntelligent Information RetrievalQuery ModelCorpus LinguisticsText MiningNatural Language ProcessingApplied LinguisticsLanguage DocumentationInformation RetrievalComputational LinguisticsLanguage EngineeringQuery ExpansionLanguage StudiesMachine TranslationTerminology ExtractionQuery AnalysisLanguage CorpusMinority Language CorporaWeb-search QueriesMinority LanguageLinguisticsInteractive Information Retrieval
The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages.
| Year | Citations | |
|---|---|---|
Page 1
Page 1