Publication | Closed Access
Discovering parallel text from the World Wide Web
51
Citations
5
References
2004
Year
Unknown Venue
EngineeringParallel TextSemantic WebCorpus LinguisticsText MiningNatural Language ProcessingLanguage DocumentationInformation RetrievalData ScienceComputational LinguisticsLanguage StudiesParallel CorpusKnowledge DiscoveryWebometricsCross-language RetrievalText IndexingComputer ScienceWeb MiningContent Similarity DetectionWeb IntelligenceSearch Engine IndexingLinguisticsSemantic Similarity
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including cross-lingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents from the World Wide Web. The system crawls the Web to fetch potentially parallel multilingual Web documents using a Web spider. To determine the parallelism between potential document pairs, two modules are developed. First, a filename comparison module is used to check filename resemblance. Second, a content analysis module is used to measure the semantic similarity. The experiment conducted to a multilingual Web site shows the effectiveness of the system.
| Year | Citations | |
|---|---|---|
Page 1
Page 1