Publication | Open Access
Extracting the names of genes and gene products with a hidden Markov model
246
Citations
16
References
2000
Year
Unknown Venue
EngineeringMedline AbstractsGeneticsGenomicsTechnical TerminologyGene RecognitionBioinformatics DatabaseCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceData MiningHidden Markov ModelComputational LinguisticsComputational GenomicsBiostatisticsBiomedical Text MiningNamed-entity RecognitionMachine TranslationInterpolation ModelSequence AnalysisKnowledge DiscoveryStatistical GeneticsTerminology ExtractionGene ProductsInformation ExtractionBioinformaticsFunctional GenomicsGene Sequence AnnotationComputational BiologyKeyword ExtractionSystems BiologyMedicine
We report the results of a study into the use of a linear interpolating hidden Markov model (HMM) for the task of extracting technical terminology from MEDLINE abstracts and texts in the molecular-biology domain. This is the first stage in a system that will extract event information for automatically updating biology databases. We trained the HMM entirely with bigrams based on lexical and character features in a relatively small corpus of 100 MEDLINE abstracts that were marked-up by domain experts with term classes such as proteins and DNA. Using cross-validation methods we achieved an F-score of 0.73 and we examine the contribution made by each part of the interpolation model to overcoming data sparseness.
| Year | Citations | |
|---|---|---|
Page 1
Page 1