Publication | Closed Access
Text Features Extraction based on TF-IDF Associating Semantic
67
Citations
7
References
2018
Year
Unknown Venue
EngineeringWord VectorText Feature ExtractionText Features ExtractionSemantic WebCorpus LinguisticsText MiningNatural Language ProcessingInformation RetrievalData ScienceComputational LinguisticsLanguage StudiesDocument ClusteringKnowledge DiscoveryTerminology ExtractionInformation ExtractionVector Space ModelKeyword ExtractionLinguisticsWord StatisticsSemantic Similarity
The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words with the same or similar meanings will result in the loss of partial information when text feature were extracted. The representation of words needs to extract the similarity of words, and the similarity among words needs to be obtained by the meaning of words in texts. In order to improve the accuracy of text feature extraction, this paper uses the word2vec model to train the word vector in the corpus to obtain its semantic features. After excluding words with low TF-IDF value, the density clustering algorithm is used to cluster the remaining words according to word vector similarity. As a result, similar words are clustered together and can be represented to each other. Experiments show that using the TF-IDF algorithm again, constructing a VSM (vector space model) with these clusters as feature units can effectively improve the accuracy of text feature extraction.
| Year | Citations | |
|---|---|---|
Page 1
Page 1