Named Entity Recognition in Tweets: An Experimental Study

TLDR

People tweet more than 100 million times daily, producing a noisy, informal 140‑character corpus that reflects the zeitgeist, yet standard NLP tools perform poorly on such data. This study rebuilds the NLP pipeline—from part‑of‑speech tagging through chunking to named‑entity recognition—to address these challenges. The authors employ LabeledLDA with Freebase dictionaries to exploit tweet redundancy, integrating it into the rebuilt pipeline from POS tagging to NER. T‑ner doubles the F1 score of Stanford NER, and LabeledLDA increases F1 by 25 % over co‑training across ten common entity types. The NLP tools are available at http://github.com/aritter/twitter_nlp.

Abstract

People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-ner system doubles F1 score compared with the Stanford NER system. T-ner leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms co-training, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http://github.com/aritter/twitter_nlp

References

Page 1

	Year	Citations

Page 1