Concepedia

Publication | Open Access

Towards a Better Understanding of Noise in Natural Language Processing

44

Citations

15

References

2021

Year

Abstract

In this paper, we propose a definition and taxonomy of various types of non-standard textual content -generally referred to as "noise"in Natural Language Processing (NLP). While data pre-processing is undoubtedly important in NLP, especially when dealing with usergenerated content, a broader understanding of different sources of noise and how to deal with them is an aspect that has been largely neglected. We provide a comprehensive list of potential sources of noise, categorise and describe them, and show the impact of a subset of standard pre-processing strategies on different tasks. Our main goal is to raise awareness of non-standard content -which should not always be considered as "noise" -and of the need for careful, task-dependent preprocessing. This is an alternative to blanket, all-encompassing solutions generally applied by researchers through "standard" preprocessing pipelines. The intention is for this categorisation to serve as a point of reference to support NLP researchers in devising strategies to clean, normalise or embrace nonstandard content.

References

YearCitations

Page 1