A little goes a long way: Improving toxic language classification despite data scarcity
Loading...
Access rights
openAccess
publishedVersion
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
Date
Department
Major/Subject
Mcode
Degree programme
Language
en
Pages
18
Series
Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2991-3009
Abstract
Detection of some types of toxic language is hampered by extreme scarcity of labeled training data. Data augmentation – generating new synthetic data from a labeled seed dataset – can help. The efficacy of data augmentation on toxic language classification has not been fully explored. We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers, ranging from shallow logistic regression architectures to BERT – a state-of-the-art pre-trained Transformer network. We compare the performance of eight techniques on very scarce seed datasets. We show that while BERT performed the best, shallow classifiers performed comparably when trained on data augmented with a combination of three techniques, including GPT-2-generated sentences. We discuss the interplay of performance and computational overhead, which can inform the choice of techniques under different constraints.Description
Keywords
Other note
Citation
Juuti, M, Gröndahl, T, Flanagan, A & Asokan, N 2020, A little goes a long way: Improving toxic language classification despite data scarcity. in Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, pp. 2991-3009, Conference on Empirical Methods in Natural Language Processing, Virtual, Online, 16/11/2020. https://doi.org/0.18653/v1/2020.findings-emnlp.269