Learning Centre

A little goes a long way: Improving toxic language classification despite data scarcity

 |  Login

Show simple item record

dc.contributor Aalto-yliopisto fi
dc.contributor Aalto University en
dc.contributor.author Juuti, Mika
dc.contributor.author Gröndahl, Tommi
dc.contributor.author Flanagan, Adrian
dc.contributor.author Asokan, N.
dc.date.accessioned 2020-12-31T08:37:48Z
dc.date.available 2020-12-31T08:37:48Z
dc.date.issued 2020-11-20
dc.identifier.citation Juuti , M , Gröndahl , T , Flanagan , A & Asokan , N 2020 , A little goes a long way: Improving toxic language classification despite data scarcity . in Findings of the Association for Computational Linguistics: EMNLP 2020 . Association for Computational Linguistics , pp. 2991-3009 , Conference on Empirical Methods in Natural Language Processing , Virtual, Online , 16/11/2020 . https://doi.org/0.18653/v1/2020.findings-emnlp.269 en
dc.identifier.isbn 978-1-952148-90-3
dc.identifier.other PURE UUID: 1cc4aade-2fb6-4e84-bf57-7e5f7bcda162
dc.identifier.other PURE ITEMURL: https://research.aalto.fi/en/publications/1cc4aade-2fb6-4e84-bf57-7e5f7bcda162
dc.identifier.other PURE LINK: https://www.aclweb.org/anthology/2020.findings-emnlp.269/
dc.identifier.other PURE FILEURL: https://research.aalto.fi/files/54409179/Juuti_A_Little_Goes_a_Long_Way.2020.findings_emnlp.269.pdf
dc.identifier.uri https://aaltodoc.aalto.fi/handle/123456789/101418
dc.description.abstract Detection of some types of toxic language is hampered by extreme scarcity of labeled training data. Data augmentation – generating new synthetic data from a labeled seed dataset – can help. The efficacy of data augmentation on toxic language classification has not been fully explored. We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers, ranging from shallow logistic regression architectures to BERT – a state-of-the-art pre-trained Transformer network. We compare the performance of eight techniques on very scarce seed datasets. We show that while BERT performed the best, shallow classifiers performed comparably when trained on data augmented with a combination of three techniques, including GPT-2-generated sentences. We discuss the interplay of performance and computational overhead, which can inform the choice of techniques under different constraints. en
dc.format.extent 18
dc.format.extent 2991-3009
dc.format.mimetype application/pdf
dc.language.iso en en
dc.relation.ispartof Conference on Empirical Methods in Natural Language Processing en
dc.relation.ispartofseries Findings of the Association for Computational Linguistics: EMNLP 2020 en
dc.rights openAccess en
dc.title A little goes a long way: Improving toxic language classification despite data scarcity en
dc.type A4 Artikkeli konferenssijulkaisussa fi
dc.description.version Peer reviewed en
dc.contributor.department University of Waterloo
dc.contributor.department Adj. Prof Asokan N. group
dc.contributor.department Huawei Technologies
dc.contributor.department Helsinki Institute for Information Technology (HIIT)
dc.contributor.department Department of Computer Science en
dc.identifier.urn URN:NBN:fi:aalto-2020123160239
dc.identifier.doi 0.18653/v1/2020.findings-emnlp.269
dc.type.version publishedVersion


Files in this item

Files Size Format View

There are no open access files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search archive


Advanced Search

article-iconSubmit a publication

Browse

Statistics