Learning From Long-Tailed Data and Out-Of-Distribution Generalisation in NLP

No Thumbnail Available

Files

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Bachelor's thesis
Electronic archive copy is available locally at the Harald Herlin Learning Centre. The staff of Aalto University has access to the electronic bachelor's theses by logging into Aaltodoc with their personal Aalto user ID. Read more about the availability of the bachelor's theses.

Date

2024-04-26

Department

Major/Subject

Data Science

Mcode

SCI3095

Degree programme

Aalto Bachelor’s Programme in Science and Technology

Language

en

Pages

21

Series

Abstract

Natural Language Processing (NLP) has rapidly gained in popularity during the 2020s, with the development of new tools such as chatbots and translation services being at the forefront of artificial intelligence. These machine learning models are not without flaw, and are greatly influenced by the data they are provided during training. Some of these problems are caused by long-tailed data - the uneven distribution of data points in datasets. This distribution causes issues such as bias towards the head class and lack of understanding of tail class points within context. This thesis compares different methods used in research to tackle this issue and make NLP models more robust, by taking inspiration from similar techniques used in visual recognition. The results show that i) class re-balancing can be used to improve accuracy of models trained on uneven datasets by 3.755\% on average; ii) information augmentation shows similar results with an average increase of 1.7\%; iii) in-context learning can be a powerful tool to tackle long-tailed data problems. The thesis also discusses the results presented and suggests original ideas for techniques to improve NLP models further in the future.

Description

Supervisor

Korpi-Lagg, Maarit

Thesis advisor

Moisio, Anssi

Keywords

natural alnguage processing, long-tailed data, out-of-distribution generalisation, machine learning, visual recognition

Other note

Citation