Extrapolated Markov Chain Oversampling Method for Imbalanced Text Classification

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorAvela, Aleksi
dc.contributor.authorIlmonen, Pauliina
dc.contributor.departmentDepartment of Mathematics and Systems Analysisen
dc.contributor.groupauthorMathematical Statistics and Data Scienceen
dc.date.accessioned2026-03-18T09:12:26Z
dc.date.available2026-03-18T09:12:26Z
dc.date.issued2026
dc.description.abstractText classification is the task of automatically assigning text documents correct labels from a predefined set of categories. In real-life (text) classification tasks, observations and misclassification costs are often unevenly distributed between the classes - known as the problem of imbalanced data. Synthetic oversampling is a popular approach to imbalanced classification. The idea is to generate synthetic observations in the minority class to balance the classes in the training set. Many general-purpose oversampling methods can be applied to text data; however, imbalanced text data poses a number of distinctive difficulties that stem from the unique nature of text compared to other domains. One such factor is that when the sample size of text increases, the sample vocabulary (i.e., feature space) is likely to grow as well. We introduce a novel Markov chain based text oversampling method. The transition probabilities are estimated from the minority class but also partly from the majority class, thus allowing the minority feature space to expand in oversampling. We evaluate our approach against prominent oversampling methods and show that our approach is able to produce highly competitive results against the other methods in several real data examples, especially when the imbalance is severe.en
dc.description.versionPeer revieweden
dc.format.extent28
dc.format.mimetypeapplication/pdf
dc.identifier.citationAvela, A & Ilmonen, P 2026, 'Extrapolated Markov Chain Oversampling Method for Imbalanced Text Classification', Journal of Machine Learning Research, vol. 27, 18, pp. 1-28. < https://www.jmlr.org/papers/v27/24-0428.html >en
dc.identifier.issn1532-4435
dc.identifier.issn1533-7928
dc.identifier.otherPURE UUID: fe5d51c5-7064-47b5-ba4a-710d79b0bae2
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/fe5d51c5-7064-47b5-ba4a-710d79b0bae2
dc.identifier.otherPURE LINK: https://www.jmlr.org/papers/v27/24-0428.html
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/213435947/Extrapolated_Markov_Chain_Oversampling_Method_for_Imbalanced_Text_Classification.pdf
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/143579
dc.identifier.urnURN:NBN:fi:aalto-202603182921
dc.language.isoenen
dc.publisherMicrotome Publishing
dc.relation.fundinginfoThe authors acknowledge support from the Academy of Finland via the Finnish Centre of Excellence in Randomness and Structures (decision number 346308). Moreover, Aleksi Avela acknowledges the personal grants from The Emil Aaltonen Foundation (Nuoren tutkijan apuraha, numero 230014, and työskentelyapuraha, numero 240013)
dc.relation.ispartofseriesJournal of Machine Learning Researchen
dc.relation.ispartofseriesVolume 27, pp. 1-28en
dc.rightsopenAccessen
dc.rightsCC BY
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subject.keywordMarkov chain
dc.subject.keywordimbalanced data
dc.subject.keywordnatural language processing
dc.subject.keywordtext classification
dc.subject.keywordoversampling
dc.titleExtrapolated Markov Chain Oversampling Method for Imbalanced Text Classificationen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Extrapolated_Markov_Chain_Oversampling_Method_for_Imbalanced_Text_Classification.pdf
Size:
1010.57 KB
Format:
Adobe Portable Document Format