Extrapolated Markov Chain Oversampling Method for Imbalanced Text Classification
| dc.contributor | Aalto-yliopisto | fi |
| dc.contributor | Aalto University | en |
| dc.contributor.author | Avela, Aleksi | |
| dc.contributor.author | Ilmonen, Pauliina | |
| dc.contributor.department | Department of Mathematics and Systems Analysis | en |
| dc.contributor.groupauthor | Mathematical Statistics and Data Science | en |
| dc.date.accessioned | 2026-03-18T09:12:26Z | |
| dc.date.available | 2026-03-18T09:12:26Z | |
| dc.date.issued | 2026 | |
| dc.description.abstract | Text classification is the task of automatically assigning text documents correct labels from a predefined set of categories. In real-life (text) classification tasks, observations and misclassification costs are often unevenly distributed between the classes - known as the problem of imbalanced data. Synthetic oversampling is a popular approach to imbalanced classification. The idea is to generate synthetic observations in the minority class to balance the classes in the training set. Many general-purpose oversampling methods can be applied to text data; however, imbalanced text data poses a number of distinctive difficulties that stem from the unique nature of text compared to other domains. One such factor is that when the sample size of text increases, the sample vocabulary (i.e., feature space) is likely to grow as well. We introduce a novel Markov chain based text oversampling method. The transition probabilities are estimated from the minority class but also partly from the majority class, thus allowing the minority feature space to expand in oversampling. We evaluate our approach against prominent oversampling methods and show that our approach is able to produce highly competitive results against the other methods in several real data examples, especially when the imbalance is severe. | en |
| dc.description.version | Peer reviewed | en |
| dc.format.extent | 28 | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.citation | Avela, A & Ilmonen, P 2026, 'Extrapolated Markov Chain Oversampling Method for Imbalanced Text Classification', Journal of Machine Learning Research, vol. 27, 18, pp. 1-28. < https://www.jmlr.org/papers/v27/24-0428.html > | en |
| dc.identifier.issn | 1532-4435 | |
| dc.identifier.issn | 1533-7928 | |
| dc.identifier.other | PURE UUID: fe5d51c5-7064-47b5-ba4a-710d79b0bae2 | |
| dc.identifier.other | PURE ITEMURL: https://research.aalto.fi/en/publications/fe5d51c5-7064-47b5-ba4a-710d79b0bae2 | |
| dc.identifier.other | PURE LINK: https://www.jmlr.org/papers/v27/24-0428.html | |
| dc.identifier.other | PURE FILEURL: https://research.aalto.fi/files/213435947/Extrapolated_Markov_Chain_Oversampling_Method_for_Imbalanced_Text_Classification.pdf | |
| dc.identifier.uri | https://aaltodoc.aalto.fi/handle/123456789/143579 | |
| dc.identifier.urn | URN:NBN:fi:aalto-202603182921 | |
| dc.language.iso | en | en |
| dc.publisher | Microtome Publishing | |
| dc.relation.fundinginfo | The authors acknowledge support from the Academy of Finland via the Finnish Centre of Excellence in Randomness and Structures (decision number 346308). Moreover, Aleksi Avela acknowledges the personal grants from The Emil Aaltonen Foundation (Nuoren tutkijan apuraha, numero 230014, and työskentelyapuraha, numero 240013) | |
| dc.relation.ispartofseries | Journal of Machine Learning Research | en |
| dc.relation.ispartofseries | Volume 27, pp. 1-28 | en |
| dc.rights | openAccess | en |
| dc.rights | CC BY | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.subject.keyword | Markov chain | |
| dc.subject.keyword | imbalanced data | |
| dc.subject.keyword | natural language processing | |
| dc.subject.keyword | text classification | |
| dc.subject.keyword | oversampling | |
| dc.title | Extrapolated Markov Chain Oversampling Method for Imbalanced Text Classification | en |
| dc.type | A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä | fi |
| dc.type.version | publishedVersion |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Extrapolated_Markov_Chain_Oversampling_Method_for_Imbalanced_Text_Classification.pdf
- Size:
- 1010.57 KB
- Format:
- Adobe Portable Document Format