Addressing statistical and computational challenges in extreme multilabel classification with unbiased estimators, macro-averaged metrics, and hardware-aware implementations

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorBabbar, Rohit, Prof., University of Bath, UK
dc.contributor.authorSchultheis, Erik
dc.contributor.departmentTietotekniikan laitosfi
dc.contributor.departmentDepartment of Computer Scienceen
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.schoolSchool of Scienceen
dc.contributor.supervisorMarttinen, Pekka, Prof., Aalto University, Department of Computer Science, Finland
dc.date.accessioned2025-09-26T09:00:43Z
dc.date.available2025-09-26T09:00:43Z
dc.date.defence2025-10-02
dc.date.issued2025
dc.description.abstractThis thesis tackles statistical and computational challenges in extreme multilabel classification (XMC) problems, that is, in tasks where the label space is gigantic, possibly in the millions of labels. Such problems are plagued by missing labels and data scarcity, particularly in the form of tail labels, and the enormous label space turns operations that are cheap in typical machine learning problems, such as calculating the loss in the classification layer, into computationall challenging tasks. Towards addressing the missing-label problem, this thesis derives unbiased estimators for generic multilabel loss functions under the assumption that a propensity model is available. A critical look at the propensity model that is in widespread usage in the current XMC literature is provided, in particular regarding the the problematic double role of using propensities both to compensate for missing labels and as a measure for performance on infrequent tail labels. As an alternative, macro-averaged performance metrics are proposed, and prediction algorithms aiming to optimize these metrics in two different inference frameworks are presented. The thesis presents a new approach to train linear extreme classifiers, still an important baseline, significantly faster than before, owing to a new weight initialization scheme, and code that is aware of the memory layout of modern NUMA processors. Additionally, it presents a novel way to exploit weight sparsity, already at the training stage, to reduce the on-device memory consumption. This is achieved by combining dynamic sparse training algorithms with an efficient weight storage format that at the same time allows for a fast implementation of matrix multiplication.en
dc.description.accessibilityfeaturenavigointi mahdollistafi
dc.description.accessibilityfeaturestrukturell navigationsv
dc.description.accessibilityfeaturestructural navigationen
dc.format.extent93 + app. 209
dc.format.mimetypeapplication/pdfen
dc.identifier.isbn978-952-64-2730-0 (electronic)
dc.identifier.isbn978-952-64-2731-7 (printed)
dc.identifier.issn1799-4942 (electronic)
dc.identifier.issn1799-4934 (printed)
dc.identifier.issn1799-4934 (ISSN-L)
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/139176
dc.identifier.urnURN:ISBN:978-952-64-2730-0
dc.language.isoenen
dc.opnMenon, Aditya Krishna, Research Scientist, Google, USA
dc.publisherAalto Universityen
dc.publisherAalto-yliopistofi
dc.relation.haspart[Publication 1]: Mohammadreza Qaraei, Erik Schultheis, Priyanshu Gupta, and Rohit Babbar. Convex surrogates for unbiased loss functions in extreme classification with missing labels. In WWW ’21: Proceedings of the Web Conference 2021, Ljubljana, pages 3711–3720, April 2021. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202108098273. DOI: 10.1145/3442381.3450139
dc.relation.haspart[Publication 2]: Erik Schultheis and Rohit Babbar. Unbiased Loss Functions for Multilabel Classification with Missing Labels. Accepted for publication in Transactions on Machine Learning Research, September 2025. DOI: 10.48550/arXiv.2109.11282
dc.relation.haspart[Publication 3]: Erik Schultheis, Rohit Babbar, Marek Wydmuch, Krzysztof Dembczynski. On Missing Labels, Long-tails and Propensities in Extreme Multi-label Classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1547–1557, August 2022. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202208244984. DOI: 10.1145/3534678.3539466
dc.relation.haspart[Publication 4]: Erik Schultheis, Rohit Babbar. Speeding-up one-versus-all training for extreme classification via mean-separating initialization. Machine Learning, volume 111, issue 11, pp 3953-3976, November 2022. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202211096451. DOI: 10.1007/s10994-022-06228-2
dc.relation.haspart[Publication 5]: Erik Schultheis, Rohit Babbar. Towards Memory-Efficient Training for Extremely Large Output Spaces–Learning with 670k Labels on a Single Commodity GPU. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 689-704, September 2023. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202408065234. DOI: 10.1007/978-3-031-43418-1_41
dc.relation.haspart[Publication 6]: Nasib Ullah, Erik Schultheis, Mike Lasby, Yani Ioannou, Rohit Babbar. Navigating Extremes: Dynamic Sparsity in Large Output Spaces. In Advances in Neural Information Processing Systems, Vol. 37, 2024
dc.relation.haspart[Publication 7]: Erik Schultheis, Marek Wydmuch, Wojciech Kotłowski, Rohit Babbar, Krzysztof Dembczynski. Generalized test utilities for long-tail performance in extreme multi-label classification. In Advances in Neural Information Processing Systems, Vol. 36, 2023. DOI: 10.48550/arXiv.2311.05081
dc.relation.haspart[Publication 8]: Erik Schultheis, Wojciech Kotłowski, Marek Wydmuch, Rohit Babbar, Strom Borman, Krzysztof Dembczyński. Consistent algorithms for multilabel classification with macro-at-k metrics. In The Twelfth International Conference on Learning Representations, May 2024. DOI: 10.48550/arXiv.2401.16594
dc.relation.ispartofseriesAalto University publication series Doctoral Thesesen
dc.relation.ispartofseries180/2025
dc.revMenon, Aditya Krishna, Research Scientist, Google, USA
dc.revDhillon, Inderjit, Prof., University of Texas at Austin, USA
dc.subject.keywordmultilabel classificationen
dc.subject.keywordmissing labelsen
dc.subject.keywordclassification with large output spacesen
dc.subject.keywordlongtailed predictionen
dc.subject.keywordsparse neural networksen
dc.subject.otherComputer scienceen
dc.titleAddressing statistical and computational challenges in extreme multilabel classification with unbiased estimators, macro-averaged metrics, and hardware-aware implementationsen
dc.typeG5 Artikkeliväitöskirjafi
dc.type.dcmitypetexten
dc.type.ontasotDoctoral dissertation (article-based)en
dc.type.ontasotVäitöskirja (artikkeli)fi
local.aalto.acrisexportstatuschecked 2025-10-02_1028
local.aalto.archiveyes
local.aalto.formfolder2025_09_26_klo_08_32
local.aalto.infraScience-IT

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
isbn978952627300.pdf
Size:
1.53 MB
Format:
Adobe Portable Document Format