Browsing by Author "Xiao, Han"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
Item Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification(Springer Netherlands, 2020-11) Khandagale, Sujay; Xiao, Han; Babbar, Rohit; Department of Computer Science; Professorship Babbar Rohit; Helsinki Institute for Information Technology (HIIT); Adj. Prof. Gionis Aris group; Department of Computer ScienceExtreme multi-label classification (XMC) refers to supervised multi-label learning involving hundreds of thousands or even millions of labels.In this paper, we develop a suite of algorithms, called Bonsai, which generalizes the notion of label representation in XMC, and partitions the labels in the representation space to learn shallow trees.We show three concrete realizations of this label representation space including: (i) the input space which is spanned by the input features, (ii) the output space spanned by label vectors based on their co-occurrence with other labels, and (iii) the joint space by combining the input and output representations. Furthermore, the constraint-free multi-way partitions learnt iteratively in these spaces lead to shallow trees.By combining the effect of shallow trees and generalized label representation, Bonsai achieves the best of both worlds—fast training which is comparable to state-of-the-art tree-based methods in XMC, and much better prediction accuracy, particularly on tail-labels. On a benchmark Amazon-3M dataset with 3 million labels, Bonsai outperforms a state-of-the-art one-vs-rest method in terms of prediction accuracy, while being approximately 200 times faster to train. The code for Bonsai is available at https://github.com/xmc-aalto/bonsai.Item Concise and interpretable multi-label rule sets(Springer, 2023-12) Ciaperoni, Martino; Xiao, Han; Gionis, Aristides; Department of Computer Science; Helsinki Institute for Information Technology (HIIT); KTH Royal Institute of Technology; Department of Computer ScienceMulti-label classification is becoming increasingly ubiquitous, but not much attention has been paid to interpretability. In this paper, we develop a multi-label classifier that can be represented as a concise set of simple “if-then” rules, and thus, it offers better interpretability compared to black-box models. Notably, our method is able to find a small set of relevant patterns that lead to accurate multi-label classification, while existing rule-based classifiers are myopic and wasteful in searching rules, requiring a large number of rules to achieve high accuracy. In particular, we formulate the problem of choosing multi-label rules to maximize a target function, which considers not only discrimination ability with respect to labels, but also diversity. Accounting for diversity helps to avoid redundancy, and thus, to control the number of rules in the solution set. To tackle the said maximization problem, we propose a 2-approximation algorithm, which circumvents the exponential-size search space of rulesusing a novel technique to sample highly discriminative and diverse rules. In addition to our theoretical analysis, we provide a thorough experimental evaluation and a case study, which indicate that our approach offers a trade-off between predictive performance and interpretability that is unmatched in previous work.Item Data science for social good - Theory and applications in epidemics, polarization, and fair clustering(Aalto University, 2020) Xiao, Han; Tietotekniikan laitos; Department of Computer Science; Data Mining group; Perustieteiden korkeakoulu; School of Science; Gionis, Aristides, Adj. Prof., Aalto University, Department of Computer Science, FinlandTechnical innovations have transformed our lives fundamentally, in both positive and negative ways. In this thesis, we look at the negative side. We identify three problems to tackle, namely epidemics, online polarization, and bias in automatic decision-making processes, and approach them using data-driven approaches. Thanks to globalization, our world is more interconnected than before. While trade and exchange of ideas are happening at an unprecedented rate, the rapid spread of disease is happening globally, as evidenced by the pandemic of COVID-19. To contain epidemics effectively, it is crucial to identify as many infected persons as possible. In practice, however, it is almost impossible to obtain the complete information of who is infected. We study this challenge in the context of social networks, where a disease spreads via network edges. Specifically, we assume only a subset of all infections is observed and we seek to infer who else is infected. Furthermore, we consider two different settings: (1) temporal setting, in which infection time is also observed and, (2) probabilistic setting, in which infection probability of each individual is produced.Social-media platforms enable people to share and access information easily. Meanwhile, flawed designs in these platforms contribute to the formation of online polarization. As a result, people are unlikely to adopt new ideas that differ from their beliefs, which finally leads to a polarized society. To tackle online polarization, we argue that it is important to discover who is involved in the polarization. We consider a problem setting under social networks, in which the interaction between two persons is either friendly or antagonistic. Furthermore, given some seed nodes that represent different sides of a polarized subgraph, we seek to find the polarized subgraph that is relevant to the seeds. Finding such structures can be used to understand the nature of polarization, and to mitigate the degree of polarization. Machine-learning algorithms allow the automation of many decision-making processes, for example, deciding whether to grant a loan to a loan applicant. However, unfair results that favor one demographic group (e.g., male) over another (e.g., female) are witnessed. The unfair outcomes may further affect the well-being of the mistreated groups. In this thesis, we focus on the task of data clustering, which has applications in infrastructure design and online social media. We discuss potential fairness issues in existing clustering algorithms that are designed to be fair. As a result, we propose a new fair clustering formulation that captures a novel fairness notion. For all proposed problems, we study their complexity and design algorithms whose theoretical performance is analyzed. We evaluate all proposed algorithms' efficacy in both synthetic and real-world settings.Item Discovering topically- and temporally-coherent events in interaction networks(2016) Xiao, Han; Rozenshtein, Polina; Gionis, Aristides; Department of Computer Science; Adj. Prof. Gionis Aris group; Helsinki Institute for Information Technology (HIIT)With the increasing use of online communication platforms, such as email, Twitter, and messaging applications, we are faced with a growing amount of data that combine content (what is said), time (when), and user (by whom) information. Discovering meaningful patterns and understand what is happening in this data is an important challenge. We consider the problem of mining online communication data and finding top-k temporal events. A temporal event is a coherent topic that is discussed frequently in a relatively short time span, while its information flow respects the underlying network. Our method consists of two steps. We first introduce the notion of interaction meta-graph, which connects associated interactions. Using this notion, we define a temporal event to be a subset of interactions that (i) are topically and temporally close and (ii) correspond to a tree that captures the information flow. Finding the best temporal event leads to a budget version of the prize-collecting Steiner-tree (PCST) problem, which we solve using three different methods: a greedy approach, a dynamic-programming algorithm, and an adaptation to an existing approximation algorithm. Finding the top-k events maps to a maximum set-cover problem, and thus, solved by greedy algorithm. We compare and analyze our algorithms in both synthetic and real datasets, such as Twitter and email communication. The results show that our methods are able to detect meaningful temporal events. The software related to this paper are available at https://github.com/xiaohan2012/lst.Item Estimating the Effects of Public Health Measures by SEIR(MH) Model of COVID-19 Epidemic in Local Geographic Areas(Frontiers Research Foundation, 2022-01-04) Qiu, Tianyi; Xiao, Han; Brusic, Vladimir; Department of Computer Science; Helsinki Institute for Information Technology (HIIT); Adj. Prof. Gionis Aris group; Fudan University; University of Nottingham Ningbo ChinaThe COVID-19 pandemic of 2020–21 has been a major challenge to public health systems worldwide. Mathematical models of epidemic are useful tools for assessment of the situation and for providing decision-making support for relevant authorities. We developed and implemented SEIR(MH) model that extends the conventional SEIR model with parameters that define public lockdown (the level and start of lockdown) and the medical system capacity to contain patients. Comparative modeling of four regions in Europe that have similar population sizes and age structures, but different public health systems, was performed: Baden-Württemberg, Lombardy, Belgium, and Switzerland. Modeling suggests that the most effective measure for controlling epidemic is early lockdown (exponential effect), followed by the number of available hospital beds (linear effect if the capacity is insufficient, with diminishing returns when the capacity is sufficient). Dynamic management of lockdown levels is likely to produce better outcomes than strict lockdown.Item Media Attention to Science(2017) Garimella, Venkata; Xiao, Han; Adj. Prof. Gionis Aris group; Department of Computer ScienceHow has media attention to science changed over the past decade? Does media attention bring scientific attention too? To answer these questions, we collected media attention statistics from Altmetric.com for over 40,000 papers published in PNAS journal in the last 13 years. Our analysis reveals that (i) Media attention to science has slowly, but steadily increased over time, (ii) Media attention doesn't necessarily translate into scientific attention (citations) and, (iii) It is non-trivial to predict the amount of media attention a paper gets from attributes such as authors, author affiliations, abstract, and title.Item Mining Signed Networks(2020-04-20) Gionis, Aristides; Matakos, Antonis; Ordozgoiti, Bruno; Xiao, Han; Helsinki Institute for Information Technology (HIIT); Department of Computer ScienceSigned networks transform the information encoded by conventional graphs by attaching either a positive or a negative sign to every edge. This subtle modification vastly enhances the modelling capabilities of graphs. For instance, in a social network, where edges might represent interactions between users, the sign may determine whether an exchange was friendly or hostile. However, the introduction of edge signs invalidates many established methods and results from the graph-mining toolbox, and thus, problem formulations and algorithmic techniques must be studied anew. In this tutorial we aim to provide an overview of the literature in mining signed networks. We will present the most important theoretical results since their inception to the present day, we will discuss some of the most common applications, and we will reflect on emerging applications and directions for future work.Item Reconstructing a cascade from temporal observations(2018) Xiao, Han; Rozenshtein, Polina; Tatti, Nikolaj; Gionis, Aristides; School services,SCI; Department of Computer Science; Adj. Prof. Gionis Aris group; Helsinki Institute for Information Technology (HIIT)Given a subset of active nodes in a network can we reconstruct the cascade that has generated these observations? This is a problem that has been studied in the literature, but here we focus in the case that temporal information is available about the active nodes. In particular, we assume that in addition to the subset of active nodes we also know their activation time. We formulate this cascade-reconstruction problem as a variant of a Steiner-tree problem: we ask to find a tree that spans all reported active nodes while satisfying temporal-consistency constraints. For the proposed problem we present three approximation algorithms. The best algorithm in terms of quality achieves a O(√k)-approximation guarantee, where k is the number of active nodes, while the most efficient algorithm has linearithmic running time, making it scalable to very large graphs. We evaluate our algorithms on real-world networks with both simulated and real cascades. Our results indicate that utilizing the available temporal information allows for more accurate cascade reconstruction. Furthermore, our objective leads to finding the “backbone” of the cascade and it gives solutions of very high precision.Item Robust Cascade Reconstruction by Steiner Tree Sampling(IEEE, 2018-11) Xiao, Han; Aslay, Cigdem; Gionis, Aristides; Adj. Prof. Gionis Aris group; Department of Computer Science; Helsinki Institute for Information Technology (HIIT)We consider a network where an infection has taken place and a subset of infected nodes has been partially observed. Our goal is to reconstruct the underlying cascade that is likely to have generated these observations. We reduce this cascadereconstruction problem to computing the marginal probability that a node is infected given the partial observations, which is a #P-hard problem. To circumvent this issue, we resort to estimating infection probabilities by generating a sample of probable cascades, which span the nodes that have already been observed to be infected, and avoid the nodes that have been observed to be uninfected. The sampling problem corresponds to sampling directed Steiner trees with a given set of terminals, which is a problem of independent interest and has received limited attention in the literature. For the latter problem we propose two novel algorithms with provable guarantees on the sampling distribution of the returned Steiner trees. The resulting method improves over state-of-the-art approaches that often make explicit assumptions about the infection-propagation model, or require additional parameters. Our method provides a more robust approach to the cascadereconstruction problem, which makes weaker assumptions about the infection model, requires fewer additional parameters, and can be used to estimate node infection probabilities. We experimentally validate the proposed reconstruction algorithm on realworld graphs with both synthetic and real cascades. We show that our method outperforms all other baseline strategies in most cases.Item Searching for polarization in signed graphs: A local spectral approach(2020-04-20) Xiao, Han; Ordozgoiti, Bruno; Gionis, Aristides; Department of Computer Science; Helsinki Institute for Information Technology (HIIT); Adj. Prof. Gionis Aris groupSigned graphs have been used to model interactions in social networks, which can be either positive (friendly) or negative (antagonistic). The model has been used to study polarization and other related phenomena in social networks, which can be harmful to the process of democratic deliberation in our society. An interesting and challenging task in this application domain is to detect polarized communities in signed graphs. A number of different methods have been proposed for this task. However, existing approaches aim at finding globally optimal solutions. Instead, in this paper we are interested in finding polarized communities that are related to a small set of seed nodes provided as input. Seed nodes may consist of two sets, which constitute the two sides of a polarized structure. In this paper we formulate the problem of finding local polarized communities in signed graphs as a locally-biased eigen-problem. By viewing the eigenvector associated with the smallest eigenvalue of the Laplacian matrix as the solution of a constrained optimization problem, we are able to incorporate the local information as an additional constraint. In addition, we show that the locally-biased vector can be used to find communities with approximation guarantee with respect to a local analogue of the Cheeger constant on signed graphs. By exploiting the sparsity in the input graph, an indicator-vector for the polarized communities can be found in time linear to the graph size. Our experiments on real-world networks validate the proposed algorithm and demonstrate its usefulness in finding local structures in this semi-supervised manner.