Input variable selection methods for construction of interpretable regression models

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Doctoral thesis (article-based)
Checking the digitized thesis and permission for publishing
Instructions for the author
Degree programme
Verkkokirja (1039 KB, 70 s.)
TKK dissertations in information and computer science, 11
Large data sets are collected and analyzed in a variety of research problems. Modern computers allow to measure ever increasing numbers of samples and variables. Automated methods are required for the analysis, since traditional manual approaches are impractical due to the growing amount of data. In the present thesis, numerous computational methods that are based on observed data with subject to modelling assumptions are presented for producing useful knowledge from the data generating system. Input variable selection methods in both linear and nonlinear function approximation problems are proposed. Variable selection has gained more and more attention in many applications, because it assists in interpretation of the underlying phenomenon. The selected variables highlight the most relevant characteristics of the problem. In addition, the rejection of irrelevant inputs may reduce the training time and improve the prediction accuracy of the model. Linear models play an important role in data analysis, since they are computationally efficient and they form the basis for many more complicated models. In this work, the estimation of several response variables simultaneously using the linear combinations of the same subset of inputs is especially considered. Input selection methods that are originally designed for a single response variable are extended to the case of multiple responses. The assumption of linearity is not, however, adequate in all problems. Hence, artificial neural networks are applied in the modeling of unknown nonlinear dependencies between the inputs and the response. The first set of methods includes efficient stepwise selection strategies that assess usefulness of the inputs in the model. Alternatively, the problem of input selection is formulated as an optimization problem. An objective function is minimized with respect to sparsity constraints that encourage selection of the inputs. The trade-off between the prediction accuracy and the number of input variables is adjusted by continuous-valued sparsity parameters. Results from extensive experiments on both simulated functions and real benchmark data sets are reported. In comparisons with existing variable selection strategies, the proposed methods typically improve the results either by reducing the prediction error or decreasing the number of selected inputs or with respect to both of the previous criteria. The constructed sparse models are also found to produce more accurate predictions than the models including all the input variables.
data analysis, machine learning, function approximation, multiresponse linear regression, artificial neural networks, input variable selection, model selection, constrained optimization
Other note
  • [Publication 1]: Timo Similä and Jarkko Tikka. 2005. Multiresponse sparse regression with application to multidimensional scaling. In: Włodzisław Duch, Janusz Kacprzyk, Erkki Oja, and Sławomir Zadrożny (editors). Proceedings of the 15th International Conference on Artificial Neural Networks: Formal Models and Their Applications (ICANN 2005). Part II. Warsaw, Poland. 11-15 September 2005. Springer-Verlag. Lecture Notes in Computer Science, volume 3697, pages 97-102. © 2005 by authors and © 2005 Springer Science+Business Media. By permission.
  • [Publication 2]: Timo Similä and Jarkko Tikka. 2006. Common subset selection of inputs in multiresponse regression. In: Proceedings of the 2006 International Joint Conference on Neural Networks (IJCNN 2006). Vancouver, BC, Canada. 16-21 July 2006, pages 1908-1915. © 2006 IEEE. By permission.
  • [Publication 3]: Timo Similä and Jarkko Tikka. 2007. Input selection and shrinkage in multiresponse linear regression. Computational Statistics & Data Analysis, volume 52, number 1, pages 406-422. © 2007 Elsevier Science. By permission.
  • [Publication 4]: Jarkko Tikka and Jaakko Hollmén. 2008. Sequential input selection algorithm for long-term prediction of time series. Neurocomputing, volume 71, numbers 13-15, pages 2604-2615. © 2008 Elsevier Science. By permission.
  • [Publication 5]: Jarkko Tikka and Jaakko Hollmén. 2008. Selection of important input variables for RBF network using partial derivatives. In: Michel Verleysen (editor). Proceedings of the 16th European Symposium on Artificial Neural Networks - Advances in Computational Intelligence and Learning (ESANN 2008). Bruges, Belgium. 23-25 April 2008. d-side publications, pages 167-172.
  • [Publication 6]: Jarkko Tikka. 2007. Input selection for radial basis function networks by constrained optimization. In: Joaquim Marques de Sá, Luís A. Alexandre, Włodzisław Duch, and Danilo Mandic (editors). Proceedings of the 17th International Conference on Artificial Neural Networks (ICANN 2007). Part I. Porto, Portugal. 9-13 September 2007. Springer-Verlag. Lecture Notes in Computer Science, volume 4668, pages 239-248. © 2007 by author and © 2007 Springer Science+Business Media. By permission.
  • [Publication 7]: Jarkko Tikka. 2008. Simultaneous input variable and basis function selection for RBF networks. Neurocomputing, accepted for publication. © 2008 by author and © 2008 Elsevier Science. By permission.