Correlation-compressed direct-coupling analysis

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorGao, Chen Yien_US
dc.contributor.authorZhou, Hai Junen_US
dc.contributor.authorAurell, Eriken_US
dc.contributor.departmentDepartment of Applied Physicsen
dc.contributor.departmentDepartment of Computer Scienceen
dc.contributor.groupauthorCentre of Excellence in Computational Inference, COINen
dc.contributor.groupauthorComplex Systems and Materialsen
dc.contributor.groupauthorProfessorship Kaski Samuelen
dc.contributor.organizationInstitute of Theoretical Physics of the Chinese Academy of Sciencesen_US
dc.date.accessioned2018-10-16T08:55:28Z
dc.date.available2018-10-16T08:55:28Z
dc.date.issued2018-09-11en_US
dc.description.abstractLearning Ising or Potts models from data has become an important topic in statistical physics and computational biology, with applications to predictions of structural contacts in proteins and other areas of biological data analysis. The corresponding inference problems are challenging since the normalization constant (partition function) of the Ising or Potts distribution cannot be computed efficiently on large instances. Different ways to address this issue have resulted in a substantial amount of methodological literature. In this paper we investigate how these methods could be used on much larger data sets than studied previously. We focus on a central aspect, that in practice these inference problems are almost always severely undersampled, and the operational result is almost always a small set of leading predictions. We therefore explore an approach where the data are prefiltered based on empirical correlations, which can be computed directly even for very large problems. Inference is only used on the much smaller instance in a subsequent step of the analysis. We show that in several relevant model classes such a combined approach gives results of almost the same quality as inference on the whole data set. It can therefore provide a potentially very large computational speedup at the price of only marginal decrease in prediction quality. We also show that the results on whole-genome epistatic couplings that were obtained in a recent computation-intensive study can be retrieved by our approach. The method of this paper hence opens up the possibility to learn parameters describing pairwise dependences among whole genomes in a computationally feasible and expedient manner.en
dc.description.versionPeer revieweden
dc.format.extent1-15
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationGao, C Y, Zhou, H J & Aurell, E 2018, ' Correlation-compressed direct-coupling analysis ', Physical Review E, vol. 98, no. 3, 032407, pp. 1-15 . https://doi.org/10.1103/PhysRevE.98.032407en
dc.identifier.doi10.1103/PhysRevE.98.032407en_US
dc.identifier.issn2470-0045
dc.identifier.issn1550-2376
dc.identifier.otherPURE UUID: 87a87ad8-4c93-4e61-9b9f-bf8d6c9071f1en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/87a87ad8-4c93-4e61-9b9f-bf8d6c9071f1en_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85053241828&partnerID=8YFLogxKen_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/28341525/PhysRevE.98.032407.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/34297
dc.identifier.urnURN:NBN:fi:aalto-201810165374
dc.language.isoenen
dc.relation.ispartofseriesPhysical Review Een
dc.relation.ispartofseriesVolume 98, issue 3en
dc.rightsopenAccessen
dc.titleCorrelation-compressed direct-coupling analysisen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files