Statistical mechanics approaches to covariation in protein families
Extracting functional and structural information about protein families from the covariation of residues in multiple sequence alignments is an important challenge in computational biology. In this talk I will introduce a statistical physics inspired framework to analyze those covariations, which naturally unifies existing methods in the literature. Our approach allows us to identify statistically relevant ‘patterns’ of residues, specific to a protein family. We show that many patterns correspond to a small number of sites on the protein sequence, in close contact on the 3D fold. Hence, those patterns allow us to make accurate predictions about the contact map from sequence data only. Furthermore, we show that the dimensional reduction, which is achieved by considering only the statistically most significant patterns, avoids overfitting in small sequence alignments, and improves our capacity of extracting residue contacts in this case.