Functional selection acting on regulatory regions such as the cis-regulatory modules (CRMs) causes differential enrichment of nucleotide contents therein across evolutionarily related orgamisms. The exact impact of such selection on gene regulatory mechanisms is not yet clearly known; but one important characteristic of CRM function in higher organisms is that they are often multi-functional; that is, under different conditions and times, the same sequence in the CRM can drive different biological regulatory functions via recruitment of different combinations of transcription regulatory proteins. Existing models for transcription factor binding site (TFBS) such as PWMs or single dictionaries of oligomers can not capture the multi-functionality of CRM, and offer no insight of the evolutionary mechanism of this phenomena. In this paper, we develop a novel Admixture of Stochastic Dictionaries (ASD) model for the CRM and motifs therein, which succinctly extract and expose the sequence-compositional basis of such multi-functionality.
We have developed sophisticated algorithms for learning the Admixture of Stochastic Dictionaries within one organism, and across multiple evolutionarily related organisms, which allow us to examine multi-functionality of CRMs, and the way it evolves by analyzing the extend of change of every functionality-specific dictionary in the ASD models across organisms. We show that the learned component dictionaries in our model are indeed functionally discriminative, and can be used for predicting regulatory regions. We further show that such discriminality is based on their TF binding affinity scores. We find that the corresponding functionality-specific dictionaries across species have similar (but non-identical) distributions over oligomers, such that regulatory information from one species can be used to predict regulatory regions in other species. We conclude that our model is easy to estimate and interpret, and serves as a good platform for modeling functional evolution of the regulatory genome, and a useful tool to identify regulatory function based on these properties.