Computational Methods for Analyzing the Architecture and Evolution of the Regulatory Genome

Pradipta Ray

Thesis Committee: Eric P. Xing, co-chair, Veronica F. Hinman, co-chair, Jaime Carbonell, Ziv-Bar Joseph, Martin Kreitman


Diversity in forms of animal life, diversity in function of an organisms cells, and diversity in function of a single cell over its lifetime or in response to different stimuli drives a whole plethora of processes in biology. Gene control circuitry actually dictates whether or not, and when and by how much a particular gene should be expressed in a cell in order to create such diversity. Such control mechanisms are often present in the genome in the form of cis-regulatory modules: regions in the neighborhood of each gene which contain sequence motifs (particular genetic subsequences which are noisy copies of each other) where proteins (called Transcription Factors) that regulate the gene expression bind. The large amounts of genomic (and often other corresponding experimental) data involved, and the complexity of the resulting analysis lends itself well to a machine learning setting.

One goal of this thesis is to explore supervised motif detection in regulatory sequences by maximally utilizing the inherent grammar or structure of the cis-regulatory modules. We achieve this goal by using hierarchical and generalized Hidden Markov Models in a Bayesian setting. Another goal is to explore supervised motif detection by using multiple sequence alignments, specifically modeling functional turnover : a confounding phenomenon in phylogenetic analysis where orthologous sequences across even closely related species have varying functionality due to rapid coordinated evolutionary change. We developed a generative graphical model which models the multiple sequence alignment as the output of a mixture of phylogenies and perform inference on it to identify regulatory regions and turnover events in related species. A third goal is to analyze diverse sources of evidence and conclude which genetic and epigenetic features correlate well with binding site locations, and to use such information to create a discriminative model for supervised prediction of binding sites. We use the discriminative framework of a conditional random field for the purpose, which assigns weights to genomic, evolutionary, translational, compositional, as well as epigenetic features.

A final goal is to model the evolutionary dynamics of regulatory regions. We modeled co-evolving regions inside cis-regulatory modules by spectral clustering evolutionary parameters in different regulatory subsequences. We also analyzed evolutionary forces in the regulatory genome by identifying which k-mers are preferentially present in regulatory regions across species by modeling regulatory regions as being constructed from evolving mixtures of stochastic dictionaries. This thesis provides novel statistical frameworks for identifying regulatory regions, and analyzing them in terms of their architecture, function, evolutionary properties and correlation with other genomic and epigenomic features in a computationally optimal and statistically sound way.