Statistical Machine Learning
Emphasizing theory and algorithms for learning complex probabilistic models, learning with prior knowledge, and reasoning under uncertainty. Current projects include :
Bayesian statistics, nonparametric Bayesian analysis, algorithms and applications of Bayesian nonparametrics in data mining
In this project we develop nonparametric and semiparametric Bayesian models (based on the Dirichlet process and extensions, sometimes known as the generalized Polya urn schemes) for analyzing time series data, hierarchical data, and other complex inputs with uncertain internal structure, which arise from temporal text mining (e.g., emails, news streams), object tracking (e.g., video surveillance, navigation and control) and biological data analysis. We develope formal probabilistic formalisms, sampling and variational inference algorithms, and also address theoretical issues such as consistence, bounds and convergence of our models and algorithms.Statistical Models and Algorithms of Networks and Relational Data
In this project we develop probabilistic genetative models for the formation, growth, evolution, and dynamics of networks and relational data in general, and inference/learning algorithms for node labeling, link prediction, latent theme extraction, etc., for network and relational data. We also work on theoretical issues, such as bounds, complexity, related to our models and algorithms, and applications to social networks and biological networks. (in collaboration with Stephen Fienberg)Semi-unsupervised and unsupervised learning of distance metrics
In this project we develop algorithms and theories for learning proper distance metrics underlying complex high-dimensional data based on weak auxiliary information regarding data distribution, similarity, continuity, conductivity, etc. We will explore techniques such as probabilistic modeling, dimensionality reduction, spectral graph analysis, kernel methods, and various optimization approaches; and we will apply our results to pattern recognition, classification, and clustering problems.Variational inference/learning theory and development of turn-key approximate inference engines
In this project we develop algorithms and theories of variational approximations for probabilistic inference on large-scale directed/undirected graphical models and chain graphs, and methodologies for structure and parameter estimations for such model. The goal is to develop fully autonomous, distributed, turnkey software based on variational and sampling techniques for reasoning and learning under uncertainty for generic intelligence systems.Applications of probabilistic graphical models in Computational Biology, IR, NLP, Multimedia and Control
We design various task-specific generative, discriminative, and hybrid graphical models and algorithms for various biological and genetic problems (see bellow), for NLP problems such as statistical machine translation, for comprehending and categorizing text corpus, for segmenting, tracking and interpreting video and caption streams from various sources (e.g., surveillance system, robots), and for decision making and active learning in dynamic environments. (in collaboration with many faculty and students at CMU and other universities)Computational Biology
With an emphasis on developing formal models and algorithms that address problems of practical biological and medical concerns. Current projects include :
Probabilistic evolutionary models of cis-regulatory models in Drosophila
In this project we study the evolutionary relationships reflected in the sequence, ordering, position, spacing and function of the regulatory motifs controlling body segmentation during early embryogenesis in 15 species of the Drosophila. We are interested in understanding the biological driving forces, molecular mechanisms and functional implications of motif evolution in general from this biological model, and in developing comparative genomic algorithms for motif finding from unaligned non-coding sequences. (in collaboration with Martin Kreitman)Nonparametric Bayesian models for genetic variations and their associations to diseases and genetic demography
In this project we develop nonparametric Bayesian models and computational algorithms for uncovering the chromosomal association (i.e., haplotypes), population distribution (i.e., diversity and frequency) inheritance process (i.e., recombination/substitution) and phenotypic association (i.e., linkage) of genetic polymorphisms such as SNPs to address problems such as disease-gene discovery, chromosomal evolution and genetic demography. (in collaboration with various faculty in UPMC and U of Chicago)Computational systems biology of genome-microenvironment interactions in breast cancer
In this project we analyze the molecular abundance profiles (e.g., microarray, CGH, ChIp-ChIp) measured in a designer microenvironment, realized in 3D culture model that imitates the in vivo cellular context and dynamics of cancer progression, reversion and apoptosis. We will develop algorithms to identify molecular determinants and markers of cancer states and categorize cancers on the basis of signaling pathway characteristics. Using probabilistic graphical modeling approaches, we hope to infer stochastic network models for transcriptional regulation in response to combinations of signaling inhibitions in cancer cells. (in collaboration with Mina Bissell)Biological sequence analysis: motif detection, gene finding and systems biology
In this project we develop models and algorithms for understanding and uncovering the structure of genomic sequences of higher organisms. We develop Bayesian models for DNA/protein motif detection and gene finding based on both sequence-level signatures and meta-sequence-level structural information reflecting protein-DNA binding, transcript stability, and prior knowledge of the organization rules of regulatory modules. We intend to integrate motif finding with the system biology research of gene regulatory network.A few snaps of our research below : top left - clustered Drosophila developmental embryo images, top right - inference schemata in CSMET evolutionary motif finder, bottom - inferred population structure on HDMAP data.