Modeling Content and Users: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Amr Ahmed

Thesis Committee: Eric Xing, chair, John Lafferty, Zoubin Ghahramani, Alexander J. Smola


Online content have become an important medium to disseminate information and express opinions. With the proliferation of online document collections, users are faced with the problem of missing the big picture in a sea of irrelevant and/or diverse content. In this thesis, we addresses the problem of information organization of online document collections, and provide algorithms that create a structured representation of the otherwise unstructured content. We leverage the expressiveness of latent probabilistic models (e.g. topic models) and non-parametric Bayes techniques (e.g. Dirichlet processes), and give online and distributed inference algorithms that scale to terabyte datasets and adapt the inferred representation with the arrival of new documents. Throughout the thesis, we consider two different domains: research publications and social media (news articles and blog posts); and focus on modeling two facets of contnet: temporal dynamics and structural correspondence.

To model the temporal dynamics of document collections, we introduce a nonparametric Bayes model that we call the recurrent Chinese restaurant process (RCRP). RCRP is a framework for modeling complex longitudinal data, in which the number of mixture components at each time point is unbounded. On top of this process, we develop a hierarchical extension and use it to build an infinite dynamic topic model that recovers the timeline of ideas in research publications. Despite the expressiveness of the aforementioned model, it fails to capture the essential element of dynamics in social media: stories. To remedy this, we developed a multi-resolution model that treats stories as a first-citizen object and combines long-term, high-level topics with short-lived, tightly-focused storylines. Inference in the new model is carried out via a sequential Monte Carlo algorithm that processes new documents on real time.

We then consider the problem of structural correspondence in document collections both across modalities and communities. In research publications, this problem arises due to the multi-modalities of research papers and the pressing need for developing systems that can retrieve relevant documents based on any of these modalities (e.g. figures, text, named entities, to name a few). In social media this problem arises due to ideological bias of the document’s author that mixes facts with opinions. For both problems we develop a series of factored models. In research publications, the developed model represents ideas across modalities and as such can solve the aforementioned retrieval problem. In social media, the model contrasts the same idea across different ideologies, and as such can explain the bias of a given document on a topical-level and help the user staying informed by providing documents that express alternative views.

Finally, we address the problem of inferring users’ intent when they interact with document collections, and how this intent changes over time. The induced user model can then be used in matching users with relevant content.