econstructing ata Science avid Bamman, UC Berkeley Info 290 Lecture 11: Topic models Feb 29, 2016
Topic models
Latent variables A latent variable is one that s unobserved, either because: e are predicting it (but have observed that variable for other data points) it is unobservable
Latent variables observed variables latent variables email text, date, sender topic novels text, author, pub date genre, topic social netork nodes, friendship structure communities fitbit data accelerometer output steps, sleep patterns legislators netflix users voting behavior, speeches atching behavior, ratings political preference genre preference
Probabilistic graphical models Nodes represent variables (shaded = observed, clear = latent) y Arros indicate conditional relationships The probability of x here is dependent on y x Simply a visual ay of riting the joint probability: P(x, y) =P(y) P(x y)
Topic Models A probabilistic model for discovering hidden topics or themes (groups of terms that tend to occur together) in documents. Unsupervised (find interesting structure in the data) Clustering algorithm: Ho to tokens cluster into topics?
Topic Models Input: set of documents, number of clusters to learn. Output: topics topic ratio in each document topic distribution for each ord in doc
topic models cluster tokens into topics The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo."
topic models cluster tokens into topics The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo." eath
topic models cluster tokens into topics The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo." Love
topic models cluster tokens into topics The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo." Family
topic models cluster tokens into topics The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo." Etc.
tokens, not types The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo." People A different Paris token might belong to a Place or French topic
Applications http://.rci.rutgers.edu/~ag978/quiet/
x = feature vector β = coefficients Feature Value Feature β follo clinton 0 follo trump 0 republican in profile 0 democrat in profile 0 benghai" 1 topic 1 0.55 topic 2 0.32 topic 3 0.13 follo clinton -3.1 follo trump 6.8 republican in profile 7.9 democrat in profile -3.0 benghai" -1.7 topic 1 0.3 topic 2-1.2 topic 3 5.7 15
Softare Mallet http://mallet.cs.umass.edu/ Gensim (python) https://radimrehurek.com/ gensim/ Visualiation https://github.com/udata/ termite-visualiations
α θ document distribution over topics ɣ topic indicators for ords φ ords topic distribution over ords
Topic Models A document has distribution over topics g a q f 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family
Topic Models a q A topic is a distribution over ords g f 0.00 0.10 0.20 e.g., P( adore topic = love) =.18
a g q f K=20
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f???? P(topic topic distribution)
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar??? P(topic topic distribution)
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar aliens?? P(topic topic distribution)
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar aliens ar? P(topic topic distribution)
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar aliens ar love P(topic topic distribution)
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar aliens ar love???? 0.00 0.10 0.20 0.00 0.10 0.20 0.00 0.10 0.20
a g q f K=20
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar aliens ar love fights alien kills marries 0.00 0.10 0.20 0.00 0.10 0.20 0.00 0.10 0.20
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f???? P(topic topic distribution)
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens??? P(topic topic distribution)
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens family?? P(topic topic distribution)
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens family aliens? P(topic topic distribution)
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens family aliens love P(topic topic distribution)
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens family aliens love???? 0.00 0.10 0.20 0.00 0.10 0.20 0.00 0.10 0.20
a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens family aliens love ET mom space friend 0.00 0.10 0.20 0.00 0.10 0.20 0.00 0.10 0.20
Inferred Topics
Inference hat are the topic distributions for each document? a q hat are the topic assignments for each ord in a document? g hat are the ord distributions for each topic? f Find the parameters that maximie the likelihood of the data!
Inference Markov chain Monte Carlo (Gibbs sampling, Metropolis Hastings, etc.) Variational methods Spectral methods (Anandkumar et al. 2012, Arora et al. 2013)
Gibbs Sampling Markov chain Monte Carlo method for approximating the joint distribution of a set of variables (Geman and Geman 1984; Metropolis et al. 1953; Hastings et al. 1970) Josiah Gibbs
Gibbs Sampling 1. Start ith some initial value for all the variables 2. Sample a value for a variable conditioned on all of the other variables around it (using Bayes theorem) g f a q P ( X) = P ( )P (X ) P ( )P (X )
α Inference θ ɣ φ
α Inference θ P ( d, d ) P ( d ) P ( i d) ɣ ir( ) i Cat( i ) i φ
α Inference θ P ( d,, ) ɣ P ( d)p (, ) Cat( d)cat(, ) d φ
α Sampling θ P( θ) P( ) P( θ) P( ) norm =1 0.100 0.010 0.001 0.019 ɣ =2 0.200 0.030 0.006 0.112 =3 0.070 0.020 0.001 0.026 φ =4 0.130 0.080 0.010 0.193 =5 0.500 0.070 0.035 0.651
Aside: sampling?
Sampling from a Multinomial Probability mass function (PMF) P( = x) exactly P( = x) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 2 3 4 5 x
Sampling from a Multinomial Cumulative density function (CF) P( x) P( <= x) 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 x
Sampling from a Multinomial Sample p uniformly in [0,1] Find the point CF -1 (p) P( <= x) 0.0 0.2 0.4 0.6 0.8 1.0 p=.78 1 2 3 4 5 x
Sampling from a Multinomial Sample p uniformly in [0,1] Find the point CF -1 (p) P( <= x) 0.0 0.2 0.4 0.6 0.8 1.0 p=.06 1 2 3 4 5 x
Sampling from a Multinomial 1.000 Sample p uniformly in [0,1] Find the point CF -1 (p) P( <= x) 0.0 0.2 0.4 0.6 0.8 1.0 0.008 0.059 0.071 0.703 1 2 3 4 5 x
α Assumptions Every ord has one topic ɣ φ θ Every document has one topic distribution No sequential information (topics for ords are independent of each other given the set of topics for a document) Topics don t have arbitrary correlations (irichlet prior) ords don t have arbitrary correlations (irichlet prior) The only information you learn from are the identities of ords and ho they are divided into documents.
hat if you ant to encode other assumptions or reason over other observations?
α θ φ
α θ t φ
α θ t αt φ βt Time is dran from a Beta distribution [0,1] (ang and McCallum 2006)
α θ t αt φ βt P (,, t,, t, t) P ( d)p (, )P (t,, ) Cat( d)cat(, )Beta(t t, t) d t t 1 (1 t) t 1 B( t, t)
α θ t μ σ Time is dran from a Normal distribution φ [-, ]
α θ t μ P (,, t,, µ, ) φ σ P ( d)p (, ),P(t,µ, ) Cat( d)cat(, )Norm(t µ, ) d 1 2 exp (t µ ) 2 2 2
α θ t ψ φ Time is dran from a Multinomial distribution [1,, K]
α θ t ψ φ P (,,, t, ) P ( d)p (, )P (t, ) Cat( d)cat(, )Cat(t, ) d t
Goldstone and Underood (2014), The Quiet Transformations of Literary Studies
Grimmer (2010), A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases