Deconstructing Data Science

Similar documents
CS485/685 Lecture 5: Jan 19, 2016

Same-different and A-not A tests with sensr. Same-Different and the Degree-of-Difference tests. Outline. Christine Borgen Linander

ECE 5424: Introduction to Machine Learning

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 21

NPTEL NPTEL ONINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture-59 Ensemble Methods- Bagging,Committee Machines and Stacking

The Evolution of Cognitive and Noncognitive Skills Over the Life Cycle of the Child

1. Introduction Formal deductive logic Overview

Introduction Chapter 1 of Social Statistics

Family Studies Center Methods Workshop

Information Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)

Closing Remarks: What can we do with multiple diverse solutions?

THREE LITTLE PIGS. Do you want to join him in his dreamy adventure?

POLS 205 Political Science as a Social Science. Making Inferences from Samples

Quantifiers: Their Semantic Type (Part 3) Heim and Kratzer Chapter 6

ECE 5424: Introduction to Machine Learning

NPTEL NPTEL ONLINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture 31

ECE 5424: Introduction to Machine Learning

CHAPTER FIVE SAMPLING DISTRIBUTIONS, STATISTICAL INFERENCE, AND NULL HYPOTHESIS TESTING

What is a counterexample?

CSSS/SOC/STAT 321 Case-Based Statistics I. Introduction to Probability

The World Wide Web and the U.S. Political News Market: Online Appendices

S1 Supporting Information for: Material security, life history, and moralistic religions: A cross-cultural examination. Contents

Outline. The argument from so many arguments. Framework. Royall s case. Ted Poston

A Scientific Realism-Based Probabilistic Approach to Popper's Problem of Confirmation

Statistical Inference Without Frequentist Justifications

Excerpts from Romeo and Juliet By William Shakespeare c. 1593

POLS 205 Political Science as a Social Science. Examples of Theory-Building in Political Science

Privacy: more than meets the eye. Daniel Kifer (Penn State University)

SAMPLE. alleluia. F œ # & œ œw. á œ œ. œ œ œ œ œ. œ œ G C/E. G Am/G. ia. C/E. you have the words of ever -

On Truth At Jeffrey C. King Rutgers University

Clustering. ABDBM Ron Shamir

Self Truth. 2-5 Years: Opening Prayer: to open the session. For example:

Aboutness and Justification

Lesson 10 Notes. Machine Learning. Intro. Joint Distribution

Culture and Public Goods: The Case of Religion and the Voluntary Provision of Environmental Quality. Ann L. Owen* Julio R. Videras.

CS224W Project Proposal: Characterizing and Predicting Dogmatic Networks

A Recursive Semantics for Defeasible Reasoning

Foreword 9 Introduction 13

Segment 2 Exam Review #1

A Recursive Semantics for Defeasible Reasoning

The Negative Relationship between Size and the Probability of Weekly Attendance in Churches in the United States

Computational Learning Theory: Agnostic Learning

VISUALIZING INFERENCE

Scientific errors should be controlled, not prevented. Daniel Eindhoven University of Technology

Conditional Probability, Hypothesis Testing, and the Monty Hall Problem

Module 02 Lecture - 10 Inferential Statistics Single Sample Tests

Evidence and the epistemic theory of causality

Particle Sizes and Clumps from Stellar Occultations

ECE 5424: Introduction to Machine Learning

ROMEO AND JULIET TEST MEMO:

The SELF THE SELF AND RELIGIOUS EXPERIENCE: RELIGIOUS INTERNALIZATION PREDICTS RELIGIOUS COMFORT MICHAEL B. KITCHENS 1

Summer Reading 2018 David E. Owens Middle School New Milford, New Jersey

I also occasionally write for the Huffington Post: knoll/

Saturday of St. Lazarus - Friday Evening Vespers (Vespers alone without Pre-sanctified Liturgy) Stichera on "Lord I have cried" Kievan Chant Tone 6

I assume some of our justification is immediate. (Plausible examples: That is experienced, I am aware of something, 2 > 0, There is light ahead.

Discussion Notes for Bayesian Reasoning

and Christian Discipleship

Identifying Anaphoric and Non- Anaphoric Noun Phrases to Improve Coreference Resolution

Six Sigma Prof. Dr. T. P. Bagchi Department of Management Indian Institute of Technology, Kharagpur

CAS LX 522 Syntax I Fall 2000 November 6, 2000 Paul Hagstrom Week 9: Binding Theory. (8) John likes him.

Completamento di Cinema

Deep Trouble for the Deep Self 1. [Forthcoming in Philosophical Psychology] David Rose, Jonathan Livengood, Justin Sytsma and Edouard Machery

Announcements. CS243: Discrete Structures. First Order Logic, Rules of Inference. Review of Last Lecture. Translating English into First-Order Logic

Conditionals IV: Is Modus Ponens Valid?

ECE 5984: Introduction to Machine Learning

EMBARGOED FOR RELEASE: Sunday, November 27 at 8:00 a.m.

What can happen if two quorums try to lock their nodes at the same time?

The New Paradigm and Mental Models

Marcello Pagano [JOTTER WEEK 5 SAMPLING DISTRIBUTIONS ] Central Limit Theorem, Confidence Intervals and Hypothesis Testing

SAMPLE. Penitential Act with Invocations # & á œ œ. œ œ # & œ œ œ œ. w w. á á. w œ œ

Beyond the Doomsday Argument: Reply to Sowers and Further Remarks

HP-35s Calculator Program Compute Horizontal Curve Values given only 2 Parameters

MISSOURI S FRAMEWORK FOR CURRICULAR DEVELOPMENT IN MATH TOPIC I: PROBLEM SOLVING

A Problem for a Direct-Reference Theory of Belief Reports. Stephen Schiffer New York University

Against Coherence: Truth, Probability, and Justification. Erik J. Olsson. Oxford: Oxford University Press, Pp. xiii, 232.

Reasoning and Decision-Making under Uncertainty

How many imputations do you need? A two stage calculation using a quadratic rule

Conditionals II: no truth conditions?

DATA TABLES Global Warming, God, and the End Times by Demographic and Social Group

Torah Code Cluster Probabilities

A Christmas. Patricia Hutchison. Charles Dickens. adapted by

1: Act III, Scene III. 2 Actors: Friar Laurence and Romeo FRIAR LAURENCE ROMEO

The Taboo of Religion in YA Novels. I was an avid reader of young adult (YA) literature growing up, and I still am to this day.

Introduction to Statistical Hypothesis Testing Prof. Arun K Tangirala Department of Chemical Engineering Indian Institute of Technology, Madras

Deep Neural Networks [GBC] Chap. 6, 7, 8. CS 486/686 University of Waterloo Lecture 18: June 28, 2017

ACT IV. SCENE I. Friar Laurence's cell.

REVEAL Spiritual Vitality Index for Brazos Meadows Baptist Church

Other Logics: What Nonclassical Reasoning Is All About Dr. Michael A. Covington Associate Director Artificial Intelligence Center

ECE 6504: Deep Learning for Perception

Epistemic conditions for rationalizability

Conditional Degree of Belief

Van Fraassen: Arguments Concerning Scientific Realism

Smith Waterman Algorithm - Performance Analysis

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 3

Van Fraassen: Arguments concerning scientific realism

Causation and Free Will

Use of Gaia DR1 data from TOPCAT

=EQUALS= Center for. A Club of Investigation and Discovery. Published by: autosocratic PRESS Copyright 2011 Michael Lee Round

THE GEORGE WASHINGTON BATTLEGROUND POLL

Artificial Intelligence I

Transcription:

econstructing ata Science avid Bamman, UC Berkeley Info 290 Lecture 11: Topic models Feb 29, 2016

Topic models

Latent variables A latent variable is one that s unobserved, either because: e are predicting it (but have observed that variable for other data points) it is unobservable

Latent variables observed variables latent variables email text, date, sender topic novels text, author, pub date genre, topic social netork nodes, friendship structure communities fitbit data accelerometer output steps, sleep patterns legislators netflix users voting behavior, speeches atching behavior, ratings political preference genre preference

Probabilistic graphical models Nodes represent variables (shaded = observed, clear = latent) y Arros indicate conditional relationships The probability of x here is dependent on y x Simply a visual ay of riting the joint probability: P(x, y) =P(y) P(x y)

Topic Models A probabilistic model for discovering hidden topics or themes (groups of terms that tend to occur together) in documents. Unsupervised (find interesting structure in the data) Clustering algorithm: Ho to tokens cluster into topics?

Topic Models Input: set of documents, number of clusters to learn. Output: topics topic ratio in each document topic distribution for each ord in doc

topic models cluster tokens into topics The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo."

topic models cluster tokens into topics The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo." eath

topic models cluster tokens into topics The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo." Love

topic models cluster tokens into topics The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo." Family

topic models cluster tokens into topics The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo." Etc.

tokens, not types The messenger, hoever, does not reach Romeo and, instead, Romeo learns of Juliet's apparent death from his servant Balthasar. Heartbroken, Romeo buys poison from an apothecary and goes to the Capulet crypt. He encounters Paris ho has come to mourn Juliet privately. Believing Romeo to be a vandal, Paris confronts him and, in the ensuing battle, Romeo kills Paris. Still believing Juliet to be dead, he drinks the poison. Juliet then aakens and, finding Romeo dead, stabs herself ith his dagger. The feuding families and the Prince meet at the tomb to find all three dead. Friar Laurence recounts the story of the to "star-cross'd lovers". The families are reconciled by their children's deaths and agree to end their violent feud. The play ends ith the Prince's elegy for the lovers: "For never as a story of more oe / Than this of Juliet and her Romeo." People A different Paris token might belong to a Place or French topic

Applications http://.rci.rutgers.edu/~ag978/quiet/

x = feature vector β = coefficients Feature Value Feature β follo clinton 0 follo trump 0 republican in profile 0 democrat in profile 0 benghai" 1 topic 1 0.55 topic 2 0.32 topic 3 0.13 follo clinton -3.1 follo trump 6.8 republican in profile 7.9 democrat in profile -3.0 benghai" -1.7 topic 1 0.3 topic 2-1.2 topic 3 5.7 15

Softare Mallet http://mallet.cs.umass.edu/ Gensim (python) https://radimrehurek.com/ gensim/ Visualiation https://github.com/udata/ termite-visualiations

α θ document distribution over topics ɣ topic indicators for ords φ ords topic distribution over ords

Topic Models A document has distribution over topics g a q f 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family

Topic Models a q A topic is a distribution over ords g f 0.00 0.10 0.20 e.g., P( adore topic = love) =.18

a g q f K=20

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f???? P(topic topic distribution)

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar??? P(topic topic distribution)

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar aliens?? P(topic topic distribution)

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar aliens ar? P(topic topic distribution)

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar aliens ar love P(topic topic distribution)

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar aliens ar love???? 0.00 0.10 0.20 0.00 0.10 0.20 0.00 0.10 0.20

a g q f K=20

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f ar aliens ar love fights alien kills marries 0.00 0.10 0.20 0.00 0.10 0.20 0.00 0.10 0.20

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f???? P(topic topic distribution)

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens??? P(topic topic distribution)

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens family?? P(topic topic distribution)

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens family aliens? P(topic topic distribution)

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens family aliens love P(topic topic distribution)

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens family aliens love???? 0.00 0.10 0.20 0.00 0.10 0.20 0.00 0.10 0.20

a 0.0 0.1 0.2 0.3 0.4 ar love chases boats aliens family g q f aliens family aliens love ET mom space friend 0.00 0.10 0.20 0.00 0.10 0.20 0.00 0.10 0.20

Inferred Topics

Inference hat are the topic distributions for each document? a q hat are the topic assignments for each ord in a document? g hat are the ord distributions for each topic? f Find the parameters that maximie the likelihood of the data!

Inference Markov chain Monte Carlo (Gibbs sampling, Metropolis Hastings, etc.) Variational methods Spectral methods (Anandkumar et al. 2012, Arora et al. 2013)

Gibbs Sampling Markov chain Monte Carlo method for approximating the joint distribution of a set of variables (Geman and Geman 1984; Metropolis et al. 1953; Hastings et al. 1970) Josiah Gibbs

Gibbs Sampling 1. Start ith some initial value for all the variables 2. Sample a value for a variable conditioned on all of the other variables around it (using Bayes theorem) g f a q P ( X) = P ( )P (X ) P ( )P (X )

α Inference θ ɣ φ

α Inference θ P ( d, d ) P ( d ) P ( i d) ɣ ir( ) i Cat( i ) i φ

α Inference θ P ( d,, ) ɣ P ( d)p (, ) Cat( d)cat(, ) d φ

α Sampling θ P( θ) P( ) P( θ) P( ) norm =1 0.100 0.010 0.001 0.019 ɣ =2 0.200 0.030 0.006 0.112 =3 0.070 0.020 0.001 0.026 φ =4 0.130 0.080 0.010 0.193 =5 0.500 0.070 0.035 0.651

Aside: sampling?

Sampling from a Multinomial Probability mass function (PMF) P( = x) exactly P( = x) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 2 3 4 5 x

Sampling from a Multinomial Cumulative density function (CF) P( x) P( <= x) 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 x

Sampling from a Multinomial Sample p uniformly in [0,1] Find the point CF -1 (p) P( <= x) 0.0 0.2 0.4 0.6 0.8 1.0 p=.78 1 2 3 4 5 x

Sampling from a Multinomial Sample p uniformly in [0,1] Find the point CF -1 (p) P( <= x) 0.0 0.2 0.4 0.6 0.8 1.0 p=.06 1 2 3 4 5 x

Sampling from a Multinomial 1.000 Sample p uniformly in [0,1] Find the point CF -1 (p) P( <= x) 0.0 0.2 0.4 0.6 0.8 1.0 0.008 0.059 0.071 0.703 1 2 3 4 5 x

α Assumptions Every ord has one topic ɣ φ θ Every document has one topic distribution No sequential information (topics for ords are independent of each other given the set of topics for a document) Topics don t have arbitrary correlations (irichlet prior) ords don t have arbitrary correlations (irichlet prior) The only information you learn from are the identities of ords and ho they are divided into documents.

hat if you ant to encode other assumptions or reason over other observations?

α θ φ

α θ t φ

α θ t αt φ βt Time is dran from a Beta distribution [0,1] (ang and McCallum 2006)

α θ t αt φ βt P (,, t,, t, t) P ( d)p (, )P (t,, ) Cat( d)cat(, )Beta(t t, t) d t t 1 (1 t) t 1 B( t, t)

α θ t μ σ Time is dran from a Normal distribution φ [-, ]

α θ t μ P (,, t,, µ, ) φ σ P ( d)p (, ),P(t,µ, ) Cat( d)cat(, )Norm(t µ, ) d 1 2 exp (t µ ) 2 2 2

α θ t ψ φ Time is dran from a Multinomial distribution [1,, K]

α θ t ψ φ P (,,, t, ) P ( d)p (, )P (t, ) Cat( d)cat(, )Cat(t, ) d t

Goldstone and Underood (2014), The Quiet Transformations of Literary Studies

Grimmer (2010), A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases