CS485/685 Lecture 5: Jan 19, 2016

Similar documents
ECE 5424: Introduction to Machine Learning

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 21

Discussion Notes for Bayesian Reasoning

Computational Learning Theory: Agnostic Learning

CSSS/SOC/STAT 321 Case-Based Statistics I. Introduction to Probability

Introductory Statistics Day 25. Paired Means Test

Outline. The argument from so many arguments. Framework. Royall s case. Ted Poston

ECE 5424: Introduction to Machine Learning

Lesson 09 Notes. Machine Learning. Intro

INTRODUCTION TO HYPOTHESIS TESTING. Unit 4A - Statistical Inference Part 1

Torah Code Cluster Probabilities

Agnostic KWIK learning and efficient approximate reinforcement learning

Marcello Pagano [JOTTER WEEK 5 SAMPLING DISTRIBUTIONS ] Central Limit Theorem, Confidence Intervals and Hypothesis Testing

ECE 5424: Introduction to Machine Learning

NPTEL NPTEL ONINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture-59 Ensemble Methods- Bagging,Committee Machines and Stacking

2nd International Workshop on Argument for Agreement and Assurance (AAA 2015), Kanagawa Japan, November 2015

Statistical Inference Without Frequentist Justifications

Content Area Variations of Academic Language

Introduction Chapter 1 of Social Statistics

CHAPTER 17: UNCERTAINTY AND RANDOM: WHEN IS CONCLUSION JUSTIFIED?

MITOCW watch?v=4hrhg4euimo

Six Sigma Prof. Dr. T. P. Bagchi Department of Management Indian Institute of Technology, Kharagpur

Module 02 Lecture - 10 Inferential Statistics Single Sample Tests

Scientific Realism and Empiricism

Deconstructing Data Science

POLS 205 Political Science as a Social Science. Making Inferences from Samples

Explanationist Aid for the Theory of Inductive Logic

Conditional Probability, Hypothesis Testing, and the Monty Hall Problem

The argument from so many arguments

Same-different and A-not A tests with sensr. Same-Different and the Degree-of-Difference tests. Outline. Christine Borgen Linander

1/17/2018 ECE 313. Probability with Engineering Applications Section B Y. Lu. ECE 313 is quite a bit different from your other engineering courses.

HIGH CONFIRMATION AND INDUCTIVE VALIDITY

Some basic statistical tools. ABDBM Ron Shamir

Deep Neural Networks [GBC] Chap. 6, 7, 8. CS 486/686 University of Waterloo Lecture 18: June 28, 2017

Reasoning and Decision-Making under Uncertainty

Module - 02 Lecturer - 09 Inferential Statistics - Motivation

Ways Churches Welcome Guests. Survey of Protestant Pastors

This report is organized in four sections. The first section discusses the sample design. The next

Sins of the Epistemic Probabilist Exchanges with Peter Achinstein

Family Studies Center Methods Workshop

Pastor Views on Sermons and the IRS

MLLunsford, Spring Activity: Conditional Probability and The Law of Total Probability

Detachment, Probability, and Maximum Likelihood

Lesson 10 Notes. Machine Learning. Intro. Joint Distribution

Pastors Views on the Economy s Impact Survey of Protestant Pastors

11 Beware of Syllogism: Statistical Reasoning and Conjecturing According to Peirce

A Scientific Realism-Based Probabilistic Approach to Popper's Problem of Confirmation

Statistics, Politics, and Policy

Protestant Pastors Views on the Economy. Survey of 1,000 Protestant Pastors

How many imputations do you need? A two stage calculation using a quadratic rule

Closing Remarks: What can we do with multiple diverse solutions?

Pastors Views on Immigration. Survey of American Protestant Pastors

Scientific errors should be controlled, not prevented. Daniel Eindhoven University of Technology

Introduction to Statistical Hypothesis Testing Prof. Arun K Tangirala Department of Chemical Engineering Indian Institute of Technology, Madras

RATIONALITY AND SELF-CONFIDENCE Frank Arntzenius, Rutgers University

Conditional Degree of Belief

Probability Distributions TEACHER NOTES MATH NSPIRED

Coincidences and How to Think about Them. Elliott Sober

The Problem of Induction and Popper s Deductivism

McDougal Littell High School Math Program. correlated to. Oregon Mathematics Grade-Level Standards

The Problem of Induction. Knowledge beyond experience?

Some questions about Adams conditionals

Curriculum Guide for Pre-Algebra

What Is On The Final. Review. What Is Not On The Final. What Might Be On The Final

A FIRST COURSE IN PARAMETRIC INFERENCE BY B. K. KALE DOWNLOAD EBOOK : A FIRST COURSE IN PARAMETRIC INFERENCE BY B. K. KALE PDF

Fusion Confusion? Comments on Nancy Reid: BFF Four Are we Converging?

The following content is provided under a Creative Commons license. Your support

Final Exam (PRACTICE-2) #2

Pastor Plans for Christmas/ New Year s Day Services. Survey of Protestant Pastors

Lesson 07 Notes. Machine Learning. Quiz: Computational Learning Theory

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 3

Social Perception Survey. Do people make prejudices based on appearance/stereotypes? We used photos as a bias to test this.

THE ROLE OF COHERENCE OF EVIDENCE IN THE NON- DYNAMIC MODEL OF CONFIRMATION TOMOJI SHOGENJI

Chapter 2 Science as a Way of Knowing: Critical Thinking about the Environment

Coincidences and How to Think about Them. Elliott Sober

Other Logics: What Nonclassical Reasoning Is All About Dr. Michael A. Covington Associate Director Artificial Intelligence Center

The Problem of Induction

MITOCW watch?v=ogo1gpxsuzu

Experimental Design. Introduction

Introduction to Inference

Rationality and the Bayesian Paradigm: An Integrative Note

Pastor Plans for Super Bowl Sunday Activities. Survey of Protestant Pastors in Churches Typically Conducting Sunday Night Activities

Math2UU3*TEST1. Duration of Test: 60 minutes McMaster University, 25 September Last name (PLEASE PRINT): First name (PLEASE PRINT): Student No.

The end of the world & living in a computer simulation

Beyond the Doomsday Argument: Reply to Sowers and Further Remarks

Grade 6 correlated to Illinois Learning Standards for Mathematics

HPS 1653 / PHIL 1610 Revision Guide (all topics)

The New Paradigm and Mental Models

ON SOPHIE GERMAIN PRIMES

The World Wide Web and the U.S. Political News Market: Online Appendices

American Views on Islam. Phone Survey of 1,000 Americans

Sociology Exam 1 Answer Key February 18, 2011

Grade 6 Math Connects Suggested Course Outline for Schooling at Home

Philosophy Epistemology Topic 5 The Justification of Induction 1. Hume s Skeptical Challenge to Induction

MITOCW watch?v=iozvbilaizc

Radiomics for Disease Characterization: An Outcome Prediction in Cancer Patients

Certainty, probability and abduction: why we should look to C.S. Peirce rather than GoÈ del for a theory of clinical reasoning

Protestant Pastors Views on the Environment. Survey of 1,000 Protestant Pastors

There are two common forms of deductively valid conditional argument: modus ponens and modus tollens.

NPTEL NPTEL ONLINE COURSES REINFORCEMENT LEARNING. UCB1 Explanation (UCB1)

Transcription:

CS485/685 Lecture 5: Jan 19, 2016 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2 CS485/685 (c) 2016 P. Poupart 1

Statistical Learning View: we have uncertain knowledge of the world Idea: learning simply reduces this uncertainty CS485/685 (c) 2012 P. Poupart 2

Terminology Probability distribution: A specification of a probability for each event in our sample space Probabilities must sum to 1 Assume the world is described by two (or more) random variables Joint probability distribution Specification of probabilities for all combinations of events CS485/685 (c) 2012 P. Poupart 3

Joint distribution Given two random variables and : Joint distribution: Pr Λ for all, Marginalisation (sumout rule): Pr Σ Pr Λ Pr Σ Pr Λ CS485/685 (c) 2012 P. Poupart 4

Example: Joint Distribution sunny ~sunny cold ~cold cold ~cold headache 0.108 0.012 ~headache 0.016 0.064 headache 0.072 0.008 ~headache 0.144 0.576 P(headacheΛsunnyΛcold) = P(~headacheΛsunnyΛ~cold) = P(headacheVsunny) = P(headache) = marginalization CS485/685 (c) 2012 P. Poupart 5

Conditional Probability : fraction of worlds in which is true that also have true H= Have headache F= Have Flu F H Pr 1/10 Pr 1/40 Pr 1/2 Headaches are rare and flu is rarer, but if you have the flu, then there is a 50-50 chance you will have a headache CS485/685 (c) 2012 P. Poupart 6

Conditional Probability H= Have headache F= Have Flu Pr 1/10 Pr 1/40 Pr 1/2 F H Pr Fraction of flu inflicted worlds in which you have a headache =(# worlds with flu and headache)/ (# worlds with flu) = (Area of H and F region)/ (Area of F region) = Pr Λ / Pr CS485/685 (c) 2012 P. Poupart 7

Conditional Probability Definition: Chain rule: Memorize these! CS485/685 (c) 2012 P. Poupart 8

Inference F H One day you wake up with a headache. You think Drat! 50% of flues are associated with headaches so I must have a 50-50 chance of coming down with the flu H= Have headache F= Have Flu Pr 1/10 Pr 1/40 Pr 1/2 Is your reasoning correct? Pr Λ Pr CS485/685 (c) 2012 P. Poupart 9

Example: Joint Distribution sunny ~sunny cold ~cold cold ~cold headache 0.108 0.012 ~headache 0.016 0.064 headache 0.072 0.008 ~headache 0.144 0.576 Pr Λ Pr Λ ~ CS485/685 (c) 2012 P. Poupart 10

Bayes Rule Note Pr Pr Pr Λ Pr Λ Pr Bayes Rule Pr Pr Pr /Pr Memorize this! CS485/685 (c) 2012 P. Poupart 11

Using Bayes Rule for inference Often we want to form a hypothesis about the world based on what we have observed Bayes rule is vitally important when viewed in terms of stating the belief given to hypothesis H, given evidence e Likelihood Prior probability Posterior probability Normalizing constant CS485/685 (c) 2012 P. Poupart 12

Bayesian Learning Prior: Likelihood: Evidence: 1 2 Bayesian Learning amounts to computing the posterior using Bayes Theorem: CS485/685 (c) 2012 P. Poupart 13

Bayesian Prediction Suppose we want to make a prediction about an unknown quantity X Predictions are weighted averages of the predictions of the individual hypotheses Hypotheses serve as intermediaries between raw data and prediction CS485/685 (c) 2012 P. Poupart 14

Candy Example Favorite candy sold in two flavors: Lime (hugh) Cherry (yum) Same wrapper for both flavors Sold in bags with different ratios: 100% cherry 75% cherry + 25% lime 50% cherry + 50% lime 25% cherry + 75% lime 100% lime CS485/685 (c) 2012 P. Poupart 15

Candy Example You bought a bag of candy but don t know its flavor ratio After eating candies: What s the flavor ratio of the bag? What will be the flavor of the next candy? CS485/685 (c) 2012 P. Poupart 16

Statistical Learning Hypothesis H: probabilistic theory of the world 1 : 100% cherry 2 : 75% cherry + 25% lime 3 : 50% cherry + 50% lime 4 : 25% cherry + 75% lime 5 : 100% lime Examples E: evidence about the world 1 : 1 st candy is cherry 2 : 2 nd candy is lime 3 : 3 rd candy is lime CS485/685 (c) 2012 P. Poupart 17

Candy Example Assume prior Assume candies are i.i.d. (identically and independently distributed) Suppose first 10 candies all taste lime: 5 3 1 CS485/685 (c) 2012 P. Poupart 18

Posterior P(h_i e_1...e_t) 1 0.8 0.6 0.4 0.2 P(h_1 E) P(h_2 E) P(h_3 E) P(h_4 E) P(h_5 E) Posteriors given data generated from h_5 0 0 2 4 6 8 10 Number of samples CS485/685 (c) 2012 P. Poupart 19

Prediction Probability P(red e_1...e_t) that next candy is lime 1 0.9 0.8 0.7 0.6 0.5 0.4 Bayes predictions with data generated from h_5 0 2 4 6 8 10 Number of samples CS485/685 (c) 2012 P. Poupart 20

Bayesian Learning Bayesian learning properties: Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) No overfitting (all hypotheses considered and weighted) There is a price to pay: When hypothesis space is large Bayesian learning may be intractable i.e. sum (or integral) over hypothesis often intractable Solution: approximate Bayesian learning CS485/685 (c) 2012 P. Poupart 21

Maximum a posteriori (MAP) Idea: make prediction based on most probable hypothesis Pr Pr Pr In contrast, Bayesian learning makes prediction based on all hypotheses weighted by their probability CS485/685 (c) 2012 P. Poupart 22

MAP properties MAP prediction less accurate than Bayesian prediction since it relies only on one hypothesis But MAP and Bayesian predictions converge as data increases Controlled overfitting (prior can be used to penalize complex hypotheses) Finding may be intractable: Pr Optimization may be difficult CS485/685 (c) 2012 P. Poupart 23

Maximum Likelihood (ML) Idea: simplify MAP by assuming uniform prior (i.e., ) Make prediction based on only: CS485/685 (c) 2012 P. Poupart 24

ML properties ML prediction less accurate than Bayesian and MAP predictions since it ignores prior info and relies only on one hypothesis But ML, MAP and Bayesian predictions converge as data increases Subject to overfitting (no prior to penalize complex hypothesis that could exploit statistically insignificant data patterns) Finding is often easier than Σ log Pr CS485/685 (c) 2012 P. Poupart 25