Computational Learning Theory: Agnostic Learning

Similar documents
NPTEL NPTEL ONINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture-59 Ensemble Methods- Bagging,Committee Machines and Stacking

Agnostic KWIK learning and efficient approximate reinforcement learning

Module 02 Lecture - 10 Inferential Statistics Single Sample Tests

ECE 5424: Introduction to Machine Learning

MITOCW watch?v=ogo1gpxsuzu

Introduction to Statistical Hypothesis Testing Prof. Arun K Tangirala Department of Chemical Engineering Indian Institute of Technology, Madras

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 21

Lesson 07 Notes. Machine Learning. Quiz: Computational Learning Theory

Module - 02 Lecturer - 09 Inferential Statistics - Motivation

Torah Code Cluster Probabilities

CS485/685 Lecture 5: Jan 19, 2016

MITOCW watch?v=4hrhg4euimo

INTRODUCTION TO HYPOTHESIS TESTING. Unit 4A - Statistical Inference Part 1

McDougal Littell High School Math Program. correlated to. Oregon Mathematics Grade-Level Standards

Final Exam (PRACTICE-2) #2

POLS 205 Political Science as a Social Science. Making Inferences from Samples

Biometrics Prof. Phalguni Gupta Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Lecture No.

The following content is provided under a Creative Commons license. Your support

Six Sigma Prof. Dr. T. P. Bagchi Department of Management Indian Institute of Technology, Kharagpur. Lecture No. # 18 Acceptance Sampling

Types of Error Power of a Hypothesis Test. AP Statistics - Chapter 21

It is One Tailed F-test since the variance of treatment is expected to be large if the null hypothesis is rejected.

Introduction to Inference

Grade 6 correlated to Illinois Learning Standards for Mathematics

Logical (formal) fallacies

NPTEL NPTEL ONLINE COURSES REINFORCEMENT LEARNING. UCB1 Explanation (UCB1)

Curriculum Guide for Pre-Algebra

Marcello Pagano [JOTTER WEEK 5 SAMPLING DISTRIBUTIONS ] Central Limit Theorem, Confidence Intervals and Hypothesis Testing

Georgia Quality Core Curriculum

MLLunsford, Spring Activity: Conditional Probability and The Law of Total Probability

Project: The Power of a Hypothesis Test

Introductory Statistics Day 25. Paired Means Test

ANSWER SHEET FINAL EXAM MATH 111 SPRING 2009 (PRINT ABOVE IN LARGE CAPITALS) CIRCLE LECTURE HOUR 10AM 2PM FIRST NAME: (PRINT ABOVE IN CAPITALS)

Boosting. D. Blei Interacting with Data 1 / 15

Statistics for Experimentalists Prof. Kannan. A Department of Chemical Engineering Indian Institute of Technology - Madras

Scientific errors should be controlled, not prevented. Daniel Eindhoven University of Technology

Probability Foundations for Electrical Engineers Prof. Krishna Jagannathan Department of Electrical Engineering Indian Institute of Technology, Madras

CHAPTER 17: UNCERTAINTY AND RANDOM: WHEN IS CONCLUSION JUSTIFIED?

ECE 5424: Introduction to Machine Learning

Lecture 1 The Concept of Inductive Probability

Introduction Chapter 1 of Social Statistics

Detachment, Probability, and Maximum Likelihood

Discussion Notes for Bayesian Reasoning

Math 10 Lesson 1 4 Answers

ON SOPHIE GERMAIN PRIMES

Brandeis University Maurice and Marilyn Cohen Center for Modern Jewish Studies

MISSOURI S FRAMEWORK FOR CURRICULAR DEVELOPMENT IN MATH TOPIC I: PROBLEM SOLVING

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 3

Content Area Variations of Academic Language

occasions (2) occasions (5.5) occasions (10) occasions (15.5) occasions (22) occasions (28)

APPENDIX C STATE ESTIMATION AND THE MEANING OF LIFE

Conditional Probability, Hypothesis Testing, and the Monty Hall Problem

The end of the world & living in a computer simulation

Scientific Realism and Empiricism

Chapter 20 Testing Hypotheses for Proportions

Near and Dear? Evaluating the Impact of Neighbor Diversity on Inter-Religious Attitudes

History of Probability and Statistics in the 18th Century. Deirdre Johnson, Jessica Gattoni, Alex Gangi

ECE 5424: Introduction to Machine Learning

Van Fraassen: Arguments Concerning Scientific Realism

I thought I should expand this population approach somewhat: P t = P0e is the equation which describes population growth.

Protestant Pastors Views on the Economy. Survey of 1,000 Protestant Pastors

Probability Distributions TEACHER NOTES MATH NSPIRED

Religious affiliation, religious milieu, and contraceptive use in Nigeria (extended abstract)

Experimental Design. Introduction

How many imputations do you need? A two stage calculation using a quadratic rule

Lesson 09 Notes. Machine Learning. Intro

Information Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)

Statistics, Politics, and Policy

MITOCW MITRES18_006F10_26_0703_300k-mp4

Pastor Attrition: Myths, Realities, and Preventions. Study sponsored by: Dr. Richard Dockins and the North American Mission Board

Key definitions Action Ad hominem argument Analytic A priori Axiom Bayes s theorem

A Layperson s Guide to Hypothesis Testing By Michael Reames and Gabriel Kemeny ProcessGPS

Scientific Arguments

Netherlands Interdisciplinary Demographic Institute, The Hague, The Netherlands

Outline. Uninformed Search. Problem-solving by searching. Requirements for searching. Problem-solving by searching Uninformed search techniques

Six Sigma Prof. Dr. T. P. Bagchi Department of Management Indian Institute of Technology, Kharagpur

Chapter 2 Science as a Way of Knowing: Critical Thinking about the Environment

Lesson 10 Notes. Machine Learning. Intro. Joint Distribution

Balancing Authority Ace Limit (BAAL) Proof-of-Concept BAAL Field Trial

The following content is provided under a Creative Commons license. Your support

Identifying Anaphoric and Non- Anaphoric Noun Phrases to Improve Coreference Resolution

About Type I and Type II Errors: Examples

PHILOSOPHIES OF SCIENTIFIC TESTING

Okay, good afternoon everybody. Hope everyone can hear me. Ronet, can you hear me okay?

Quorums. Christian Plattner, Gustavo Alonso Exercises for Verteilte Systeme WS05/06 Swiss Federal Institute of Technology (ETH), Zürich

TÜ Information Retrieval

Misunderestimating Corruption

THE ROLE OF COHERENCE OF EVIDENCE IN THE NON- DYNAMIC MODEL OF CONFIRMATION TOMOJI SHOGENJI

NPTEL NPTEL ONLINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture 31

Review for Test III #1

Beyond the Doomsday Argument: Reply to Sowers and Further Remarks

Midterm Review Part 1 #4

MITOCW watch?v=k2sc-wpdt6k

MEASURING THE TOTAL QUALITY MANAGEMENT IN THE INDONESIAN UNIVERSITIES: FROM THE PERSPECTIVES OF FACULTY MEMBERS THESIS

Family Studies Center Methods Workshop

Pulling Rabbits from Hats (Conditional Probability), Part I

What can happen if two quorums try to lock their nodes at the same time?

AN EXPLORATORY SURVEY EXAMINING THE FAMILIARITY WITH AND ATTITUDES TOWARD CRYONIC PRESERVATION. W. Scott Badger, Ph.D. ABSTRACT INTRODUCTION

Sociology Exam 1 Answer Key February 18, 2011

The World Wide Web and the U.S. Political News Market: Online Appendices

CHAPTER FIVE SAMPLING DISTRIBUTIONS, STATISTICAL INFERENCE, AND NULL HYPOTHESIS TESTING

Transcription:

Computational Learning Theory: Agnostic Learning Machine Learning Fall 2018 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others 1

This lecture: Computational Learning Theory The Theory of Generalization Probably Approximately Correct (PAC) learning Positive and negative learnability results Agnostic Learning Shattering and the VC dimension 2

This lecture: Computational Learning Theory The Theory of Generalization Probably Approximately Correct (PAC) learning Positive and negative learnability results Agnostic Learning Shattering and the VC dimension 3

So far we have seen The general setting for batch learning PAC learning and Occam s Razor How good will a classifier that is consistent on a training set be? Two assumptions so far: 1. Training and test examples come from the same distribution 2. For any concept, there is some function in the hypothesis space that is consistent with the training set Is the second assumption reasonable? 4

So far we have seen The general setting for batch learning PAC learning and Occam s Razor How good will a classifier that is consistent on a training set be? Two assumptions so far: 1. Training and test examples come from the same distribution 2. For any concept, there is some function in the hypothesis space that is consistent with the training set Is the second assumption reasonable? 5

So far we have seen The general setting for batch learning PAC learning and Occam s Razor How good will a classifier that is consistent on a training set be? Two assumptions so far: 1. Training and test examples come from the same distribution 2. For any concept, there is some function in the hypothesis space that is consistent with the training set Is the second assumption reasonable? 6

What is agnostic learning? So far, we have assumed that the learning algorithm could find the true concept What if: We are trying to learn a concept f using hypotheses in H, but f Ï H That is C is not a subset of H This setting is called agnostic learning Can we say something about sample complexity? More realistic setting than before 7

What is agnostic learning? So far, we have assumed that the learning algorithm could find the true concept H What if: We are trying to learn a concept f using hypotheses in H, but f Ï H That is C is not a subset of H This setting is called agnostic learning Can we say something about sample complexity? C More realistic setting than before 8

What is agnostic learning? So far, we have assumed that the learning algorithm could find the true concept H What if: We are trying to learn a concept f using hypotheses in H, but f Ï H That is C is not a subset of H This setting is called agnostic learning Can we say something about sample complexity? C More realistic setting than before 9

Agnostic Learning Learn a concept f using hypotheses in H, but f Ï H Are we guaranteed that training error will be zero? No. There may be no consistent hypothesis in the hypothesis space! Our goal should be to find a classifier h 2 H that has low training error This is the fraction of training examples that are misclassified 10

Agnostic Learning Learn a concept f using hypotheses in H, but f Ï H Our goal should be to find a classifier h 2 H that has low training error What we want: A guarantee that a hypothesis with small training error will have a good accuracy on unseen examples 11

We will use Tail bounds for analysis How far can a random variable get from its mean? 12

We will use Tail bounds for analysis How far can a random variable get from its mean? Tails of these distributions 13

Bounding probabilities Law of large numbers: As we collect more samples, the empirical average converges to the true expectation Eg: Suppose we have an unknown coin and we want to estimate its bias (i.e. probability of heads) Toss the coin m times!"#$%& () *%+,- # P heads As m increases, we get a better estimate of P(heads). What can we say about the gap between these two terms? 14

Bounding probabilities Markov s inequality: Bounds the probability that a nonnegative random variable exceeds a fixed value Chebyshev s inequality: Bounds the probability that a random variable differs from its expected value by more than a fixed number of standard deviations What we want: To bound sums of random variables Why? Because the training error depends on the number of errors on the training set 15

Hoeffding s inequality Upper bounds on how much the sum of a set of random variables differs from its expected value 16

Hoeffding s inequality Upper bounds on how much the sum of a set of random variables differs from its expected value Expected mean (Eg. For a coin toss, the probability of seeing heads) 17

Hoeffding s inequality Upper bounds on how much the sum of a set of random variables differs from its expected value Expected mean (Eg. For a coin toss, the probability of seeing heads) Empirical mean, computed over m independent trials 18

Hoeffding s inequality Upper bounds on how much the sum of a set of random variables differs from its expected value Expected mean (Eg. For a coin toss, the probability of seeing heads) Empirical mean, computed over m independent trials What this tells us: The empirical mean will not be too far from the expected mean if there are many samples. And, it quantifies the convergence rate as well. 19

Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be a random variable The training error over m examples err S (h) is the empirical estimate of this true error Let s apply Hoeffding s inequality 20

Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be a random variable The training error over m examples err S (h) is the empirical estimate of this true error Let s apply Hoeffding s inequality 21

Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be a random variable The training error over m examples err S (h) is the empirical estimate of this true error We can ask: What is the probability that the true error is more than ε away from the empirical error? 22

Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be a random variable The training error over m examples err S (h) is the empirical estimate of this true error Let s apply Hoeffding s inequality 23

Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be a random variable The training error over m examples err S (h) is the empirical estimate of this true error Let s apply Hoeffding s inequality 24

Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be a random variable The training error over m examples err S (h) is the empirical estimate of this true error Let s apply Hoeffding s inequality 25

Agnostic learning The probability that a single hypothesis h has a training error that is more than ² away from the true error is bounded above The learning algorithm looks for the best one of the H possible hypotheses The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above 26

Agnostic learning The probability that a single hypothesis h has a training error that is more than ² away from the true error is bounded above The learning algorithm looks for the best one of the H possible hypotheses The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above 27

Agnostic learning The probability that a single hypothesis h has a training error that is more than ² away from the true error is bounded above The learning algorithm looks for the best one of the H possible hypotheses The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above Union bound 28

Agnostic learning The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above Same game as before: We want this probability to be smaller than ± Rearranging this gives us 29

Agnostic learning The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above Same game as before: We want this probability to be smaller than ± Rearranging this gives us 30

Agnostic learning The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above Same game as before: We want this probability to be smaller than ± Rearranging this gives us 31

Agnostic learning The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above Same game as before: We want this probability to be smaller than ± Rearranging this gives us 32

Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 - ± that the training error is not off by more than ² from the training error if Difference between generalization and training errors: How much worse will the classifier be in the future than it is at training time? Size of the hypothesis class: Again an Occam s razor argument prefer smaller sets of functions 33

Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 - ± that the training error is not off by more than ² from the training error if 34

Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 - ± that the training error is not off by more than ² from the training error if Difference between generalization and training errors: How much worse will the classifier be in the future than it is at training time? 35

Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 - ± that the training error is not off by more than ² from the training error if Difference between generalization and training errors: How much worse will the classifier be in the future than it is at training time? Size of the hypothesis class: Again an Occam s razor argument prefer smaller sets of functions 36

Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 - ± that the training error is not off by more than ² from the training error if 2. We have a generalization bound: A bound on how much the true error will deviate from the training error. If we have more than m examples, then with high probability (more than 1 - ±), Generalization error Training error 37

What we have seen so far Occam s razor: When the hypothesis space contains the true concept Agnostic learning: When the hypothesis space may not contain the true concept Learnability depends on the log of the size of the hypothesis space Have we solved everything? Eg: What about linear classifiers? 38

What we have seen so far Occam s razor: When the hypothesis space contains the true concept Agnostic learning: When the hypothesis space may not contain the true concept Learnability depends on the log of the size of the hypothesis space Have we solved everything? Eg: What about linear classifiers? 39

What we have seen so far Occam s razor: When the hypothesis space contains the true concept Agnostic learning: When the hypothesis space may not contain the true concept Learnability depends on the log of the size of the hypothesis space Have we solved everything? Eg: What about linear classifiers? 40