TÜ Information Retrieval

Similar documents
occasions (2) occasions (5.5) occasions (10) occasions (15.5) occasions (22) occasions (28)

Agnostic KWIK learning and efficient approximate reinforcement learning

Heap and Merge Sorts

COS 226 Algorithms and Data Structures Fall Midterm

Computational Learning Theory: Agnostic Learning

Predictive Coding. CSE 390 Introduction to Data Compression Fall Entropy. Bad and Good Prediction. Which Context to Use? PPM

Module 02 Lecture - 10 Inferential Statistics Single Sample Tests

ECE 5424: Introduction to Machine Learning

Parish Needs Survey (part 2): the Needs of the Parishes

Introduction to Statistical Hypothesis Testing Prof. Arun K Tangirala Department of Chemical Engineering Indian Institute of Technology, Madras

The numbers of single adults practising Christian worship

I thought I should expand this population approach somewhat: P t = P0e is the equation which describes population growth.

SPIRITUAL LIFE SURVEY REPORT. One Life Church. September 2011

Northfield Methodist Church

Discussion Notes for Bayesian Reasoning

Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

Grade 6 correlated to Illinois Learning Standards for Mathematics

POLS 205 Political Science as a Social Science. Making Inferences from Samples

Information Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)

Allreduce for Parallel Learning. John Langford, Microsoft Resarch, NYC

Who wrote the Letter to the Hebrews? Data mining for detection of text authorship

Excel Lesson 3 page 1 April 15

Torah Code Cluster Probabilities

Men practising Christian worship

Information Retrieval LIS 544 IMT 542 INSC 544

Georgia Quality Core Curriculum

Biometrics Prof. Phalguni Gupta Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Lecture No.

The following content is provided under a Creative Commons license. Your support

NPTEL NPTEL ONINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture-59 Ensemble Methods- Bagging,Committee Machines and Stacking

Christian Media in Australia: Who Tunes In and Who Tunes It Out. Arnie Cole, Ed.D. & Pamela Caudill Ovwigho, Ph.D.

1. Introduction Formal deductive logic Overview

Math2UU3*TEST1. Duration of Test: 60 minutes McMaster University, 25 September Last name (PLEASE PRINT): First name (PLEASE PRINT): Student No.

Introducing Stewardship. What is Stewardship?

An Efficient Indexing Approach to Find Quranic Symbols in Large Texts

NCLS Occasional Paper 8. Inflow and Outflow Between Denominations: 1991 to 2001

The Realities of Orthodox Parish Life in the Western United States: Ten Simple Answers to Ten Not Too Easy Questions.

MILL. The principle of utility determines the rightness of acts (or rules of action?) by their effect on the total happiness.

CHRISTIANITY FOR THE TECHNICALLY INCLINED: Risk Assessment, Probability and Prophecy. James Dietz

Gesture recognition with Kinect. Joakim Larsson

Buddhism Stations Workbook

This report is organized in four sections. The first section discusses the sample design. The next

INTRODUCTION TO HYPOTHESIS TESTING. Unit 4A - Statistical Inference Part 1

Prisoners' Dilemma Is a Newcomb Problem

Stewardship, Finances, and Allocation of Resources

Gandalf s Solution to the Newcomb Problem. Ralph Wedgwood

HAS DAVID HOWDEN VINDICATED RICHARD VON MISES S DEFINITION OF PROBABILITY?

Where to get help. There are many ways you can get help as you gather family history information

Analyzing the activities of visitors of the Leiden Ranking website

THE ESOTERIC CODEX: NAZISM AND THE OCCULT BY HANS TRIDLE DOWNLOAD EBOOK : THE ESOTERIC CODEX: NAZISM AND THE OCCULT BY HANS TRIDLE PDF

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 21

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

INTERMEDIATE LOGIC Glossary of key terms

Quorums. Christian Plattner, Gustavo Alonso Exercises for Verteilte Systeme WS05/06 Swiss Federal Institute of Technology (ETH), Zürich

Brandeis University Maurice and Marilyn Cohen Center for Modern Jewish Studies

Intercessory Prayer Fuels It Relational Evangelism Drives It

Appendix. One of the most important tests of the value of a survey is the sniff

The Fixed Hebrew Calendar

Overview of the ATLAS Fast Tracker (FTK) (daughter of the very successful CDF SVT) July 24, 2008 M. Shochet 1

Inverse Relationships Between NAO and Calanus Finmarchicus

ECE 5424: Introduction to Machine Learning

Some details of the contact phenomenon

Stout s teleological theory of action

MISSOURI S FRAMEWORK FOR CURRICULAR DEVELOPMENT IN MATH TOPIC I: PROBLEM SOLVING

The Rapture versus the Second Advent in Matthew 24 Dr. John Niemelä Grace Chapel February 11, 2009 INTRODUCTION

The Scripture Engagement of Students at Christian Colleges

Probability Foundations for Electrical Engineers Prof. Krishna Jagannathan Department of Electrical Engineering Indian Institute of Technology, Madras

Verification of Occurrence of Arabic Word in Quran

MITOCW watch?v=ogo1gpxsuzu


The Zeal of the Convert: Religious Characteristics of Americans who Switch Religions

FALL 2017 CHURCH SURVEY RESPONSES

STEP SEVEN: INTUITION RECEIVING HIGHER GUIDANCE

BY LOUISE HAY POWER THOUGHTS (UNABRIDGED) (5/16/04) BY LOUISE HAY

What can happen if two quorums try to lock their nodes at the same time?

Whatever happened to cman?

August Parish Life Survey. Saint Benedict Parish Johnstown, Pennsylvania

The World Wide Web and the U.S. Political News Market: Online Appendices

2nd International Workshop on Argument for Agreement and Assurance (AAA 2015), Kanagawa Japan, November 2015

CYCLOMANCY: THE SECRET OF PSYCHIC POWER CONTROL BY FRANK RUDOLPH YOUNG

Recursive Mergesort. CSE 589 Applied Algorithms Spring Merging Pattern of Recursive Mergesort. Mergesort Call Tree. Reorder the Merging Steps

Digital Logic Lecture 5 Boolean Algebra and Logic Gates Part I

REMNANT STUDY BIBLE- LEATHERSOFT BURGUNDY BY REMNANT PUBLISHING DOWNLOAD EBOOK : REMNANT STUDY BIBLE- LEATHERSOFT BURGUNDY BY REMNANT PUBLISHING PDF

Supplement to: Aksoy, Ozan Motherhood, Sex of the Offspring, and Religious Signaling. Sociological Science 4:

Smith Waterman Algorithm - Performance Analysis

MLLunsford, Spring Activity: Conditional Probability and The Law of Total Probability

An old problem in Mycology. solved with simple Mathematics

Does Ramadan Have Any Effect on Food Prices: A Dual-Calendar Perspective on the Turkish Data

Minimal and Maximal Models in Reinforcement Learning

Degrees of Belief II

The Decline of the Traditional Church Choir: The Impact on the Church and Society. Dr Arthur Saunders

Report on the Digital Tripitaka Koreana 2001

Agnostic Learning with Ensembles of Classifiers

An alternative understanding of interpretations: Incompatibility Semantics

Results from the Johns Hopkins Faculty Survey. A Report to the Johns Hopkins Committee on Faculty Development and Gender Dr. Cynthia Wolberger, Chair

ECE 5984: Introduction to Machine Learning

THE SEVENTH-DAY ADVENTIST CHURCH AN ANALYSIS OF STRENGTHS, WEAKNESSES, OPPORTUNITIES, AND THREATS (SWOT) Roger L. Dudley

-- did you get a message welcoming you to the cours reflector? If not, please correct what s needed.

Empiricist Mentalist Semantics

Loving To Our Neighbor Bishop s Annual Telephone Follow-up Manual. Bishop s Annual Appeal Follow-up Process

Realism and instrumentalism

Transcription:

TÜ Information Retrieval Übung 2 Heike Adel, Sascha Rothe Center for Information and Language Processing, University of Munich May 8, 2014 1 / 17

Problem 1 Assume that machines in MapReduce have 100GB of disk space each the postings list of the term THE has a size of 180GB for a particular collection we do not use compression Then the MapReduce algorithm as described in class cannot be run to construct the inverted index. Why? How would you modify the algorithm so that it can handle this case? 2 / 17

Problem 1 Recap MapReduce as described in class: 3 / 17

Problem 1 Assume that machines in MapReduce have 100GB of disk space each the postings list of the term THE has a size of 180GB for a particular collection we do not use compression Then the MapReduce algorithm as described in class cannot be run to construct the inverted index. Why? 4 / 17

Problem 1 Assume that machines in MapReduce have 100GB of disk space each the postings list of the term THE has a size of 180GB for a particular collection we do not use compression Then the MapReduce algorithm as described in class cannot be run to construct the inverted index. Why? The algorithm assumes that each term s postings list will fit on a single inverter. This is not true for the postings list of THE. 4 / 17

Problem 1 Assume that machines in MapReduce have 100GB of disk space each the postings list of the term THE has a size of 180GB for a particular collection we do not use compression How would you modify the algorithm so that it can handle this case? 5 / 17

Problem 1 Assume that machines in MapReduce have 100GB of disk space each the postings list of the term THE has a size of 180GB for a particular collection we do not use compression How would you modify the algorithm so that it can handle this case? Let N be the largest docid for THE. The master defines two partitions for the postings list of THE: (1) THE in documents with docid N 2 (2) THE in documents with docid > N 2 These shorter postings lists can be sorted by two single inverters. Afterwards, the master links the end of list (1) to the beginning of list (2). 5 / 17

Problem 2 Given a collection with exactly 4 words a, b, c, d. The frequency order is a > b > c > d. The total number of tokens in the collection is 5000. Assume that Zipf s law holds exactly for this collection. What are the frequencies of the four words? 6 / 17

Problem 2 Recap Zipf s law: The i th most frequent term has a collection frequency cf i proportional to 1 i. Hence: c : cf i = c 1 i More general: In natural language, there are a few very frequent terms and very many very rare terms. 7 / 17

Problem 2 Given a collection with exactly 4 words a, b, c, d. The frequency order is a > b > c > d. The total number of tokens in the collection is 5000. Assume that Zipf s law holds exactly for this collection. What are the frequencies of the four words? 8 / 17

Problem 2 Given a collection with exactly 4 words a, b, c, d. The frequency order is a > b > c > d. The total number of tokens in the collection is 5000. Assume that Zipf s law holds exactly for this collection. What are the frequencies of the four words? Assume a appears f a times in the collection. Then: f a + 1 2 f a + 1 3 f a + 1 4 f a = 5000 f a = 2400 f b = 1 2 f a = 1200 f c = 1 3 f a = 800 f d = 1 4 f a = 600 8 / 17

Problem 3 We define a hapax legomenon as a term that occurs exactly once in a collection. We want to estimate the number of hapax legomena using Heap s law and Zipf s law. (i) How many unique terms does a web collection of 600,000,000 web pages containing 600 tokens on average have? Use the Heaps parameters k = 100 and b = 0.5. 9 / 17

Problem 3 We define a hapax legomenon as a term that occurs exactly once in a collection. We want to estimate the number of hapax legomena using Heap s law and Zipf s law. (i) How many unique terms does a web collection of 600,000,000 web pages containing 600 tokens on average have? Use the Heaps parameters k = 100 and b = 0.5. Recap Heap s law: M = k T b with M being the size of the vocabulary and T the number of tokens in the collection 9 / 17

Problem 3 We define a hapax legomenon as a term that occurs exactly once in a collection. We want to estimate the number of hapax legomena using Heap s law and Zipf s law. (i) How many unique terms does a web collection of 600,000,000 web pages containing 600 tokens on average have? Use the Heaps parameters k = 100 and b = 0.5. 10 / 17

Problem 3 We define a hapax legomenon as a term that occurs exactly once in a collection. We want to estimate the number of hapax legomena using Heap s law and Zipf s law. (i) How many unique terms does a web collection of 600,000,000 web pages containing 600 tokens on average have? Use the Heaps parameters k = 100 and b = 0.5. M = 100 (600, 000, 000 600) 0.5 = 60, 000, 000 10 / 17

Problem 3 We define a hapax legomenon as a term that occurs exactly once in a collection. We want to estimate the number of hapax legomena using Heap s law and Zipf s law. (ii) Use Zipf s law to estimate the proportion of the term vocabulary of the collection that consists of hapax legomena. Hint: n i=1 1 i ln(n) 11 / 17

Problem 3 We define a hapax legomenon as a term that occurs exactly once in a collection. We want to estimate the number of hapax legomena using Heap s law and Zipf s law. (ii) Use Zipf s law to estimate the proportion of the term vocabulary of the collection that consists of hapax legomena. Hint: n i=1 1 i ln(n) Zipf s law: cf i 1/i c : cf i = c 1 i Calculate c: The sum of all collection frequencies is the total number of tokens T: (600, 000, 000 600) = T = M i=1 c 1 i = c 60,000,000 i=1 c ln(60, 000, 000) 17.9c c = T 17.9 2 1010 1 i 11 / 17

Problem 3 We define a hapax legomenon as a term that occurs exactly once in a collection. We want to estimate the number of hapax legomena using Heap s law and Zipf s law. (ii) Use Zipf s law to estimate the proportion of the term vocabulary of the collection that consists of hapax legomena. Hint: n i=1 1 i ln(n) 12 / 17

Problem 3 We define a hapax legomenon as a term that occurs exactly once in a collection. We want to estimate the number of hapax legomena using Heap s law and Zipf s law. (ii) Use Zipf s law to estimate the proportion of the term vocabulary of the collection that consists of hapax legomena. Hint: n i=1 1 i ln(n) Zipf s law: cf i 1/i cf i = 2 10 10 1 i Calculate the frequency of the least frequent term (i.e. term with rank i = 60, 000, 000): cf 60,000,000 = 2 1010 60,000,000 1 3 1000 The least frequent term appears more than once! Based on Heap s law and Zipf s law, there are no hapax legomena in the collection! The proportion of hapax legomena is 0. 12 / 17

Problem 3 Based on Heap s law and Zipf s law, there are no hapax legomena in the collection! (iii) Do you think that the estimate you get is correct? 13 / 17

Problem 3 Based on Heap s law and Zipf s law, there are no hapax legomena in the collection! (iii) Do you think that the estimate you get is correct? This prediction is not correct. Generally, roughly 50% of the vocabulary consists of hapax legomena (but this depends on the collection!) 13 / 17

Problem 3 Based on Heap s law and Zipf s law, there are no hapax legomena in the collection! (iv) Discuss what possible reasons there might be for the incorrectness of the estimate 14 / 17

Problem 3 Based on Heap s law and Zipf s law, there are no hapax legomena in the collection! (iv) Discuss what possible reasons there might be for the incorrectness of the estimate One of the laws has to be the reason for the incorrect prediction Heap s law: fairly accurate (see: class) Heap s law is not the reason Zipf s law: bad fit, especially at the low-frequent end This is the reason for the incorrect prediction! 14 / 17

Problem 4 γ-codes are inefficient for large numbers (e.g. > 1000) because they encode the length of the offset in binary code. δ-codes, on the other hand, use γ-codes for encoding this length. Definitions γ-code of G: unary-code(length(offset(g))), offset(g) δ-code of G: γ-code(length(offset(g + 1))), offset(g + 1) Compute the δ-codes for 1, 2, 3, 4, 31, 63, 127, 1023 15 / 17

Problem 4 Compute the δ-codes for 1, 2, 3, 4, 31, 63, 127, 1023 number δ-code 1 0,0 2 0,1 3 10,0,00 4 10,0,01 31 110,01,00000 63 110,10,000000 127 110,11,0000000 1023 1110,010,0000000000 16 / 17

The end Thank you for your attention! Do you have any questions? 17 / 17