Predictive Coding. CSE 390 Introduction to Data Compression Fall Entropy. Bad and Good Prediction. Which Context to Use? PPM

Similar documents
Who wrote the Letter to the Hebrews? Data mining for detection of text authorship

Basic Algorithms Overview

TÜ Information Retrieval

Recursive Mergesort. CSE 589 Applied Algorithms Spring Merging Pattern of Recursive Mergesort. Mergesort Call Tree. Reorder the Merging Steps

COS 226 Algorithms and Data Structures Fall Midterm

Digital Logic Lecture 5 Boolean Algebra and Logic Gates Part I

Multiple Regression-FORCED-ENTRY HIERARCHICAL MODEL Dennessa Gooden/ Samantha Okegbe COM 631/731 Spring 2018 Data: Film & TV Usage 2015 I. MODEL.

The nature of consciousness underlying existence William C. Treurniet and Paul Hamden, July, 2018

NPTEL NPTEL ONINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture-59 Ensemble Methods- Bagging,Committee Machines and Stacking

Brandeis University Maurice and Marilyn Cohen Center for Modern Jewish Studies

AAC: Past, Present, & Future

Sorting: Merge Sort. College of Computing & Information Technology King Abdulaziz University. CPCS-204 Data Structures I

Smith Waterman Algorithm - Performance Analysis

Grade 6 correlated to Illinois Learning Standards for Mathematics

This is certainly a time series. We can see very strong patterns in the correlation matrix. This comes out in this form...

Radiomics for Disease Characterization: An Outcome Prediction in Cancer Patients

Artificial Intelligence: Valid Arguments and Proof Systems. Prof. Deepak Khemani. Department of Computer Science and Engineering

Lesson 07 Notes. Machine Learning. Quiz: Computational Learning Theory

Heap and Merge Sorts

Verification of Occurrence of Arabic Word in Quran

Parish Needs Survey (part 2): the Needs of the Parishes

A Discussion on Kaplan s and Frege s Theories of Demonstratives

A Scientific Model Explains Spirituality and Nonduality

Information Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)

ECE 5424: Introduction to Machine Learning

INTERMEDIATE LOGIC Glossary of key terms

This is certainly a time series. We can see very strong patterns in the correlation matrix. This comes out in this form...

RISE Scholarship Report

Insights and Learning From September 21-22, 2011 Upper Midwest Diocesan Planners Meetings

Mark V. Shaney. Comp 140

Potten End Church of England Primary School Curriculum Map. Year 6


ECE 5424: Introduction to Machine Learning

Artificial Intelligence Prof. Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology, Madras

KNOWLEDGE AND THE PROBLEM OF LOGICAL OMNISCIENCE

McDougal Littell High School Math Program. correlated to. Oregon Mathematics Grade-Level Standards

Georgia Quality Core Curriculum 9 12 English/Language Arts Course: Ninth Grade Literature and Composition

Virtual Logic Number and Imagination

Artificial Intelligence Prof. Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology, Madras

Allreduce for Parallel Learning. John Langford, Microsoft Resarch, NYC

Order-Planning Neural Text Generation from Structured Data

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

PARSEC An R package for PARtial orders in Socio- EConomics Alberto Arcagni and Marco Fattore

Quorum Website Terms of Use

This report is organized in four sections. The first section discusses the sample design. The next

15 Does God have a Nature?

Sentiment Flow! A General Model of Web Review Argumentation

Emmanuel Church, Bel Air. The Church Assessment Tool 11/15/2012

StoryTown Reading/Language Arts Grade 3

Factors related to students spiritual orientations

Windstorm Simulation & Modeling Project

RATIONALITY AND SELF-CONFIDENCE Frank Arntzenius, Rutgers University

Identity and Curriculum in Catholic Education

Agency Info The Administrator is asked to complete and keep current the agency information including web site and agency contact address.

Georgia Quality Core Curriculum

The curious case of Mark V. Shaney. Comp 140 Fall 2008

UNBLOCK YOUR ABUNDANCE YOUR PRIVATE ACTION GUIDE WITH CHRISTIE MARIE SHELDON

Sociology Exam 1 Answer Key February 18, 2011

Under the command of algorithms

Chapter 4 The Hebrew Alphabet

Kant Lecture 4 Review Synthetic a priori knowledge

The following content is provided under a Creative Commons license. Your support

2016 Statement of ROI

Summary. Background. Individual Contribution For consideration by the UTC. Date:

Curriculum LONG TERM OVERVIEW YEAR 1 (FROM SEPTEMBER 2017) TERM 1 TERM 2 TERM 3

Extreme obedience adventure guide

In the brief time that I have today, I d like to talk about a project that I am just

Attendees: Pitinan Kooarmornpatana-GAC Rudi Vansnick NPOC Jim Galvin - RySG Petter Rindforth IPC Jennifer Chung RySG Amr Elsadr NCUC

AWAKENING DYNAMICS VIP CLUB SPECIAL EVENT. Emotional Eating Brent Phillips. Effortless Clearing Audio

An Efficient Indexing Approach to Find Quranic Symbols in Large Texts

Touch Receptors and Mapping the Homunculus

Key words and phrases: Genesis, equidistant letter sequences, cylindrical representations, statistical analysis.

Our Savior's Lutheran Church, Faribault, MN. The Congregation Assessment Tool 4/5/17

The Evolution of Cognitive and Noncognitive Skills Over the Life Cycle of the Child

A Correlation of. To the. Language Arts Florida Standards (LAFS) Grade 3

Critical Thinking - Section 1

Studying Adaptive Learning Efficacy using Propensity Score Matching

TIME-WAVE ZERO TIMELINE THE I-CHING END OF TIME SEQUENCE Release Language

Church of the Ascension, Chicago, IL. The Congregation Assessment Tool 5/12/17

The World Wide Web and the U.S. Political News Market: Online Appendices

In Our Own Words 2000 Research Study

Saint Thomas of Canterbury, Temecula, CA. The Congregation Assessment Tool 3/31/2016

Lakatos Award Lectures

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 21

Bethany Congregational Church, Foxboro, MA. The Church Assessment Tool 2/6/2013

Georgia Quality Core Curriculum 9 12 English/Language Arts Course: American Literature/Composition

2.1 Review. 2.2 Inference and justifications

The Book of Nathan the Prophet Volume II

Lazy Functional Programming for a survey

CSC2556 Spring 18 Algorithms for Collective Decision Making

From Machines To The First Person

Computational Learning Theory: Agnostic Learning

100 BC 0. The Bible Timeline 24-Week Bible Study. Leader s Guide. Jeff Cavins, Sarah Christmyer and Tim Gray 100 AD

End of the year test day 2 #3

MISSOURI S FRAMEWORK FOR CURRICULAR DEVELOPMENT IN MATH TOPIC I: PROBLEM SOLVING

ELA CCSS Grade Three. Third Grade Reading Standards for Literature (RL)

Macro Plan

YEAR: UNIT-SPECIFIC GOALS (italicized) Assessable Student Outcome

Six Sigma Prof. Dr. T. P. Bagchi Department of Management Indian Institute of Technology, Kharagpur

Share Your Faith Workshop Host Package

Transcription:

Predictive Coding CSE 390 Introduction to Data Compression Fall 2004 Predictive Coding (PPM, JBIG, Differencing, Move-To-Front) Burrows-Wheeler Transform (bzip2) The next symbol can be statistically predicted from the past: Code with context, or Code the difference, or Move to front, then code. Goals of prediction: The prediction should make the probability of the next symbol as high as possible. fter prediction there is nothing left to know except the probabilities. CSE 390 - Lecture 10 - Fall 2004 2 Bad and Good Prediction From information theory The lower the information, the fewer bits are needed to code the symbol. 1 inf(a) = log2( ) P(a) s: P(a) = 1023/1024, inf(a) =.000977 P(a) = 1/2, inf(a) = 1 P(a) = 1/1024, inf(a) = 10 Entropy Entropy is the expected number of bit to code a symbol in the model with a i having probability P(a i ). m 1 H = P(a i )log 2( ) i= 1 P(a i ) Good coders should be close to this bound. rithmetic Huffman Golomb Tunstall CSE 390 - Lecture 10 - Fall 2004 3 CSE 390 - Lecture 10 - Fall 2004 4 PPM Prediction with Partial Matching Cleary and Witten (1984) Tries to find a good context to code the next symbol. Good? context a...e...i...r...s...y the 0 0 5 7 4 7 he 10 1 7 10 9 7 e 12 2 10 15 10 10 <nil> 50 70 30 35 40 13 Uses adaptive arithmetic coding for each context. CSE 390 - Lecture 10 - Fall 2004 5 Which Context to Use? Using previous table, which context for italicized letter? We pulled a heavy wagon. The theatre was fun. Twas theere haus! CSE 390 - Lecture 10 - Fall 2004 6 1

JBIG JBIG Coder for binary images documents graphics Codes in scan line order using context from the same and previous scan lines................... context next bit to be coded Uses adaptive arithmetic coding with context. 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 next bit 0 1 frequency 100 10 10 110 100 110 H = log( ) + log( ) =.44 110 10 110 100 next bit 0 1 frequency 15 50 15 65 50 65 H = log( ) + log( ) =.78 65 15 65 50 CSE 390 - Lecture 10 - Fall 2004 7 CSE 390 - Lecture 10 - Fall 2004 8 Issues with Context Context dilution If there are too many contexts then too few symbols are coded in each context, making them ineffective because of the zero-frequency problem. Context saturation If there are too few contexts then the contexts might not be as good as having more contexts. Wrong context gain poor predictors. Prediction by Differencing Used for Numerical Data : 2 3 4 5 6 7 8 7 6 5 4 3 2 Transform to 2 1 1 1 1 1 1 1 1 1 1 1 1 much lower first-order entropy CSE 390 - Lecture 10 - Fall 2004 9 CSE 390 - Lecture 10 - Fall 2004 10 General Differencing Let x 1, x 2,..., x n be some numerical data that is correlated, that is x i is near x i+1 Better compression can result from coding x 1, x 2 x 1, x 3 x 2,..., x n x n-1 This idea is used in signal coding audio coding video coding There are fancier prediction methods based on linear combinations of previous data, but these can require training. Move to Front Coding MTF is part of Burrows-Wheeler, basis for bzip2! Non-numerical data. The data have a relatively small working set that changes over the sequence. Move to Front algorithm: Symbols are kept in a list indexed 0 to m-1. To code a symbol output its index and move the symbol to the front of the list. CSE 390 - Lecture 10 - Fall 2004 11 CSE 390 - Lecture 10 - Fall 2004 12 2

0 0 1 CSE 390 - Lecture 10 - Fall 2004 13 CSE 390 - Lecture 10 - Fall 2004 14 0 1 1 0 1 1 1 CSE 390 - Lecture 10 - Fall 2004 15 CSE 390 - Lecture 10 - Fall 2004 16 0 1 1 1 1 0 1 1 1 1 0 CSE 390 - Lecture 10 - Fall 2004 17 CSE 390 - Lecture 10 - Fall 2004 18 3

0 1 1 1 1 0 1 0 1 1 1 1 0 1 2 c b a d CSE 390 - Lecture 10 - Fall 2004 19 CSE 390 - Lecture 10 - Fall 2004 20 0 1 1 1 1 0 1 2 0 1 0 1 0 00 1 3 1 2 0 0 1 1 1 1 0 1 2 0 1 0 1 0 00 1 3 1 2 0 c b d a Frequencies of {a, b, c, d} 4 7 8 1 Frequencies of {0, 1, 2, 3} 8 9 2 1 CSE 390 - Lecture 10 - Fall 2004 21 CSE 390 - Lecture 10 - Fall 2004 22 Extreme Input: aaaaaaaaaabbbbbbbbbbccccccccccdddddddddd Output 0000000000100000000020000000003000000000 Frequencies of 10 10 10 10 Frequencies of 37 1 1 1 CSE 390 - Lecture 10 - Fall 2004 23 Burrows-Wheeler Transform Burrows-Wheeler, 1994 BW Transform creates a representation of the data which has a small working set. The transformed data is compressed with move to front compression. The decoder is quite different from the encoder. The algorithm requires processing the entire string at once (it is not on-line). It is a remarkably good compression method. CSE 390 - Lecture 10 - Fall 2004 24 4

In-Class Exercise Use Move-to-Front Coding with an initial ordering of { a, b, c, d } for the following string: d c b a a b b c b c c b CSE 390 - Lecture 10 - Fall 2004 25 Encoding abracadabra 1. Create all cyclic shifts of the string. 0 abracadabra 1 bracadabraa 2 racadabraab 5 adabraabrac 6 dabraabraca 7 abraabracad 8 braabracada 1 CSE 390 - Lecture 10 - Fall 2004 26 Encoding 2. Sort the strings alphabetically in to array 3. the last column Encoding 0 abracadabra 1 bracadabraa 2 racadabraab 5 adabraabrac 6 dabraabraca 7 abraabracad 8 braabracada 1 rdarcaaaabb CSE 390 - Lecture 10 - Fall 2004 27 CSE 390 - Lecture 10 - Fall 2004 28 Encoding 4. Transmit X the index of the input in and L (using move to front coding). rdarcaaaabb X = 2 Why BW Works Ignore decoding for the moment. The prefix of each shifted string is a context for the last symbol. The last symbol appears just before the prefix in the original. By sorting, similar contexts are adjacent. This means that the predicted last symbols are similar. CSE 390 - Lecture 10 - Fall 2004 29 CSE 390 - Lecture 10 - Fall 2004 30 5

We first decode assuming some information. We then show how compute the information. Let s be shifted by 1 s 0 raabracadab 1 dabraabraca 2 aabracadabr 3 racadabraab 5 abraabracad 6 abracadabra 7 acadabraabr 8 adabraabrac 9 braabracada 10 bracadabraa CSE 390 - Lecture 10 - Fall 2004 31 ssume we know the mapping T[i] is the index in s of the string i in. [] s 0 raabracadab 1 dabraabraca 2 aabracadabr 3 racadabraab 5 abraabracad 6 abracadabra 7 acadabraabr 8 adabraabrac 9 braabracada 10 bracadabraa CSE 390 - Lecture 10 - Fall 2004 32 Let F be the first column of, it is just L, sorted. 4 5 6 7 8 9 10 4 5 6 7 8 9 10 Follow the pointers in T in F to recover the input starting with X. CSE 390 - Lecture 10 - Fall 2004 33 X = 2 4 5 6 7 8 9 10 4 5 6 7 8 9 10 a CSE 390 - Lecture 10 - Fall 2004 34 4 5 6 7 8 9 10 4 5 6 7 8 9 10 ab 4 5 6 7 8 9 10 4 5 6 7 8 9 10 abr CSE 390 - Lecture 10 - Fall 2004 35 CSE 390 - Lecture 10 - Fall 2004 36 6

Why does this work? The first symbol of [T[i]] is the second symbol of [i] because s [T[i]] = [i]. s 0 raabracadab 1 dabraabraca 2 aabracadabr 3 racadabraab 5 abraabracad 6 abracadabra 7 acadabraabr 8 adabraabrac 9 braabracada 10 bracadabraa CSE 390 - Lecture 10 - Fall 2004 37 How do we compute T from L and X? 4 5 6 7 8 9 10 Note that L is the first column of s and s is in the same order as. If i is the k-th x in F then T[i] is the k-th x in L. CSE 390 - Lecture 10 - Fall 2004 38 4 5 6 7 8 9 10 4 5 6 7 8 9 10 T= 4 5 6 7 8 9 10 2 5 6 7 8 T= 4 5 6 7 8 9 10 2 5 6 7 8 9 10 CSE 390 - Lecture 10 - Fall 2004 39 CSE 390 - Lecture 10 - Fall 2004 40 4 5 6 7 8 9 10 4 5 6 7 8 9 10 T= 4 5 6 7 8 9 10 2 5 6 7 8 9 10 4 T= 4 5 6 7 8 9 10 2 5 6 7 8 9 10 4 1 CSE 390 - Lecture 10 - Fall 2004 41 CSE 390 - Lecture 10 - Fall 2004 42 7

4 5 6 7 8 9 10 4 5 6 7 8 9 10 Notes on BW lphabetic sorting does not need the entire cyclic shifted inputs. You just have to look at long enough prefixes. bucket sort will work here. Requires entire input. In practice, that s impossible. Break input into blocks. There are high quality practical implementations: Bzip Bzip2 (seems to be public domain) CSE 390 - Lecture 10 - Fall 2004 43 CSE 390 - Lecture 10 - Fall 2004 44 8