Who wrote the Letter to the Hebrews? Data mining for detection of text authorship

Similar documents
Introduction to Statistical Hypothesis Testing Prof. Arun K Tangirala Department of Chemical Engineering Indian Institute of Technology, Madras

Six Sigma Prof. Dr. T. P. Bagchi Department of Management Indian Institute of Technology, Kharagpur

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 21

Statistics for Experimentalists Prof. Kannan. A Department of Chemical Engineering Indian Institute of Technology - Madras

Grade 6 correlated to Illinois Learning Standards for Mathematics

Georgia Quality Core Curriculum

McDougal Littell High School Math Program. correlated to. Oregon Mathematics Grade-Level Standards

A New Parameter for Maintaining Consistency in an Agent's Knowledge Base Using Truth Maintenance System

Content Area Variations of Academic Language

RECOMMENDED CITATION: Pew Research Center, July, 2014, How Americans Feel About Religious Groups

Introduction to Inference

Predictive Coding. CSE 390 Introduction to Data Compression Fall Entropy. Bad and Good Prediction. Which Context to Use? PPM

Visual Analytics Based Authorship Discrimination Using Gaussian Mixture Models and Self Organising Maps: Application on Quran and Hadith

The Decline of the Traditional Church Choir: The Impact on the Church and Society. Dr Arthur Saunders

The Scripture Engagement of Students at Christian Colleges

Statistics, Politics, and Policy

Introductory Statistics Day 25. Paired Means Test

Module 02 Lecture - 10 Inferential Statistics Single Sample Tests

Six Sigma Prof. Dr. T. P. Bagchi Department of Management Indian Institute of Technology, Kharagpur. Lecture No. # 18 Acceptance Sampling

NCLS Occasional Paper Church Attendance Estimates

Logical (formal) fallacies

Balancing Authority Ace Limit (BAAL) Proof-of-Concept BAAL Field Trial

Curriculum Guide for Pre-Algebra

Gesture recognition with Kinect. Joakim Larsson

Biometrics Prof. Phalguni Gupta Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Lecture No.

Appendix 1. Towers Watson Report. UMC Call to Action Vital Congregations Research Project Findings Report for Steering Team

Module - 02 Lecturer - 09 Inferential Statistics - Motivation

Torah Code Cluster Probabilities

ECE 5424: Introduction to Machine Learning

I also occasionally write for the Huffington Post: knoll/

Computational Learning Theory: Agnostic Learning

The World Wide Web and the U.S. Political News Market: Online Appendices

ABB STOTZ-KONTAKT GmbH ABB i-bus KNX DGN/S DALI Gateway for emergency lighting

MITOCW watch?v=4hrhg4euimo

occasions (2) occasions (5.5) occasions (10) occasions (15.5) occasions (22) occasions (28)

Studying Adaptive Learning Efficacy using Propensity Score Matching

Probability Distributions TEACHER NOTES MATH NSPIRED

Religious Beliefs of Higher Secondary School Teachers in Pathanamthitta District of Kerala State

The Effect of Religiosity on Class Attendance. Abstract

Anaphora Resolution in Biomedical Literature: A

May Parish Life Survey. St. Mary of the Knobs Floyds Knobs, Indiana

Measuring religious intolerance across Indonesian provinces

KEEP THIS COPY FOR REPRODUCTION Pý:RPCS.15i )OCUMENTATION PAGE 0 ''.1-AC7..<Z C. in;2re PORT DATE JPOTTYPE AND DATES COVERID

MISSOURI S FRAMEWORK FOR CURRICULAR DEVELOPMENT IN MATH TOPIC I: PROBLEM SOLVING

Same-different and A-not A tests with sensr. Same-Different and the Degree-of-Difference tests. Outline. Christine Borgen Linander

Supplement to: Aksoy, Ozan Motherhood, Sex of the Offspring, and Religious Signaling. Sociological Science 4:

CHAPTER 17: UNCERTAINTY AND RANDOM: WHEN IS CONCLUSION JUSTIFIED?

Parish Needs Survey (part 2): the Needs of the Parishes

TECHNICAL WORKING PARTY ON AUTOMATION AND COMPUTER PROGRAMS. Twenty-Fifth Session Sibiu, Romania, September 3 to 6, 2007

How many imputations do you need? A two stage calculation using a quadratic rule

ON SOPHIE GERMAIN PRIMES

TÜ Information Retrieval

ECE 5424: Introduction to Machine Learning

Scientific Realism and Empiricism

Grade 7 Math Connects Suggested Course Outline for Schooling at Home 132 lessons

Netherlands Interdisciplinary Demographic Institute, The Hague, The Netherlands

AUTHORSHIP DISCRIMINATION ON QURAN AND HADITH USING DISCRIMINATIVE LEAVE-ONE-OUT CLASSIFICATION

APRIL 2017 KNX DALI-Gateways DG/S x BU EPBP GPG Building Automation. Thorsten Reibel, Training & Qualification

Some details of the contact phenomenon

The Fixed Hebrew Calendar

Identity and Curriculum in Catholic Education

CHRISTIANITY FOR THE TECHNICALLY INCLINED: Risk Assessment, Probability and Prophecy. James Dietz

Automatic Recognition of Tibetan Buddhist Text by Computer. Masami Kojima*1, Yoshiyuki Kawazoe*2 and Masayuki Kimura*3

Factors related to students spiritual orientations

FACTS About Non-Seminary-Trained Pastors Marjorie H. Royle, Ph.D. Clay Pots Research April, 2011

A Layperson s Guide to Hypothesis Testing By Michael Reames and Gabriel Kemeny ProcessGPS

Why the Hardest Logic Puzzle Ever Cannot Be Solved in Less than Three Questions

The numbers of single adults practising Christian worship

Position Description. Minister of Student and Family Ministries. VISION STATEMENT Discipleship Evangelism Service

STI 2018 Conference Proceedings

Verification of Occurrence of Arabic Word in Quran

PROBABILITY DISTRIBUTIONSOF THE VERSES, WORDS, AND LETTERS OF THE HOLY QURAN

ARAB BAROMETER SURVEY PROJECT ALGERIA REPORT

Debates and Decisions: On a Rationale of Argumentation Rules

Religious affiliation, religious milieu, and contraceptive use in Nigeria (extended abstract)

APPENDIX C STATE ESTIMATION AND THE MEANING OF LIFE

THE EFFECT OF PULPITS IN THE RASTI VALUES WITHIN CHURCHES

Meaning in Modern America by Clay Routledge

St. Anselm Church 2017 Community Life Survey Results

=EQUALS= Center for. A Club of Investigation and Discovery. Published by: autosocratic PRESS Copyright 2011 Michael Lee Round

End of the year test day 2 #3

A Short Addition to Length: Some Relative Frequencies of Circumstantial Structures

Carolina Bachenheimer-Schaefer, Thorsten Reibel, Jürgen Schilder & Ilija Zivadinovic Global Application and Solution Team

Artificial Intelligence Prof. Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology, Madras

Artificial Intelligence Prof. Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology, Madras

ANSWER SHEET FINAL EXAM MATH 111 SPRING 2009 (PRINT ABOVE IN LARGE CAPITALS) CIRCLE LECTURE HOUR 10AM 2PM FIRST NAME: (PRINT ABOVE IN CAPITALS)

The synoptic problem and statistics

Asking the Right Questions: A Guide to Critical Thinking M. Neil Browne and Stuart Keeley

ECE 5424: Introduction to Machine Learning

Radiomics for Disease Characterization: An Outcome Prediction in Cancer Patients

Tuen Mun Ling Liang Church

Discussion Notes for Bayesian Reasoning

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 3

The synoptic problem and statistics

CSSS/SOC/STAT 321 Case-Based Statistics I. Introduction to Probability

Sentiment Flow! A General Model of Web Review Argumentation

The Chronology Of The Old Testament (Book & CD) PDF

Touch Receptors and Mapping the Homunculus

The nature of consciousness underlying existence William C. Treurniet and Paul Hamden, July, 2018

PHIL 155: The Scientific Method, Part 1: Naïve Inductivism. January 14, 2013

Transcription:

Who wrote the Letter to the? Data mining for detection of text authorship Madeleine Sabordo a, Shong Y. Chai a, Matthew J. Berryman a, and Derek Abbott a a Centre for Biomedical Engineering and School of Electrical and Electronic Engineering, The University of Adelaide, SA 55, Australia. ABSTRACT This paper explores the authorship of the Letter to the using a number of different measures of relationship between different texts of the New Testament. The methods used in the study include file zipping and compression techniques, prediction by the partial matching technique and the word recurrence interval technique. The long term motivation is that the techniques employed in this study may find applicability in future generation web search engines, email authorship identification, detection of plagiarism and terrorist email traffic filtration. Keywords: data mining, text authorship, text statistics, word recurrence interval, compression 1. INTRODUCTION Although data mining has been around for many years, the term itself only became notable in the 199 s. Data mining is technically defined as the process of extracting information hidden within large volumes of raw data. Data mining has found wide applicability in business management, being used in such fields as marketing, retail, finance and insurance. Recently, researchers have applied data mining algorithms to areas such as DNA analysis and text classification. Examples of these include Mantegna et al. s exploitation of Shannon s concept of information entropy for the study of DNA sequences, 1 3 Ortuño et al. s use of standard deviation of word recurrence interval (WRI) for extracting key words, 4 and the application of this work to the question of authorship of the books of the New Testament. 5, 6 The books of the New Testament were written around 45-9 CE. The authorship attribution of many books of the New Testament is a subject of continuing debate and investigation among Bible readers and researchers. The authorship of the Letter to the has been a long-standing debate and make an interesting case study herein. We employed a GZip compression technique, 7, 8 Prediction by Partial Matching (PPM) compression technique 9, 1 and Word Recurrence Interval (WRI) 6 technique for detection of text authorship. We found that the GZip compression technique was not very effective in analysing similarities or differences in patterns, trends or relationships between texts with sizes similar to the texts of the New Testament. The Word Recurrence Interval method, on the other hand, can detect similarities between texts written by the same author and thus is able to assist in text authorship identification. 2. DATA MINING TECHNIQUES We use three techniques for text classification. The first two involve compressing the file, and for each, using two related measures. The first of these is the delta value, Send correspondence to Derek Abbott E-mail: dabbott@eleceng.adelaide.edu.au, Telephone: +61 8 833 5748 Ab = L A b L A, (1) Smart Structures, Devices, and Systems II, edited by Said F. Al-Sarawi, Proceedings of SPIE Vol. 5649 (SPIE, Bellingham, WA, 25) 277-786X/5/$15 doi: 1.1117/12.58298 513

where L x is the compressed length (in bytes) of a file x, and denotes string concatenation. Here, A denotes a portion of the text to be compared with extract b from source text B. The second one is a distance metric, S AB = Ab Bb Bb + Ba Aa Aa, (2) from Benedetto et al., 11 where Xy is defined as in Eq. 1 The third method compares the scaled standard deviation of WRI, where WRI is the interval between recurrences of a word, for example in The cat sat on the mat, the interval between the occurrences of the is three, because there are three words between (non-inclusive) each occurrence. 2.1. GZip compression For this method, the text was taken as source text A. It was compressed separately to measure the value L A. From each of the remaining 26 texts of the New Testament, a small sequence b from another source text, B, was randomly extracted to append to source A. The new sequence A b was then compressed, giving length L A b. The value of Ab was obtained using Eq. 1. In order to obtain more reliable data values, random extraction of small sequences from each of the 26 texts of the New Testament to append to the text was run five times and the corresponding values for Ab were calculated in each case. Small sequences were varied at different number of words to assist in graphing and analysis. Plots of Ab values for texts with sizes greater than 2, words compressed using the GZip software are depicted in Figure 1. It is clear from Figure 1 Delta lengths Plot of delta values for lengths of small sequences in the coding optimized for 45 ZIP AND COMPRESSION TECHNIQUE solid line: John the Son of Zebedee dotted line: Paul dashed line: neither Paul nor John 4 solid blue: 1 John solid green: Revelations solid red: John dotted green: Galatians 35 dotted red: Ephesians dotted magenta: 2 Corinthians dotted black: Romans dotted blue: 1 Corinthians 3 dashed green: Mark dashed red: Matthew dashed blue: dashed black: Luke 25 dotted cyan: 2 1 John 15 1 5 5 5 1 15 2 Number of words appended to Figure 1. Using the GZip compression algorithm, we compressed with appended portions of other books and calculated the Ab values using Eq. 1. The portion of the appended text was selected at random. Each point on the above curves represents the average of five randomly selected portions of appended text and the error bars represent +/ one standard deviation. This is a plot of the results, with a smaller value of Ab ideally indicating common authorship or at least style. appended to itself gives the line at Ab.TextsusedwereintheoriginalKoine Greek. that 1 John has the smallest Ab. We consider the closest text B in terms of authorship to be that for which 514 Proc. of SPIE Vol. 5649

the value of Ab is minimized. This rather surprisingly suggests that the author of 1 John is likely to be the author of the Letter to the. Traditionally, the authorship of the Letter to the was attributed to Paul. However, critical research made by experts indicated that this is attribution is incorrect. 12 Thus, in order to validate our results, we repeated the experiment using a text from the New Testament whose authorship attribution to Paul was confirmed true by experts. We have chosen the Letter to the Romans as source A since it is one of the seven epistles of Paul in which we know the author with great certainty. 12 Figure 2 shows plots of Ab values for all texts with sizes greater than 2, words appended to Romans. Once again, Figure 2 shows 5 Plot of delta values for lengths of small sequences in the coding optimized for Romans Delta lengths ZIP AND COMPRESSION TECHNIQUE 45 solid line: John the Son of Zebedee dotted line: Paul dashed line: neither Paul nor John 4 solid blue: 1 John solid green: Revelations solid red: John 35 dotted magenta: 2 Corinthians dotted red: Ephesians dotted green: Galatians 3 dashed green: Mark dotted blue: 1 Corinthians dotted black: 25 dashed black: Luke dashed blue: dashed red: Matthew 2 dotted cyan: Romans 1 John 15 1 5 5 5 1 15 2 Number of words appended to Romans Figure 2. Using the GZip compression algorithm, we compressed Romans with portions of other books combined (at random, over repeated trials), and calculated the Ab values using Eq. 1. This is a plot of the results, with a smaller value of Ab ideally indicating common authorship or at least style. Romans appended to itself gives the line at Ab that 1 John has the smallest Ab. This suggests that the author of 1 John could be the author of Romans. However, there is no indication that 1 John was written by Paul. Although confirmed false by critical research, the traditional authorship of 1 John was attributed to John, the Son of Zebedee. 12 As shown in Figure 2, plots of Ab for is closest to those of Luke and whose authorships were traditionally attributed to Luke. A possible interpretation of these results is that the author of Romans and the author of could not be the same since has a very high Ab value when appended to Romans. These inconclusive results prompted another experiment where a text written by Luke waschosenassourcea. We decided to repeat the experiment using as source A. Figure 3 shows plots of Ab values for all texts with sizes greater than 2, words appended to. It is clear from Figure 3 that 1 John has the smallest Ab.TheLetter to the, on the other hand, has the largest Ab. Since 1 John has minimum Ab when appended to the texts, Romans and whereas the Letter to the has large values of Ab when appended to Romans and, we conclude it is possible that the author of 1 John is not the same as the author of the Letter to the. The key results found from these investigations were: 1 John has the smallest Ab when compressed with, Romans and as texts B. This suggests Proc. of SPIE Vol. 5649 515

Delta lengths Plot of delta values for lengths of small sequences in the coding optimized for 5 ZIP AND COMPRESSION TECHNIQUE solid line: John the Son of Zebedee dotted line: Paul dashed line: neither Paul nor John 45 solid blue: 1 John solid green: Revelations solid red: John dotted green: Galatians 4 dotted red: Ephesians dotted magenta: 2 Corinthians dotted black: Romans dotted blue: 1 Corinthians 35 dashed green: Mark dashed red: Matthew dashed blue: dotted cyan: 3 dashed black: Luke 25 2 1 John 15 1 5 5 5 1 15 2 Number of words appended to Figure 3. Using the GZip compression algorithm, we compressed with appended portions of other books combined (at random, over repeated trials), and calculated the Ab values using Eq. 1. This is a plot of the results, with a smaller value of Ab ideally indicating common authorship or at least style. that the author of 1 John is most likely the author of, Romans and. However, the traditional authorship of 1 John was attributed to John. The traditional authorship of and Romans is attributed to Paul and the traditional authorship of is attributed to Luke. Of these, only Paul s authorship of the text Romans was confirmed true by critical research 12 When compressed with Romans and, 1 John has the smallest Ab whereas has a relatively very high Ab. This indicates that the author of 1 John could not be the same as the author of. Thus, confronted with conflicting results presented above, it is not possible for us to come up with a sound and reasonable conclusion as to who is the author of the Letter to the using the delta parameter. For the distance metric, again was used as source A and the other 26 texts of the New Testament were used as sources B. Small sequences b s were appended to long sequences A and B and then compressed using GZip. The values of their s were then calculated. Similarly, a small sequence was selected randomly from was appended to long sequences A and B and then compressed and the corresponding s were likewise calculated. The distance S AB between sources A and B was calculated using Eq. 2 above. The whole process was run 1 times for each calculation of distance between text pairs. Plots of S AB s for books with sizes greater than 2, words compressed using GZip shown in Figure 4. As evident in Figure 4, the standard deviations of distances between the Letter to the and 1 and 2 Corinthians did not overlap with those of other texts of the New Testament not written by Paul. However, their distances from the Letter to the are further than those of texts not written by Paul. Thus, this experiment did not assist us in coming up with a conclusion as to who wrote the Letter to the. 516 Proc. of SPIE Vol. 5649

7 ZIP AND COMPRESSION TECHNIQUE Plot of distances between and other books of the New Testament Romans Mean distance between and the books of the New Testament 6 5 4 3 2 1 1 Corinthians 2 Corinthians Revelations Mark John Matthew Luke 2 4 6 8 1 12 Number assigned to books of the New Testament (size > 4) Figure 4. Using the GZip compression algorithm, we compressed with portions of other books combined (at random, over repeated trials), and calculated the S AB values using Eq. 2. This is a plot of the results, with a smaller value of S AB ideally indicating common authorship or at least style. To help us analyse whether these ambiguous results are only restricted to distances between and other texts or not, we have decided to repeat the experiment using Luke as source A and all other texts of the New Testament as sources B. The resulting distances are plotted and shown in Figure 5. As seen in Figure 5, the standard deviations of the distances between Luke and other texts of the New Testament are overlapping. However, we noticed the similarity between the distance of the Letter to the and 2 Corinthians from Luke. Despite this observation, it is still not possible for us to decide on the authorship attribution of the Letter to the. Thus, based on the results presented above, we conclude that the GZip compression technique is not useful for investigating authorship attribution of texts with sizes similar to the books of the New Testament. 2.2. PPM Compression Technique The Prediction by Partial Matching (PPM) data compression scheme, developed by Cleary and Witten, is capable of good compression on a large range of source data. 9 The scheme can encode English text in as little as 2.2 bits/character. 1 The PPM algorithm is based on the idea that the most effective way to predict the frequency of the next symbol and consequently, to compress data, is to bias the predictions according to the previous symbols in the uncompressed symbol stream. We used PPM software to compress all 27 books from the Koine Greek New Testament. We again applied Eq. 1 to measure the relationship between the Letter to the and other books of the New Testament. Once again, the Letter to the was used as file A. A small sequence b was randomly extracted from each of the remaining 26 books of the New Testament to append to file A. The files A and A b were compressed and their difference, Ab, was calculated. For the resulting Ab values we calculated the mean and standard deviation and plotted the results. We then considered the graph that has the smallest Ab. Figure 6 shows plots of Ab values for all texts with sizes greater than 2, words appended to the file. Similarly to the results obtained by using GZip compression, Figure 6 shows that 1 John has the smallest value of Ab. Notice Proc. of SPIE Vol. 5649 517

6 Plot of distances between Luke and other books of the New Testament Mean distance between Luke and the books of the New Testament 5 4 3 2 1 1 Corinthians 2 Corinthians Romans Luke Revelations Mark John Matthew 1 2 4 6 8 1 12 Number assigned to books of the New Testament (size > 4) Figure 5. Using the GZip compression algorithm, we compressed Luke with portions of other books combined (at random, over repeated trials), and calculated the S AB values using Eq. 2. This is a plot of the results, with a smaller value of S AB ideally indicating common authorship or at least style. Delta lengths 3 x Plot of delta values for lengths of small sequences in the coding optimized for 14 PPM TECHNIQUE solid line: John the Son of Zebedee dotted line: Paul dashed line: neither Paul nor John solid blue: John 1 2.5 solid green: Revelations solid red: John dotted green: Galatians dotted red: Ephesians dotted magenta: Corinthians II 2 dotted black: Romans dotted blue: Corinthians I dashed green: Mark dashed red: Matthew dashed blue: 1.5 dashed black: Luke dotted cyan: 1.5 5 5 1 15 2 Number of words appended to Figure 6. Using the PPM compression algorithm, we compressed with portions of other books combined (at random, over repeated trials), and calculated the Ab values using Eq. 1. This is a plot of the results, with a smaller value of Ab ideally indicating common authorship or at least style. 518 Proc. of SPIE Vol. 5649

that the plot for Romans tends to 1.1 as the number of words increases to 2,. Eventually, Romans positioned itself in the middle of the graph. Note also that and Luke are found on top of all other error bars which is relatively far from 1 John and the Letter to the. In order to validate the results obtained, we repeated the experiment using Romans as source A. Figure 7 shows plots of Ab for books with sizes greater than 2, words. As evident from Figure 7, 1 John once again Delta lengths x Plot of delta values for lengths of small sequences in the coding optimized for Romans 14 3 PPM TECHNIQUE solid line: John the Son of Zebedee dotted line: Paul dashed line: neither Paul nor John solid blue: John 1 2.5 solid green: Revelations solid red: John dotted green: Galatians dotted red: Ephesians dotted magenta: Corinthians II 2 dotted black: Romans dotted blue: Corinthians I dashed green: Mark dashed red: Matthew dashed blue: 1.5 dashed black: Luke dotted cyan: 1 Corinthians 1.5 5 5 1 15 2 Number of words appended to Romans Romans Figure 7. Using the PPM compression algorithm, we compressed Romans with portions of other books combined (at random, over repeated trials), and calculated the Ab values using Eq. 1. This is a plot of the results, with a smaller value of Ab ideally indicating common authorship or at least style. has the smallest Ab when appended to the text Romans. Since Romans and the Letter to the were believed to be written by Paul, it is acceptable if plots of Ab for 1 John are closest to both of them. Then, we might say that Paul is also the author of 1 John. In addition to this, the authorship of 1 John, traditionally attributed to John, the Son of Zebedee, was confirmed false by critical research. 12 Therefore, the hypothesis that Paul is the author of 1 John is not unreasonable at first sight. However, Figure 7 revealed that plots of Ab for the Letter to the are not closest to Romans and 1 John. Note the closeness between the Letter to the and 2 Corinthians in Figure 7. Confronted with ambiguous and conflicting results described above, we repeated the experiment using as file A. Figure 8 shows plot of plots of Ab for books with sizes greater than 2, words. Notice that once again, the text that has the minimum Ab is 1 John. This means as we cannot say that the authors of, Romans and are the same. A common observation from Figures 7 and 8 is that plots of Ab for the Letter to the and 2 Corinthians are always close to each other regardless of whether they are appended to Romans or. This may suggest that the author of 2 Corinthians is likely to be the author of the. We shall see the same pattern of relationship as we embark into the WRI technique in section 2.3. Similar to the results obtained in GZip compression technique, the key results obtained from the delta parameter examined using the PPM compression technique were: 1 John has the smallest Ab when compressed with, Romans and. Again, this suggests that the author of 1 John is most likely the author of, Romans and, mainly in conflict with traditional biblical research Proc. of SPIE Vol. 5649 519

Delta lengths x Plot of delta values for lengths of small sequences in the coding optimized for 14 2.5 PPM TECHNIQUE solid line: John the Son of Zebedee dotted line: Paul dashed line: neither Paul nor John solid blue: John 1 2 solid green: Revelations solid red: John dotted green: Galatians dotted red: Ephesians dotted magenta: Corinthians II 1.5 dotted black: Romans dotted blue: Corinthians I dashed green: Mark dashed red: Matthew dashed blue: 1 dashed black: Luke dotted cyan: Romans.5 2 4 6 8 1 12 14 16 18 2 Number of words appended to Figure 8. Using the PPM compression algorithm, we compressed with portions of other books combined (at random, over repeated trials), and calculated the Ab values using Eq. 1. This is a plot of the results, with a smaller value of Ab ideally indicating common authorship or at least style. When compressed with Romans and, 1 John has the smallest Ab whereas has a relatively higher Ab, close to that of 2 Corinthians. This indicates that the author of 1 John could not be the same as the author of. Thus, faced with ambiguous results presented in the foregoing context, it is not possible for us to come up with a sound and reasonable conclusion as to who is the author of the Letter to the using the delta parameter in conjunction with PPM compression. Therefore we conclude that the delta parameter is not a useful tool in investigating the authorship of texts with sizes similar to the books of the New Testament. This may be due to the fact that the delta formula does not satisfy the triangular inequality, as detailed in Benedetto et al. 4 As with the GZip technique, we use the distance formula in Eq. 2, to investigate the authorship of the Letter to the. The PPM software was utilized to compress the 27 books from the Koine Greek New Testament. Here, wasusedassourcea. Plots of S AB values for books with sizes greater than 2, words, compressed using the PPM algorithm are illustrated in Figure 9. A good result obtained from this graph is that the ranges of the standard deviations are small. Therefore, an unambiguous result can be read from the graph. It is clear from Figure 9 that on the positive side, 2 Corinthians is closest to and on the negative side, since it conflicts with established authorship attributions, 1 Corinthians and Romans are closest to. Note that the author of 1 Corinthians, 2 Corinthians and Romans is Paul. It was observed that Revelations is also close to. In order to validate the results obtained above, we repeat the experiment with Luke as our source A. Results of this are shown in Figure 1. It is obvious from Figure 1 that is closest to Luke. We also observed that the relative distance of, 2 Corinthians, 1 Corinthians and Romans from Luke agree with their corresponding positions in Figure 9. In view of the above, the results support the hypothesis that the author of the Letter to the is Paul. Furthermore, we are convinced that the distance metric appears useful in detection of text authorship. Based on 52 Proc. of SPIE Vol. 5649

Mean distance between and the books of the New Testament 4 3 2 1 1 Plot of distances between and other books of the New Testament 1 John Galatians Ephesians 2 Corinthians 1 Corinthians Romans PPM COMPRESSION TECHNIQUE Revelations Mark John Matthew Luke 2 2 4 6 8 1 12 14 Number assigned to books of the New Testament (size > 2) Figure 9. Using the PPM compression algorithm, we compressed with portions of other books combined (at random, over repeated trials), and calculated the S AB values using Eq. 2. This is a plot of the results, with a smaller value of S AB ideally indicating common authorship or at least style. Negative values should not occur in practice, however they appear due to extra hash function (checksum) information being stored in the files, guaranteeing verifiable decompression but increasing the lengths of compressed files. the results of our experiments, we believe that the PPM compression technique is more powerful than the Zip and Compression technique. 2.3. WRI technique We used the scaled standard deviations of WRI graphical method to identify texts with similarity in style to the Letter to the. This method was first introduced by Ortuño et al. 4 andwasshowntobeuseful by Berryman et al. 5, 6 in investigating text authorship. Berryman et al. defined WRI as the number of words in between successive occurrences of a keyword (non-inclusive). 6 For each text of the New Testament, we automated the calculation of the scaled standard deviation of WRIs for each word that occurs in the text more than 5 times. These scaled standard deviations were then ranked in descending order and graphs of scaled standard deviations versus log(rank) were plotted. For clarity and reference purposes, only curves representing, Luke, 2 Corinthians, and 1 John are included in the graph shown in Figure 11. It is evident from Figure 11 that a close match between 2 Corinthians and is obtained. Note that the curve representing 1 John is also close to the curve representing. However, the curves deviate for a log(rank) less than.5 and a log(rank) greater than 1.5 thereby obscuring the similarities between the two texts. Hence, the result of this technique adds weight to the hypothesis that Paul is the author of. Having investigated the scaled standard deviations of the WRI graphical method, we embarked on the WRI linear regression method where we calculated the linear regression of scaled standard deviations of WRIs. We examined the slopes of the linear regression equations to identify the similarity between the texts. The closer the values of the slopes, the more likely that the texts are written by the same author. Proc. of SPIE Vol. 5649 521

15 1 John Plot of distances between Luke and other books of the New Testament PPM COMPRESSION TECHNIQUE Mean distance between Luke and the books of the New Testament 1 5 Galatians Ephesians 2 Corinthians 1 Corinthians Romans Revelations Mark John Matthew Luke 2 4 6 8 1 12 14 Number assigned to books of the New Testament (size > 2) Figure 1. Using the PPM compression algorithm, we compressed Luke with portions of other books combined (at random, over repeated trials), and calculated the S AB values using Eq. 2. This is a plot of the results, with a smaller value of S AB ideally indicating common authorship or at least style. Figure 12 shows the linear regression of scaled standard deviations of WRIs for the texts, 2 Corinthians, and Luke together with the corresponding plots of scaled standard deviations of WRIs versus the rank of standard deviations in descending order. Figure 12 illustrates that there is a similarity in style between and 2 Corinthians and also between Luke and. It is clear from the graph that Luke and are different in styles to and 2 Corinthians as evident from the values of the slopes of their regression lines and their distances from the other two curves. Hence, this observation supports the traditional opinion that Paul is the author of the Letter to the. 3. CONCLUSIONS The PPM compression technique and the WRI technique are valuable tools for authorship detection. The PPM compression technique, when used with the distance metric of Eq. 2, gives interesting results. It enabled us to identify 2 Corinthians, 1 Corinthians and Romans as texts having smallest distances from the Letter to the. Since Paul s authorship of 2 Corinthians, 1 Corinthians and Romans was confirmed true by critical research, 12 our results add weight to the traditional opinion that Paul is the author of the Letter to the. However, it should also be noted that Ephesians and Galatians are both authored by Paul and appear far apart. Thus we cannot conclusively determine authorship using the PPM compression technique. The WRI technique proved useful in comparing similarities in styles of texts. Results from the scaled standard deviations of WRI graphical method showed that 2 Corinthians and the Letter to the are similar in styles. The WRI linear regression method produced similar values for slopes of curves representing 2 Corinthians and. However, the difference in sizes of the books of the New Testament might affect the validity of results. Truncating the scaled standard deviations of all texts to the same size may affect the style of writing of the author as this is analogous to removing the corresponding words from the texts. Thus, it may be a good 522 Proc. of SPIE Vol. 5649

4 3.5 Graph of WRI scaled standard deviations vs log(rank of scaled standard deviations) 2cor 1joh hebr luke acts 3 WRI Scaled Standard Deviations 2.5 2 1.5 1.5.5 1 1.5 2 2.5 3 Log(rank of scaled standard deviations) Figure 11. For each word (occurring more than 5 times) in the texts plotted, we have calculated its WRI and plotted the scaled (by mean) standard deviation of each word, ranked from highest to lowest. For a log(rank) less than.5, there is a noticeable discrepancy between their standard deviations. However, this accounts for a very small fraction of the total curve and can be treated as negligible. According to Berryman et al., 6 texts with similar style appear close together when the scaled standard deviation of WRI is plotted, so this figure indicates a close match between 2 Corinthians and. idea to randomly extract a long sequence from each text first before experimenting them under the WRI linear regression method. The GZip compression technique is not an effective tool in text authorship identification. Graphical results produced by this technique showed overlapping standard deviations giving rise to poor discrimination. The delta parameter, with both the GZip and PPM compression schemes, did not provide acceptable results. Thus, further work is needed in investigating the usefulness of this method in the area of authorship detection. There is an indication that Revelations is also (to a lesser extent) close to as evident in Figure 9, using the PPM compression with the distance metric. The WRI method shown in Figure 12 showed a close match between and 2 Corinthians, and large separation from Luke and. 4. ACKNOWLEDGMENTS We greatly acknowledge funding from The University of Adelaide. REFERENCES 1. R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, and H. E. Stanley, Linguistic features of non-coding DNA sequences, Physical Review Letters 73, pp. 3169 3172, 1994. 2. R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, and H. E. Stanley, Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics, Physical Review E 52, pp. 2939 295, 1995. Proc. of SPIE Vol. 5649 523

Polynomial value of the regression 4 3.5 3 2.5 2 1.5 1 Linear Regression of STDs of WRIs yluke =.2x + 1.728 2cor 2cor hebr hebr luke luke acts acts.5 ycor2 =.5x + 1.746 yacts =.2x + 1.68 yhebr =.5x + 1.683 1 2 3 4 5 6 7 8 9 1 rank of STDs in descending order Figure 12. Here we fit straight lines to the WRI data, and indicate the functions obtained by linear regression. Notice that the slopes of the linear regression of standard deviations of WRIs for the texts and 2 Corinthians are approximately -.5 whereas the slopes of the linear regression line for and Luke are about -.2. 3. R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, and H. E. Stanley, Reply to comments on linguistic features of non-coding DNA sequences, Physical Review Letters 76, pp. 1979 1981, 1996. 4. M. Ortuño, P. Carpena, P. Bernaola-Galván, E. Muñoz, and A. M. Somoza, Keyword detection in natural languages and DNA, Europhysics Letters 57(5), pp. 759 764, 22. 5. M. J. Berryman, A. Allison, P. Carpena, and D. Abbott, Signal processing and statistical methods in analysis of text and DNA, Proc. SPIE: Biomedical Applications of Micro- and Nanoengineering 4937, pp. 231 24, 22. 6. M. J. Berryman, A. Allison, and D. Abbott, Statistical techniques for text classification based on word recurrence intervals, Fluctuations and Noise Letters 3(1), pp. L1 L1, 23. 7. A. Lempel and J. Ziv, A universal algorithm for sequential data compression, IEEE Transactions on Information Theory 23(3), pp. 337 343, 1977. 8. J. Ziv and A. Lempel, Compression of individual sequences via variable-rate coding, IEEE Transactions on Information Theory 24(5), pp. 53 536, 1978. 9. J. Cleary and I. Witten, Data compression using adaptive coding and partial string matching, IEEE Trans. Comms. 32, pp. 396 42, Apr. 1984. 1. A. Moffat, Implementing the PPM data compression scheme, IEEE Trans. Comms. 38, pp. 1917 1921, Nov. 199. 11. D. Benedetto, E. Caglioti, and V. Loreto, Language trees and zipping, Physical Review Letters 88, pp. 4872/1 4, Jan. 22. 12. R. Davidson and A. R. C. Leaney, The Penguin Modern Guide to Theology III: Biblical Criticism, Penguin, 1972. 524 Proc. of SPIE Vol. 5649