Visual Analytics Based Authorship Discrimination Using Gaussian Mixture Models and Self Organising Maps: Application on Quran and Hadith

Similar documents
AUTHORSHIP DISCRIMINATION ON QURAN AND HADITH USING DISCRIMINATIVE LEAVE-ONE-OUT CLASSIFICATION

The Impact of Oath Writing Style on Stylometric Features and Machine Learning Classifiers

Identifying Anaphoric and Non- Anaphoric Noun Phrases to Improve Coreference Resolution

QCAA Study of Religion 2019 v1.1 General Senior Syllabus

StoryTown Reading/Language Arts Grade 2

MISSOURI S FRAMEWORK FOR CURRICULAR DEVELOPMENT IN MATH TOPIC I: PROBLEM SOLVING

Georgia Quality Core Curriculum

TEXT MINING TECHNIQUES RORY DUTHIE

Arizona Common Core Standards English Language Arts Kindergarten

Macmillan/McGraw-Hill SCIENCE: A CLOSER LOOK 2011, Grade 3 Correlated with Common Core State Standards, Grade 3

StoryTown Reading/Language Arts Grade 3

USER AWARENESS ON THE AUTHENTICITY OF HADITH IN THE INTERNET: A CASE STUDY

A Correlation of. To the. Language Arts Florida Standards (LAFS) Grade 3

Prentice Hall Literature: Timeless Voices, Timeless Themes, Bronze Level '2002 Correlated to: Oregon Language Arts Content Standards (Grade 7)

PAGE(S) WHERE TAUGHT (If submission is not text, cite appropriate resource(s))

McDougal Littell High School Math Program. correlated to. Oregon Mathematics Grade-Level Standards

Introduction to Statistical Hypothesis Testing Prof. Arun K Tangirala Department of Chemical Engineering Indian Institute of Technology, Madras

World Religions. These subject guidelines should be read in conjunction with the Introduction, Outline and Details all essays sections of this guide.

Prentice Hall Literature: Timeless Voices, Timeless Themes, Silver Level '2002 Correlated to: Oregon Language Arts Content Standards (Grade 8)

Who wrote the Letter to the Hebrews? Data mining for detection of text authorship

Ms. Shruti Aggarwal Assistant Professor S.G.G.S.W.U. Fatehgarh Sahib

A study of teacher s preferences by using of statistical methods

Houghton Mifflin English 2001 Houghton Mifflin Company Grade Three Grade Five

Scott Foresman Reading Street Common Core 2013

Strand 1: Reading Process

Arkansas English Language Arts Standards

Grade 6 correlated to Illinois Learning Standards for Mathematics

***** [KST : Knowledge Sharing Technology]

A Knowledge-based System for Extracting Combined and Individual Quranic Recitations

Macmillan/McGraw-Hill. Treasures. Grades K - 6. Correlated with. Oklahoma Priority Academic Student Skills (PASS) Language Arts.

Preface. amalgam of "invented and imagined events", but as "the story" which is. narrative of Luke's Gospel has made of it. The emphasis is on the

Intelligent Agent for Information Extraction from Arabic Text without Machine Translation

ELA CCSS Grade Three. Third Grade Reading Standards for Literature (RL)

Lecture (1) Introduction

Houghton Mifflin English 2004 Houghton Mifflin Company Level Four correlated to Tennessee Learning Expectations and Draft Performance Indicators

Digital Methods for App Analysis Mapping App Ecologies in the Google Play Store

The City School Syllabus Outline for Parents Class 6

Keyword based Clustering Technique for Collections of Hadith Chapters

Louisiana English Language Arts Content Standards BENCHMARKS FOR 5 8

1.2. What is said: propositions

SEVENTH GRADE RELIGION

PROSPECTIVE TEACHERS UNDERSTANDING OF PROOF: WHAT IF THE TRUTH SET OF AN OPEN SENTENCE IS BROADER THAN THAT COVERED BY THE PROOF?

Macmillan/McGraw-Hill SCIENCE: A CLOSER LOOK 2011, Grade 1 Correlated with Common Core State Standards, Grade 1

Prioritizing Issues in Islamic Economics and Finance

Prentice Hall United States History Survey Edition 2013

A Correlation of. To the. Language Arts Florida Standards (LAFS) Grade 5

WEB BASED DATA ANALYSIS: A CASE STUDY OF RELIGIOUS INFORMATION

Computable Difference Matrix for Synonyms in Holy Quran

1. Introduction Formal deductive logic Overview

Question Answering. CS486 / 686 University of Waterloo Lecture 23: April 1 st, CS486/686 Slides (c) 2014 P. Poupart 1

Keywords: Knowledge Organization. Discourse Community. Dimension of Knowledge. 1 What is epistemology in knowledge organization?

Macmillan/McGraw-Hill SCIENCE: A CLOSER LOOK 2011, Grade 4 Correlated with Common Core State Standards, Grade 4

Propositional Revelation and the Deist Controversy: A Note

Studying Adaptive Learning Efficacy using Propensity Score Matching

Comparative Power of Three Author-Attribution Techniques for Differentiating Authors

Prentice Hall U.S. History Modern America 2013

A New Parameter for Maintaining Consistency in an Agent's Knowledge Base Using Truth Maintenance System

correlated to the Missouri Grade Level Expectations Grade 6 Objectives

RELIGIOUS EDUCATION IN THE EARLY YEARS ~ PRE-PRIMARY TO YEAR THREE

Reference Resolution. Regina Barzilay. February 23, 2004

Reference Resolution. Announcements. Last Time. 3/3 first part of the projects Example topics

CREATING THRIVING, COHERENT AND INTEGRAL NEW THOUGHT CHURCHES USING AN INTEGRAL APPROACH AND SECOND TIER PRACTICES

FOURTH GRADE. WE LIVE AS CHRISTIANS ~ Your child recognizes that the Holy Spirit gives us life and that the Holy Spirit gives us gifts.

A Correlation of Scott Foresman Reading Street Common Core Edition Kindergarten, 2013

Sample biography of a faculty member Fifth criterion of quality - professional development

INF5020 Philosophy of Information: Ontology

A Correlation of. To the. Language Arts Florida Standards (LAFS) Grade 4

Minnesota Academic Standards for Language Arts Kindergarten

CORRELATION FLORIDA DEPARTMENT OF EDUCATION INSTRUCTIONAL MATERIALS CORRELATION COURSE STANDARDS/BENCHMARKS

Frequently Asked Questions about ALEKS at the University of Washington

U.S. Catholics Express Favorable View of Pope Francis

Pastor Search Survey Text Analytics Results. An analysis of responses to the open-end questions

Trinity First Lutheran School 3 rd Grade Curriculum Plan Ms. Anna Schield

Scott Foresman Reading Street Common Core 2013

Using Machine Learning Algorithms for Categorizing Quranic Chapters by Major Phases of Prophet Mohammad s Messengership

Network Analysis of the Four Gospels and the Catechism of the Catholic Church

Russell: On Denoting

1. What is the key to finding meaning and significance in service for Christ? What is the result of finding that key?

The SAT Essay: An Argument-Centered Strategy

10648NAT Diploma of Ministry (Insert Stream)

Individual fulfillment and the value of self-reliance saturate the mindset. Sacred Companions

University of Warwick institutional repository:

Automatic Recognition of Tibetan Buddhist Text by Computer. Masami Kojima*1, Yoshiyuki Kawazoe*2 and Masayuki Kimura*3

Theory of knowledge prescribed titles

Mormon Studies Review 23/1 (2011): (print), (online)

A Short Addition to Length: Some Relative Frequencies of Circumstantial Structures

Rational Answers to Ideological Commitments. Jaafar Sheikh Idris. website

Argumentation and Positioning: Empirical insights and arguments for argumentation analysis

Religious Beliefs of Higher Secondary School Teachers in Pathanamthitta District of Kerala State

Kurnia Syinta Devi Diah Fortuna

Understanding irrational numbers by means of their representation as non-repeating decimals

New Numerical Hidden Structure in the Holy Quran Based on Number 7

Allah Is Dead: Why Islam Is Not A Religion PDF

THIRD GRADE CURRICULUM RELIGION

III Knowledge is true belief based on argument. Plato, Theaetetus, 201 c-d Is Justified True Belief Knowledge? Edmund Gettier

Natural Language Processing (NLP) 10/30/02 CS470/670 NLP (10/30/02) 1

A Study on the Impact of Yoga Tourism on Tourists Visiting Kerala

APAS assistant flexible production assistant

On the epistemological status of mathematical objects in Plato s philosophical system

Grade 7. correlated to the. Kentucky Middle School Core Content for Assessment, Reading and Writing Seventh Grade

Transcription:

Visual Analytics Based Authorship Discrimination Using Gaussian Mixture Models and Self Organising Maps: Application on Quran and Hadith Halim Sayoud (&) USTHB University, Algiers, Algeria halim.sayoud@uni.de, halim.sayoud@gmail.com Abstract. An interesting way to analyse the authorship authenticity of a document, is the use of stylometry. However, the use of conventional features and classifiers has some disadvantages such as the automatic authorship decision, which usually gives us a speechless authorship classification without (often) any way to measure or interpret the consistency of the results. In this paper, we present a visual analytics based approach for the task of authorship discrimination. A specific application is dedicated to the authorship comparison between two ancient religious books: the Quran and Hadith. In fact, an important raising question is: could these ancient books be written by the same Author? Thus, seven types of features are combined and normalized by PCA reduction and three visual analytical clustering methods are employed and commented on, namely: Principal Component Analysis, Gaussian Mixture Models and Self Organizing Maps. The new visual analytical approach appears interesting, since it does not only show the distinction between the author styles, but also sheds light on how consistent was that distinction (i.e. visually). Concerning the discrimination application on the ancient religious books, the results have shown the appearance of two separated clusters: namely a Quran cluster and Hadith cluster. The clusters distinction corresponds to a clear authorship difference between the two investigated documents, which implies that the two books (i.e. Quran and Hadith) come from two different Authors. Keywords: Artificial intelligence Data mining Visual analytics Natural language processing Authorship attribution Quran authorship 1 Introduction Visual Analytics (VA) is defined as the graphical visualisation of the information resulting from an Analytical Modelling (AM). This graphical visualisation represents a bridge between the human and the mathematical results, and helps the experts extracting the important information for taking a decision [1]. It is impossible to dissociate the VA from AM, but in the contrary the two entities have to be associated to help the experts getting clear information from the analysed data. Springer International Publishing AG, part of Springer Nature 218 M. Mouhoub et al. (Eds.): IEA/AIE 218, LNAI 1868, pp. 158 164, 218. https://doi.org/1.17/978-3-319-9258-_15

Visual Analytics Based Authorship Discrimination 159 Authorship Discrimination (AD) [2], which represents a sub-field of stylometry, consists in checking whether two text documents belong to the same author or not. This research field can efficiently respond to some literary disputes with regards to the authentic writer of a document [3]. Mostly, stylometry (or authorship attribution) uses AM computations to evaluate the probability that a specific author could have written a given piece of text. This manner, the user or expert can difficultly manage to make a decision with regards to the real author supposed to be the writer of that document. The originality of this research work is that we propose a new way of authorship analysis by using the VA approach. Furthermore we propose a new set of linguistic features that are also original in stylometry. The principal application of our work is the analysis of the authorship authenticity of the Quran. This task is made by applying an authorship discrimination between the Quran, claimed to be from God [4], and the Hadith (i.e. statements of the Prophet). Our corpus consists of the two ancient books, which are segmented into text segments of the same size: 14 different text segments for the Quran and 11 different text segments for the Hadith. The segments have a medium size of about 276 words per text. 2 Stylometric Features Several linguistic features are proposed in the field of authorship attribution. We can quote four main types: Vocabulary based Features, Syntax based Features, Orthographic based features and Characters based features. In our investigation, a mixture of different features is proposed: Author Related Pronouns (ARP), Father Based Surname (FBS), Discriminative Words (DisW), COST value, Word Length Frequency (WLF), Coordination Conjunction (CC) and Starting Coordination conjunction (SCC). All those features are original and some of them are used for the first time in stylometry (during the preparation of this work). Those features are described as follows: 2.1 Author s Pronoun Based Feature In Arabic, the pronoun I ( - ا ني (ا نا is the most used one for representing the speaker person (i.e. myself). In fact, most speakers use the pronoun I, which is normal, when speaking or writing, like in the following sentence: سعيد لرو يتك, انا meaning «Iam happy to see you». However, in some few cases, the author s pronouns He (هو) and We are also employed, instead of I, at least in special circumstances. This great (نحن - ا نا ( variety of speaker s pronoun in Arabic makes a great challenge in trying using them in stylometry. 2.2 On the Use of ا با (Father of) for Naming People In the Arabic language, it is usual to call a person using the name of his oldest child. That is, if somebody has a son called Youssof for instance, then it is possible to call him Aba-Youssof, which can be translated into Father-of-Youssof. This fact is often noticed in verbal communications, when somebody talks with his companions.

16 H. Sayoud 2.3 Frequency of Some Discriminative Words The key idea is to investigate the use of some words that are very discriminative. In practice, we remarked that such words, for instance: الذين (in English: THOSE or WHO in a plural form), are very commonly used by certain speakers. As other example, one can cite the word الا رض (in English: EARTH), which is frequently used in several Arabic religious books. 2.4 COST Parameter Based Feature Usually, when poets write a series of poems, they make a termination similarity between the neighboring sentences of the poem, such as a same final syllable or letter. To evaluate that termination similarity, a new parameter estimating the degree of text chain (in a text of several sentences) has been proposed: the COST parameter [5]. 2.5 Word Length Frequency The fifth feature is the word length frequency, which is the number of letters composing that word. The word length frequency F(n) for a specific length n, represents the number (in percent) of words composed of n letters each, present in the text (In practice we choose n < 11). 2.6 Frequency of the Coordination Conjunction «و» (Meaning AND) The coordination conjunctions represent an interesting type of features, which are widely used in the Arabic literature. In this study, we have limited our investigation to one of the most interesting conjunction, it is the conjunction, و which corresponds to the coordination conjunction AND (in English). 2.7 Frequency of the Conjunction «و» at the Beginning of Sentence Herein we are still interested in the frequency of the conjunction. و However, in this case we only keep the conjunctions that are localized at the beginning of sentences, such as in the following sentence: And now, what should we do?. 3 Visual Analytics Based Clustering Methods In pattern recognition, cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (i.e. cluster) are more similar to each other than to those in other groups [6]. On the other hand, visual analytics [1, 7], which is a combination of several fields (i.e. computer science, information visualization and graphic design) is often used in cluster analysis to make the analyst s judgment easier to develop and more objective. That is, the combination of those two research fields can lead to a strong and efficient analysis tool for handling some classification tasks that could be extremely difficult to perform with conventional analytic tools. Consequently, it appears that the association of visual analytics with

Visual Analytics Based Authorship Discrimination 161 clustering analysis may be interesting for solving some stylometric problems, for which we do not possess any training possibility or information to make a supervised classification task. So, it should be extremely motivating to apply them in our application of authorship discrimination (i.e. Quran vs Hadith). In our survey, we propose to use the Gaussian Mixtures Models and Self Organizing Maps, separately in order to find out the possible clusters related to the different investigated text segments. Our corpus consists of the two ancient books: Quran and Hadith. However, since the sizes of the two books are different, we segmented them into segments of the same size: there are 14 different text segments for the Quran and 11 different text segments for the Hadith. The segments have the same size and the medium size is about 276 words per text. 3.1 Principal Components Analysis A PCA representation of the data, using the 3 most important eigenvectors, is given in Fig. 1. We can notice that all the Quran documents are grouped together in the right side, while all the Hadith ones are separately grouped in the left side. PCA Analysis.4 Quran Hadith.2 feature 3 -.2 -.4 -.6 -.8.2 -.2 -.4 feature 2 -.6 -.8.5 -.5 feature 1.5 1 1.5 Fig. 1. PCA representation of the Quran (circles) and Hadith (crosses). 3.2 Gaussian Mixture Model Based Clustering A GMM based clustering is performed after PCA reduction into the 2 most important components. We notice that the different text samples have been clustered into 2 main groups: Quran cluster, at the bottom left side, gathering all the Quran texts and a Hadith

162 H. Sayoud cluster at top right, gathering all Hadith texts. The Gaussian mixtures are represented by different 3D gaussians surrounding the two clusters (Fig. 2). This fact confirms, once again, that the writing styles of the 2 books are probably different. Cluster 1: Quran Cluster 2: Hadith.8 Hadith cluster.6 pdf.4 Quran cluster.2 2 1.5 1 2nd Component.5 -.5-3 -2.5-2 1st Component.5 Fig. 2. GMM clustering in 3D. The 3 rd dimension represents the probability density function. 3.3 Self-Organizing Map Based Clustering In Fig. 3, a Self-Organizing Map (SOM) using 3 PCA components is performed. The U-matrix is shown on the left, and a grid named Labels is shown on the right. In the left figure, the different cells have been labelled (with regards to the book origin) by using 2 colours (red for the Quran and green for the Hadith). We notice that the Quran samples in red are well grouped together and separated from the Hadith samples in green, by a sharp horizontal black (dark) line representing a boundary between the two classes. Consequently, we can see that the SOM clustering leads to the same previous conclusion: the two books should have two different authors. 4 Discussion In this investigation, we have proposed a new set of linguistic features that are original and not used previously. Furthermore, we have proposed a new graphical way to analyse the authorship authenticity of a document by using three approaches: PCA, GMM and SOM techniques. The different results led to the following conclusions:

Visual Analytics Based Authorship Discrimination 163 Fig. 3. 2D Self-Organizing Map (SOM). We can see 2 main clusters: one cluster is visible at the right bottom and another one at the left top. The different cells have been labelled by using 2 colours (green for the Hadith and red for the Quran). The dark lines represents boundaries. (Color figure online) Visual Analytics is interesting and promising in the field of authorship attribution. Although the first approach (i.e. PCA) is not a clustering method, the resulting 3D representation suggests that the two books have two different author styles. The second approach, namely GMM, is a clustering technique based on gaussian mixture models. According to the 3D representation, the two books appear to have two different author styles, too. The third approach (i.e. SOM) is a self organizing neural network, which makes a 2D representation of the different possible clusters. The resulting mapping shows that there are also two different author styles: one for the Quran and one for the Hadith. Consequently, it appears that the two investigated books (Quran and Hadith) have 2 different writing styles, which suggests the hypothesis of 2 different authors. References 1. Blascheck, T., John, M., Kurzhals, K., Koch, S., Ertl, T.: VA2: a visual analytics approach for evaluating visual analytics applications. IEEE Trans. Vis. Comput. Graph. 22(1), 61 7 (216) 2. Sayoud, H.: Segmental analysis based authorship discrimination between the Holy Quran and Prophet s statements. Digital Stud. J. 214 215 (215) 3. Sayoud, H.: A visual analytics based investigation on the authorship of the Holy Quran. In: International Conference on Information Visualization Theory and Applications (IVAPP 215), 11 14 March 215, pp. 177 181 (215)

164 H. Sayoud 4. Ibrahim, I.A.: A brief illustrated guide to understanding Islam. Library of Congress, Darussalam Publishers, Houston. www.islam-guide.com/contents-wide.htm 5. Sayoud, H.: Author discrimination between the Holy Quran and Prophet s statements. Literary Linguist. Comput. 27(4), 427 444 (212) 6. Norusis, M.: Cluster analysis. In: SPSS 17. Statistical Procedures Companion, Marija Norusis, pp. 361 391. Pearson editor (28). Chap. 16 7. Ellis, G., Mansmann, F.: VisMaster, Visual Analytics. In: Mastering the Information Age. Scientific Coordinator of VisMaster. Daniel Keim Jörn Kohlhammer (21). Chap. 2