QUESTION ANSWERING SYSTEM USING SIMILARITY AND CLASSIFICATION TECHNIQUES

Similar documents
Question Answering. CS486 / 686 University of Waterloo Lecture 23: April 1 st, CS486/686 Slides (c) 2014 P. Poupart 1

An Efficient Indexing Approach to Find Quranic Symbols in Large Texts

TEXT MINING TECHNIQUES RORY DUTHIE

ECE 5984: Introduction to Machine Learning

Information Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)

Intelligent Agent for Information Extraction from Arabic Text without Machine Translation

A Question Answering System on Holy Quran Translation Based on Question Expansion Technique and Neural Network Classification

NPTEL NPTEL ONINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture-59 Ensemble Methods- Bagging,Committee Machines and Stacking

Universiti Teknologi MARA. Ontology of Social Interaction Ethics in Al Adab Al - Mufrad by Using Semantic Web

Visual Analytics Based Authorship Discrimination Using Gaussian Mixture Models and Self Organising Maps: Application on Quran and Hadith

MISSOURI S FRAMEWORK FOR CURRICULAR DEVELOPMENT IN MATH TOPIC I: PROBLEM SOLVING

Using Machine Learning Algorithms for Categorizing Quranic Chapters by Major Phases of Prophet Mohammad s Messengership

Identifying Anaphoric and Non- Anaphoric Noun Phrases to Improve Coreference Resolution

ECE 5424: Introduction to Machine Learning

Universiti Teknologi MARA. Zakat Calculation System for Academy of Contemporary Islamic Studies (ACIS), UiTM Melaka Campus Jasin

Prioritizing Issues in Islamic Economics and Finance

Argument Harvesting Using Chatbots

The Impact of Oath Writing Style on Stylometric Features and Machine Learning Classifiers

A Survey: Framework of an Information Retrieval for Malay Translated Hadith Document

NPTEL NPTEL ONLINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture 31

Artificial Intelligence Prof. Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology, Madras

Extracting the Semantics of Understood-and- Pronounced of Qur anic Vocabularies Using a Text Mining Approach

McDougal Littell High School Math Program. correlated to. Oregon Mathematics Grade-Level Standards

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 21

USER AWARENESS ON THE AUTHENTICITY OF HADITH IN THE INTERNET: A CASE STUDY

1. Introduction Formal deductive logic Overview

The UPV at 2007

ECE 5424: Introduction to Machine Learning

StoryTown Reading/Language Arts Grade 2

THE PROFIT EFFICIENCY: EVIDENCE FROM ISLAMIC BANKS IN INDONESIA

Prentice Hall Literature: Timeless Voices, Timeless Themes, Bronze Level '2002 Correlated to: Oregon Language Arts Content Standards (Grade 7)

Order-Planning Neural Text Generation from Structured Data

Natural Language Processing (NLP) 10/30/02 CS470/670 NLP (10/30/02) 1

Information Retrieval LIS 544 IMT 542 INSC 544

Macmillan/McGraw-Hill SCIENCE: A CLOSER LOOK 2011, Grade 3 Correlated with Common Core State Standards, Grade 3

Analyzing the activities of visitors of the Leiden Ranking website

Ms. Shruti Aggarwal Assistant Professor S.G.G.S.W.U. Fatehgarh Sahib

Winning on the Merits: The Joint Effects of Content and Style on Debate Outcomes

Gesture recognition with Kinect. Joakim Larsson

KEEP THIS COPY FOR REPRODUCTION Pý:RPCS.15i )OCUMENTATION PAGE 0 ''.1-AC7..<Z C. in;2re PORT DATE JPOTTYPE AND DATES COVERID

South Carolina English Language Arts / Houghton Mifflin English Grade Three

Grade 6 correlated to Illinois Learning Standards for Mathematics

Anaphora Resolution in Hindi Language

Perception of Individual Consumers toward Islamic Banking Products and Services in Pakistan

Anaphora Resolution in Biomedical Literature: A Hybrid Approach

Studying Adaptive Learning Efficacy using Propensity Score Matching

Agnostic KWIK learning and efficient approximate reinforcement learning

Prentice Hall Literature: Timeless Voices, Timeless Themes, Silver Level '2002 Correlated to: Oregon Language Arts Content Standards (Grade 8)

A New Parameter for Maintaining Consistency in an Agent's Knowledge Base Using Truth Maintenance System

The SAT Essay: An Argument-Centered Strategy

Anaphora Resolution in Biomedical Literature: A

Georgia Quality Core Curriculum

STI 2018 Conference Proceedings

Automatic Recognition of Tibetan Buddhist Text by Computer. Masami Kojima*1, Yoshiyuki Kawazoe*2 and Masayuki Kimura*3

Deep Neural Networks [GBC] Chap. 6, 7, 8. CS 486/686 University of Waterloo Lecture 18: June 28, 2017

AUTHORSHIP DISCRIMINATION ON QURAN AND HADITH USING DISCRIMINATIVE LEAVE-ONE-OUT CLASSIFICATION

Keyword based Clustering Technique for Collections of Hadith Chapters

Network Analysis of the Four Gospels and the Catechism of the Catholic Church

The Meaning of Muslim-Friendly Destination: Perspective of Malaysian and Korean Scholars

That's Your Evidence?: Using Mechanical Turk To Develop A Computational Account Of Debate And Argumentation In Online Forums

Sentiment Flow! A General Model of Web Review Argumentation

Macmillan/McGraw-Hill SCIENCE: A CLOSER LOOK 2011, Grade 4 Correlated with Common Core State Standards, Grade 4

Houghton Mifflin MATHEMATICS

Reference Resolution. Regina Barzilay. February 23, 2004

Reference Resolution. Announcements. Last Time. 3/3 first part of the projects Example topics

Artificial Intelligence: Valid Arguments and Proof Systems. Prof. Deepak Khemani. Department of Computer Science and Engineering

Reductio ad Absurdum, Modulation, and Logical Forms. Miguel López-Astorga 1

In the name of Allah, the Beneficent and Merciful S/5/100 report 1/12/1982 [December 1, 1982] Towards a worldwide strategy for Islamic policy (Points

A Quranic Quote Verification Algorithm for Verses Authentication

UNIVERSITI TEKNOLOGI MARA AN EXPLORATORY OF CONCEPTUAL MODEL OF POVERTY INFORMATION CROWDSOURCING FOR ZAKAT DISTRIBUTION UMMU FATIH AH BT MOHD BAHRIN

Artificial Intelligence Prof. P. Dasgupta Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

COACHING THE BASICS: WHAT IS AN ARGUMENT?

TÜ Information Retrieval

Outline of today s lecture

Saint Bartholomew School Third Grade Curriculum Guide. Language Arts. Writing

China Buddhism Encyclopedia Online Website Project.

Buddha Images in Mudras Representing Days of a Week: Tactile Texture Design for the Blind

Transcription ICANN London IDN Variants Saturday 21 June 2014

Tuen Mun Ling Liang Church

Measuring religious intolerance across Indonesian provinces

Our Story with MCM. Shanghai Jiao Tong University. March, 2014

THE SEVENTH-DAY ADVENTIST CHURCH AN ANALYSIS OF STRENGTHS, WEAKNESSES, OPPORTUNITIES, AND THREATS (SWOT) Roger L. Dudley

Resolving Direct and Indirect Anaphora for Japanese Definite Noun Phrases

correlated to the Massachussetts Learning Standards for Geometry C14

REQUIRED DOCUMENT FROM HIRING UNIT

All They Know: A Study in Multi-Agent Autoepistemic Reasoning

This report is organized in four sections. The first section discusses the sample design. The next


Inimitable Human Intelligence and The Truth on Morality. to life, such as 3D projectors and flying cars. In fairy tales, magical spells are cast to

APAS assistant flexible production assistant

SYLLABUS. Department Syllabus. Philosophy of Religion

South Carolina English Language Arts / Houghton Mifflin Reading 2005 Grade Three

Tools Andrew Black CS 305 1

Introduction. I. Proof of the Minor Premise ( All reality is completely intelligible )

Pearson myworld Geography Western Hemisphere 2011

INF5020 Philosophy of Information: Ontology

Prentice Hall U.S. History Modern America 2013

Six Sigma Prof. Dr. T. P. Bagchi Department of Management Indian Institute of Technology, Kharagpur. Lecture No. # 18 Acceptance Sampling

How many imputations do you need? A two stage calculation using a quadratic rule

Proceedings of the Meeting & workshop on Development of a National IT Strategy Focusing on Indigenous Content Development

Transcription:

International Journal of Computer Systems (ISSN: 394-65), Volume 03 Issue 07, July, 06 Available at http://www.ijcsonline.com/ QUESTION ANSWERING SYSTEM USING SIMILARITY AND CLASSIFICATION TECHNIQUES Nabeel Neamah Ȧ, Saidah Saad Ḃ Ȧ Faculty of Information Sciences and Technology, UKM, Bangi, Malaysia Ḃ School of IT, Faculty of Information Sciences and Technology, UKM, Bangi, Malaysia Abstract The main aim of question answering system is provide correct answers based on users queries. Question answering system developed to provide answers for various domains or restricted domain. There are main challenges face the question answering systems such as extract answers based on weak concepts of users queries and difficulty to retrieve accurate answers from large corpus of documents. These challenges increase the difficulty of questions analyzing and retrieve relevant and correct answers based on users queries. This research applies several NLP methods such as tokenization, stemming, and N-gram in order to analyze the users query effectively. Additionally, SVM method is deployed to classify the answers documents based on questions types in order to reduce the searching scope of proposed answers. The findings revealed that the average answers accuracy using CS technique is 67%, the average answers accuracy using LCS technique is 66%, the average answers accuracy using combination of CS and LCS techniques is 70%, and the average answers accuracy using CS, LCS, and SVM is 80%. Results accuracy involving SVM method is more accurate than other methods like CS and LCS. SVM enhance the system accuracy up to % more than using other methods without classification processes Keywords: Question Answering System, NLP, SVM, Hadiths, Classification, Similarity I. INTRODUCTION Nowadays, there are large increasing of information sources such as online sources; these sources contain huge volume of information related to variant fields of topics i.e. economic, health, industry, and educational information [, ]. Traditional retrieving systems like google search engine retrieve information based on searching keywords of the users' queries rather than retrieve exact answers based on searching queries [3, 4]. For example, traditional search engine will retrieve documents that contain words similar to "Iraq" and "capital" based on the query "what is the capital of Iraq" rather than retrieve the accurate answer which is "Baghdad". Therefore, the users could expense efforts and time to find exact answers from large sources. There are two important processes to ensure the accuracy of QAS; () analyze the users' Query needs using various methods such as Natural Processing Language (NLP), and () classify and manage the documents that contain the candidates' answers accurately based on many methods such as machine learning. Therefore, the accurate matching between users' questions and the proposed answers will be founded effectively [5, 6]. The main aim of methods such as NLP is to update the concepts of users' queries based on formal representation of documents concepts which maximize the opportunities of found the similarities between users' quires and documents contents. On the other hand, the questions and documents classifications will support the matching between the users' types of questions and the proposed answers based on questions types [6, 7]. The user s questions classify as many types such as "What" to inquire about facts and explanations, "Where" to ask about places, and "Who" to ask about persons; the documents classified based on the purpose of information depend on questions types i.e. places information to match "Where" questions. The main problem of this research is the difficulty of retrieve accurate answers based on Hadiths documents due to two main reasons which are as the following: i. Difficulty of provide formal concepts of Hadiths query: the Hadiths documents written based on Arabic languages using classical concepts. Currently, Arabic people are used the modern Arabic concepts. This would increase the difficulty of provide query concepts according to formal concepts of Hadiths. Also, non Arabic people face difficulty of provide the right concepts based on English language due to their weakness of Arabic language skills and knowledge weakness of Hadiths formal concepts using English. ii. Large document of Hadiths that provided by various resources: Hadiths are spoken by Mohammad (Islam messenger) and after many decades these Hadiths were written as texts. Currently, there are large numbers of Hadiths published through various sources such as internet and books. This could increase the difficulty of extract right Hadiths that match with users needs. The main objective of this research is to develop questionanswering systems using NLP and machine learning methods in order to retrieve accurate answers of Hadiths based on users' questions. The following section presents related works to this research. Section 3 explains the 54

research methodology. Section 4 presents the experimental data of the proposed QAS. Section 5 discusses the experimental results of this research. Lastly, section 6 presents the conclusion and future works. II. RELATED WORKS This presents literature of queries analysis, sentence similarity matching, WordNet Ontology, and documents classifications using machine learning. A. Questions Analysis Using NLP The queries analyzed and evaluated based on many factors such as question type and keywords. Reference [8] mentioned that, there are two parts for question analysis which are; () concepts analysis by extract the main or important concepts of users queries, and () concepts processing through update the analyzed concepts to be compatible with formal representations of QA domains concepts. According to [9], the main NLP methods for this stage are as the following:. Normalization: Text normalization is the process of transforming text into a single canonical form that it might not have before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure. Text normalization is frequently used when converting text to speech. Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to be pronounced differently depending on context. For examples, "$00" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan and "vi" could be pronounced as "vie," "vee," or "the sixth" depending on the surrounding words.. Remove stop words to remove the un-important keywords such as 'and', 'the', and 'has'. Stop words are words that are filtered out before or after processing of natural language data (text) []. There is no single universal list of stop words used by all processing of natural language tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. Other search engines remove some of the most common words including lexical words, such as "want" from a query in order to improve performance. 3. Tokenization which divides the text sequence into sentences and then the sentences into tokens. So, In the English language, words are bounded by whitespace and optionally preceded and followed by parentheses, quotes, or punctuation marks. Therefore, the tokenization divides the character sequence based on the whitespace positions or other punctuation marks between words in the sentence. In addition, it cuts off the parentheses and punctuation marks to obtain the sequence of tokens. 4. N-gram which works on divides the sentences into words in query, where the N-gram algorithm focuses on calculating word by word, two words by two words and so on. Word gram is a contiguous sequence of -n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. An n-gram of size is referred to as a "unigram"; size is a "bigram" (or, less commonly, a "diagram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on [, ]. B. Similarity Measure This section discusses the most common techniques of sentence similarity measuring which are; Cosine similarity (CS) and Longest Common Subsequences (LCS). In QAS, CS and LCS techniques are used widely to measure the similarity between users queries and system documents []. Based on CS and LCS techniques increase the opportunity of extract accurate answers depend on users questions. i. Cosine Similarity (CS) Cosine similarity (CS) is a well-known vector based similarity measure in the fields of text mining and information retrieval. Basically, this measure is extensively employed to estimate the relationship between words, the strength of association between elements in two sets is determined by considering the cosine of the angle between two feature vectors. When two vectors are exactly the same, the angle between them is 0, and the cosine of the angle between them is ; when the vectors are orthogonal, the cosine value is 0. After obtaining the term weights (wij, wlj) of all words by, it's easy to apply cosine similarity to compute the similarity of two sentences. The cosine similarity between two sentences (si and sl) is defined as the following formula: sim cs(, ) = =, i, l=,,n. () where is the number of similar words between both sentences, the weights of words in si and sl sentences. ii. Longest Common Subsequence (LCS) is the total number of the Longest Common Subsequence (LCS) is a technique that applied on the problem of deriving patterns in a set of sequences, it is given as input a set of (related or partially related) sequences, and the goal is to find a set of patterns that are common to all or most of the sequences in the set. A good algorithm for this problem should output patterns that are of high sensitivity and specificity. Suppose there 543

are closely related terminologies for multiple sequences, namely Longest Common Subsequence (LCS). Given sequences S= s sm and T = t tn, S is the subsequence of T if for each j m, i < i< < im n, sj = j i t. Given a set of sequences S+ = {S, S,, Sk}, the LCS of S+ is the longest. Possible sequence T such that it is a subsequence of each and every sequence in S at the same time. Emphasizing the given set of sequences, their LCS and pattern are related; LCS represent different aspects of these sequences profile and it can all be used for sequence comparisons and analysis. C. Classification Based Machine Learning Since presenting all syntactic and semantic rules of a language to algorithm is a cumbersome task, for this reason different types of algorithms are made that can receive different examples and have the learning ability and can preview the user' expected response easily. Using machine learning, we can generate systems that includes thousands features of questions and do classification those questions automatically. This action increases the productivity rate of QAS []. Any text classification algorithm can be employed such SVM to classify the texts based on the purpose of information that included in this text [5, 6, ]. For example, the text that talk about places is referring to where questions and the text that talk about date and time refer to when questions and so on. i. Support Vector Machine (SVM) SVM is a useful technique for data classification and it is easier to be implemented than other classification methods such as Neural Networks. A classification task usually involves separating data into training and testing sets. Each instance in the training set contains one target value (i.e. the class labels) and several attributes (i.e. the features or observed variables). The goal of SVM is to produce a model (based on the training data) which predicts the target values of the test data given only the test data attributes. Given a training set of instance-label pairs (xi, yi), i =,..., l where xi Rn and y {, }, the support vector machines (SVM) [8] require the solution of the following optimization problem: Here training vectors xi are mapped into a higher (maybe infinite) dimensional space by the function φ. SVM finds a linear separating hyperplane with the maximal margin in this higher dimensional space. C > 0 is the penalty parameter of the error term. Furthermore, K (xi, xj) φ(xi) T φ(xj ) is called the kernel function. Though new kernels are being proposed by researchers, beginners may find in SVM books the following four basic kernels: Linear: K (xi, xj) = x T i xj. Polynomial: K (xi, xj) = (γxi T xj + r) d, γ > 0. Radial basis function (RBF): K (xi, xj) = exp( γkxi xjk ), γ > 0. Sigmoid: K (xi, xj) = tanh (γxi T xj + r). Here, γ, r, and d are kernel parameters. III. RESEARCH DOMAIN This research focuses on question answering system for Hadiths. The domain of this research is as the following: i. Hadiths Sources: Albukary documents of Hadiths is the main documents source of this research due to many reasons such as Albukary documents of Hadith considered as the most trust sources of Hadiths. These documents are standard references in the Islam world, and Albukhary source contain large volume of Hadiths in various subjects. Thus, it is required to classify and analyze the documents based users query efficiently. ii. Language of Hadiths documents: formal English translate of Hadiths is the language of Hadiths documents in this research. The non Arabic people face difficulty in provide effective concepts of Hadiths queries more than the Arabic people. iii. Hadiths Subjects: pray and fasting are the main two Hadiths subjects in this research. These subjects involve the Muslim daily activities. The other subjects such as Hajj, Zakat, and Al-shahadateen are accomplished by Muslims for one time in the life or time yearly. iv. Questions Types: this research focuses on classify users queries and Hadiths documents depend on two types of questions; () Where, and () When. These types of questions are related to places and time classes. Pray and fasting questions and documents are mostly about places and time. IV. METHOD The research method involves two important directions that effects on question answering system accuracy. Firstly, the users could not have the effective skills to provide their questions in right way. For example, the query typing of what is Malaysia capital? is better than typing give me cities in Malaysia. Therefore, the question answering system could address this challenge using NLP methods. Secondly, the document classifications according to queries types (i.e. When questions) using machine learning methods could improve the accuracy of provided answers. The NLP and machine learning methods selection of question answering system for Hadiths domain take into account many points based on question answering system aspects and research scope. These points are as the following: The selected domain: there are two main domains of question answering system which are open domains and close domains. The Hadiths domain considered as close domain because it is focuses on information based on same context rather than tourism (i.e. open domain) which 5443

focuses on many fields like weather and hotels bookings. However, the Hadith domain contains large information about various related fields e.g. Pray, Zakat, Fasting, Alshadteen, and Pilgrimage. Thus, this research focuses on two main fields which are Pray and fasting due to its relation with daily activities of people. The other fields like pilgrimage are done one time in the life. Type of users query: there are two main types of questions which are open and restricted questions. In this research the restricted type is adopted in order to provide more accurate answers. The restricted type of questions help the users to manage their query using define keywords such as what, when, where, and how questions. On other hand, the documents or answers can be managed according question type which increases the opportunity to retrieve accurate answers from large documents. Specifically, this research focuses on when and where questions due to nature of selected Hadith fields. Usually, the fasting subject is related to time (i.e. When ) and the Pray is related to time and places (i.e. where and when ). Architecture of question answering systems: the question answering systems involve two main directions in order to increase the opportunities of provide accurate answers; () query analysis, () and documents or answers management. The query analysis can accomplished effectively using many methods such as tokenization, stop-word removing, and N-gram. On the other side, one of most effective methods to classify the documents based on specific indicators (i.e. questions types) such SVM method. Another important process of question answering system is the concepts similarity which support the answers retrieving based on similarity measuring between query and system documents. Also, the similarity is important to enhance the users query supporting ontology that contains standard or right concepts of the related domain. The Cosine Similarity (CS) and Long Common Similarity (LCS) are the most common similarity measuring techniques in question answering systems and the WordNet is used widely for the purpose of extract and replace the weak concepts in queries by right concepts. Consequently, Fig. illustrates the methodology of this research according to selected methods of question answering system. The methodology can be described as three main phases which are preprocessing phase, similarity measuring phase, and classification phase. The corpus consists of Hadiths Documents related to pray and Fasting Subjects To classify the Hadiths documents according to proposed Subjects. To classify each documents according to question type i.e. places for where and time for when. Hadiths documents classified according to subjects and question type. Classification Phase SVM Hadiths Documents Classification Question Type Group Where When DB of indexing based on group Retrieve the Answers Features Preprocessing and Similarity Matching Phase Users queries PREPROCESSING Stop Word Removal Tokenization N-Gram Similarity Matching Figure. Research Methodology Figure 3.: Research Methodology CS LCS Measure the similarity between updated query and Hadiths Documents V. EXPERIMENTAL DATA Analyze question type through identifiers i.e. Mecca is place identifier which indicate where question. Remove the insignificant words such as and, the, is split the query as single tokens generate N-gram list depend on tokens number i.e. WordNet Enrich query concepts by formal Hadiths concepts using WordNet. The dataset of the proposed system consists of Hadiths documents about pray and fasting subjects. These documents were selected from Al-bukhari reference of true Hadiths. Al-bukhari considered as one of most trusted references of Hadiths documents due to strong procedures that followed by writer to assure the truth of spoken Hadiths by prophet Mohammad. Prof. Dr. Ahamad Shaker Mahmmod validated the selected Hadiths as proposed dataset as pray and fasting documents who work in Islamic college in Baghdad University, and he considered as expert in Hadiths due to his large experience years (more than 0 years) in the domain of sacred Hadiths. Table illustrates the selected Hadiths numbers according to subject. TABLE PROPOSED DATASET Subject Number of Hadiths Pray 8 Fasting 50 Total The number of Hadith documents that related to when question and pray subject is. The number of Hadith documents that related to where question and pray subject is. The total number of Hadiths that related to when and where questions connecting with pray subject is 8. The number of Hadith documents that related to when question and fasting subject is. The number of Hadith documents that related to where question and fasting subject is 0 (i.e. fasting subjects cannot be related with places). The total number of Hadiths that related to when and where questions connecting with fasting subject is. Table summarizes the Hadiths documents classifications. 5454

TABLE SUMMARY OF HADITHS CORPUS Type of Question Pray Fasting When Where 0 Not related to when or where 54 36 Total 8 50 The proposed system test was conducted based on queries that were selected based on proposed questions about pray and fasting subjects that provided by 5 students from UKM universities according to discussion of proposed system objectives. Table 3 presents the proposed queries according to question and subject classifications. TABLE 3 DIRECTION OF TESTED QUIRES Query Subject Question Proposed Query Type Q Pray When When is the five time of pray for Muslims? Q Pray Where Where was the first Friday prayers? Q3 Fasting When When is the fasting month of Muslims Q4 Pray When When does the Muslims can pray for eid? Q5 Pray When When do you pray Maghrib? Q6 Pray Where Where was the first qibla of Muslims? Q7 Fasting When When does fasting begin? Q8 Pray Where When can be Muslims taraweeh prayers? Q9 Pray When When Should the Traveler Shorten the Prayer? Q Fasting When When does fasting end? Q Pray When When does the Muslim pray for God? Q Pray When When is time of al-fajr prayer? VI. Test #: Cosine Similarity EXPERIMENTAL DATA This test is conducted through using only the cosine similarity technique to measure the similarity between user query and answers documents. The accuracy results of proposed question answering system based on CS technique. The accuracy scores computed depend on the queries precision and recalls where recall = T/(T+(Hadith- T )), Precision= T/(T+(N.H-T)) and F_Score= (recall * precision)/ (recall + precision). The most accurate F_score (0.74) is belonging to third query (When is the fasting month of Muslims?) while the lowest F_score (0.55) is belong to 8th query (when can be Muslims Taraweeh prayers?). The accuracy results for the queries based on cosine similarity technique are; 0.7 for Q, 0.70 for Q, 0.74 for Q3, 0.6 for Q4, 0.73 for Q5, 0.6 for Q6, 0.56 for Q7, 0.55 for Q8, 0.60 for Q9, 0.7 for Q, 0.7 for Q, and 0.73 for Q. The average of answers accuracy of all tested queries record 67%. Thus, the accuracy results of cosine similarity technique considered acceptable, but it could be enhanced supporting other methods to provide answers that are more accurate. Test #: Longest Common Subsequence This test is conducted through using only the long common similarity technique to measure the similarity between user query and answers documents. The accuracy results of proposed question answering system based on LCS technique. The accuracy scores computed depend on the queries precision and recalls where recall = T/(T+(Hadith-T)), Precision= T/(T+(N.H-T)) and F_Score= (recall * precision)/ (recall + precision). The most accurate F_score () is belonging to nd query (where was the first Friday prayers?) while the lowest F_score (0.55) is belonging to 6 th query (When where was the first Qibla of Muslims?), and 9 th query (When Should the Traveler Shorten the Prayer?). The accuracy results for the queries based on long common similarity technique are; 0.67 for Q, for Q, 0.69 for Q3, 0.57 for Q4, 0.67 for Q5, 0.55 for Q6, 0.7 for Q7, 0.7 for Q8, 0.55 for Q9, 0.7 for Q, 0.75 for Q, and 0.57 for Q. The average of answers accuracy of all tested queries record 66%. It can be noticed that the average of accuracy results of CS and LCS are approximately same. Thus, these results could be enhanced supporting other methods to provide answers that are more accurate. Test #3: Combination of CS and LCS The combination between CS and LCS techniques was conducted through testing each proposed query using these two techniques and selects the better F-score of CS and LCS. For example, if F_score of first query using CS technique is higher than F_score of first query using LCS then the system will select F-Score of CS. The measurement of F-score and results selection was accomplished using proposed QA system. Thus, the combination results represent the best possible answers using similarities techniques. The accuracy results of proposed question answering system based on the combination of CS and LCS techniques. The most accurate F_score () is belonging to nd query (where was the first Friday prayers?) while the lowest F_score (0.60) is belonging to 9 th query (When Should the Traveler Shorten the Prayer?). The accuracy results for the queries based on CS and LCS combination are; 0.7 for Q, for Q, 0.74 for Q3, 0.6 for Q4, 0.73 for Q5, 0.6 for Q6, 0.7 for Q7, 0.7 for Q8, 0.60 for Q9, 0.7 for Q, 0.75 for Q, and 0.73 for Q. The average of F_score using combination of CS and LCS was recorded 70%. It can be noticed that combination of CS and LCS provide accurate 545 5465

F_Score Recall Precision Hadith Query answers more than separate CS technique and LCS technique. Test #4: Combination of CS, LCS, and SVM This test is conducted based on two main steps; () classify the Hadiths documents using SVM method based on the questions types, and () measure the similarity through CS and LCS combination. SVM is used to classify Hadiths documents according to Hadiths subjects (Pray and Fasting), and proposed answers of questions types (Where and When). Then, the similarity between queries and documents was measured using the combination of CS and LCS techniques before calculate F-score of final extracted answers. Table 4 summarizes the accuracy results of proposed question answering system based on the combination of CS and LCS, in addition to classification based SVM. The accuracy scores computed depend on the queries precision and recalls where recall = T/(T+(Hadith-T )), Precision= T/(T+(N.H-T)) and F_Score= (recall * precision)/ (recall + precision). As noticed from Table 4, the average of F_score using combination of CS, LCS, and SVM was recorded 80%. The most accurate F_score (0.86) is belonging to nd query (where was the first friday prayers?) while the lowest F_score (0.73) is belonging to 8th and th queries. TABLE 4 ACCURACY MEASUREMENTS BASED ON CS, LCS, AND SVM Q Q Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q Q Q Av Output system N.H 5 T 9 F 0.93 0.90 0.83 0.87 0.93 0.9 0.83 0.85 0.86 0.9 0.93 0.85 0.8 0.7 0.7 0.65 0.7 0.7 0.65 0.84 0.86 0.8 0.84 0.83 0.73 0.80 0.84 0.73 0.80 VII. DISCUSSION ON THE FINDINGS According to findings of experimental results, the combination of CS, LCS, and SVM techniques record the highest accurate records of answers (80%) followed by the combination of CS and LCS techniques (70%), then CS technique (67%), and finally LCS technique (66%). SVM technique is plays important role to improve the accuracy results of proposed question answering system. The average accuracy results of all queries was improved by % when apply SVM with other techniques. On the other hand, most individual results of queries record accurate answers based on SVM with other techniques. SVM reduce the searching space of Hadiths documents through classify the Hadiths depend on proposed question types and documents subjects. The reducing of searching space increases the opportunities of retrieving true answers that match with users queries. This finding can be justified clearly through compare the results of SVM of pray and fasting subjects. the pray subject can classified as when and where question types documents but the fasting subject can be classified as when question type documents. the results of pray queries based on SVM is more accurate than Fasting queries due to possibility of minimize the searching space of pray subject more than the fasting subject. VIII. CONCLUSION There are many methods were applied to analyze the query needs of answers and update the query to be more effective based on the formal concepts of Hadiths. Preprocessing methods such as normalization, tokenization, stop-word removal, and N-gram were applied to analyze the concepts of users quires. the WordNet tool was applied to replace the weak concepts of queries by effective Hadiths concepts or synonyms. SVM technique was applied to reduce the searching space of answers and improve the possibility of retrieving accurate answers. SVM classify Hadiths documents as four main cluster which are; Pray documents for when question type, pray documents for where question type, fasting documents for when question type, and fasting documents for where question type. The results of experimental tests show that the proposed methods are effective to improve the accuracy of question answering system for Hadith domain. Significantly, SVM technique reduces the searching space of answers which improve the accuracy of provided answers. REFERENCES [] Xu-Dong Lin, Hong Peng, Bo Liu, Support Vector Machines for Text Categorization in Chinese Question Classification, College of Computer Science and Engineering, South China University of Technology, International Conference on Web Intelligence (WI 006 Main Conference Proceedings), IEEE, 006. [] Marcin Skowron, Kenji Araki, Evaluation of the New Feature Types for Question Classification with Support Vector Machines, Graduate School of Information Science and Technology Hokkaido University, Sapporo, 060-868, Japan, International Symposium on Communication and Information Technology ( ISCIT), 004. 5476

[3] Hakan Sundblad, Question Classification in Question Answering Systems, Thesis No. 0 ISSN 080-797, Department of Computer and Information Science Linkopings University, Linkoping, 007. [4] Dell Zhang, Wee Sun Lee, Question Classification using Support Vector Machines, National University of Singapore, Singapore- MIT Alliance, Toronto, Canada, 8-August, 003. [5] Harb.A, Michel Beigbeder, Jean-Jacques, Evaluation of Question Classification Systems Using Differing Features, Institute of Electrical and Electronics Engineers, 009. [6] Tan.W, Jianrong Cao, Hongyan Li, Algorithm of Shot Detection based on SVM with Modified Kernel Function, Shan Dong Jianzhu University, Jinan 50, China, International Conference on Artificial Intelligence and Computational Intelligence, IEEE, 009. [7] Gharehchopogh, Farhad Soleimanian, and Yaghoub Lotfi. "Machine Learning based Question Classification Methods in the Question Answering Systems."International Journal of Innovation and Applied Studies 4. (0): 64-73. [8] Srihari, R. & Li,W. (000). Information extraction supported question answering, In Proceedings 8th Text Retrieval Conference (TREC-8), NIST Special Publication 500-46. [9] Bhaskar.P, Pakray.P, Banerjee.S and Banerjee.S, 0, Question Answering System for QA4MRE, Department of Computer Science and Engineering, Jadavpur University, Kolkata, 70003, India. [] Ullman, Jeffrey D., Jure Leskovec, and Anand Rajaraman. "Mining of Massive Datasets." (0): 305-338. [] Brants, T., Franz, A. 006." Web IT 5-gram Version ". (www.ldc.) upenn.edu/catalog/catalogentry.jsp?catalogid=ldc006t). [] Jian-fang, S., Zong-tian, L., & Jian-feng, F. 0. Event-network clustering using similarity. In Natural Computation (ICNC), 0 Sixth International Conference on (Vol. 8, pp. 3970-3973). IEEE. [] Madylova, A. & Oguducu, S. 009. A Taxonomy Based Semantic Similarity of Documents Using the Cosine Measure. Computer and Information Sciences, 009. ISCIS 009. 4th International Symposium on, hlm. 9-4. [] Day.M, Chorng-Shyong Ong, Question Classification in English- Chinese Cross-Language Question Answering: An Integrated Genetic Algorithm and Machine Learning Approach, Institute of Information Science, Academia Sinica, Taiwan, Department of Information Management, National Taiwan University, Taiwan, IEEE, 007. [5] Molina-González, M. D., Martínez-Cámara, E., Martín-Valdivia, M.-T., & Perea-Ortega, J. M. (0). Semantic orientation for polarity classification in Spanish reviews. Expert Systems with Applications, 40(8), 750-757. [6] Turney, P. D. (00). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. Paper presented at the Proceedings of the 40th annual meeting on association for computational linguistics. [] Xu, T., Peng, Q., & Cheng, Y. (0). Identifying the semantic orientation of terms using S-HAL for sentiment analysis. Knowledge-Based Systems, 35, 79-89. [8] Cortes, C., & Vapnik, V. (995). Support-vector networks. Machine learning,0(3), 73-97. 548 7