ANAPHORA RESOLUTION IN HINDI LANGUAGE USING GAZETTEER METHOD

Similar documents
Performance Analysis of two Anaphora Resolution System for Hindi Language

Anaphora Resolution in Hindi Language

Keywords Coreference resolution, anaphora resolution, cataphora, exaphora, annotation.

Anaphora Resolution in Hindi: Issues and Directions

A Machine Learning Approach to Resolve Event Anaphora

Paninian Grammar Based Hindi Dialogue Anaphora Resolution

Identifying Anaphoric and Non- Anaphoric Noun Phrases to Improve Coreference Resolution

Hybrid Approach to Pronominal Anaphora Resolution in English Newspaper Text

TIME AND WORK QUESTIONS FOR SSC GD RPF EXAM 2018 TIME AND WORK PDF HINDI 2018

08 Anaphora resolution

Application Reference Letter

Reference Resolution. Regina Barzilay. February 23, 2004

Reference Resolution. Announcements. Last Time. 3/3 first part of the projects Example topics

Anaphora Resolution. Nuno Nobre

Hindi. Lesson 8 Skip Counting Lesson 11 Money Lesson -12 Time Addition carry over

Towards a more consistent and comprehensive evaluation of anaphora resolution algorithms and systems

Bill No. 15 of 2014 THE CONTRACT LABOUR (REGULATION AND ABOLITION) (RAJASTHAN AMENDMENT) BILL, 2014 (To be Introduced in the Rajasthan Legislative

The UPV at 2007

Outline of today s lecture

DELHI PUBLIC SCHOOL NTPC FARAKKA SYLLABUS BREAKUP FOR

Automatic Evaluation for Anaphora Resolution in SUPAR system 1

Coreference Resolution Lecture 15: October 30, Reference Resolution

Mishra English Study Centre. Conjunction ज ड़न व ल. BY Pritam Kumar Raw

A Survey on Anaphora Resolution Toolkits

DAV CENTENARY PUBLIC SCHOOL, PASCHIM ENCLAVE, NEW DELHI-87 SUMMATIVE ASSESSMENT 2 (SESSION ) CLASS III

Models of Anaphora Processing and the Binding Constraints

Dialogue structure as a preference in anaphora resolution systems

D.A.V PUBLIC SCHOOL (10 +2) PRATAP VIHAR HOLIDAY HOME WORK FOR CLASS- III SESSION- ( ) SUBJECT- ENGLISH

DAV PUBLIC SCHOOL,ASHOK VIHAR,PH-IV,DELHI SESSION

9 Uncorrected/ Not for Publication

GURU HARKRISHAN PUBLIC SCHOOL VASANT VIHAR NEW DELHI HOLIDAYS HOME WORK CLASS-III ENGLISH

ह द : 1. सभ म त र ओ स सम ब हदत २-२ शब द ल ख ए 2.प च प ज स ल न

Bill No. 13 of 2011 THE RAJASTHAN AGRICULTURAL PRODUCE MARKETS (AMENDMENT) BILL, 2011 (To be Introduced in the Rajasthan Legislative Assembly) A Bill

DELHI PUBLIC SCHOOL NTPC FARAKKA SYLLABUS BREAKUP FOR


Kindly note that answers to the above questions is to be done in EVS notebook. ***********************

WHERE TO with Three Modes of Communication. LOTE Conference NYCDOE Monday, January 31, 2011 Presenter: Sushma Malhotra

Bill No. 9 of 2011 THE RAJASTHAN TENANCY (AMENDMENT) BILL, 2011 (To be Introduced in the Rajasthan Legislative Assembly) A Bill further to amend the

vlk/kj.k EXTRAORDINARY Hkkx II [k.m 3 mi&[k.m (ii) PART II Section 3 Sub-section (ii) izkf/dkj ls izdkf'kr PUBLISHED BY AUTHORITY

सवर न म, ल ग,वचन स य क त र वर म चह न अन च छ द ल खन. English Hindi Mathematics Environmental Science

WIT AND HUMOUR, POETRY AND COUPLET to (16 th Session of 16 th Lok Sabha) Date Subject Name of Member/Minister

NPS INTERNATIONAL SCHOOL, GUWAHATI

807 - TEXT ANALYTICS. Anaphora resolution: the problem

An Introduction to Anaphora

ANAPHORIC REFERENCE IN JUSTIN BIEBER S ALBUM BELIEVE ACOUSTIC

CS 671 ICT For Development 19 th Sep 2008

ANAPHORA RESOLUTION IN MACHINE TRANSLATION

MESSAGE BY I/C HM. A Child Without Education Is Like A Bird Without Wings. Mr. ANIL KUMAR (PRINCIPAL)

SCHOOL OF ENGINEERING AND TECHNOLOGY MONAD UNIVERSITY, HAPUR

KV Paschim Vihar Winter holiday homework Class I

Lt. Col. Mehar Little Angels Sr. Sec. School. Lesson 1 (No Smiles Today) Q.1. How do you know that Shanti and Arun were good friends?

Droan Vidya Peeth New Jeewan Nagar, Sonepat ( )

Term I. Subject : English (Written)

AliQAn, Spanish QA System at multilingual

Resolving Direct and Indirect Anaphora for Japanese Definite Noun Phrases

6 BACHELOR OF COMMERCE (B.COM.)(CBSGS)(75:25)SEM VI / C0185 FINANCIAL ACCOUNTING & AUDITING : PAPER X AUDITI. [Time: Hours ] [Marks: 75 ]

HS01: The Grammar of Anaphora: The Study of Anaphora and Ellipsis An Introduction. Winkler /Konietzko WS06/07

Anaphora Resolution Exercise: An overview

INFORMATION EXTRACTION AND AD HOC ANAPHORA ANALYSIS

Anaphora Resolution in Biomedical Literature: A

Anaphora Resolution in Biomedical Literature: A Hybrid Approach

Broadways International School,Sec-76, Gurugram

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES. Design of Amharic Anaphora Resolution Model. Temesgen Dawit

Introduction to the Special Issue on Computational Anaphora Resolution

Bill No. 8 of 2015 THE RAJASTHAN AGRICULTURAL PRODUCE MARKETS (AMENDMENT) BILL, 2015 (To be Introduced in the Rajasthan Legislative Assembly) A Bill

च क त स उप रण एव अस पत ल य जन ववभ ग चचककत स उपकरण- आई एस ओ क य ग क ददश तनद श

Dictionaries द व र : स ज व भद र य स न तक त त श क षक (स गणक शवज ञ न ) क ० शव० ब ब क (लखनऊ स भ ग) स ब एसई प ठ यक रम पर आध ररत कक ष -11

Palomar & Martnez-Barco the latter being the abbreviating form of the reference to an entity. This paper focuses exclusively on the resolution of anap

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Summer Holiday home work

Have you heard of the Four Spiritual Laws?

Preparation before Registration & Making Payment

SHARJAH INDIAN SCHOOL

ISA Cardiopulmonary Resuscitation (CPR) Guidelines 2017

CREDENCE HIGH SCHOOL DUBAI Term-Wise Syllabus Grade: 3

1 Bill No. 23 Of 2010 THE RAJSTHAN TENANCY (AMENDMENT) BILL, 2010 (To be Introduced in the Rajasthan Legislative Assembly) A Bill

KENDRIYA VIDYALAYA KHICHRIPUR, DELHI (SHIFT II) HOLIDAY HOMEWORK FOR WINTER BREAK SESSION

Anaphoric Deflationism: Truth and Reference

S.B.V.M. Inter College,Mahmudabad (Sitapur) (English Medium Branch)

ÛIm] g]v]t]/ g]it]] म क षस न य सय ग:

ENGLISH HOLIDAY HOMEWORK Class- VI

ÛIm] g]v]t]/ g]it]] य वभ गय ग: Chapter 17 अज र न उव च य श व धम त स ज य यजन त य न वत : त ष न त क क ष ण स वम ह रजस तम: 17-1

Q.1 Give the female nouns of : a) Peacock - b) Lion - c) Son - d) Hero - e) Emperor - f) Bull - g) God - h) landlord -

KENDRIYA VIDYALAYA VIZIANAGARAM PRIMARY NEWS LETTER

Apex court does well to set the government a deadline to cure the MCI of its several ills.

KALINDI COLLEGE (University of Delhi)

Amendment of clause-2

GUIDELINES FOR RSBs/ ZSBs FOR PROVIDING EMPLOYMENT ASSISTANCE TO JCOs/ORs. What is the procedure to apply for employment assistance through DGR?

Preparation before Registration & Making Payment

vlk/kj.k izkf/dkj ls izdkf'kr अ धस चन

NISCORT FATHER AGNEL SCOOL, VAISHALI

1 Uncorrected/ Not for Publication The House met at eleven of the clock, MR. CHAIRMAN in the Chair ---

क स जल य रख अपन अन दर क च ग र क. Chetan Bhagat

Information Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)

Discourse Constraints on Anaphora Ling 614 / Phil 615 Sponsored by the Marshall M. Weinberg Fund for Graduate Seminars in Cognitive Science

MASTER QUESTION PAPER WITH KEY

ARMY PUBLIC SCHOOL MEERUT CANTT SYLLABUS FOR UNIT TEST II CLASS VIII,

Q1. (a)we got a /(b) lot of information on / (c) this matter from the internet./ (d) No error

J.P. World School, Jammu Syllabus Bifurcation: Class: U.K.G

vlk/kj.k izkf/dkj ls izdkf'kr अ धस चन

Transcription:

ANAPHORA RESOLUTION IN HINDI LANGUAGE USING GAZETTEER METHOD Smita Singh, Priya Lakhmani, Dr.Pratistha Mathur and Dr.Sudha Morwal Department of Computer Science, Banasthali University, Jaipur, India ABSTRACT Anaphora resolution is one of the active research areas within the realm of natural language processing. Resolution of anaphoric reference is one of the most challenging and complex task to be handled. This paper completely emphasis on pronominal anaphora resolution for Hindi Language. There are various methodologies for resolving anaphora. This paper presents a computational model for anaphora resolution in Hindi that is based on Gazetteer method. Gazetteer method is a creation of lists and then applies operations to classify elements present in the list. There are many salient factors for resolving anaphora. The proposed model resolves anaphora by using two factors that is Animistic and Recency. Animistic factor always represent living things and non living things whereas Recency describes that the referents mentioned in current sentence tends to have higher weights than those in previous sentence. This paper demonstrate the experiments conducted on short Hindi stories,news articles and biography content from Wikipedia, its result & future directions to improve accuracy. KEYWORDS Anaphora, Discourse, Centering approach, Lappin Leass approach, Gazetteer method 1. INTRODUCTION Anaphora denotes the act of referring. It is the use of an expression the interpretation of which depends upon another expression in discourse. Discourse is a group of collocated and related sentences. The process of binding the referring expression to the correct antecedent, in the discourse, is called anaphora resolution or pronominal resolution. Consider the following: म म ल म गय In the above example, वह refers to म ल, whereas उसन refers to. Since this type of understanding is still poorly implemented in software, resolution of anaphoric reference is one of the most challenging tasks in the field of Natural Language Processing (NLP). Consider the following example: phal In this example pronoun व refers to either फल or. This anaphor creates ambiguity & resolves to either or both. Therefore resolving pronouns is very complex task. The most common type of anaphora is the pronominal anaphora.it is the process of finding noun phrase which refers to pronoun and it occurs at the level of personal pronoun, possessive pronoun, demonstrative pronoun, reflexive pronoun and relative pronouns. DOI:10.5121/ijcsa.2014.4307 71

2. RELATED WORK An extensive work done for anaphora resolution based on Gazetteer method is summarized below: Richard Evans and Constantin Orasan improved anaphora resolution by identifying animate entities in texts [4]. Ruslan Mitkov, Richard Evans resolved anaphora resolution by using Gazetteer method in 2007[2]. Tyne Liang and Dian-Song Wu used above approach in automatic pronominal anaphora resolution in English texts in 2002. Constantin Orasan and Richard Evans used NP Animacy Identification for Anaphora Resolution in 2007[2]. Natalia N. Modjeska, Katja Markert and Malvina Nissim used web in Machine Learning for Other-Anaphora Resolution in 2003[3]. Strube & Hahn present a system for anaphora resolution for German based on extension of Centering theory in 1991[6]. S. Lappin and H. Leass proposed their algorithm for pronoun resolution for English language in year 1994[7]. Joshi, A. K. & Kuhn. S, in 1979 and Joshi, A. K. & Weinstein.S in 1981, gave centering theory for pronoun resolution [8]. Dev Bahadur using Lappin Leass approach pronominal anaphora is resolved in Nepali Language [9]. Thiago Thomes Coelho, Ariadne Maria Brito Rizzoni done work in Portugeese language using Lappin and Leass algorithm [7]. Manuel Palomar, Lidia Moreno and Jesfis Peral resolved anaphora in Spanish Texts using Centering approach [10]. S.Lappin and M.McCord developed a syntactic filter on pronominal anaphora for slot grammer using Lappin Leass principles in 1990[11]. Sobha and Patnaik gave a rule based approach for the resolution of anaphora in Hindi and Malayalam as well [12]. Dutta et al. presented modified Hobbs algorithm for Hindi [13]. J.Balaji applied Centering principles in Tamil [14]. 3. APPROACH A. Gazetteer Method There are various approaches for resolving pronouns. Each approach has its own constraints and features. In this research we have used approach called Gazetteer method. Gazetteer Method is the creation of different lists for different elements and then applies operations to classify the elements. Gazettes, therefore, are utilized to supply external knowledge to learners, or to supply data with a training source. In our system we have created lists of animistic pronoun ( pronoun refers to living things), animistic noun ( nouns which represent living beings), non animistic pronoun (pronoun refers to non living things) and non animistic noun(noun represent non living beings) and the last list of middle animistic pronoun(pronoun refer to both living and non living things).this external knowledge helps the system in resolving anaphors. The advantage of Gazetteer method: The Gazetteer method gives very fast result The accuracy of Gazetteer method depends on completeness of the Gazetteer used. 72

B.Salient Factor: There are various salience factors for resolving anaphors. Our anaphora resolution system incorporates Recency and Animistic knowledge as salient factors. Recency factor describes that the referents mentioned in current sentence tends to have higher weights than those in previous sentence. Recency moves backwards spatially through the text and adds noun phrases. For example र ध न फ ल द ख वह बह त स दर थ In the above example the pronoun वह can either refer to फ ल or र ध.But according to Recency फ ल is more close to वह as compare to noun र ध, therefore pronoun वह will refer to फ ल. Animistic Knowledge: Animistic knowledge filters candidates based on which ones represent living beings. Inanimate candidates are removed from consideration when the pronoun being resolved must refer to an animated co referent, and animated candidates are removed from consideration for pronouns that must refer to inanimate co referents. Consider the following. र म र ज़ फल ख त थ और अपन क भ थ In the above example pronoun अपन refers to noun र म as pronoun अपन is animistic pronoun. Animistic pronoun always refers to animistic noun.\ Besides, Recency and Animistic Factor there are other factors that affect the anaphora resolution process. Although, these factors are not considered in our system but these factors would definitely increase the accuracy of system. These two factors are described as follows: Gender Agreement: Gender Agreement compares the gender of candidate co referents to the gender required by the pronoun being resolved. Any candidate that doesn t match the required gender of the pronoun is removed from further consideration. स हन न म ल स वह उस पस द करत ह ग त न म ल स वह उस पस द करत ह In Hindi Language verbs are used to resolve pronouns based on gender agreement. In the above example using the verbs करत ह and करत ह, it can be understand that उस refers to male and female respectively. Number Agreement: Number Agreement extracts the part of speech of candidates. The part of speech label is checked for plurality. If the candidate is plural but the current pronoun being resolved doesn t indicate a plural co referent the candidate is removed from consideration. The same process occurs for singular candidates which are removed if the pronoun being resolved requires a plural co referent. र म और In the above example pronoun व refers to र म और. C. How it works व बह त बदम श ह 1. When the system encounters any pronoun then first it finds the referent noun based on Recency factor. Hence it chooses the closest noun as a referent. 73

2. The system checks whether the pronoun falls under animistic, non animistic or middle animistic category. 3. If the pronoun falls under animistic category then it checks whether the referent selected by Recency factor falls under animistic noun or non animistic noun category. 4. If the referent selected falls under animistic noun category then that referent is the final output for that pronoun otherwise if the referent falls under non animistic noun then in that case the referents are backtracked (at least up to three sentences) until we find the correct animistic referent for animistic pronoun. 5. If the pronoun falls under non animistic category, then the same process mention above is done until we get a non animistic referent. 6. If the pronoun falls under middle animistic category then the referent selected by Recency factor is the final output. Our computational model based on the above approach use recency and animistic factor as a baseline. Animistic factor is used to increase the accuracy of system. We train our system so that it differentiates between animistic pronoun and non animistic pronoun and middle animistic pronoun. We have created lists for animistic pronoun, animistic noun, non animistic pronoun and non animistic pronoun and middle animistic pronoun.this knowledge is helpful in resolving animistic pronouns. For resolving middle animistic pronouns ( pronouns that refer to non living thing and living thing) we have used recency as a salient factor. For resolving pronouns using recency as salient factor we used the concept of centering approach. Centering theory : It provides a framework to model what a sentence is speaking about. This can be used to find which entities are referred to by pronouns in a given sentence. This theory models the attentional salience of discourse entities, and relates it to referential continuity. Centering has certain transitions rule based on which it resolves anaphora. 4. EXPERIMENT AND RESULT We have performed experiments on three different types of data sets. These experiments are based on finding the contribution of recency and animistic factor to the overall accuracy of correctly resolved pronouns. Based on recency and animistic factor accuracy of the system is calculated. Data set 1: This experiment uses the text from children story domain. We have taken short stories in Hindi language from indif.com (http://indif.com/kids/hindi_stories/short_stories.aspx), a popular site for short Hindi stories and performed anaphora resolution over these stories. Ideally this experiment represents a baseline performance since the story is a straightforward narrative style with extremely low sentence structure complexity. Also it contains approx 10 to 25 sentences having 100 to 300 words. The result shown by experiment is summarized below: Data Set Table1. Result from experiment performed on short stories Sentences Word Anaphors Correctly Resolved Anaphor Accuracy Story1 11 129 13 11 84% Story2 11 133 11 9 82% Story3 23 275 21 7 34% Story4 17 213 19 15 79% Story5 21 227 20 9 45% 74

The result of proposed system shows that recency and animistic factor contribute 65% accuracy to overall system. It is observed that accuracy vary with the structure of sentences. The stories are narrative style and Hindi is free order.so it affects the transition rule of Centering approach. It is also observed that sometimes, locative pronouns (वह and ) are not resolve correctly and hence affect the accuracy. Data set 2: This experiment uses text from news article domain. We have taken news articles from webduniya.com (http://webduniya//hindi_news) a popular site of Hindi news. Table2. Result from experiment performed on news articles यह Data Set Sentences Word Anaphors Correctly Resolved Anaphor Accuracy News1 9 175 7 5 72% News2 8 207 6 3 50% News3 8 143 10 5 50% News4 13 247 19 13 69% News5 11 195 15 10 72% The result of proposed system shows that recency and animistic factor contribute 63% accuracy to overall system. It is observed that certain pronouns refer to both animistic and non animistic nouns.due to this system refers to wrong antecedent. Therefore this affects the accuracy. Data set 3: This experiment uses biography content from Wikipedia.We have taken biography of famous leaders of India from wikipedia.com http://en.wikipedia.org/wiki/), and then accuracy is calculated. Data Set Table3. Result from experiment performed on biography Sentences Word Anaphors Correctly Resolved Anaphor Accuracy Wiki1 16 329 16 12 75% Wiki2 20 347 15 13 87% Wiki3 22 374 15 13 87% Wiki4 14 284 10 8 80% Wiki5 28 348 19 16 84% The result of proposed system shows that recency and animistic factor contribute 83% accuracy to overall system. In the above experiment articles about the political leaders from Wikipedia are taken. Different articles have different way of writing.this affects the transition rules of Centering approach and hence affect the accuracy of the system. 75

From the above experiments, it is observed that the propose system has 70 % overall accuracy. The correctness of the accuracy obtained by the experiment is measured by the language expert. Hindi is a free word order, which indirectly affects the accuracy. It is also observed that pronouns are ambiguous to person, number and gender features. Further, it is observed that some pronouns refer both to animate and inanimate things. These all features affect the accuracy. 7. CONCLUSION This paper presents the experimental results of anaphora resolution in Hindi language using Gazetteer method. Hindi language is free word order and hence it has several complications in resolving pronoun as compare to other languages. This paper describes how recency and animistic factor contributes to the accuracy of anaphora.in this paper we have shown how anaphora resolution is done by performing experiments on different data sets. We have taken recency and animistic as a constraint sources which forms the base line of our experiment. The experiment is performed to determine the contribution of these constraint sources to pronoun resolution on different styles of written text. However, apart from recency and animistic, gender agreement, number agreement also play significant role in anaphora resolution. In the future we wil try to incorporate these sources to further increase the accuracy. REFERENCES [1] Ruslan Mitkov, Richard Evans, ( 2007) Anaphora Resolution: To What Extent Does It Help NLP Applications? DAARC, LNAI 4410, pp. 179 190. [2] Constantin Or asan and Richard Evans ;( 2007) NP Animacy Identification for Anaphora Resolution, Journal of Artificial Intelligence Research 29, 79-103. [3] Razvan Bunescu,( 2003) Associative anaphora resolution: A web-based approach, In Proceedings of EACL 2003 - Workshop on The Computational Treatment of Anaphora, Budapest. [4] Barlow, M., (1998). Feature Mismatches and Anaphora Resolution. In Proceedings of DAARC2, University of Lancaster. [5] Brent, (1993). From grammar to lexicon: unsupervised learning of lexical syntax. Computational Linguistics, 19(3):243 262. [6] Strube & Hahn A system for anaphora resolution for German based on extension of Centering theory. [7] Thiago Thomes, Lappin and leass algorithm for pronoun resolution in Portuguese, Institute of State University of Campinas, Campinas, SP, Brazil EPIA'05 Proceedings of the 12th Portuguese conference on Progress in Artificial Intelligence Pages 680-692. [8] Aravind K Joshi, Rashmi Prasad, Eleni Miltsakaki Anaphora Resolution: A Centering Approach. [9] Dev Bahadur Poudel and Bivod Aale Magar Anaphoric Resolution in Nepali, Nepal Engineering College. [10] Manuel Palomar, Lidia Moreno Algorithm for Anaphora Resolution in Spanish Texts, University of Alicante, Valencia University of Technology. [11] McCord, Michael, (1990) "Slot grammar: A system for simpler construction of practical natural language grammars." In Natural Language and Logic: International Scientific Symposium, edited by R. Studer, 118-145. Lecture Notes in Computer. [12] L. Sobha and B.N. Patnaik, Vasisth: An anaphora resolution system for Malayalam and Hindi, Symposium on Translation Support Systems,2002. [13] K. Dutta, N. Prakash and S. Kaushik, Resolving Pronominal Anaphora in Hindi using Hobbs algorithm, Web Journal of Formal Computation and Cognitive Linguistics, Issue 10, 2008. [14] Anaphora Resolution in Tamil using Universal Networking Language "12/2011; In proceeding of: Indian International Conference on Artificial Intelligence (IICAI-2011), At Tumkur, Karnataka, India. 76