Language Model for Cyrillic Mongolian to Traditional Mongolian Conversion

Similar documents
COVER ILAC-G8:1996. Guidelines on Assessment and Reporting of Compliance with Specification (based on measurements and tests in a laboratory)

The Effects of Rumors on Stock Prices: A Test in an Emerging Market Yan ZHANG 1,2 and Hao-jia CHEN 1

God s Great Passion. Burning Hearts. Recently a group of Christians were asked the question, Do you know God more than your spouse?

ScienceDirect. Capacity Model for Signalized Intersection under the Impact of Upstream Short Lane. Jing ZHAO a, Meiping YUN b *, Xiaoguang YANG c

Geography and the Rise of Rome

SAMPLE LESSON Copyright WestEd

A Bayesian Simulation Model of Group Deliberation and Polarization

Susan Lingo Rt52Teachings1-9-SC.indd 1 2/3/10 1:26:51 PM

5 Equality or Priority?l

Copyright 2014 Our Sunday Visitor Publishing Divison, Our Sunday Visitor, Inc. All rights reserved. Please call , or visit

THE INTEGRATION OF ISLAMIC STOCK MARKETS: DOES A PROBLEM FOR INVESTORS?

1 A B C D E F G H I J K L M N O P Q R S { U V W X Y Z 1 A B C D E F G H I J K L M N O P Q R S { U V W X Y Z

Simulation of quorum systems in ad-hoc networks

Impacts Of Ramadan On European Islamic Finance Stock Volatility Based On EGARCH-M Model And Empirical Analysis Of EIIB Stock Luyao Zhu

How GAIA asteroids can improve planetary ephemerides?

AUGMENTING SHORT HYDROLOGICAL RECORDS TO IMPROVE WATER RESOURCES STUDIES

Where Are You Standing?

Prime Minister Macdonald was keen to expand Canada

The Macrotheme Review A multidisciplinary journal of global macro trends

Archived Content. Contenu archivé

Mental Models Theory and Anaphora

Projection and position Evidence from Georgian. Martha McGinnis - MIT. 1. Introduction

Latent Variable Models and Signal Separation

This book is a revision of Growing in God s Love (42036).

Copyright by Dean S. Thomas

NO! NO! NO! NO! NO! NO! NO!

econstor Make Your Publications Visible.

Opening address. Purdue e-pubs. Purdue University. Sven Westberg Chalmers University of Technology

Little Bighorn LESSONS LEARNED. Notes:

SESSION 5 OVERCOME BITTERNESS

IN THE COURT OF APPEAL OF THE DEMOCRATIC SOCIALIST REPUBLIC OF SRI LANKA

CHILDREN S SESSION GUIDE. The GOD. We Can Know. Exploring the I Am Sayings of Jesus. Rob Fuquay

UPEL 12 April, 1985 / ORIGINAL : ENGLISH

THE SURVIVAL OF ISLAMIC BANKING: A MICRO-EVOLUTIONARY PERSPECTIVE

[yrzt. Parashat HaShavuah. Understanding the Parsha Leviticus 12:1-13:59. Vayikra (Leviticus) 12:1-13:59 Tazria (Conceived)

Efficient Model Checking of Fault-Tolerant Distributed Protocols

What Do Short Sellers Know? Boehmer, Jones & Zhang D I S C U S S I O N B Y A D A M V. R E E D U N C C H A P E L H I L L

Attachment 15. City and Neighborhood Maps City Map Thresholds and City Context Neighborhood Map Neighborhood Assets

SELF-ORGANISING QUORUM SYSTEMS FOR AD HOC NETWORKS

Chairman Hickey called the meeting to order at 5:15 p.rn. and announced that A.B. 745 would be the first bill on the agenda.

Introduction. apriori allows us to. realize hard-dollar savings. across our product lines. and positively impact the. profitability of our company.

Matthews Key for Informal Logic Exercises 1. Use these answers to grade and correct your homework assignment. A perfect score would be 100.

Tishreen University Journal for Research and Scientific Studies -Economic and Legal Sciences Series Vol. (30) No. (4) 2008 *** ***

; tional Student Association dele- ing of Alaska addressed the stu- i uno PP se <i- Once in Washington, the students dates.

~ THE COURTING OF. Adam's Rib. ~ MARRIAGE i

KEYWORDS: Design Specifications, AASHTO, LRFD, Load Factors, Resistance Factor, Calibration, Reliability.

CULTURE, PERSONALITY AND EDUCATION

GENDER AND CENTRAL BANKING

The pedagogy of Jesus

evangelization doing what Jesus does

Pictures from Past and Present: Church of Saint- Laurent

The GNH Centre. Vol. I January, Gross National Happiness is more important than Gross Domestic Product.

a~lilaalll~::roo ~0"'C1lOQr+p..0~~~_5 C1l n 0"'r+00'lj... C1l III ~~sc1lc1l00 C1lril~~IIl]~C1l~O"'~~OO

HEANING IN RELIGION AND '::'HE BEANING QIi' RELIGION. Ninian Smart. Colloquium Paper: December 1969:,,university of Lancaster

The Second Announcement AUGUST, 20-22, 2013 ULAN-UDE, REPUBLIC OF BURYATIA, RUSSIA

'No constitutional right to die' Court rules to keep Quinlan alive

...,t, librar'< t,.'jr MILTON COLLltE LX1~AR!

Central Florida Future, Vol. 01 No. 15, February 21, 1969

MONDAY EUCHARIST. Connecting Sunday Liturgy with Daily Work and Relationships WILLIAM L. DROEL

~ The Oberrvrr/Johannn Hacker

TRE,<~,;W~~RD\~,OF i7fruth

, '. U. W. Participation-page 3. Expecting Americans

O F F E R I N G G U I D E

Reinforcement Learning with Symbiotic Relationships for Multiagent Environments

JCP Chumash Curriculum Framework

Sister Margaret Mary Hohl, D.C.

Faculty News. Erik S. Ohlander DEPARTMENT OF PHILOSOPHY. From the Acting Chair, Erik Ohlander

CE TYPE EXAMINATION CERTIFICATE

Hearts Reaching Up to God

of Cincinnati N'EWS RECORD Published Tuesdays and Fridays during the Academic Year except as scheduled,

Newspeak Volume 12, Issue 11, May 1, 1984

First Formal Dance of the Year

GUY MARTIN/PANOS PICTURE

China Academic Library

Houghton Mifflin English 2004 Houghton Mifflin Company Grade Six. correlated to. TerraNova, Second Edition Level 16

StoryTown Reading/Language Arts Grade 2

Preliminary Examination in Oriental Studies: Setting Conventions

FIVE WAYS OF LOOKING AT MORALITY

THE FLAT HAT COLLEGE OF WILLIAM AND MARY

Invalidity Patent Search System of NTT DATA

points A Sabbath School Bible Study Guide for Juniors/Teens

I am happy to be writing this letter to you, through which it is my intention to convoke the 27th General Chapter. l e p. c i. r e. h t. t f.

A~CQr.diP.g_1Q.7,:Wper:t. Preside!}t

Language contact and lexical competition: Chinese impact on Mongolian negations

A DATA MINING ALGORITH FOR MULTI LEVEL PREFETCHING IN STORAGE SYSTEMS

Publishing Salvation to Zion - Isaiah 52:7 A ministry of Jewish Awareness Ministries

THE ORDER OF ST JOHN PRIORY IN THE UNITED STATES OF AMERICA A MESSAGE FROM THE PRIOR

I am reminded everywhere that I go of the reality of spiritual deadness that is so prevalent - such a soreness does it bring to the heart!

WE RE THERE WHEN YOU CAN T BE

TABULATION. 2 I! THE POETRY OF THE GREEK BOOK OF PROVERBS.

Freeze told the committe he had allotted an extra 30 grand to the Admission

The Quarterly. January 1969 OFFICIAL PUBLICATION OF THE ST. LAWRENCE COUNTY HISTORICAL ASSOCIATION. liih:.adii<osu.icks 16 THL ULUES TlYhb

PAPERS IN PHILIPPINE LINGUISTICS No.5

University of Michigan Law School Scholarship Repository. Law School History and Publications

Litchfield fire guts farm

Oil. seepage water from finding its way through the earthwork through the loose strata underlying the false bedrock. NUUIUDI

unrversify of notre dome sf mary's college Vol. X, No. 98 Wednesday, March 3, 1976 Jackson credits labor support

MAY DAY PARAD TOMORROW

Wilson Fundations Scope and Sequence

Metropolitan Community Churches Draft Strategic Plan

Transcription:

Language Model for Cyrillic Mongolian o Tradiional Mongolian Conversion Feilong Bao, Guanglai Gao, Xueliang Yan, and Hongwei Wang College of Compuer Science, Inner Mongolia Universiy, Hohho 0002, China {csfeilong,csggl,csyxl}@imu.edu.cn, wanghongwei636@63.com Absrac. Tradiional Mongolian and Cyrillic Mongolian are boh Mongolian languages ha are respecively used in china and Mongolia. Wih similar oral pronunciaion, heir wriing forms are oally differen. A large par of Cyrillic Mongolian words have more han one corresponds in Tradiional Mongolian. This makes he conversion from Cyrillic Mongolian o Tradiional Mongolian a hard problem. To overcome his difficuly, his paper proposed a Language model based approach, which akes he advanage of conex informaion. Experimenal resuls show ha, for Cyrillic Mongolian words ha have muliple correspondence in Tradiional Mongolian, he correc rae of his approach reaches 87.66%, hereby grealy improve he overall sysem performance. Keywords: Cyrillic Mongolian, Tradiional Mongolian, Language Model. Inroducion Mongolia, as a widely used language over differen counries and muliple regions, has a significan impac on he world. Is main users are disribued over China, Mongolia and Russia. A major difference beween he Mongolian used in China (called Tradiional Mongolian) and ha used in Mongolia (called Cyrillic Mongolian or Modern Mongolian) is ha hey have same pronunciaion bu differen wrien forms. As a derivaive language, Cyrillic Mongolian has boh similar grammar and vocabulary o Tradiional Mongolian. This means ha he conversion of he wo languages does no need o follow he radiional machine ranslaion framework. We can jus conver he wo languages word by word according o heir correspondence relaionship. A serial of research ha focus on he conversion from Cyrillic Mongolian o Tradiional Mongolian has been carried ou by Bao Sarina, Wuriliga and Hao Li [-4] e al wih eiher dicionary based approaches or rule based ones and achieved accepable resuls. However, none of hem have considered he muliple correspondence problems. Observed ha he correc convered word has a srong relaionship o is conex, we proposed a language model based approach o overcome he muliple correspondence problem. The res of he paper is organized as follows: secion 2 inroduces he characerisic of Tradiional Mongolian and Cyrillic Mongolian; secion 3 depics in G. Zhou e al. (Eds.): NLPCC 203, CCIS 400, pp. 3 8, 203. Springer-Verlag Berlin Heidelberg 203

4 F. Bao e al. deail he language model based conversion approach; in secion 3, experimens and he corresponding resuls are discussed; a las, we conclude he paper in secion 4. 2 Comparison beween Tradiional Mongolian and Cyrillic Mongolian Alhough having a srong relaionship o each oher, he Tradiional Mongolian and Cyrillic Mongolian, as wo differen languages, sill have some significan difference as follows:. Tradiion Mongolian is composed of 35 characers, in which 8 are vowels and 27 are consonans[5]; Cyrillic Mongolian, on he oher hand, has also 35 characers. Bu 3 of hem are vowels and 20 are consonans. Besides, i also includes a harden-characer and sofen-characer[6]. The complee alphabes for he wo languages are lised in Tab. for comparison. 2. Cyrillic Mongolian is a case-sensiive language while Tradiional Mongolian is no. In Cyrillic Mongolian, he usage of case is similar o English. For he Tradiional Mongolian, alhough i s no sensiive o he case, is form will be differen according o he posiion (op, middle or boom) in a word [7]. Table. Comparison of he characers of Cyrillic Mongolian and Tradiional Mongolian Cyril Tradiional Cyril Tradiional Cyril Tradiional Cyril Tradiional Аа Ии Рр Шш Бб Йй Сс Щщ Вв Кк Тт Ъъ Гг Лл Уу Ыы Дд Мм Үү Ьь Ее Нн Фф Ээ Ёё Оо Хх Юю Жж Өө Цц Яя Зз Пп Чч 3. The wrien direcion is differen for Cyrillic Mongolian and Tradiional Mongolian. For Cyrillic Mongolian, he words are wrien from lef o righ and he lines are changed op-down; for Tradiional Mongolian, he words are wrien op-down and he lines are changed from lef o righ.

Language Model for Cyrillic Mongolian o Tradiional Mongolian Conversion 5 4. The degrees of unificaion beween he wrien form and oral pronunciaion are differen for Cyrillic Mongolian and Tradiional Mongolian. Cyrillic Mongolian is a well-unified language. I has a consisen correspondence beween he wrien form and he pronunciaion; on he oher hand, however, ha for he Tradiional Mongolian is no -o- mapping. Someimes he vowel or consonan will be dropped, added or ransformed when convering he wrien form o he pronunciaion. In some cases, a Cyrillic Mongolian word would have more han one Tradiional Mongolian word corresponded, as shown in Fig., where he hree Tradiional Mongolian words are differen bu all correspond o he Cyril word "асар". Cyril Mongolian асар The corresponding Tradiional Mongolian ᠠᠰᠠᠷ ᠠᠰᠤᠷᠤ ᠠᠰᠠᠷᠠ Lain ranslieraion: asar, meaning: pavilion, gae Lain ranslieraion: asvrv, meaning: mos, especially Lain ranslieraion: asara, meaning: ake case, concern Fig.. An example of muliple correspondence for Cyrillic Mongolian o Tradiional Mongolian 3 Language Model Based Conversion Approach Generally speaking, Cyrillic Mongolian and Tradiional Mongolian words, when convering, are one-o-one correspondence. However, a large par of Cyrillic Mongolian words have more han one corresponds in Tradiional Mongolian. Take he Cyrillic Mongolian senence "Танай амар төвшинийг хамгаалхаар явсан юм." for example. The words "амар" and "юм" have more han one correspondences in Tradiional Mongolian as shown in Fig. 2, where he corresponding Tradiional Mongolian is represened in Lain-ranslieraion form. More specifically, he Cyril word "амар" has four correspondences in Mongolian: "amara", "amar", "amar_a" and "amvr"; he Cyril word "юм" has wo correspondences in Tradiional Mongolian: "yagam_a" and "yvm". The correc conversion for he whole senence is denoed by he pah wih he line in bolder, i.e., "an-v amvr obsin-i hamagalahv-bar yabvgsan yvm" (" ᠲᠠᠨ ᠤ ᠠᠮᠤᠷ ᠲᠥᠪᠰᠢᠨ ᠢ ᠬᠠᠮᠠᠭᠠᠯᠠᠬᠤ ᠪᠠᠷ ᠶᠠᠪᠤᠭᠰᠠᠨ ᠶᠤᠮ "). If we consider he conversion as a sochasic process and make he final decision according o he probabiliy of he Tradiional Mongolian word sequence T condiioned on he Cyrillic Mongolian word sequence C, hen he conversion problem can be represened as finding he words sequence ha saisfies (): T ' = arg max T C) () T Q where T={ 2... m } denoes he possible pah and C denoes he Cyrillic Mongolian senence o be convered.

6 F. Bao e al. Fig. 2. A conversion example for Cyrillic Mongolian o Tradiional Mongolian As we all know, he condiional probabiliy for T={ 2... m } can be decomposed as: m j C) 2, C) 3 2, C)... m 2... m, C) = j, C) j= T C) = (2) hen formula () can be represened as: T C) = arg max m T = 2... m Q j= j j, C) (3) If we furher assume he N-gram language model assumpion[8], formulae (3) can hen be furher simplified as: T C) = argmax m T = 2... m Q j= j j j N+, C) We use he Maximum Likelihood Esimaion o esimae he parameers in (4) and adop Kneser-ney echnique[8] o overcome he sample sparseness problem. 4 Experimen We ake he Conversion Accurae Rae (CAR) as he evaluaion meric, which is defined as: N correc CAR = (5) N oal Where N denoes he oal number of words ha are correcly convered and correc Noal denoes he number of all he words need o be convered. The SRILM is adoped for raining he language model[9]. A dicionary ha conains he Cyrillic Mongolian word o is muliple correspondences in Tradiional Mongolian words is consruced for our experimen. This dicionary has 4679 Cyrillic Mongolian words in oal. A Tradiional Mongolian ex corpus, which conains 54MB ex in inernaional sandard coding, is adoped for n-gram language model raining. We use a Cyrillic Mongolian corpus which conains 0000 senences o es our approach. This corpus is composed of 8794 words, among which 4663 have (4)

Language Model for Cyrillic Mongolian o Tradiional Mongolian Conversion 7 more han one Tradiional Mongolian words corresponded. Our conversion progress can be divided ino wo seps: in he firs sep, we conver all he Cyrillic Mongolian words o heir corresponding Tradiional Mongolian words according o he rulebased approach; and hen, for each word, we check wheher here is only one Tradiional Mongolian word generaed. If no, we furher deermine he bes one according o he Language Model based approach proposed in secion 3. The daa se for he rule-based approach is composed of hree pars: a mapping dicionary for Cyrillic Mongolian sem o Tradiional Mongolian sem, which conains 52830 enries; a dicionary for Cyrillic Mongolian saic inflecional suffix o Tradiional Mongolian saic inflecional suffix, which conains 336 suffixes; and a dicionary for Cyrillic Mongolian verb suffix o Tradiional Mongolian verb suffix, which conains 498 inflecional suffixes. Based on he word formaion rule of Tradiional Mongolian and Cyrillic Mongolian, ogeher wih he above menioned sem mapping dicionary and suffix mapping dicionary, we consruced a rule-based conversion sysem. 5 $ & UXOHEDVHG XQLJUDP ELJUDP WULJUDP Fig. 3. Performance comparison beween he LM based approaches For he words ha have more han one Tradiional Mongolian correspondence, we compare he Language Model based approach wih differen grams (unigram, bigram and rigram) o he rule-based approach. The experimen resuls are illusraed in Fig 3, from where we can see ha all he Language Model based approaches significanly ouperform he rule-based approach, among which he bigram achieved he bes performance (CAR: 87.66%). Affeced by he sample sparseness problem, he rigram approach is slighly worse han he bigram approach, bu sill much beer han he unigram one which has considered only he occurrence frequency, bu no conex informaion. This again reconfirm he fac ha if he conex informaion is no considered, he performance would be badly decreased. We also es he overall sysem performance of rule-based approach and he improved one on all he Mongolian words (boh -o- and -o-n). The experimenal resuls are illusraed in Fig 4. In Fig 4, we can see ha he conversion correcness for he rule-based approach is 8.66%. When i s inegraed wih he LM-based approach, he overall sysem correcness is grealy improved, which reaches 88.4%.

8 F. Bao e al. 5 $ & UXOHEDVHG 5XOH/0 Fig. 4. Overall sysem performance comparision 5 Conclusions When convering he Cyrillic Mongolian o he Tradiional Mongolian, a lo of problem emerged. In his paper, we focus our aenion on he muliple correspondences problem and proposed a language model based conversion approach which akes he conex informaion ino consideraion. The proposed approach effecively seled his problem and hereby grealy improved he overall conversion sysem performance. However, here is sill some issues o be considered, like he conversion problem for newly-added words and ha for he words borrowed from oher languages. We will ake all hese problems as our fuure work. Acknowledgemens. This work is suppored by he Naural Science Foundaion of China (NSFC) (NO. 6263037, NO. 763029) and he Naural Science Foundaion of Inner Mongolia of China (NO. 20ZD). References. Sarina, B.: The Research on Conversion of Noun and Is Case from Classic Mongolian ino Cyrillic Mongolian. Inner Mongolia Universiy, Hohho (2009) 2. Wuriliga: The Elecronic Dicionary Consrucion of he Tradiional Mongolian-Chinese and Cyrillic Mongolian-Chinese. Inner Mongolia Universiy, Hohho (2009) 3. Li, H., Sarina, B.: The Sudy of Comparison and Conversion abou Tradiional Mongolian and Cyrillic Mongolian. In: 20 4h Inernaional Conference on Inelligen Neworks and Inelligen Sysems, pp. 99 202 (20) 4. Gao, H., Ma, X.: Research on ex-ransform of Cyrillic Mongolian o Tradiional Mongolian conversion sysem. Journal of Inner Mongolia Universiy for Naionaliies 8(5), 7 8 (202) 5. Quejingzhabu: Mongolian code. Inner Mongolia Universiy press, Hohho (2000) 6. Galsenpengseg. Sudy Reader of Cyrillic Mongolian. Inner Mongolia educaion press, Hohho (2006) 7. Qinggeerai. Mongolian Grammar. Inner Mongolia People s Publishing Press, Hohho (992) 8. Zong, C.: Saisical Naural Language Processing. Tsinghua Universiy Press, Beijing (2008) 9. Solcke, A.: SRILM - An Exensible Language Modeling Toolki. In: Proc. Inl. Conf. Spoken Language Processing, Denver, Colorado (2002)