Language Model for Cyrillic Mongolian o Tradiional Mongolian Conversion Feilong Bao, Guanglai Gao, Xueliang Yan, and Hongwei Wang College of Compuer Science, Inner Mongolia Universiy, Hohho 0002, China {csfeilong,csggl,csyxl}@imu.edu.cn, wanghongwei636@63.com Absrac. Tradiional Mongolian and Cyrillic Mongolian are boh Mongolian languages ha are respecively used in china and Mongolia. Wih similar oral pronunciaion, heir wriing forms are oally differen. A large par of Cyrillic Mongolian words have more han one corresponds in Tradiional Mongolian. This makes he conversion from Cyrillic Mongolian o Tradiional Mongolian a hard problem. To overcome his difficuly, his paper proposed a Language model based approach, which akes he advanage of conex informaion. Experimenal resuls show ha, for Cyrillic Mongolian words ha have muliple correspondence in Tradiional Mongolian, he correc rae of his approach reaches 87.66%, hereby grealy improve he overall sysem performance. Keywords: Cyrillic Mongolian, Tradiional Mongolian, Language Model. Inroducion Mongolia, as a widely used language over differen counries and muliple regions, has a significan impac on he world. Is main users are disribued over China, Mongolia and Russia. A major difference beween he Mongolian used in China (called Tradiional Mongolian) and ha used in Mongolia (called Cyrillic Mongolian or Modern Mongolian) is ha hey have same pronunciaion bu differen wrien forms. As a derivaive language, Cyrillic Mongolian has boh similar grammar and vocabulary o Tradiional Mongolian. This means ha he conversion of he wo languages does no need o follow he radiional machine ranslaion framework. We can jus conver he wo languages word by word according o heir correspondence relaionship. A serial of research ha focus on he conversion from Cyrillic Mongolian o Tradiional Mongolian has been carried ou by Bao Sarina, Wuriliga and Hao Li [-4] e al wih eiher dicionary based approaches or rule based ones and achieved accepable resuls. However, none of hem have considered he muliple correspondence problems. Observed ha he correc convered word has a srong relaionship o is conex, we proposed a language model based approach o overcome he muliple correspondence problem. The res of he paper is organized as follows: secion 2 inroduces he characerisic of Tradiional Mongolian and Cyrillic Mongolian; secion 3 depics in G. Zhou e al. (Eds.): NLPCC 203, CCIS 400, pp. 3 8, 203. Springer-Verlag Berlin Heidelberg 203
4 F. Bao e al. deail he language model based conversion approach; in secion 3, experimens and he corresponding resuls are discussed; a las, we conclude he paper in secion 4. 2 Comparison beween Tradiional Mongolian and Cyrillic Mongolian Alhough having a srong relaionship o each oher, he Tradiional Mongolian and Cyrillic Mongolian, as wo differen languages, sill have some significan difference as follows:. Tradiion Mongolian is composed of 35 characers, in which 8 are vowels and 27 are consonans[5]; Cyrillic Mongolian, on he oher hand, has also 35 characers. Bu 3 of hem are vowels and 20 are consonans. Besides, i also includes a harden-characer and sofen-characer[6]. The complee alphabes for he wo languages are lised in Tab. for comparison. 2. Cyrillic Mongolian is a case-sensiive language while Tradiional Mongolian is no. In Cyrillic Mongolian, he usage of case is similar o English. For he Tradiional Mongolian, alhough i s no sensiive o he case, is form will be differen according o he posiion (op, middle or boom) in a word [7]. Table. Comparison of he characers of Cyrillic Mongolian and Tradiional Mongolian Cyril Tradiional Cyril Tradiional Cyril Tradiional Cyril Tradiional Аа Ии Рр Шш Бб Йй Сс Щщ Вв Кк Тт Ъъ Гг Лл Уу Ыы Дд Мм Үү Ьь Ее Нн Фф Ээ Ёё Оо Хх Юю Жж Өө Цц Яя Зз Пп Чч 3. The wrien direcion is differen for Cyrillic Mongolian and Tradiional Mongolian. For Cyrillic Mongolian, he words are wrien from lef o righ and he lines are changed op-down; for Tradiional Mongolian, he words are wrien op-down and he lines are changed from lef o righ.
Language Model for Cyrillic Mongolian o Tradiional Mongolian Conversion 5 4. The degrees of unificaion beween he wrien form and oral pronunciaion are differen for Cyrillic Mongolian and Tradiional Mongolian. Cyrillic Mongolian is a well-unified language. I has a consisen correspondence beween he wrien form and he pronunciaion; on he oher hand, however, ha for he Tradiional Mongolian is no -o- mapping. Someimes he vowel or consonan will be dropped, added or ransformed when convering he wrien form o he pronunciaion. In some cases, a Cyrillic Mongolian word would have more han one Tradiional Mongolian word corresponded, as shown in Fig., where he hree Tradiional Mongolian words are differen bu all correspond o he Cyril word "асар". Cyril Mongolian асар The corresponding Tradiional Mongolian ᠠᠰᠠᠷ ᠠᠰᠤᠷᠤ ᠠᠰᠠᠷᠠ Lain ranslieraion: asar, meaning: pavilion, gae Lain ranslieraion: asvrv, meaning: mos, especially Lain ranslieraion: asara, meaning: ake case, concern Fig.. An example of muliple correspondence for Cyrillic Mongolian o Tradiional Mongolian 3 Language Model Based Conversion Approach Generally speaking, Cyrillic Mongolian and Tradiional Mongolian words, when convering, are one-o-one correspondence. However, a large par of Cyrillic Mongolian words have more han one corresponds in Tradiional Mongolian. Take he Cyrillic Mongolian senence "Танай амар төвшинийг хамгаалхаар явсан юм." for example. The words "амар" and "юм" have more han one correspondences in Tradiional Mongolian as shown in Fig. 2, where he corresponding Tradiional Mongolian is represened in Lain-ranslieraion form. More specifically, he Cyril word "амар" has four correspondences in Mongolian: "amara", "amar", "amar_a" and "amvr"; he Cyril word "юм" has wo correspondences in Tradiional Mongolian: "yagam_a" and "yvm". The correc conversion for he whole senence is denoed by he pah wih he line in bolder, i.e., "an-v amvr obsin-i hamagalahv-bar yabvgsan yvm" (" ᠲᠠᠨ ᠤ ᠠᠮᠤᠷ ᠲᠥᠪᠰᠢᠨ ᠢ ᠬᠠᠮᠠᠭᠠᠯᠠᠬᠤ ᠪᠠᠷ ᠶᠠᠪᠤᠭᠰᠠᠨ ᠶᠤᠮ "). If we consider he conversion as a sochasic process and make he final decision according o he probabiliy of he Tradiional Mongolian word sequence T condiioned on he Cyrillic Mongolian word sequence C, hen he conversion problem can be represened as finding he words sequence ha saisfies (): T ' = arg max T C) () T Q where T={ 2... m } denoes he possible pah and C denoes he Cyrillic Mongolian senence o be convered.
6 F. Bao e al. Fig. 2. A conversion example for Cyrillic Mongolian o Tradiional Mongolian As we all know, he condiional probabiliy for T={ 2... m } can be decomposed as: m j C) 2, C) 3 2, C)... m 2... m, C) = j, C) j= T C) = (2) hen formula () can be represened as: T C) = arg max m T = 2... m Q j= j j, C) (3) If we furher assume he N-gram language model assumpion[8], formulae (3) can hen be furher simplified as: T C) = argmax m T = 2... m Q j= j j j N+, C) We use he Maximum Likelihood Esimaion o esimae he parameers in (4) and adop Kneser-ney echnique[8] o overcome he sample sparseness problem. 4 Experimen We ake he Conversion Accurae Rae (CAR) as he evaluaion meric, which is defined as: N correc CAR = (5) N oal Where N denoes he oal number of words ha are correcly convered and correc Noal denoes he number of all he words need o be convered. The SRILM is adoped for raining he language model[9]. A dicionary ha conains he Cyrillic Mongolian word o is muliple correspondences in Tradiional Mongolian words is consruced for our experimen. This dicionary has 4679 Cyrillic Mongolian words in oal. A Tradiional Mongolian ex corpus, which conains 54MB ex in inernaional sandard coding, is adoped for n-gram language model raining. We use a Cyrillic Mongolian corpus which conains 0000 senences o es our approach. This corpus is composed of 8794 words, among which 4663 have (4)
Language Model for Cyrillic Mongolian o Tradiional Mongolian Conversion 7 more han one Tradiional Mongolian words corresponded. Our conversion progress can be divided ino wo seps: in he firs sep, we conver all he Cyrillic Mongolian words o heir corresponding Tradiional Mongolian words according o he rulebased approach; and hen, for each word, we check wheher here is only one Tradiional Mongolian word generaed. If no, we furher deermine he bes one according o he Language Model based approach proposed in secion 3. The daa se for he rule-based approach is composed of hree pars: a mapping dicionary for Cyrillic Mongolian sem o Tradiional Mongolian sem, which conains 52830 enries; a dicionary for Cyrillic Mongolian saic inflecional suffix o Tradiional Mongolian saic inflecional suffix, which conains 336 suffixes; and a dicionary for Cyrillic Mongolian verb suffix o Tradiional Mongolian verb suffix, which conains 498 inflecional suffixes. Based on he word formaion rule of Tradiional Mongolian and Cyrillic Mongolian, ogeher wih he above menioned sem mapping dicionary and suffix mapping dicionary, we consruced a rule-based conversion sysem. 5 $ & UXOHEDVHG XQLJUDP ELJUDP WULJUDP Fig. 3. Performance comparison beween he LM based approaches For he words ha have more han one Tradiional Mongolian correspondence, we compare he Language Model based approach wih differen grams (unigram, bigram and rigram) o he rule-based approach. The experimen resuls are illusraed in Fig 3, from where we can see ha all he Language Model based approaches significanly ouperform he rule-based approach, among which he bigram achieved he bes performance (CAR: 87.66%). Affeced by he sample sparseness problem, he rigram approach is slighly worse han he bigram approach, bu sill much beer han he unigram one which has considered only he occurrence frequency, bu no conex informaion. This again reconfirm he fac ha if he conex informaion is no considered, he performance would be badly decreased. We also es he overall sysem performance of rule-based approach and he improved one on all he Mongolian words (boh -o- and -o-n). The experimenal resuls are illusraed in Fig 4. In Fig 4, we can see ha he conversion correcness for he rule-based approach is 8.66%. When i s inegraed wih he LM-based approach, he overall sysem correcness is grealy improved, which reaches 88.4%.
8 F. Bao e al. 5 $ & UXOHEDVHG 5XOH/0 Fig. 4. Overall sysem performance comparision 5 Conclusions When convering he Cyrillic Mongolian o he Tradiional Mongolian, a lo of problem emerged. In his paper, we focus our aenion on he muliple correspondences problem and proposed a language model based conversion approach which akes he conex informaion ino consideraion. The proposed approach effecively seled his problem and hereby grealy improved he overall conversion sysem performance. However, here is sill some issues o be considered, like he conversion problem for newly-added words and ha for he words borrowed from oher languages. We will ake all hese problems as our fuure work. Acknowledgemens. This work is suppored by he Naural Science Foundaion of China (NSFC) (NO. 6263037, NO. 763029) and he Naural Science Foundaion of Inner Mongolia of China (NO. 20ZD). References. Sarina, B.: The Research on Conversion of Noun and Is Case from Classic Mongolian ino Cyrillic Mongolian. Inner Mongolia Universiy, Hohho (2009) 2. Wuriliga: The Elecronic Dicionary Consrucion of he Tradiional Mongolian-Chinese and Cyrillic Mongolian-Chinese. Inner Mongolia Universiy, Hohho (2009) 3. Li, H., Sarina, B.: The Sudy of Comparison and Conversion abou Tradiional Mongolian and Cyrillic Mongolian. In: 20 4h Inernaional Conference on Inelligen Neworks and Inelligen Sysems, pp. 99 202 (20) 4. Gao, H., Ma, X.: Research on ex-ransform of Cyrillic Mongolian o Tradiional Mongolian conversion sysem. Journal of Inner Mongolia Universiy for Naionaliies 8(5), 7 8 (202) 5. Quejingzhabu: Mongolian code. Inner Mongolia Universiy press, Hohho (2000) 6. Galsenpengseg. Sudy Reader of Cyrillic Mongolian. Inner Mongolia educaion press, Hohho (2006) 7. Qinggeerai. Mongolian Grammar. Inner Mongolia People s Publishing Press, Hohho (992) 8. Zong, C.: Saisical Naural Language Processing. Tsinghua Universiy Press, Beijing (2008) 9. Solcke, A.: SRILM - An Exensible Language Modeling Toolki. In: Proc. Inl. Conf. Spoken Language Processing, Denver, Colorado (2002)