Yıl: 2022 Cilt: 30 Sayı: 1 Sayfa Aralığı: 184 - 199 Metin Dili: İngilizce DOI: 10.3906/elk-2102-57 İndeks Tarihi: 30-06-2022

Evaluating the English-Turkish parallel treebank for machine translation

Öz:
This study extends our initial efforts in building an English-Turkish parallel treebank corpus for statistical machine translation tasks. We manually generated parallel trees for about 17K sentences selected from the Penn Treebank corpus. English sentences vary in length: 15 to 50 tokens including punctuation. We constrained the translation of trees by (i) reordering of leaf nodes based on suffixation rules in Turkish, and (ii) gloss replacement. We aim to mimic human annotator’s behavior in real translation task. In order to fill the morphological and syntactic gap between languages, we do morphological annotation and disambiguation. We also apply our heuristics by creating Nokia English-Turkish Treebank (NTB) to address technical document translation tasks. NTB also includes 8.3K sentences in varying lengths. We validate the corpus both extrinsically and intrinsically, and report our evaluation results regarding perplexity analysis and translation task results. Results prove that our heuristics yield promising results in terms of perplexity and are suitable for translation tasks in terms of BLEU scores.
Anahtar Kelime:

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık
  • [1] Chomsky N. Syntactic Structures. The Hague: Mouton and Co., 1957.
  • [2] Marcus M, Marcinkiewicz M, Santorini B. Building a large annotated corpus of English: the penn treebank. Computational Linguistics 1993; 19 (2): 313-330. doi: 10.21236/ada273556
  • [3] Brants S, Dipper S, Hansen S, Lezius W, Smith G. The TIGER treebank. In: Workshop on treebanks and linguistic theories; Sozopol, Bulgaria; 2002. pp. 24-41.
  • [4] Abeillé A, Clément L, Kinyon A. Building a treebank for French. In: Second International Conference on Language Resources and Evaluation (LREC 2000); Athens, Greece; 2000. pp. 165-187
  • [5] Haverinen K, Nyblom J, Viljanen T, Laippala V, Kohonen S et al. Building the essential resources for Finnish: the Turku Dependency Treebank. Language Resources and Evaluation 2014; 48 (3): 493–531. doi: 10.1007/s10579-013- 9244-1
  • [6] Csendes D, Csirik J, Gyimóthy T, Kocsor A. The Szeged Treebank. In: Text, Speech and Dialogue, 8th International Conference (TSD 2005); Karlovy Vary, Czech Republic. pp. 123-131
  • [7] Maamouri M, Bies A, Buckwalter T, Mekki W. The penn Arabic treebank: Building a large-scale annotated Arabic corpus. In: NEMLAR Conference on Arabic Language Resources and Tools; Cairo, Egypt; 2004. pp. 102-109.
  • [8] Xue N, Xia F, Chiou F-D, Palmer M. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering 2005; 11 (2): 207–238. doi: 10.1017/S135132490400364X
  • [9] Kornfilt J. Turkish (Descriptive Grammars). London, UK: Routledge, 1997.
  • [10] Koehn P. Europarl: A multilingual corpus for evaluation of machine translation. Tenth Machine Translation Summit; Phuket, Thailand; 2005. pp. 79-86.
  • [11] Cyrus L, Feddes H, Schumacher F. FuSe - A Multi-Layered Parallel Treebank. In: Second Workshop on Treebanks and Linguistic Theories; Växjö, Sweden; 2003. pp. 213-216.
  • [12] Ahrenberg L. LinES: An English-Swedish Parallel Treebank. In: 16th Nordic Conference of Computational Linguistics (NODALIDA 2007); Tartu, Estonia; 2007. pp. 270–273.
  • [13] Gustafson-C̆apková S, Samuelsson Y, Volk M. SMULTRON (version 1.0) - The Stockholm MUltilingual parallel TReebank. An English-German-Swedish parallel Treebank with subsentential alignment; 2007.
  • [14] C̆mejrek M, Cur̆ín J, Havelka J, Hajic̆ J, Kubon̆ V. Prague Czech-English dependency treebank: Syntactically annotated resources for machine translation. In: Fourth International Conference on Language Resources and Evaluation (LREC 2004); Lisbon, Portugal; 2004. pp. 1597-1600
  • [15] Yeniterzi R, Oflazer K. Syntax-tomorphology mapping in factored phrase-based statistical machine translation from English to Turkish. In: 48th Annual Meeting of the Association for Computational Linguistics; Stroudsburg, PA, USA; 2010. pp.454-464.
  • [16] El-Kahlout ID. Statistical machine translation from English to Turkish. PhD, Sabanci University, Istanbul, Turkey, 2009.
  • [17] Yıldız OT, Solak E, Görgün O, Ehsani R. Constructing a Turkish-English parallel treebank. In: 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Baltimore, MD, USA; 2014. pp. 112-117.
  • [18] Atalay NB, Oflazer K, Say B. The annotation process in the Turkish treebank. In: 4th International Workshop on Linguistically Interpreted Corpora (LINC 2003) at EACL 2003; Budapest, Hungary; 2003. pp. 33-38.
  • [19] Eryiğit G, Oflazer K. Statistical dependency parsing of Turkish. In: 11th Conference of the European Chapter of the Association for Computational Linguistics(EACL); Trento, Italy; 2006. pp. 89–96.
  • [20] Yüret D. Dependency parsing as a classification problem. In: Tenth Conference on Computational Natural Language Learning (CoNLL-X); New York City, NY, USA; 2006. pp. 246-250.
  • [21] Riedel S, Çakıcı R, Meza-Ruiz I. Multi-lingual dependency parsing with incremental integer linear programming. In: Tenth Conference on Computational Natural Language Learning (CoNLL-X); New York City, NY, USA; 2006. pp. 226-230.
  • [22] Çakıcı R, Baldridge J. Projective and non-projective Turkish parsing. In: Conference on Treebanks and Linguistic Theories (TLT 2006); Prague, Czech Republic; 2006. pp. 19-30.
  • [23] Eryiğit G, Adalı E, Oflazer K. Türkçe cümlelerin kural tabanlı bağlılık analizi. In: 15th Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN 2006); Muğla, Turkey; 2006. pp. 17–24 (in Turkish).
  • [24] Eryiğit G, Nivre J, Oflazer K. Dependency parsing of Turkish. Computational Linguistics 2008; 34 (3): 357-389. doi:10.1162/coli.2008.07-017-R1-06-83
  • [25] Şahin G, Adalı E. Annotation of semantic roles for the Turkish Proposition Bank. Language Resources and Evaluation 2018; 52 (3): pp. 673–706. doi: 10.1007/s10579-017-9390-y
  • [26] Sulubacak U, Pamay T, Eryiğit G. IMST: revisited Turkish dependency treebank. In: TurCLing 2016, The First International Conference on Turkic Computational Linguistics at CICLING; Konya, Turkey; 2016. pp. 1-6.
  • [27] Çakıcı R. Automatic induction of a CCG grammar for Turkish. In: ACL Student Research Workshop; Ann Harbor, MI, USA; 2005. pp. 73-78.
  • [28] Çetinoğlu Ö, Oflazer K. Morphology-syntax interface for Turkish LFG. In: 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics; Sydney, Australia; 2006. pp.153-160.
  • [29] Çetinoğlu Ö, Oflazer K. Integrating derivational morphology into syntax. In: Recent Advances in Natural Language Processing (RNLP 2007); Borovets, Bulgaria; 2007. pp. 155-170.
  • [30] Megyesi B, Dahlqvist B, Pettersson E, Nivre J. Swedish-Turkish parallel treebank. In: Sixth International Conference on Language Resources and Evaluation (LREC 2008); Marrakech, Morocoo; 2008. pp.470-473.
  • [31] Megyesi B, Dahlqvist B, Csató É, Nivre J. The English-Swedish-Turkish parallel treebank. In: Seventh International Conference on Language Resources and Evaluation (LREC 2010); Valletta, Malta; 2010. pp. 3393-3397.
  • [32] Nivre J, Hall J, Nilsson J. MaltParser: A data-driven parser-generator for dependency parsing. In: Fifth International Conference on Language Resources and Evaluation; Genoa, Italy; 2006. pp. 2216-2219.
  • [33] Nivre J, Nilsson J, Hall J. Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In: Fifth International Conference on Language Resources and Evaluation (LREC 2006); Genoa, Italy; 2006. pp.1392- 1395.
  • [34] Sulger S, Butt M, King TH, Meurer P, Laczko T et al. ParGramBank: The ParGram parallel treebank. In: 51st Annual Meeting of the Association for Computational Linguistics; Sofia, Bulgaria; 2013. pp. 550-560.
  • [35] Erguvanlı ET. The Function of Word Order in Turkish Grammar. Berkeley, CA, USA: University of California Press, 1984.
  • [36] Dryer MS. The Greenbergian Word Order Correlations Language 1992; 68 (1): 81-138. doi: 10.2307/416370
  • [37] Yıldız OT, Çandır S, Solak E, Ehsani R, Görgün O. Constructing a Turkish constituency parse treeBank. In: International Conference on Computer and Information Sciences (ISCIS); London, UK; 2015. pp. 339-347.
  • [38] Yıldız OT, Avar B, Ercan G. An open, extendible, and fast Turkish morphological analyzer. In: International Conference on Recent Advances in Natural Language Processing (RANLP 2019); Varna, Bulgaria. pp. 1364–1372.
  • [39] Görgün O, Yıldız OT. A novel approach to morphological disambiguation for Turkish. In: Computer and Information Sciences II - 26th International Symposium on Computer and Information Sciences; London, UK; 2011. pp. 77-83.
  • [40] Klein D, Manning C. Accurate unlexicalized parsing. In: 41st Annual Meeting of the Association for Computational Linguistics; Morristown, NJ, USA; 2003. pp. 423-430.
  • [41] Chen S, Beeferman D, Rosenfeld R. Evaluation metrics for language models. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop; Lansdowne, VA, USA; 1998. pp. 275-280.
  • [42] Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition 2nd ed. Upper Saddle River, NJ: Prentice Hall, 2009.
  • [43] Chang J, Boyd-Graber J, Wang C, Gerrish S, Blei DM. Reading tea leaves: How humans interpret topic models. In: Advances in Neural Information Processing Systems 21 (NIPS 2009); Vancouver, Canada; 2009. pp. 288-296.
  • [44] Kneser R, Ney H. Improved backing-off for m-gram language modeling. In: International Conference on Acoustics, Speech, and Signal Processing; Detroit, MI, USA; 1995. pp.181–184.
  • [45] Görgün O, Yıldız OT, Solak E, Ehsani R. English-Turkish parallel treebank with morphological annotations and its use in tree-based smt. In: 5th International Conference on Pattern Recognition and Methods (ICPRAM); Rome, Italy; 2016. pp. 510-516.
  • [46] Yüret D, Biçici E. Modeling morphologically rich languages using split words and unstructured dependencies. In: ACL-IJCNLP 2009 Conference Short Papers; Suntec, Singapore; 2009. pp. 345-348.
APA Görgün O, Yildiz O (2022). Evaluating the English-Turkish parallel treebank for machine translation . , 184 - 199. 10.3906/elk-2102-57
Chicago Görgün Onur,Yildiz Olcay Taner Evaluating the English-Turkish parallel treebank for machine translation . (2022): 184 - 199. 10.3906/elk-2102-57
MLA Görgün Onur,Yildiz Olcay Taner Evaluating the English-Turkish parallel treebank for machine translation . , 2022, ss.184 - 199. 10.3906/elk-2102-57
AMA Görgün O,Yildiz O Evaluating the English-Turkish parallel treebank for machine translation . . 2022; 184 - 199. 10.3906/elk-2102-57
Vancouver Görgün O,Yildiz O Evaluating the English-Turkish parallel treebank for machine translation . . 2022; 184 - 199. 10.3906/elk-2102-57
IEEE Görgün O,Yildiz O "Evaluating the English-Turkish parallel treebank for machine translation ." , ss.184 - 199, 2022. 10.3906/elk-2102-57
ISNAD Görgün, Onur - Yildiz, Olcay Taner. "Evaluating the English-Turkish parallel treebank for machine translation ". (2022), 184-199. https://doi.org/10.3906/elk-2102-57
APA Görgün O, Yildiz O (2022). Evaluating the English-Turkish parallel treebank for machine translation . Turkish Journal of Electrical Engineering and Computer Sciences, 30(1), 184 - 199. 10.3906/elk-2102-57
Chicago Görgün Onur,Yildiz Olcay Taner Evaluating the English-Turkish parallel treebank for machine translation . Turkish Journal of Electrical Engineering and Computer Sciences 30, no.1 (2022): 184 - 199. 10.3906/elk-2102-57
MLA Görgün Onur,Yildiz Olcay Taner Evaluating the English-Turkish parallel treebank for machine translation . Turkish Journal of Electrical Engineering and Computer Sciences, vol.30, no.1, 2022, ss.184 - 199. 10.3906/elk-2102-57
AMA Görgün O,Yildiz O Evaluating the English-Turkish parallel treebank for machine translation . Turkish Journal of Electrical Engineering and Computer Sciences. 2022; 30(1): 184 - 199. 10.3906/elk-2102-57
Vancouver Görgün O,Yildiz O Evaluating the English-Turkish parallel treebank for machine translation . Turkish Journal of Electrical Engineering and Computer Sciences. 2022; 30(1): 184 - 199. 10.3906/elk-2102-57
IEEE Görgün O,Yildiz O "Evaluating the English-Turkish parallel treebank for machine translation ." Turkish Journal of Electrical Engineering and Computer Sciences, 30, ss.184 - 199, 2022. 10.3906/elk-2102-57
ISNAD Görgün, Onur - Yildiz, Olcay Taner. "Evaluating the English-Turkish parallel treebank for machine translation ". Turkish Journal of Electrical Engineering and Computer Sciences 30/1 (2022), 184-199. https://doi.org/10.3906/elk-2102-57