Yıl: 2012 Cilt: 20 Sayı: 5 Sayfa Aralığı: 787 - 804 Metin Dili: İngilizce İndeks Tarihi: 29-07-2022

Effects of diacritics on Turkish information retrieval

Öz:
We investigate the effects of improper use of diacritics in the Turkish alphabet on information retrieval. A diacritic is simply a supplementary sign added to a letter to change the sound value of the letter, and the Turkish alphabet has 5 special letters derived from Latin by adding different diacritics. The statistical analysis performed in this study shows that retrieval performance significantly decreases when documents and queries contain letters with different forms, such that documents consist of letters with diacritics while queries consist of standard Latin letters and vice versa. In order to tackle this challenge, we propose 3 approaches: token normalization by equivalence classes, document expansion, and query expansion. The experimental evaluations carried on the Bilkent Turkish information retrieval test collection suggests that the proposed approaches are promising as a remedy in this line of research.
Anahtar Kelime:

Konular: Mühendislik, Elektrik ve Elektronik
Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık
  • [1] J. Bar-Ilan, T. Gutman, “How do search engines handle non-English queries?”, Proceedings of the Twelfth International World Wide Web Conference, Budapest, 2003.
  • [2] G. Grefenstette, J. Nioche, “Estimation of English and non-English language use on the WWW”, Proceedings of Recherche d’Information Assist´ee par Ordinateur, pp. 237-246, 2000.
  • [3] M. Soroka, “Web search engines for Polish information retrieval: questions of search capabilities and retrieval performance”, The International Information & Library Review, Vol. 32, pp. 87-98, 2000.
  • [4] K. Choros, “Testing the effectiveness of retrieval to queries using Polish words with diacritics”, Lecture Notes in Computer Science, Vol. 3528, pp. 101-106, 2005.
  • [5] H. Moukdad, “Lost in cyberspace: how do search engines handle Arabic queries?”, Proceedings of the 32nd Annual Conference of the Canadian Association for Information Science, 2004.
  • [6] A.M. Daoud, “Morphological analysis and diacritical Arabic text compression”, International Journal of ACM, Vol. 1, pp. 41-47, 2010.
  • [7] T. Bitirim, Y. Tonta, H. Sever, “Information retrieval effectiveness of Turkish search engines”, Lecture Notes in Computer Science, Vol. 2457, pp. 93-103, 2002.
  • [8] M. Braschler, B. Ripplinger, P. Schauble, “Experiments with the Eurospider retrieval system for CLEF 2001”, Lecture Notes in Computer Science, Vol. 2406, pp. 102-110, 2002.
  • [9] C.D. Manning, P. Raghavan, H. Sch¨utze, Introduction to Information Retrieval, Cambridge, Cambridge University Press, 2008.
  • [10] Google Webmaster, “How search results may differ based on accented characters and interface languages”, 2006. Retrieved 10 May 2010 from: http://googlewebmastercentral.blogspot.com/2006/08/how-search-results-may-differbased- on.html.
  • [11] F. Can, S. Kocberber, E. Balcik, C. Kaynak, H.C. Ocalan, O.M. Vursavas, “Information retrieval on Turkish texts”, Journal of the American Society for Information Science and Technology, Vol. 59, pp. 407-421, 2008.
  • [12] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, C. Lioma, “Terrier: a high performance and scalable information retrieval platform”, Proceedings of the 2nd International Workshop on Open Source Information Retrieval, 2006.
  • [13] University of Glasgow, Terrier, 2010. Retrieved 8 February 2010 from http://www.terrier.org.
  • [14] G. Salton, C. Buckley, “Term weighting approaches in automatic text retrieval”, Information Processing and Management, Vol. 24, pp. 513-523, 1988.
  • [15] X. Long, T. Suel, “Optimized query execution in large search engines with global page ordering”, Proceedings of the 29th Very Large Data Bases Conference, 2003.
  • [16] M. Mitra, A. Singhal, C. Buckley, “Improving automatic query expansion”, Proceedings of SIGIR, pp. 206-214, 1998.
  • [17] K. Oflazer , C. Güzey, “Spelling correction in agglutinative languages”, Proceedings of the Fourth Conference on Applied Natural Language Processing, 1994.
  • [18] İ. Pehlivan, Z. Orhan, “Automatic knowledge extraction for filling in biography forms from Turkish texts”, Turkish Journal of Electrical Engineering and Computer Sciences, Vol. 19, pp. 59-71, 2011.
APA Alpkoçak A, CEYLAN M (2012). Effects of diacritics on Turkish information retrieval. , 787 - 804.
Chicago Alpkoçak Adil,CEYLAN Meltem Effects of diacritics on Turkish information retrieval. (2012): 787 - 804.
MLA Alpkoçak Adil,CEYLAN Meltem Effects of diacritics on Turkish information retrieval. , 2012, ss.787 - 804.
AMA Alpkoçak A,CEYLAN M Effects of diacritics on Turkish information retrieval. . 2012; 787 - 804.
Vancouver Alpkoçak A,CEYLAN M Effects of diacritics on Turkish information retrieval. . 2012; 787 - 804.
IEEE Alpkoçak A,CEYLAN M "Effects of diacritics on Turkish information retrieval." , ss.787 - 804, 2012.
ISNAD Alpkoçak, Adil - CEYLAN, Meltem. "Effects of diacritics on Turkish information retrieval". (2012), 787-804.
APA Alpkoçak A, CEYLAN M (2012). Effects of diacritics on Turkish information retrieval. Turkish Journal of Electrical Engineering and Computer Sciences, 20(5), 787 - 804.
Chicago Alpkoçak Adil,CEYLAN Meltem Effects of diacritics on Turkish information retrieval. Turkish Journal of Electrical Engineering and Computer Sciences 20, no.5 (2012): 787 - 804.
MLA Alpkoçak Adil,CEYLAN Meltem Effects of diacritics on Turkish information retrieval. Turkish Journal of Electrical Engineering and Computer Sciences, vol.20, no.5, 2012, ss.787 - 804.
AMA Alpkoçak A,CEYLAN M Effects of diacritics on Turkish information retrieval. Turkish Journal of Electrical Engineering and Computer Sciences. 2012; 20(5): 787 - 804.
Vancouver Alpkoçak A,CEYLAN M Effects of diacritics on Turkish information retrieval. Turkish Journal of Electrical Engineering and Computer Sciences. 2012; 20(5): 787 - 804.
IEEE Alpkoçak A,CEYLAN M "Effects of diacritics on Turkish information retrieval." Turkish Journal of Electrical Engineering and Computer Sciences, 20, ss.787 - 804, 2012.
ISNAD Alpkoçak, Adil - CEYLAN, Meltem. "Effects of diacritics on Turkish information retrieval". Turkish Journal of Electrical Engineering and Computer Sciences 20/5 (2012), 787-804.