NET-LDA: a novel topic modeling method based on semantic document similarity

Ekinci, Ekin; ilhan omurca, sevinç

doi:10.3906/elk-1912-62

NET-LDA: a novel topic modeling method based on semantic document similarity

Ekin EKİNCİ, (Yazılım Mühendisliği Bölümü, Mühendislik Fakültesi, Doğuş Üniversitesi, İstanbul, Türkiye)

Sevinç İLHAN OMURCA (Bilgisayar Mühendisliği Bölümü, Mühendislik Fakültesi, Kocaeli Üniversitesi, Kocaeli, Türkiye)

Turkish Journal of Electrical Engineering and Computer Sciences

0 0

Yıl: 2020 Cilt: 28 Sayı: 4 Sayfa Aralığı: 2244 - 2260 Metin Dili: İngilizce DOI: 10.3906/elk-1912-62 İndeks Tarihi: 03-06-2022

NET-LDA: a novel topic modeling method based on semantic document similarity

Öz:

Topic models, such as latent Dirichlet allocation (LDA), allow us to categorize each document based on the topics. It builds a document as a mixture of topics and a topic is modeled as a probability distribution over words. However, the key drawback of the traditional topic model is that it cannot handle the semantic knowledge hidden in the documents. Therefore, semantically related, coherent and meaningful topics cannot be obtained. However, semantic inference plays a significant role in topic modeling as well as in other text mining tasks. In this paper, in order to tackle this problem, a novel NET-LDA model is proposed. In NET-LDA, semantically similar documents are merged to bring all semantically related words together and the obtained semantic similarity knowledge is incorporated into the model with a new adaptive semantic parameter. The motivation of the study is to reveal the impact of semantic knowledge in the topic model researches. Therefore, in a given corpus, different documents may contain different words but may speak about the same topic. For such documents to be correctly identified, the feature space of the documents must be elaborated with more powerful features. In order to accomplish this goal, the semantic space of documents is constructed with concepts and named entities. Two datasets in the English and Turkish languages and 12 different domains have been evaluated to show the independence of the model from both language and domain. The proposed NET-LDA, compared to the baselines, outperforms in terms of topic coherence, F-measure, and qualitative evaluation

Anahtar Kelime:

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık

[1] Hofmann T. Probabilistic Latent Semantic Analysis. In: Fifteenth Conference on Uncertainty in Artificial Intelli- gence; Stockholm, Sweden; 1999. pp. 289-296.
[2] Hofmann T. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 2001; 42 (1-2): 177-196. doi: 10.1023/A:1007617005950
[3] Griffiths TL, Steyvers M. A probabilistic approach to semantic representation. In: Twenty-Fourth Annual Confer- ence of the Cognitive Science Society; Fairfax, Virginia, USA; 2002. pp. 381-386.
[4] Griffiths TL, Steyvers M. Prediction and semantic association. In: Becker S, Thrun S, Obermayer K (editors). Advances in neural information processing systems. Cambridge, MA, USA: MIT Press, 2003, pp. 11-18
[5] Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. Journal of Machine Learning Research 2003; 3: 993-1022.
[6] Griffiths TL, Steyvers M. Finding scientific topics. Proceedings of the National Academy of Sciences 2004; 101 (suppl 1): 5228-5235. doi: 10.1073/pnas.0307752101
[7] Steyvers M, Griffiths TL. Probabilistic topic models. In: Landauer TK, McNamara DS, Dennis S, Kintsch W (editors). Handbook of latent semantic analysis. Washington, DC, USA: Lawrence Erlbaum Associates Publishers, 2007, pp. 427-448.
[8] Blei DM. Probabilistic topic models. Communications of the ACM 2012; 55 (4): 77-84. doi: 10.1145/2133806.2133826
[9] Chang J, Gerrish S, Wang C, Blei DM. Reading tea leaves: How humans interpret topic models. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (editors). Advances in Neural Information Processing Systems 22. New York, NY, USA: Curran Associates Inc., 2009, pp. 288-296.
[10] Kim SM, Hovy E. Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text. In: Workshop on Sentiment and Subjectivity in Text; Sydney, Australia; 2006. pp. 1-8.
[11] Griffiths TL, Steyvers M, Tenenbaum J. Topics in semantic representation. Psychological Review 2007; 114 (2): 211-244. doi: 10.1037/0033-295X.114.2.211
[12] Chemudugunta C, Holloway A, Smyth P, Steyvers M. Modeling documents by combining semantic concepts with unsupervised statistical learning. In: Sheth A, Staab S, Paolucci M, Maynard D, Finin T et al. (editors). The Semantic Web - ISWC 2008. Heidelberg, Germany: Springer, 2008, pp. 229-244.
[13] Godin F, Slavkovikj V, De Neve W, Schrauwen B, Van de Walle R. Using topic models for twitter hashtag recommendation. In: 22nd International Conference on World Wide Web (WWW ’13 Companion); Rio de Janeiro, Brazil; 2013. pp. 593-596.
[14] Poria S, Chaturvedi I, Cambria E, Bisio F. Sentic LDA: Improving on LDA with semantic similarity for aspect-based sentiment analysis. In: International Joint Conference on Neural Networks (IJCNN); Budapest, Hungary; 2016. pp. 4465–4473.
[15] Zhang C, Wanga H, Caoc L, Wanga W, Xu F. A hybrid term–term relations analysis approach for topic detection. Knowledge-Based Systems 2016; 93: 109-120. doi: 10.1016/j.knosys.2015.11.006
[16] Moro A, Raganato A, Navigli R. Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics 2014; 2: 231-244. doi: 10.1162/tacl_a_00179
[17] Blei DM, Lafferty JD. Dynamic Topic Models. In: 23rd International Conference on Machine Learning (ICML ’06); Pittsburgh, Pennsylvania, USA; 2006. pp. 113-120.
[18] Ramage D, Hall D, Nallapati R, Manning CD. Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-labeled Corpora. In: 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09); Singapore; 2009. pp. 248-256.
[19] Jelodar H, Wang Y, Yuan C, Feng X. Latent Dirichlet allocation (LDA) and topic modeling - models, applications, a survey. Multimedia Tools and Applications 2019; 78 (11): 15169-15211. doi: 10.1007/s11042-018-6894-4
[20] Zhu J, Ahmed A, Xing EP. MedLDA: maximum margin supervised topic models for regression and classification. In: 26th Annual International Conference on Machine Learning (ICML ’09); Montreal, Quebec, Canada; 2009. pp. 1257-1264.
[21] Chang J, Blei DM. Relational topic models for document networks. In: 12th International Conference on Artificial Intelligence and Statistics (AISTATS); Clearwater Beach, Florida, USA; 2009. pp. 81-88.
[22] Zhai Z, Liu B, Xu H, Jia P. Constrained LDA for grouping product features in opinion mining. In: Huang JZ, Cao L, Srivastava J (editor). Advances in Knowledge Discovery and Data Mining. Heidelberg, Germany: Springer, 2011, pp. 448-459
23] Zhai K, Boyd-Graber J, Asadi N, Alkhouja M. Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce. In: 21st ACM International Conference on World Wide Web; Lyon, France; 2012. pp. 879-888.
[24] Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A. How to Effectively Use Topic Models for Software Engineering Tasks? An Approach Based on Genetic Algorithms. In: 2013 International Conference on Software Engineering (ICSE ’13); San Francisco, CA, USA; 2013. pp. 522-531.
[25] Bagheri A, Saraee M, Jong F. ADM-LDA: An aspect detection model based on topic modelling using the structure of review sentences. Journal of Information Science 2014; 40 (5): 621-636. doi: 10.1177/0165551514538744
[26] Zheng X, Lin Z, Wang X, Lin KJ, Song M. Incorporating appraisal expression patterns into topic modeling for aspect and sentiment word identification. Knowledge-Based Systems 2014; 61: 29-47. doi: 10.1016/j.knosys.2014.02.003
[27] Wang T, Cai Y, Leung H, Lau RYK, Li Q et al. Product aspect extraction supervised with online domain knowledge. Knowledge-Based Systems 2014; 71: 86-100. doi: 10.1016/j.knosys.2014.05.018
[28] Xie W, Zhu F, Jiang J, Lim EP, Wang K. Topicsketch: Real-time bursty topic detection from twitter. IEEE Transactions on Knowledge and Data Engineering 2016; 28 (8): 2216-2229. doi: 10.1109/TKDE.2016.2556661
[29] Li C, Cheung WK, Ye Y, Zhang X, Chu D, Li X. The Author-Topic-Community model for author interest profiling and community discovery. Knowledge and Information Systems 2015; 44 (2): 359-383. doi: 10.1007/s10115-014- 0764-9
[30] Liu Y, Wang J, Jiang Y. PT-LDA: a latent variable model to predict personality traits of social network users. Neurocomputing 2016; 210: 155-163. doi: 10.1016/j.neucom.2015.10.144
[31] Zoghbi Z, Vulic I, Moens MF. Latent Dirichlet allocation for linking user-generated content and e-commerce data. Information Sciences 2016; 367-368: 573-599. doi: 10.1016/j.ins.2016.05.047
[32] Yeh JF, Tan YS, Lee CH. Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation. Nerocomputing 2016; 216: 310-318. doi: 10.1016/j.neucom.2016.08.017
[33] Ekinci E, İlhan Omurca S. Concept-LDA: incorporating Babelfy into LDA for aspect extraction. Journal of Infor- mation Sciences 2019; 1: 1-20. doi: 10.1177/0165551519845854
[34] Rao Y. Contextual sSentiment topic model for adaptive social emotion classification. IEEE Intelligent Systems 2016; 31 (1): 41-47. doi: 10.1109/MIS.2015.91
[35] Xie P, Yang D, Xing EP. Incorporating word correlation knowledge into topic modeling. In: The 2015 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2015); Denver, Colorado, USA; 2015. pp. 725-734.
[36] Alam H, Ryu WJ, Lee SK. Joint multi-grain topic sentiment: modeling semantic aspects for online reviews. Information Sciences 2016; 339: 206-223. doi: 10.1016/j.ins.2016.01.013
[37] Yao L, Zhang Y, Chen Q, Qian H, Wei B et al. Mining coherent topics in documents using word embed- dings and large-scale text data. Engineering Applications of Artificial Intelligence 2017; 64: 432-439. doi: 10.1016/j.engappai.2017.06.024
[38] Fu X, Sun X, Wu H, Cui L, Huang JZ. Weakly supervised topic sentiment joint model with word embeddings. Knowledge-Based Systems 2018; 147: 43-54. doi: 10.1016/j.knosys.2018.02.012
[39] Shams M, Baraani-Dastjerdi A. Enriched LDA (ELDA): combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction. Expert Systems with Applications 2017; 80: 136-146. doi: doi.org/10.1016/j.eswa.2017.02.038
[40] Heng Y, Gao Z, Jiang Y, Chen X. Exploring hidden factors behind online food shopping from Amazon reviews: A topic mining approach. Journal of Retailing and Consumer Services 2018; 42: 161-168. doi: 10.1016/j.jretconser.2018.02.006 2259
41] Kandemir M, Kekeç T, Yeniterzi R. Supervising topic models with Gaussian processes. Pattern Recognition 2018; 77: 226-236. doi: 10.1016/j.patcog.2017.12.019
[42] Akın MD, Akın AA. Türk Dilleri için Açık Kaynaklı Doğal Dil İşleme Kütüphanesi: Zemberek. Elektrik Mühendisliği 2007; 431: 38-44.
[43] Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ et al. The stanford corenlp natural language processing toolkit. In: 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations; Baltimore, Maryland, USA; 2014. pp. 55-60.
[44] Navigli R, Ponzetto SP. BabelNet: Building a Very Large Multilingual Semantic Network. In: 48th Annual Meeting of the Association for Computational; Uppsala, Sweden; 2010. pp. 216-225.
[45] Ehrmann M, Cecconi F, Vannella D, McCrae J, Cimiano P et al. Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0. In: Ninth International Conference on Language Resources and Evaluation; Reykjavik, Iceland; 2014. pp. 401-408.
[46] Wallach HM, Mimno D, McCallum A. Rethinking LDA: Why Priors Matter. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (editors). Advances in Neural Information Processing Systems 22. New York, NY, USA: Curran Associates Inc., 2009, pp. 1973-1981.
[47] Chen Z, Liu B. Topic modeling using topics from many domains, lifelong learning and big data. In: 31st international conference on machine learning (ICML ’14); Beijing, China; 2014. pp. 703-711.
[48] Chen Z, Liu B. Mining Topics in Documents: Standing on the Shoulders of Big Data. In: 20th ACM SIGKDD Conference on Knowledge Discovery and Data Minnig (KDD’14); New York, NY, USA; 2014. pp. 1116-1125.
[49] Mimno D, Wallach HM, Talley E, Leenders M, McCallum A. Optimizing semantic coherence in topic models. In: 2011 Conference on Empirical Methods in Natural Language Processing; Edinburgh, Scotland, UK; 2011. pp. 262-272

APA	Ekinci E, ilhan omurca s (2020). NET-LDA: a novel topic modeling method based on semantic document similarity. , 2244 - 2260. 10.3906/elk-1912-62
Chicago	Ekinci Ekin,ilhan omurca sevinç NET-LDA: a novel topic modeling method based on semantic document similarity. (2020): 2244 - 2260. 10.3906/elk-1912-62
MLA	Ekinci Ekin,ilhan omurca sevinç NET-LDA: a novel topic modeling method based on semantic document similarity. , 2020, ss.2244 - 2260. 10.3906/elk-1912-62
AMA	Ekinci E,ilhan omurca s NET-LDA: a novel topic modeling method based on semantic document similarity. . 2020; 2244 - 2260. 10.3906/elk-1912-62
Vancouver	Ekinci E,ilhan omurca s NET-LDA: a novel topic modeling method based on semantic document similarity. . 2020; 2244 - 2260. 10.3906/elk-1912-62
IEEE	Ekinci E,ilhan omurca s "NET-LDA: a novel topic modeling method based on semantic document similarity." , ss.2244 - 2260, 2020. 10.3906/elk-1912-62
ISNAD	Ekinci, Ekin - ilhan omurca, sevinç. "NET-LDA: a novel topic modeling method based on semantic document similarity". (2020), 2244-2260. https://doi.org/10.3906/elk-1912-62

APA	Ekinci E, ilhan omurca s (2020). NET-LDA: a novel topic modeling method based on semantic document similarity. Turkish Journal of Electrical Engineering and Computer Sciences, 28(4), 2244 - 2260. 10.3906/elk-1912-62
Chicago	Ekinci Ekin,ilhan omurca sevinç NET-LDA: a novel topic modeling method based on semantic document similarity. Turkish Journal of Electrical Engineering and Computer Sciences 28, no.4 (2020): 2244 - 2260. 10.3906/elk-1912-62
MLA	Ekinci Ekin,ilhan omurca sevinç NET-LDA: a novel topic modeling method based on semantic document similarity. Turkish Journal of Electrical Engineering and Computer Sciences, vol.28, no.4, 2020, ss.2244 - 2260. 10.3906/elk-1912-62
AMA	Ekinci E,ilhan omurca s NET-LDA: a novel topic modeling method based on semantic document similarity. Turkish Journal of Electrical Engineering and Computer Sciences. 2020; 28(4): 2244 - 2260. 10.3906/elk-1912-62
Vancouver	Ekinci E,ilhan omurca s NET-LDA: a novel topic modeling method based on semantic document similarity. Turkish Journal of Electrical Engineering and Computer Sciences. 2020; 28(4): 2244 - 2260. 10.3906/elk-1912-62
IEEE	Ekinci E,ilhan omurca s "NET-LDA: a novel topic modeling method based on semantic document similarity." Turkish Journal of Electrical Engineering and Computer Sciences, 28, ss.2244 - 2260, 2020. 10.3906/elk-1912-62
ISNAD	Ekinci, Ekin - ilhan omurca, sevinç. "NET-LDA: a novel topic modeling method based on semantic document similarity". Turkish Journal of Electrical Engineering and Computer Sciences 28/4 (2020), 2244-2260. https://doi.org/10.3906/elk-1912-62