進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-0108201808081200
論文名稱(中文) 基於主題式句向量的中國古文段落匹配
論文名稱(英文) Chinese Ancient Paragraph Matching By Topical Sentence Representation
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 106
學期 2
出版年 107
研究生(中文) 葉修宏
研究生(英文) Siou-Hong Yeh
學號 P76054494
學位類別 碩士
語文別 英文
論文頁數 58頁
口試委員 指導教授-高宏宇
口試委員-謝孫源
口試委員-莊坤達
口試委員-王惠嘉
口試委員-何建明
中文關鍵字 相似文本  句向量  主題模型  自編碼 
英文關鍵字 Similar Text  Sentence Embedding  Topic model  Autoencoder 
學科別分類
中文摘要 在歷史的領域上,已經有大量的文獻紀載著這幾千年的歷史。古代歷史通常含有許多的資訊,能幫助我們去了解許多事物並且能效仿過去的經驗改善未來的一些政策。探討不同史書對同一歷史事件的紀錄方式,是歷史學研究主題之一。在過去,許多古代史學家會寫下一些手稿或是書冊來描述當時的事情以及自己的觀點。然而對於同一件事情,不同的史學家在不同的撰寫時間可能會寫下不同的歷史,像是一些常用字、風格以及立場的不同。除此之外,這些資訊散落在多本書的不同段落。當代史學家必須閱覽許多書,並且找到相似的段落再做統整,這項工作不僅耗時又耗力。幸運的是,隨著人工智慧科技的發展,對於大數據的文字處理問題能夠交由機器來做處理。
在這篇論文,我們的目的是幫助歷史學家在眾多的史書找出相同的脈絡,讓他們能夠更有效率地做統整。給定相同朝代的史書,我們能夠自動找到描述相同事情的段落並且找出彼此相似的部分。為了匹配這些相似段落,我們假設當兩個段落有最多的語意相似的句子,那麼這兩個段落很有可能是在描述一樣的事情。然而在中文古文中,由於句子普遍較短以及字面相似,以致於一般傳統的方法不能產生適當的句向量。因此我們採用特別的自編碼架構,用來結合整體段落的主題以及句子本身語意的資訊,產生更好的句子向量,這些主題式句子向量能夠解決,短句子語意量的不足以及主題不合語意卻相似的句子,而有了這些較好的相似句連結,能夠幫助我們做更好相似段落的匹配。
英文摘要 In the history field, there already has a large-scale literature which had recorded for thousands of year history. Ancient histories that usually include rich information to help us realize many things and learn from past experience to improve policy in the future. In the past, many ancient historians wrote some notes or books for describing events and their view at that time. For the same event, different ancient historians could write different descriptions of the history, such as used word, style and stance. Besides, the similar paragraphs located at the different position among different books. Modern historians have to read many books and find the similar paragraphs to survey. It costs a lot of time and labors in this work. Fortunately, with the rapid development of artificial intelligence, machines can provide help when historians deal with the big data issues.
In this paper, we aim at detecting similar semantic paragraphs to assist historians to survey and research efficiently. Given history books in the same dynasty, we want to match all paragraphs that describe the same events automatically. For matching the similar paragraphs, we assume two paragraphs which have most similar semantic sentences likely described for the same events. However, the sentences of ancient histories are usually so short and similarly morphological that traditional methods cannot generate considerable and feasible embeddings. Thus, we propose a method that utilizes peculiar autoencoder which combines global and local information to embed sentences for matching. Our approach can generate the high-quality sentence representation that can assist us to find correct similar sentences. With the better similar sentence pairs, we can match similar semantic paragraphs better.
論文目次 中文摘要 I
Abstract II
誌謝 III
LIST OF TABLES VII
LIST OF FIGURES VIII
1 INTRODUCTION 1
1.1 Background 1
1.2 Motivation 5
1.3 Our Approaches 8
1.4 Paper structure 10
2 RELATED WORK 11
2.1 Segmentation 11
2.2 Topic model 14
2.3 Sentence embedding model 18
2.4 Context-Sensitive Autoencoder 20
3 METHOD 23
3.1 Pre-trained embedding Model 24
3.2 Topic-Aware AutoEncoder 28
3.3 Similar paragraph matching 31
4 EXPERIMENTS 34
4.1 Model Parameter 34
4.2 Model Performance Analysis 36
4.3 Similar Semantic Sentences 38
4.4 Performance of different parameter 43
4.5 Case study for the similar sentences 44
4.6 Similar Paragraph Detection 49
5 CONCLUSION 50
Future Work 51
Acknowledgments 51
6 REFERENCES 52
APPENDIX 56
參考文獻 [1] Ristad, E. S., & Yianilos, P. N. (1998). Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5), 522-532.
[2] Ramos, J. (2003, December). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (Vol. 242, pp. 133-142).
[3] Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International Conference on Machine Learning (pp. 1188-1196).
[4] Pagliardini, Matteo, Prakhar Gupta, and Martin Jaggi. "Unsupervised learning of sentence embeddings using compositional n-gram features."arXiv preprint arXiv:1703.02507(2017).
[5] WANG Baoxin, ZHENG Dequan, WANG Xiaoxue, ZHAO Tiejun. Multiple-Choice Question Answering Based on Textual Entailment. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1): 134-140
[6] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
[7] Das, R., Zaheer, M., & Dyer, C. (2015). Gaussian lda for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 795-804).
[8] Hu, W., & Tsujii, J. I. (2016). A latent concept topic model for robust topic inference using word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Vol. 2, pp. 380-386).
[9] Amiri, H., Resnik, P., Boyd-Graber, J., & Daumé III, H. (2016). Learning text pair similarity with context-sensitive autoencoders. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1882-1892).
[10] Church, Kenneth Ward, and Patrick Hanks. "Word association norms, mutual information, and lexicography." Computational linguistics 16.1 (1990): 22-29.
[11] Dumais, S. T. (2004). Latent semantic analysis. Annual review of information science and technology, 38(1), 188-230.
[12] Hofmann, T. (1999, July). Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence (pp. 289-296). Morgan Kaufmann Publishers Inc.
[13] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[14] Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
[15] Levy, O., & Goldberg, Y. (2014). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Vol. 2, pp. 302-308).
[16] Dunteman, G. H. (1989). Principal components analysis (No. 69). Sage.
[17] Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015, January). Topical Word Embeddings. In AAAI (pp. 2418-2424).
[18] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
[19] Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., ... & Ward, R. (2016). Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(4), 694-707.
[20] Zhu, Z., & Hu, J. (2017). Context Aware Document Embedding. arXiv preprint arXiv:1707.01521.
[21] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786), 504-507.
[22] Xun, G., Li, Y., Gao, J., & Zhang, A. (2017, August). Collaboratively improving topic discovery and word embeddings by coordinating global and local contexts. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 535-543). ACM.
[23] Li, C., Wang, H., Zhang, Z., Sun, A., & Ma, Z. (2016, July). Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 165-174). ACM.
[24] W.-H. Pang, S.-G. Liu, H.-C. Tu, G.-A. Weng, and J. Hsiang, Automated name-extraction in Chinese classics: Applying PMI (Pointwise Mutual Information) segmentation to Zizhi Tongjian, in Digital Humanities and Craft: Technological Change, J. Hsiang, Ed. Taipei: National Taiwan University, 2014, pp. 139–163.
[25] Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.
[26] Järvelin, Kalervo, and Jaana Kekäläinen. "Cumulated gain-based evaluation of IR techniques."ACM Transactions on Information Systems (TOIS) 20.4 (2002): 422-446.
[27] Du, Lan, Wray Buntine, and Mark Johnson. "Topic segmentation with a structured topic model." Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013.
[28] Doersch, Carl. "Tutorial on variational autoencoders." arXiv preprint arXiv:1606.05908 (2016).
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2018-08-03起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw