系統識別號 U0026-0108201808081200
論文名稱(中文) 基於主題式句向量的中國古文段落匹配
論文名稱(英文) Chinese Ancient Paragraph Matching By Topical Sentence Representation
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 106
學期 2
出版年 107
研究生(中文) 葉修宏
研究生(英文) Siou-Hong Yeh
學號 P76054494
學位類別 碩士
語文別 英文
論文頁數 58頁
口試委員 指導教授-高宏宇
中文關鍵字 相似文本  句向量  主題模型  自編碼 
英文關鍵字 Similar Text  Sentence Embedding  Topic model  Autoencoder 
中文摘要 在歷史的領域上,已經有大量的文獻紀載著這幾千年的歷史。古代歷史通常含有許多的資訊,能幫助我們去了解許多事物並且能效仿過去的經驗改善未來的一些政策。探討不同史書對同一歷史事件的紀錄方式,是歷史學研究主題之一。在過去,許多古代史學家會寫下一些手稿或是書冊來描述當時的事情以及自己的觀點。然而對於同一件事情,不同的史學家在不同的撰寫時間可能會寫下不同的歷史,像是一些常用字、風格以及立場的不同。除此之外,這些資訊散落在多本書的不同段落。當代史學家必須閱覽許多書,並且找到相似的段落再做統整,這項工作不僅耗時又耗力。幸運的是,隨著人工智慧科技的發展,對於大數據的文字處理問題能夠交由機器來做處理。
英文摘要 In the history field, there already has a large-scale literature which had recorded for thousands of year history. Ancient histories that usually include rich information to help us realize many things and learn from past experience to improve policy in the future. In the past, many ancient historians wrote some notes or books for describing events and their view at that time. For the same event, different ancient historians could write different descriptions of the history, such as used word, style and stance. Besides, the similar paragraphs located at the different position among different books. Modern historians have to read many books and find the similar paragraphs to survey. It costs a lot of time and labors in this work. Fortunately, with the rapid development of artificial intelligence, machines can provide help when historians deal with the big data issues.
In this paper, we aim at detecting similar semantic paragraphs to assist historians to survey and research efficiently. Given history books in the same dynasty, we want to match all paragraphs that describe the same events automatically. For matching the similar paragraphs, we assume two paragraphs which have most similar semantic sentences likely described for the same events. However, the sentences of ancient histories are usually so short and similarly morphological that traditional methods cannot generate considerable and feasible embeddings. Thus, we propose a method that utilizes peculiar autoencoder which combines global and local information to embed sentences for matching. Our approach can generate the high-quality sentence representation that can assist us to find correct similar sentences. With the better similar sentence pairs, we can match similar semantic paragraphs better.
論文目次 中文摘要 I
Abstract II
誌謝 III
1.1 Background 1
1.2 Motivation 5
1.3 Our Approaches 8
1.4 Paper structure 10
2.1 Segmentation 11
2.2 Topic model 14
2.3 Sentence embedding model 18
2.4 Context-Sensitive Autoencoder 20
3.1 Pre-trained embedding Model 24
3.2 Topic-Aware AutoEncoder 28
3.3 Similar paragraph matching 31
4.1 Model Parameter 34
4.2 Model Performance Analysis 36
4.3 Similar Semantic Sentences 38
4.4 Performance of different parameter 43
4.5 Case study for the similar sentences 44
4.6 Similar Paragraph Detection 49
Future Work 51
Acknowledgments 51
