進階搜尋


下載電子全文  
系統識別號 U0026-0908201221593600
論文名稱(中文) 英語短文語意相似度評估演算法
論文名稱(英文) A Semantic Similarity Evaluation Algorithm for English Short-Texts
校院名稱 成功大學
系所名稱(中) 工程科學系碩博士班
系所名稱(英) Department of Engineering Science
學年度 100
學期 2
出版年 101
研究生(中文) 張家瑋
研究生(英文) Jia-Wei Chang
學號 n96994184
學位類別 碩士
語文別 中文
論文頁數 65頁
口試委員 指導教授-王宗一
口試委員-李明哲
口試委員-林豪鏘
口試委員-盧文祥
口試委員-朱治平
中文關鍵字 資訊檢索  自然語言處理  語法資訊  語意相似度 
英文關鍵字 Information Retrieval  Natural Language Processing  Syntactic Information  Semantic Similarity 
學科別分類
中文摘要 近年來,人們對於資訊檢索等服務日趨倚重,使得相關應用對於以自然語言為基礎的資訊檢索技術有著殷切的需求;故本論文欲開發一演算法評估英語短文之間的語意相似度並為兼具執行效率及檢索準確度的方法。
然而,因為英語短文的資訊量非常有限,若直接應用知名的資訊檢索模型,如潛藏語意分析、超空間語言模擬模型等等,將導致無法準確量化語意的結果;故本論文利用自然語言既有的文法規則得到語法的資訊,進一步提出利用語法資訊及WordNet-based的單字相似度演算法進行定義與量化字詞的語意,以達到精準地評估語意與提升執行效率之目的。
在實驗數據分析中顯示,本論文提出的英語短文語意評估演算法在相關研究的比較中皆擁有極佳的效能表現;在小型實驗中,得到皮爾森相關係數─0.9111;大型實驗中,本演算法亦得到非常出色的準確度─71.59%。
本論文的貢獻在於利用英語短文的語法資訊,透過比較相同語法角色的字詞,解決了字詞歧義的問題;實驗數據顯示本論文提出的英語短文語意評估演算法在準確度與執行效率上,皆有非常傑出的表現,希冀本方法未來能夠實際用於各種不同的應用上並對資訊檢索及自然語言處理等研究領域有所裨益。
英文摘要 In recent years, there is a growing need for precise and fast Information Retrieval (IR) services, which gradually push information providing systems to use Information Retrieval techniques based on natural language queries. This study presents an algorithm to evaluate the semantic similarity between English short texts and aims to enhance the execution efficiency and the accuracy on information retrieving.
For the embedded information in a short text is limited, applying well-known IR models, such as LSA, HAL and etc., directly may not always perfectly quantified the semantic of the short text. This study tries to take the advantage of syntactical relationships derived by natural language techniques and proposes a algorithm for quantifying words in a short text by using word sense disambiguation and semantic similarity measures by WordNet. The proposed algorithm proves to be comparable to the now best performing aldorithms, and has a Pearson Correlation of 0.9111 and an accuracy of 71.59%, in the small scale and the large scale datasets, respectively.
This study uses syntactic information from short texts to clear the ambiguitits of roles of words to improve the semantic similarity mesure of the short texts. The experimental results confirm that the algorithm has fair performance and good efficiency and could be useful for various practical applications in Information Retrieval and Natural Language Processing.
論文目次 摘要 I
Abstract II
誌謝 III
目錄 IV
圖目錄 V
表目錄 VI
第一章 緒論 1
第一節 研究動機與目的 2
第二節 研究貢獻 4
第三節 論文架構 5
第二章 文獻探討 6
第一節 長篇文章的語意相似度研究 8
第二節 短篇文章的語意相似度研究 13
第三節 單字的語意相似度運算 18
第四節 文法剖析器 23
第五節 文獻綜合討論 25
第三章 英語短文語意相似度評估演算法 27
第一節 演算法架構 28
第二節 語意分析 29
第三節 語意評估 32
第四節 實例演練 36
第四章 實驗與分析 42
第一節 小型實驗設計 43
第二節 大型實驗設計 44
第三節 數據分析與結果 45
第四節 錯誤分析 59
第五章 結論與建議 60
參考文獻 62
參考文獻 Allen, J. (1995). Natural Language Understanding. Redwood City, Calif., Benjamin Cummings.
Atkinson-Abutridy, J., Mellish, C., & Aitken, S. (2004). Combining Information Extraction with Genetic Algorithms for Text Mining. IEEE Intelligent Systems, 19(3), 22-30.
Banerjee, S., & Pedersen, T. (2002). An adapted lesk algorithm for word sense disambiguation using WordNet. Lecture Notes in Computer Science, 2276, 117-171.
Castillo, J. J. (2011). A WordNet-based semantic approach to textual entailment and cross-lingual textual entailment. International Journal of Machine Learning and Cybernetics, 2(3), 177-189.
Coelho, T.A.S., Calado, P.P., Souza, L.V., Ribeiro-Neto, B., & Muntz, R. (2004). Image Retrieval Using Multiple Evidence Ranking. IEEE Transactions Knowledge and Data Engineering,16(4), 408-417.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391-407.
Dolan, W., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. Paper presented at Proceedings of the 20th International Conference on Computational Linguistics.
Erkan, G., & Radev, D.R. (2004). LexRank: Graph-Based Lexical Centrality As Salience in Text Summarization. Journal of Artificial Intelligence Research, 22, 457-479.
Fox, C. (1989). A stop list for general text. Paper presented at ACM SIGIR Forum Volume 24 Issue 1-2.
Hirst, G., & St-Onge, D. (1998). Lexical chains as representations of context for the detection and correction of malapropisms. MIT Press, 305–332.
Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data, 2(2), 1-25.
Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. Paper presented at the Proceedings of ROCLING X, Taiwan.
Klein, D., & Manning, C. D. (2003). Fast Exact Inference with a Factored Model for Natural Language Parsing. Advances in Neural Information Processing Systems 15, 3-10.
Klein, D., & Manning, C. D. (2003). Accurate Unlexicalized Parsing. Paper presented at the Proceedings of the 41st Meeting of the Association for Computational Linguistics, Volume 1.
Ko, Y., Park, J., & Seo, J. (2004). Improving Text Categorization Using the Importance of Sentences. Information Processing and Management, 40, 65-79.
Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. MIT Press, 265–283.
Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone. Paper presented at Proceedings of the 5th annual international conference on Systems documentation.
Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., & Crockett, K. (2006). Sentence Similarity Based on Semantic Nets and Corpus Statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138-1150.
Lin, D. (1997). An information-theoretic definition of similarity. Paper presented at Proceedings of the 15th international conference on Machine Learning.
Liu, Y., & Zong, C.Q. (2004). Example-Based Chinese-English MT. Paper presented at Proceedings of 2004 IEEE International Conference Systems, Man, and Cybernetics, Volume 7.
Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers,28(2), 203-208.
Lyons, J. (1968). Introduction to Theoretical Linguistics. New York, Cambridge University Press.
Marcus, M. P., Marcinkiewicz, M. A., Santorini, B. (1933). Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2), 313-330.
Michie, D. (2001). Return of the Imitation Game. Electronic Transactions Artificial Intelligence, 6(2), 203-221.
Mihalcea, Rada, Corley, Courtney, & Strapparava, Carlo. (2006). Corpus-based and Knowledge-based Measures of Text Semantic Similarity. Paper presented at Proceedings of the 21st national conference on Artificial intelligence, Volume 1.
Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39-41.
O’Shea, J., Bandar, Z., Crockett, K., & McLean, D. (2008). A Comparative Study of Two Short Text Semantic Similarity Measures. Lecture Notes in Computer Science, 4953, 172-181.
Oliva, J., Serranoa, J. I., Castilloa, M. D., & Iglesiasa, Á. (2011). SyMSS: A syntax-based measure for short-text semantic similarity. Data & Knowledge Engineering, 70, 390-405.
Park, E.K., Ra, D.Y., & Jang, M.G. (2005). Techniques for Improving Web Retrieval Effectiveness. Information Processing and Management, 41(5) 1207-1223.
Resnik, P. (1995). Using Information content to evaluate semantic similarity in a taxonomy. Paper presented at Proceedings of the 14th International Joint Conference on Artificial Intelligence.
Rubenstein, H., & Goodenough, J. (1965). Contextual Correlates of Synonymy. Communications of the ACM, 8(10), 627-633.
Salton, G. (1989). Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Boston, Addison-Wesley.
Salton, G., Wong, A., & Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11), 613-620.
Tsatsaroni, G., Varlami, I., & Vazirgianni M. (2010). Text Relatedness Based on a Word Thesaurus. Journal of Artificial Intelligence Research, 37, 1-39.
Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. Paper presented at Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2017-08-23起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2017-08-23起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw