系統識別號 U0026-1408201823142100
論文名稱(中文) 使用離群偵測與實體辨識改進群眾生醫標注系統
論文名稱(英文) Using outlier detection and entity recognition to improve a crowdsourcing biocuration system
校院名稱 成功大學
系所名稱(中) 電機工程學系
系所名稱(英) Department of Electrical Engineering
學年度 106
學期 2
出版年 107
研究生(中文) 鄭宇傑
研究生(英文) Yu-Jie Zheng
學號 N26060313
學位類別 碩士
語文別 中文
論文頁數 34頁
口試委員 指導教授-張天豪
中文關鍵字 自然語言處理  生醫命名實體辨識  群眾外包 
英文關鍵字 natural language processing  biomedical named-entity recognition  crowdsourcing 
中文摘要 日新月異的科技,加速人們對生醫領域的探索,相關文獻的發表速度與數量已不可同日而語。為了將數量龐大的文獻分別建立資料庫索引以供查詢,自然語言的資訊擷取模型扮演至關重要的角色。過去,常見的模型不外乎藉由自動化工具處理文獻,再聘請領域專家進行驗證;如此,花費的金錢與專家的時間將成為不可避免的負擔,難以長久維持。而群眾外包概念的形成,造就了短時間內蒐集大量標注資料的可能。若能有效率地整合,將能以相對低廉的花費與較短的時間,獲得有價值的資訊。
英文摘要 Due to the significantly increased amount of scientific publications, many automatic tools have been developed to extract the buried information within natural language articles. A manual verification, usually relying on domain experts, is generally required to ensure the quality. This step, however, is a considerable cost if performed regularly. This work attempts to solve the problem through a crowdsourcing model which aggregates annotations from crowds. Furthermore, this work proposed some effective methods, including outlier detection and entity recognition, to improve the system. Biomedical named-entity recognition was chosen for evaluation, owing to its importance in various information extraction tasks. Experimental results demonstrate that aggregation via this system improves the performance of crowds by more than 15%, and even achieves to the level of a single expert. The priori-quality, retrieved by the method of outlier detection, of an annotator is correlative with his individual biocuration performance, which is higher than 0.66 in Spearman’s rank correlation coefficient. In addition, taking the biocuration results of automatic tools as reference and avoiding incorrect annotations resulting from carelessness, defined in the method of entity recognition, also slightly improve the average individual performance of crowds by 2.7%. Consequently, it is a promising model for biomedical information extraction tasks through crowdsourcing along with the methods proposed in this work, which are capable of detecting outliers and preventing mistakes.
論文目次 第一章 緒論 1
第二章 相關研究 4
2.1 群眾外包 (Crowdsourcing) 4
2.2 自然語言處理 (Natural Language Processing) 4
2.2.1 停用詞 (Stop Word) 5
2.2.2 編輯距離 (Edit Distance) 5
2.3 命名實體辨識 (Named-Entity Recognition) 6
第三章 研究方法 7
3.1 資料集 7
3.2 文獻前處理 7
3.2.1 文章解析 8
3.2.2 文句價值定義 8
3.3 資訊擷取 10
3.3.1 整合標注資料 10
3.3.2 過濾離群資料 11
3.4 測試實驗設計一:模型表現之驗證 12
3.4.1 小規模封閉測試 12
3.4.2 大規模模擬實驗 14
3.5 測試實驗設計二:標注流程之比較 15
第四章 研究結果 17
4.1 實驗結果一:模型表現之驗證 17
4.1.1 小規模封閉測試 17
4.1.2 大規模模擬實驗 23
4.2 實驗結果二:標注流程之比較 25
4.3 分析與探討 29
4.3.1 群眾標注之失誤 29
4.3.2 資料蒐集之決策 29
第五章 結論 31
參考文獻 32
