進階搜尋


 
系統識別號 U0026-1408201823142100
論文名稱(中文) 使用離群偵測與實體辨識改進群眾生醫標注系統
論文名稱(英文) Using outlier detection and entity recognition to improve a crowdsourcing biocuration system
校院名稱 成功大學
系所名稱(中) 電機工程學系
系所名稱(英) Department of Electrical Engineering
學年度 106
學期 2
出版年 107
研究生(中文) 鄭宇傑
研究生(英文) Yu-Jie Zheng
學號 N26060313
學位類別 碩士
語文別 中文
論文頁數 34頁
口試委員 指導教授-張天豪
口試委員-張文綺
口試委員-吳謂勝
口試委員-劉宗霖
口試委員-解巽評
中文關鍵字 自然語言處理  生醫命名實體辨識  群眾外包 
英文關鍵字 natural language processing  biomedical named-entity recognition  crowdsourcing 
學科別分類
中文摘要 日新月異的科技,加速人們對生醫領域的探索,相關文獻的發表速度與數量已不可同日而語。為了將數量龐大的文獻分別建立資料庫索引以供查詢,自然語言的資訊擷取模型扮演至關重要的角色。過去,常見的模型不外乎藉由自動化工具處理文獻,再聘請領域專家進行驗證;如此,花費的金錢與專家的時間將成為不可避免的負擔,難以長久維持。而群眾外包概念的形成,造就了短時間內蒐集大量標注資料的可能。若能有效率地整合,將能以相對低廉的花費與較短的時間,獲得有價值的資訊。
本研究應用群眾外包的發想,以一款俱備文獻標注工具的手機遊戲,建立資訊擷取模型。為有效降低離群資料帶來的影響,本研究預先計算群眾間標注結果的相似程度,作為指標以去除差異過大的標注資料。並藉由封閉測試,比較其他自動化工具與方法的標注表現,更探討是否能藉由群眾資料達成單一專家的成效。本研究亦針對標注流程進行分析與改進,同時以自動化工具為輔助,提供群眾更精確地標注文獻。
基於命名實體辨識於資訊擷取範疇之重要性,本研究因而以生醫文獻之命名實體辨識為主軸,設計一系列的實驗與探討,藉此驗證上述模型之成效。實驗結果顯示,經模型整合之群眾標注結果,其整體表現將擁有超過百分之十五的提升,更達到單一領域專家之水準。而透過比較各別玩家和群眾間標注結果之相似程度,與其真實標注能力間,具有大於0.66之斯皮爾曼等級相關係數,證明上述方法用於過濾離群資料之可靠性。此外藉由修改標注流程,並同時引入自動化工具作為輔助,亦將提升玩家之平均標注表現約百分之二點七。最後本研究藉由模擬實驗之結果與分析,提供參考予相關研究單位,自蒐集「能力較好」或「數量較多」等兩項指標中進行取決。
英文摘要 Due to the significantly increased amount of scientific publications, many automatic tools have been developed to extract the buried information within natural language articles. A manual verification, usually relying on domain experts, is generally required to ensure the quality. This step, however, is a considerable cost if performed regularly. This work attempts to solve the problem through a crowdsourcing model which aggregates annotations from crowds. Furthermore, this work proposed some effective methods, including outlier detection and entity recognition, to improve the system. Biomedical named-entity recognition was chosen for evaluation, owing to its importance in various information extraction tasks. Experimental results demonstrate that aggregation via this system improves the performance of crowds by more than 15%, and even achieves to the level of a single expert. The priori-quality, retrieved by the method of outlier detection, of an annotator is correlative with his individual biocuration performance, which is higher than 0.66 in Spearman’s rank correlation coefficient. In addition, taking the biocuration results of automatic tools as reference and avoiding incorrect annotations resulting from carelessness, defined in the method of entity recognition, also slightly improve the average individual performance of crowds by 2.7%. Consequently, it is a promising model for biomedical information extraction tasks through crowdsourcing along with the methods proposed in this work, which are capable of detecting outliers and preventing mistakes.
論文目次 第一章 緒論 1
第二章 相關研究 4
2.1 群眾外包 (Crowdsourcing) 4
2.2 自然語言處理 (Natural Language Processing) 4
2.2.1 停用詞 (Stop Word) 5
2.2.2 編輯距離 (Edit Distance) 5
2.3 命名實體辨識 (Named-Entity Recognition) 6
第三章 研究方法 7
3.1 資料集 7
3.2 文獻前處理 7
3.2.1 文章解析 8
3.2.2 文句價值定義 8
3.3 資訊擷取 10
3.3.1 整合標注資料 10
3.3.2 過濾離群資料 11
3.4 測試實驗設計一:模型表現之驗證 12
3.4.1 小規模封閉測試 12
3.4.2 大規模模擬實驗 14
3.5 測試實驗設計二:標注流程之比較 15
第四章 研究結果 17
4.1 實驗結果一:模型表現之驗證 17
4.1.1 小規模封閉測試 17
4.1.2 大規模模擬實驗 23
4.2 實驗結果二:標注流程之比較 25
4.3 分析與探討 29
4.3.1 群眾標注之失誤 29
4.3.2 資料蒐集之決策 29
第五章 結論 31
參考文獻 32
參考文獻 1. Cowie J, Lehnert W: Information extraction. Communications of the ACM 1996, 39(1):80-91.
2. Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, St Pierre S: Big data: The future of biocuration. Nature 2008, 455(7209):47.
3. Wei C-H, Kao H-Y, Lu Z: PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research 2013, 41(W1):W518-W522.
4. Howe J: The rise of crowdsourcing. Wired magazine 2006, 14(6):1-4.
5. Nadeau D, Sekine S: A survey of named entity recognition and classification. Lingvisticae Investigationes 2007, 30(1):3-26.
6. Leser U, Hakenberg J: What makes a gene name? Named entity recognition in the biomedical literature. Briefings in bioinformatics 2005, 6(4):357-369.
7. Campos D, Matos S, Oliveira JL: Gimli: open source and high-performance biomedical name recognition. BMC bioinformatics 2013, 14(1):54.
8. Tsuruoka Y, Tateishi Y, Kim J-D, Ohta T, McNaught J, Ananiadou S, Tsujii Ji: Developing a robust part-of-speech tagger for biomedical text. In: Panhellenic Conference on Informatics: 2005. Springer: 382-392.
9. Wei C-H, Kao H-Y, Lu Z: GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. BioMed research international 2015, 2015.
10. Mika S, Rost B: NLProt: extracting protein names and sequences from papers. Nucleic acids research 2004, 32(suppl_2):W634-W637.
11. 黃彥霖: 群眾外包手機遊戲用以生醫命名實體辨識. 國立成功大學電機工程學系碩士論文 2015.
12. Brabham DC: Crowdsourcing as a model for problem solving: An introduction and cases. Convergence 2008, 14(1):75-90.
13. Snow R, O'Connor B, Jurafsky D, Ng AY: Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing: 2008. Association for Computational Linguistics: 254-263.
14. Crowdflower. https://www.crowdflower.com/. Accessed 29 Mar 2018.
15. Buhrmester M, Kwang T, Gosling SD: Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on psychological science 2011, 6(1):3-5.
16. Khare R, Good BM, Leaman R, Su AI, Lu Z: Crowdsourcing in biomedicine: challenges and opportunities. Briefings in Bioinformatics 2016, 17(1):23-32.
17. White RW, Wang S, Pant A, Harpaz R, Shukla P, Sun W, DuMouchel W, Horvitz E: Early identification of adverse drug reactions from search log data. Journal of biomedical informatics 2016, 59:42-48.
18. White RW, Tatonetti NP, Shah NH, Altman RB, Horvitz E: Web-scale pharmacovigilance: listening to signals from the crowd. Journal of the American Medical Informatics Association 2013, 20(3):404-408.
19. Kawrykow A, Roumanis G, Kam A, Kwak D, Leung C, Wu C, Zarour E, Sarmenta L, Blanchette M, Waldispühl J: Phylo: a citizen science approach for improving multiple sequence alignment. PloS one 2012, 7(3):e31362.
20. Waldispühl J, Kam A, Gardner PP: Crowdsourcing RNA structural alignments with an online computer game. In: Pacific Symposium on Biocomputing Co-Chairs: 2014. World Scientific: 330-341.
21. Loguercio S, Good BM, Su AI: Dizeez: an online game for human gene-disease annotation. PLoS One 2013, 8(8):e71171.
22. Bow HC, Dattilo JR, Jonas AM, Lehmann CU: A crowdsourcing model for creating preclinical medical education study tools. Academic Medicine 2013, 88(6):766-770.
23. Chowdhury GG: Natural language processing. Annual review of information science and technology 2003, 37(1):51-89.
24. Brown PF, Pietra VJD, Pietra SAD, Mercer RL: The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 1993, 19(2):263-311.
25. Koehn P, Och FJ, Marcu D: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1: 2003. Association for Computational Linguistics: 48-54.
26. Luong M-T, Sutskever I, Le QV, Vinyals O, Zaremba W: Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:14108206 2014.
27. Rosenthal S, Biswas J, Veloso M: An effective personal mobile robot agent through symbiotic human-robot interaction. In: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1: 2010. International Foundation for Autonomous Agents and Multiagent Systems: 915-922.
28. Fasola J, Mataric M: A socially assistive robot exercise coach for the elderly. Journal of Human-Robot Interaction 2013, 2(2):3-32.
29. DeVault D, Artstein R, Benn G, Dey T, Fast E, Gainer A, Georgila K, Gratch J, Hartholt A, Lhommet M: SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems: 2014. International Foundation for Autonomous Agents and Multiagent Systems: 1061-1068.
30. Russell MA: Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More: " O'Reilly Media, Inc."; 2013.
31. Ott M, Cardie C, Hancock J: Estimating the prevalence of deception in online review communities. In: Proceedings of the 21st international conference on World Wide Web: 2012. ACM: 201-210.
32. Elhadad N, Gravano L, Hsu D, Balter S, Reddy V, Waechter H: Information extraction from social media for public health. In: KDD at Bloomberg Workshop, Data Frameworks Track (KDD 2014): 2014.
33. Hirschberg J, Manning CD: Advances in natural language processing. Science 2015, 349(6245):261-266.
34. Munková D, Munk M, Vozár M: Influence of stop-words removal on sequence patterns identification within comparable corpora. In: ICT innovations 2013. Springer; 2014: 67-76.
35. Silva C, Ribeiro B: The importance of stop word removal on recall values in text categorization. In: Neural Networks, 2003 Proceedings of the International Joint Conference on: 2003. IEEE: 1661-1666.
36. Levenshtein VI: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady: 1966. 707-710.
37. Roberts RJ: PubMed Central: The GenBank of the published literature. In.: National Acad Sciences; 2001.
38. PubMed FTP Service. https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/. Accessed 29 Mar 2018.
39. UniProt: the universal protein knowledgebase. Nucleic Acids Research 2017, 45(D1):D158-D169.
40. UniProt Citation List. http://www.uniprot.org/citations/. Accessed 29 Mar 2018.
41. Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H: GeneCards Version 3: the human gene integrator. Database 2010, 2010.
42. Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F: Extensible markup language (XML). World Wide Web Journal 1997, 2(4):27-66.
43. Perl module Lingua::Sentence. http://search.cpan.org/~capoeirab/Lingua-Sentence-1.100/lib/Lingua/Sentence.pm. Accessed 29 Mar 2018.
44. Perl module Lingua::EN::StopWordList. http://search.cpan.org/~rsavage/Lingua-EN-StopWordList-1.02/lib/Lingua/EN/StopWordList.pm. Accessed 29 Mar 2018.
45. Perl module String::Similarity. http://search.cpan.org/~mlehmann/String-Similarity-1.04/Similarity.pm. Accessed 29 Mar 2018.
46. Allahbakhsh M, Benatallah B, Ignjatovic A, Motahari-Nezhad HR, Bertino E, Dustdar S: Quality control in crowdsourcing systems: Issues and directions. IEEE Internet Computing 2013, 17(2):76-81.
47. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, Poux S, Bougueleret L, Xenarios I: UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. In: Plant Bioinformatics. Springer; 2016: 23-54.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2018-08-17起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2019-09-01起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw