系統識別號 U0026-1408201823142100
論文名稱(中文) 使用離群偵測與實體辨識改進群眾生醫標注系統
論文名稱(英文) Using outlier detection and entity recognition to improve a crowdsourcing biocuration system
校院名稱 成功大學
系所名稱(中) 電機工程學系
系所名稱(英) Department of Electrical Engineering
學年度 106
學期 2
出版年 107
研究生(中文) 鄭宇傑
研究生(英文) Yu-Jie Zheng
學號 N26060313
學位類別 碩士
語文別 中文
論文頁數 34頁
口試委員 指導教授-張天豪
中文關鍵字 自然語言處理  生醫命名實體辨識  群眾外包 
英文關鍵字 natural language processing  biomedical named-entity recognition  crowdsourcing 
中文摘要 日新月異的科技,加速人們對生醫領域的探索,相關文獻的發表速度與數量已不可同日而語。為了將數量龐大的文獻分別建立資料庫索引以供查詢,自然語言的資訊擷取模型扮演至關重要的角色。過去,常見的模型不外乎藉由自動化工具處理文獻,再聘請領域專家進行驗證;如此,花費的金錢與專家的時間將成為不可避免的負擔,難以長久維持。而群眾外包概念的形成,造就了短時間內蒐集大量標注資料的可能。若能有效率地整合,將能以相對低廉的花費與較短的時間,獲得有價值的資訊。
英文摘要 Due to the significantly increased amount of scientific publications, many automatic tools have been developed to extract the buried information within natural language articles. A manual verification, usually relying on domain experts, is generally required to ensure the quality. This step, however, is a considerable cost if performed regularly. This work attempts to solve the problem through a crowdsourcing model which aggregates annotations from crowds. Furthermore, this work proposed some effective methods, including outlier detection and entity recognition, to improve the system. Biomedical named-entity recognition was chosen for evaluation, owing to its importance in various information extraction tasks. Experimental results demonstrate that aggregation via this system improves the performance of crowds by more than 15%, and even achieves to the level of a single expert. The priori-quality, retrieved by the method of outlier detection, of an annotator is correlative with his individual biocuration performance, which is higher than 0.66 in Spearman’s rank correlation coefficient. In addition, taking the biocuration results of automatic tools as reference and avoiding incorrect annotations resulting from carelessness, defined in the method of entity recognition, also slightly improve the average individual performance of crowds by 2.7%. Consequently, it is a promising model for biomedical information extraction tasks through crowdsourcing along with the methods proposed in this work, which are capable of detecting outliers and preventing mistakes.
論文目次 第一章 緒論 1
第二章 相關研究 4
2.1 群眾外包 (Crowdsourcing) 4
2.2 自然語言處理 (Natural Language Processing) 4
2.2.1 停用詞 (Stop Word) 5
2.2.2 編輯距離 (Edit Distance) 5
2.3 命名實體辨識 (Named-Entity Recognition) 6
第三章 研究方法 7
3.1 資料集 7
3.2 文獻前處理 7
3.2.1 文章解析 8
3.2.2 文句價值定義 8
3.3 資訊擷取 10
3.3.1 整合標注資料 10
3.3.2 過濾離群資料 11
3.4 測試實驗設計一:模型表現之驗證 12
3.4.1 小規模封閉測試 12
3.4.2 大規模模擬實驗 14
3.5 測試實驗設計二:標注流程之比較 15
第四章 研究結果 17
4.1 實驗結果一:模型表現之驗證 17
4.1.1 小規模封閉測試 17
4.1.2 大規模模擬實驗 23
4.2 實驗結果二:標注流程之比較 25
4.3 分析與探討 29
4.3.1 群眾標注之失誤 29
4.3.2 資料蒐集之決策 29
第五章 結論 31
參考文獻 32
參考文獻 1. Cowie J, Lehnert W: Information extraction. Communications of the ACM 1996, 39(1):80-91.
2. Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, St Pierre S: Big data: The future of biocuration. Nature 2008, 455(7209):47.
3. Wei C-H, Kao H-Y, Lu Z: PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research 2013, 41(W1):W518-W522.
4. Howe J: The rise of crowdsourcing. Wired magazine 2006, 14(6):1-4.
5. Nadeau D, Sekine S: A survey of named entity recognition and classification. Lingvisticae Investigationes 2007, 30(1):3-26.
6. Leser U, Hakenberg J: What makes a gene name? Named entity recognition in the biomedical literature. Briefings in bioinformatics 2005, 6(4):357-369.
7. Campos D, Matos S, Oliveira JL: Gimli: open source and high-performance biomedical name recognition. BMC bioinformatics 2013, 14(1):54.
8. Tsuruoka Y, Tateishi Y, Kim J-D, Ohta T, McNaught J, Ananiadou S, Tsujii Ji: Developing a robust part-of-speech tagger for biomedical text. In: Panhellenic Conference on Informatics: 2005. Springer: 382-392.
9. Wei C-H, Kao H-Y, Lu Z: GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. BioMed research international 2015, 2015.
10. Mika S, Rost B: NLProt: extracting protein names and sequences from papers. Nucleic acids research 2004, 32(suppl_2):W634-W637.
11. 黃彥霖: 群眾外包手機遊戲用以生醫命名實體辨識. 國立成功大學電機工程學系碩士論文 2015.
12. Brabham DC: Crowdsourcing as a model for problem solving: An introduction and cases. Convergence 2008, 14(1):75-90.
13. Snow R, O'Connor B, Jurafsky D, Ng AY: Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing: 2008. Association for Computational Linguistics: 254-263.
14. Crowdflower. https://www.crowdflower.com/. Accessed 29 Mar 2018.
15. Buhrmester M, Kwang T, Gosling SD: Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on psychological science 2011, 6(1):3-5.
16. Khare R, Good BM, Leaman R, Su AI, Lu Z: Crowdsourcing in biomedicine: challenges and opportunities. Briefings in Bioinformatics 2016, 17(1):23-32.
17. White RW, Wang S, Pant A, Harpaz R, Shukla P, Sun W, DuMouchel W, Horvitz E: Early identification of adverse drug reactions from search log data. Journal of biomedical informatics 2016, 59:42-48.
18. White RW, Tatonetti NP, Shah NH, Altman RB, Horvitz E: Web-scale pharmacovigilance: listening to signals from the crowd. Journal of the American Medical Informatics Association 2013, 20(3):404-408.
19. Kawrykow A, Roumanis G, Kam A, Kwak D, Leung C, Wu C, Zarour E, Sarmenta L, Blanchette M, Waldispühl J: Phylo: a citizen science approach for improving multiple sequence alignment. PloS one 2012, 7(3):e31362.
20. Waldispühl J, Kam A, Gardner PP: Crowdsourcing RNA structural alignments with an online computer game. In: Pacific Symposium on Biocomputing Co-Chairs: 2014. World Scientific: 330-341.
21. Loguercio S, Good BM, Su AI: Dizeez: an online game for human gene-disease annotation. PLoS One 2013, 8(8):e71171.
22. Bow HC, Dattilo JR, Jonas AM, Lehmann CU: A crowdsourcing model for creating preclinical medical education study tools. Academic Medicine 2013, 88(6):766-770.
23. Chowdhury GG: Natural language processing. Annual review of information science and technology 2003, 37(1):51-89.
24. Brown PF, Pietra VJD, Pietra SAD, Mercer RL: The mathematics of statistical machine translation: Parameter estimation. Computational linguistics 1993, 19(2):263-311.
25. Koehn P, Och FJ, Marcu D: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1: 2003. Association for Computational Linguistics: 48-54.
26. Luong M-T, Sutskever I, Le QV, Vinyals O, Zaremba W: Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:14108206 2014.
27. Rosenthal S, Biswas J, Veloso M: An effective personal mobile robot agent through symbiotic human-robot interaction. In: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1: 2010. International Foundation for Autonomous Agents and Multiagent Systems: 915-922.
28. Fasola J, Mataric M: A socially assistive robot exercise coach for the elderly. Journal of Human-Robot Interaction 2013, 2(2):3-32.
29. DeVault D, Artstein R, Benn G, Dey T, Fast E, Gainer A, Georgila K, Gratch J, Hartholt A, Lhommet M: SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems: 2014. International Foundation for Autonomous Agents and Multiagent Systems: 1061-1068.
30. Russell MA: Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More: " O'Reilly Media, Inc."; 2013.
31. Ott M, Cardie C, Hancock J: Estimating the prevalence of deception in online review communities. In: Proceedings of the 21st international conference on World Wide Web: 2012. ACM: 201-210.
32. Elhadad N, Gravano L, Hsu D, Balter S, Reddy V, Waechter H: Information extraction from social media for public health. In: KDD at Bloomberg Workshop, Data Frameworks Track (KDD 2014): 2014.
33. Hirschberg J, Manning CD: Advances in natural language processing. Science 2015, 349(6245):261-266.
34. Munková D, Munk M, Vozár M: Influence of stop-words removal on sequence patterns identification within comparable corpora. In: ICT innovations 2013. Springer; 2014: 67-76.
35. Silva C, Ribeiro B: The importance of stop word removal on recall values in text categorization. In: Neural Networks, 2003 Proceedings of the International Joint Conference on: 2003. IEEE: 1661-1666.
36. Levenshtein VI: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady: 1966. 707-710.
37. Roberts RJ: PubMed Central: The GenBank of the published literature. In.: National Acad Sciences; 2001.
38. PubMed FTP Service. https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/. Accessed 29 Mar 2018.
39. UniProt: the universal protein knowledgebase. Nucleic Acids Research 2017, 45(D1):D158-D169.
40. UniProt Citation List. http://www.uniprot.org/citations/. Accessed 29 Mar 2018.
41. Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H: GeneCards Version 3: the human gene integrator. Database 2010, 2010.
42. Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F: Extensible markup language (XML). World Wide Web Journal 1997, 2(4):27-66.
43. Perl module Lingua::Sentence. http://search.cpan.org/~capoeirab/Lingua-Sentence-1.100/lib/Lingua/Sentence.pm. Accessed 29 Mar 2018.
44. Perl module Lingua::EN::StopWordList. http://search.cpan.org/~rsavage/Lingua-EN-StopWordList-1.02/lib/Lingua/EN/StopWordList.pm. Accessed 29 Mar 2018.
45. Perl module String::Similarity. http://search.cpan.org/~mlehmann/String-Similarity-1.04/Similarity.pm. Accessed 29 Mar 2018.
46. Allahbakhsh M, Benatallah B, Ignjatovic A, Motahari-Nezhad HR, Bertino E, Dustdar S: Quality control in crowdsourcing systems: Issues and directions. IEEE Internet Computing 2013, 17(2):76-81.
47. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, Poux S, Bougueleret L, Xenarios I: UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. In: Plant Bioinformatics. Springer; 2016: 23-54.
  • 同意授權校內瀏覽/列印電子全文服務,於2018-08-17起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2019-09-01起公開。

  • 如您有疑問,請聯絡圖書館