進階搜尋


 
系統識別號 U0026-1511201918175900
論文名稱(中文) 生物醫學領域中概念辨識的研究 以基因本體學為例
論文名稱(英文) A Study on Concept Recognition in Biomedical Field Using Gene Ontology as an Example
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 107
學期 2
出版年 108
研究生(中文) 楊家融
研究生(英文) Chia-Jung Yang
電子信箱 jeroyang@gmail.com
學號 P78011167
學位類別 博士
語文別 英文
論文頁數 56頁
口試委員 指導教授-蔣榮先
口試委員-張瑞紘
口試委員-林鵬展
口試委員-呂宗學
口試委員-林昭維
口試委員-郝沛毅
口試委員-鄞宗賢
口試委員-李宗儒
中文關鍵字 自然語言處理  基因本體學  機器學習 
英文關鍵字 natural language processing  gene ontology  machine learning 
學科別分類
中文摘要 近年來,自然語言處理在生物醫學的專業領域上遇見障礙;專業領域的語言使用和一般領域大相逕庭。在基因本體學等非常專精的領域中常常缺乏大型的訓練資料集,使得強大的深度學習技巧難以在小資料集中施展。
我們在研究中所採用的 Colorado Richly Annotated Full-Text 資料集包含 67 篇的全文文件,由生物學家標註「基因本體學」的資料。我們在研究中找尋出「基因本體學概念辨識」的困難所在,並且用「有名字的概念」為刀,把難題一分為二,分別用字典查找和機器學習來克服。第一步我們先用「有名字的概念」把「基因本體學概念」的資料重新架構,第二步我們再運重新架構的「基因本體學」來完成概念辨識的需求。
我們的系統在 F1-measure 上比先前頂尖的系統進步了約 20%,達到 0.804 的 precision 及 0.715 的 recall。我們也證明了使用「有名字的概念」的想法有效,或許可以推廣到其他專業的語言上。
英文摘要 In recent years, natural language processing has been facing several obstacles in professional text mining in biomedical fields; the scenarios of natural language processing usage are completely different for handling professional languages and general languages. Due to the lack of training data, powerful deep learning techniques were not applicable to small datasets available for the highly specific biological researches such as gene ontology.
The Colorado Richly Annotated Full-Text corpus used in this study contains 67 full-text documents annotated by biologists. In this research, we aimed to identify the key difficulty of the gene ontology concept recognition task and handled this problem using dictionary-matching and machine-learning techniques. Accordingly, problem solving was divided into two steps, dictionary-matching and machine-learning respectively, corresponding to the roles of named concepts. In the first step, we reconstructed the gene ontology concepts after mining the named concepts. Furthermore, in the second step, we leveraged this reconstructed data to fulfill the needs of the proposed hybrid method.
The proposed concept recognizer achieved approximately 20% improvement in F1-measure as compared to the state-of-the-art system resulting in 0.804 precision and 0.715 recall. It proved that the named concept may be applied to the concept recognition of other professional languages.
論文目次 摘要 i
Abstract ii
誌謝 iii
Contents iv
List of Tables vii
List of Figures ix
Table of Abbreviations x
Table of Symbols xi
Chapter 1. Introduction 1
1.1 Motivation 1
1.2 Purpose and Specific Aims 4
1.3 Terminology 5
1.4 Organization of the Dissertation 5
Chapter 2. Literature Review 7
2.1 Dictionary-Matching Approaches 9
2.2 Rule-Based Approaches 10
2.3 Hybrid and Other Approaches 10
Chapter 3. Organization of GO 12
3.1 The structure of GO 12
3.2 Mining the NCs from GO 16
3.3 Representation of GO Concepts by NCs 21
3.3.1 Aggregation of the NCs 21
3.3.2 Simplifying the GO statements 24
3.4 Summary 25
Chapter 4. Gene Ontology Concept Recognition System 26
4.1 Introduction of the CRAFT Corpus 27
4.2 Dictionary-Matching Component 29
4.2.1 Preprocessing: sentence segmentation 29
4.2.2 Dictionary matching 29
4.3 Machine Learning Component 31
4.3.1 Candidate Generation 31
4.3.2 Feature Extraction 32
4.3.3 Creating the Labels of the Candidates 34
4.3.4 The Choices of Machine-Learning Models 35
4.4 SN Boosting 36
4.5 Evaluation 37
4.6 Summary 38
Chapter 5. Experimental Results 39
5.1 Results of the Representation of GO with NCs 39
5.2 Results of the Concept Recognition Systems 40
5.3 Analysis of the System Components 42
5.4 Evaluation of the Machine Learning Classifiers 44
Chapter 6. Discussion 45
6.1 Principle Findings 45
6.2 Generalization of the Concept Recognition System 48
6.3 Limitations 49
Chapter 7. Conclusion and Future Studies 50
REFERENCES 52
參考文獻 Aho, A. V., & Corasick, M. J. "Efficient string matching: an aid to bibliographic search". Aho, A. V., & Corasick, M. J. "Efficient string matching: an aid to bibliographic search". Communications of the ACM, 18(6), 333–340, 1975.
Aronson, A. R., & Lang, F.-M. "An overview of MetaMap: historical perspective and recent advances". Journal of the American Medical Informatics Association, 17(3), 229–236, 2010.
Blake, J. A., Christie, K. R., Dolan, M. E., Drabkin, H. J., Hill, D. P., Ni, L., … Westerfeld, M. "Gene ontology consortium: Going forward". Nucleic Acids Research, 43(D1), D1049-1056, 2014.
Blake, J. A., Dolan, M., Drabkin, H., Hill, D. P., Ni, L., Sitnikov, D., … Westerfield, M. "Gene ontology annotations and resources". Nucleic Acids Research, 41(D1), 530–535, 2013.
Bodenreider, O. "The Unified Medical Language System (UMLS): Integrating biomedical terminology". Nucleic Acids Research, 32(Database issue), D267–D270, 2004.
Campos, D., Matos, S., & Oliveira, J. L. "A modular framework for biomedical concept recognition.". BMC Bioinformatics, 14(1), 281, 2013.
Campos, D., Matos, S., & Oliveira, J. L. "Gimli: Open source and high-performance biomedical name recognition". BMC Bioinformatics, 14, 54, 2013.
Corbett, P., & Murray-Rust, P. "High-Throughput Identification of Chemistry in Life Science Texts". In Computational Life Sciences II (pp. 107–118), 2006.
Degtyarenko, K., De matos, P., Ennis, M., Hastings, J., Zbinden, M., Mcnaught, A., … Ashburner, M. "ChEBI: A database and ontology for chemical entities of biological interest". Nucleic Acids Research, (36), D344–D350, 2008.
Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. "A survey of bioinformatics database and software usage through mining the literature". PLoS ONE, 11(6), e0157989, 2016.
Federhen, S. "The NCBI Taxonomy database". Nucleic Acids Research, 40(Database issue), D136–D143, 2012.
Ferrucci, D., & Lally, A. "UIMA: An architectural approach to unstructured information processing in the corporate research environment". Natural Language Engineering, 10(3–4), 327–348, 2004.
Funk, C., Baumgartner, W., Garcia, B., Roeder, C., Bada, M., Cohen, K., … Leser, U. "Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters". BMC Bioinformatics, 15(1), 59, 2014.
Funk, C. S., Cohen, K. B., Hunter, L. E., & Verspoor, K. M. "Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition". Journal of Biomedical Semantics, 7(1), 52, 2016.
Gobeill, J., Pasche, E., Vishnyakova, D., & Ruch, P. "Managing the data deluge: Data-driven GO category assignment improves while complexity of functional annotation increases". Database, 2013(2013), 1–9, 2013.
Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., … Gene Ontology Consortium. "The Gene Ontology (GO) database and informatics resource.". Nucleic Acids Research, 32(Database issue), D258-61, 2004.
Jonquet, C., Shah, N. H., Cherie, H., Musen, M. a, Callendar, C., & Storey, M.-A. "NCBO Annotator: Semantic Annotation of Biomedical Data". International Semantic Web Conference, Poster, 1–3, 2009.
Koopman, B., Zuccon, G., Nguyen, A., Bergheim, A., & Grayson, N. "Automatic ICD-10 classification of cancers from free-text death certificates". International Journal of Medical Informatics, 84(11), 956–965, 2015.
Mao, Y., Van Auken, K., Li, D., Arighi, C. N., McQuilton, P., Hayman, G. T., … Lu, Z. "Overview of the gene ontology task at BioCreative IV". Database : The Journal of Biological Databases and Curation, 2014, 1–14, 2014.
Miller, N., Lacroix, E. M., & Backus, J. E. "MEDLINEplus: building and maintaining the National Library of Medicine’s consumer health Web service.". Bulletin of the Medical Library Association, 88(1), 11–17, 2000.
Mujtaba, G., Shuib, L., Raj, R. G., Rajandram, R., Shaikh, K., & Al-Garadi, M. A. "Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection". PLoS ONE, 12(2), e0170242, 2017.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, É. "Scikit-learn: Machine Learning in Python". Journal of Machine Learning Research, 12(2011), 2825–2830, 2012.
Rebholz-Schuhmann, D., Arregui, M., Gaudan, S., Kirsch, H., & Jimeno, A. "Text processing through web services: Calling Whatizit". Bioinformatics, 24(2), 296–298, 2008.
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., … Mesirov, J. P. "Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles". Proceedings of the National Academy of Sciences, 102(43), 15545–15550, 2005.
Tanenblatt, M., Coden, A., & Sominsky, I. "The ConceptMapper Approach to Named Entity Recognition". Proceedings of the Seventh Conference on International Language Resources and Evaluation LREC10, 546–551, 2010.
The Gene Ontology Consortium. "The graph view of GO:0019852", 2019.
Thomas, P. D. "Expansion of the gene ontology knowledgebase and resources: The gene ontology consortium". Nucleic Acids Research, 45(D1), D331–D338, 2017.
Van Auken, K., Schaeffer, M. L., McQuilton, P., Laulederkind, S. J. F., Li, D., Wang, S.-J. J., … Lu, Z. "BC4GO: a full-text corpus for the BioCreative IV GO task.". Database : The Journal of Biological Databases and Curation, 2014(2014), 1–9, 2014.
Verspoor, K., & Baumgartner, W. A. "Unstructured Information Management Architecture (UIMA)". In Encyclopedia of Systems Biology (pp. 2320–2324), 2013.
Verspoor, K., Cohen, K. B., Lanfranchi, A., Warner, C., Johnson, H. L., Roeder, C., … Hunter, L. E. "A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.". BMC Bioinformatics, 13(2012), 207, 2012.
Weinberger, K., Dasgupta, A., Attenberg, J., Langford, J., & Smola, A. "Feature Hashing for Large Scale Multitask Learning". Proceedings of the 26th Annual International Conference on Machine Learning, (Icml), (pp. 1113-1120)., 2009.
Wishart, D. S., Knox, C., Guo, A. C., Cheng, D., Shrivastava, S., Tzur, D., … Hassanali, M. "DrugBank: A knowledgebase for drugs, drug actions and drug targets". Nucleic Acids Research, 36(Database issue), D901–D906, 2008.
Yang, C.-J., Chen, Y.-D., Li, W.-G., Huang, C.-Y., & Chiang, J.-H. "GREPC: Geneontology Concept Recognition by Entity, Pattern, and Constrain". BioCreative IV, 182–188, 2013.
Yang, C.-J., & Chiang, J.-H. "Cateye: A Hint-Enabled Search Engine Framework for Biomedical Classification Systems". In New Trends in Computer Technologies and Applications (pp. 758–763), 2018.
Yang, C.-J., & Chiang, J.-H. "Gene ontology concept recognition using named concept: understanding the various presentations of the gene functions in biomedical literature". Database, 2018(2018), 1–10, 2018.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2019-11-27起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2019-11-27起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw