進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-2708202013002800
論文名稱(中文) 選擇性強化資料擴增之生物醫學文獻疾病命名實體識別
論文名稱(英文) Selective Data Augmentation for Disease Named Entity Recognition in Biomedical Literature
校院名稱 成功大學
系所名稱(中) 醫學資訊研究所
系所名稱(英) Institute of Medical Informatics
學年度 108
學期 2
出版年 109
研究生(中文) 丁怡婷
研究生(英文) Yi-Ting Ding
學號 Q56074093
學位類別 碩士
語文別 英文
論文頁數 68頁
口試委員 指導教授-高宏宇
口試委員-蔡宗翰
口試委員-王惠嘉
口試委員-謝孫源
口試委員-李政德
中文關鍵字 疾病命名實體識別  資料擴增  條件隨機場域  類神經網路  生醫文獻文字探勘 
英文關鍵字 Disease named entity recognition  Data augmentation  Conditional random fields  neural network  Biomedical text mining 
學科別分類
中文摘要 在生醫研究領域中,疾病的命名實體識別於生醫領域的文獻探勘和自然語言處理,是一項基礎的任務,透過各種命名實體識別,可以延伸出更多進一步的研究,例如疾病與基因之間的關聯性。多數的疾病命名實體識別都會依賴於經過人工設計的特徵及專業的特定領域知識,同時,在收集訓練資料集時不但耗時而且也需要大量的人工標註者來進行訓練集標記工作。近期,大部分的命名實體識別系統都著重於利用預訓練的詞向量來解決人工的特徵工程問題,以及訓練類神經網路模型來減少以前命名實體識別系統的高運算複雜度。雖然在特徵工程部分獲得更有效率的設計,但是仍然無法解決在生醫領域中資料缺乏的問題,我們提出一個選擇性資料擴增的命名實體識別系統,名為「SDA-LSTM-CRF」,利用訓練資料集來產生大量的擴增資料,並透過選擇性的機制來評估擴增資料的品質與對模型訓練的效益。透過結合不同面向之擴增資料來提升原本訓練集中的資料多樣性,同時利用選擇性評估機制來過濾出更有效能之擴增資料,藉以提升整體模型的學習效能及預測表現。
我們利用兩個生醫領域的語料集來對我們的方法進行評估,分別是在 2015 年國際生物文獻自動探勘競賽中提供的 CDR Corpus―以及美國國家生物技術資訊中心提供的疾病資料集,而我們的系統在兩個語料集的實驗中獲得良好的表現。同時我們進行許多面向的分析,將我們系統的擴增方法與選擇機制,以交互對比的方式來評估方法中各部分的優勢與重要性,使得 SDA-LSTM-CRF 能夠達到良好的表現。
英文摘要 The recognition of disease named entities in biomedical literature is a foundational task in biomedical text mining research and natural language processing. These tasks can be used in further research, such as the relation extraction of disease and gene. Most of the disease entity recognition methods rely on hand-crafted features and professional domain knowledge. In addition, compiling a training set is costly because it requires to collect the raw data and label them. Recently, mostly named entity recognition systems solve the problem of hand-crafted feature engineering by using pre-trained word embedding and reduce the high computational complexity by introducing the neural network model. The method can improve features design more effective but still cannot tackle the problem of insufficient data in the bio-literature. We propose a selective data augmentation named entity recognition (NER) approach, SDA-LSTM-CRF, that utilizes the original training dataset to augment a large number of new training data. By the selective function is used to evaluate the quality of augmented data and the effectiveness of the model. We utilize two biomedical corpora: the CDR corpus from the BioCreative V chemical disease relation (CDR) task and the NCBI disease corpus. Our SDA-LSTM-CRF achieves better performance in NER task.
論文目次 中文摘要 III
ABSTRACT IV
致謝 V
LIST OF FIGURES VIII
LIST OF TABLES X
1. INTRODUCTION 1
1.1 Background 1
1.2 Motivation 4
1.3 Our approach 6
1.4 Paper structure 7
2. RELATED WORK 8
2.1 Conditional Random Field (CRF) 8
2.2 Machine Learning-Based System 9
2.3 Long Short-Term Memory (LSTM 10
2.4 Neural Network-Based System 11
2.5 Supervised Model with Data Label Generation 13
2.6 Data Augmentation 16
3. METHOD 17
3.1 Overview 17
3.2 Pre-processing 18
3.3 Data Augmentation 18
3.3.1 Synonym Augmentation 19
3.3.2 Translation Augmentation 20
3.4 Comparative-relatively Selection 21
3.4.1 None-relatively Selective Data Augmentation 23
3.4.2 Single-relatively Selective Data Augmentation 24
3.4.3 Cross-relatively Selective Data Augmentation 28
3.5 BiLSTM-CRF 33
4. EXPERIMENTS AND RESULTS 34
4.1 Dataset Description 34
4.1.1 CDR Corpus 34
4.1.2 NCBI Disease Corpus 36
4.2 Evaluation Metrics 36
4.3 Result 37
5 ANALYSIS 40
5.1 Evaluation of Model Function 40
5.1.1 Evaluation of Augmented Data 40
5.1.2 Evaluation of Partial Augmented 42
5.2 Data Augmentation Performance 45
5.3 Comparative-relatively Selection Performance 48
5.3.1 Partial Analysis 48
5.3.2 Overview 51
5.4 Ablation Study 52
5.5 Comparison 54
5.5.1 General Sample 54
5.5.2 Random Sample 58
6. DISCUSSION 60
6.1 Synonym Augmentation 60
6.2 Translation Augmentation 62
7. CONCLUSIONS 65
REFERENCE 66
參考文獻 [1] A. Sood, AK. Ghosh (2006). "Literature search using PubMed: an essential tool for practicing evidence-based medicine." J Assoc Physicians India. 2006;54:303-308.
[2] C. Maloney, E. Sequeira, C. Kelly, R. Orris, J. Beck (2013). "PubMed Central. National Center for Biotechnology Information." (US).
[3] D. Chapman (2009). "Advanced search features of PubMed." Journal of the Canadian Academy of Child and Adolescent Psychiatry = Journal de l'Academie canadienne de psychiatrie de l'enfant et de l'adolescent, 18(1), 58–59.
[4] J. Lafferty, A. McCallum, and F. CN Pereira (2001). "Conditional random fields: Probabilistic models for segmenting and labeling sequence data."
[5] H. M. Wallach (2004). "Conditional Random Fields: An Introduction." University of Pennsylvania CIS Technical Report MS-CIS-04-21.
[6] C. H. Wei et al. (2015). "Overview of the BioCreative V chemical disease relation (CDR) task." Proceedings of the fifth BioCreative challenge evaluation workshop.
[7] R. I. Doğan, R. Leaman, and Z. Lu (2014). "NCBI disease corpus: a resource for disease name recognition and concept normalization." Journal of biomedical informatics 47.
[8] H.-C., Lee et al. (2015). "An enhanced CRF-based system for disease name entity recognition and normalization on BioCreative V DNER Task." Proceedings of the Fifth BioCreative Challenge Evaluation Workshop.
[9] H.-C., Lee, Y.-Y., Hsu, and H.-Y., Kao, "AuDis: an automatic CRF-enhanced disease normalization in biomedical text." Database (2016) Vol. 2016: article ID baw091;doi:10.1093/database/baw091
[10] H. C., Lee, H. Y., Kao, "CDRnN: A high performance chemical-disease recognizer in biomedical literature." IEEE International Conference on Bioinformatics and Biomedicine.2017:374-379.
[11] Y.-Y. Hsu and H.-Y. Kao (2015). "Curable named-entity recognition using semantic relations." IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 12(4): 785-792.
[12] R. Leaman, C. H. Wei, and Z. Lu (2015). "tmChem: a high performance approach for chemical named entity recognition and normalization." Journal of cheminformatics 7(1):S3.
[13] C. H. Wei, B. R Harris, H. Y. Kao and Z. Lu (2013). "tmVar: a text mining approach for extracting sequence variants in biomedical literature." Bioinformatics 29(11): 1433-1439.
[14] R. Leaman and Z. Lu (2016). "TaggerOne: joint named entity recognition and normalization with semi-Markov Models." Bioinformatics 32(18): 2839-2846.
[15] S. Hochreiter, and J. Schmidhuber (1997). "Long short-term memory." Neural computation 9(8): 1735-1780.
[16] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami and C. Dyer (2016). "Neural architectures for named entity recognition." In Proceedings of NAACL-2016.
[17] X. Ma and E. Hovy (2016). "End-to-end sequence labeling via bi-directional lstm-cnns-crf." In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, page 10641074.
[18] M. Habibi et al. (2017). "Deep learning with word embeddings improves biomedical named entity recognition." Bioinformatics 33(14): i37-i48.
[19] J. Fries, S. Wu, A. Ratner, C. Ré (2017). "SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data." arXiv:1704.06360.
[20] J. Shang, L. Liu, X. Gu, X. Ren, T. Ren and J. Han (2018). "Learning Named Entity Tagger using Domain-Specific Dictionary", in Proc. of 2018 Conf. on Empirical Methods in Natural Language Processing (EMNLP'18).
[21] J. Mathew, S. Fakhraei, J. L. Ambite (2019). "Biomedical Named Entity Recognition via Reference-Set Augmented Bootstrapping." arXiv:1906.00282.
[22] J. Wei and K. Zou (2019). "Eda: Easy data augmentation techniques for boosting performance on text classification tasks." arXiv preprint arXiv:1901.11196.
[23] S. Kobayashi. (2018). "Contextual augmentation:Data augmentation by words with paradigmatic relations." NAACL-HLT.
[24] C. Fellbaum (1998). "WordNet: An Electronic Lexical Database. " Cambridge, MA: MIT Press.
[25] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi and W. Macherey et al. (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." CoRR, abs/1609.08144.
[26] A. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm," in IEEE Transactions on Information Theory, vol. 13, no. 2, pp.260-269, April 1967, doi: 10.1109/TIT.1967.1054010.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2021-09-01起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2021-09-01起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw