進階搜尋


下載電子全文  
系統識別號 U0026-2808201814555500
論文名稱(中文) 語言表徵強化LSTM-CRF方法進行生醫文獻疾病命名實體辨識
論文名稱(英文) Linguistic Representation enhanced LSTM-CRF for Disease Named Entity Recognition in Biomedical Literature
校院名稱 成功大學
系所名稱(中) 醫學資訊研究所
系所名稱(英) Institute of Medical Informatics
學年度 106
學期 2
出版年 107
研究生(中文) 楊馥謙
研究生(英文) Fu-Chien Yang
學號 Q56054027
學位類別 碩士
語文別 英文
論文頁數 39頁
口試委員 指導教授-高宏宇
口試委員-謝孫源
口試委員-吳宗憲
口試委員-李強
口試委員-王惠嘉
中文關鍵字 疾病命名實體辨識  類神經網路  條件隨機場域  生醫文獻文字探勘 
英文關鍵字 Disease named entity recognition  neural network  Conditional random fields  Biomedical text mining 
學科別分類
中文摘要 在生醫研究領域中,識別生醫文獻中的疾病命名實體是一項關鍵的任務,透過這些疾病命名實體,可以促進更進一步的研究,例如提取疾病和藥物之間的關聯性。過去的疾病命名實體辨識系統大多仰賴於人工特徵的選擇和特定領域的知識,且系統多基於機器學習方法,如條件隨機場域,但其因演算法的計算會造成高運算複雜度,使得訓練模型時較困難。除此之外,疾病名稱的多樣性也造成辨識上的困難。在這次的研究中,我們提出一個系統「L-LSTM-CRF」,此系統是基於類神經網路架構和有效的語言表徵來進行疾病命名實體的辨識。其中語言表徵的部分由四個向量所組成,有預先訓練過的詞向量,從一層卷積神經網路得到的字母表示向量和字根的字母表示向量,以及特徵組合的表示向量;其中特徵組合包含:與疾病名稱相關的字典搜尋、常見的結尾詞以及縮寫偵測,這三個特徵也經常會使用於過去的系統當中。為了要選擇出有用的特徵組合,我們共搜集了六個特徵,並透過多種不同的組合去比較得出對於我們的系統最佳的組合。在取得語言表徵向量後,我們訓練一個LSTM-CRF模型,得到隱含的特徵和對應標籤的分數,並得到句子最佳的標籤序列並提取出文章中的疾病命名實體。在評估方面,我們搜集了四個與疾病相關的資料集,有2015年國際生物文獻自動探勘競賽中提供的CDR corpus、美國國家生物技術資訊中心提供的疾病資料集、DISAE資料集以及miRMA資料集。此研究成果在這四項資料集都獲得了最佳的成績,並在CDR資料集中達到91.16%的F度量(F-score)。證明L-LSTM-CRF是一個高準確性的疾病命名實體辨識系統。
英文摘要 The recognition of disease named entities in biomedical literature is a crucial task in biomedical research which can facilitate the research on further research (e.g., disease-chemical relation extraction). Most of the disease named entity recognition systems rely heavily on hand-craft features and domain knowledge. Hence, the systems are usually based on machine learning methodology, such as conditional random fields. However, this method has highly computationally complexity at the training stage. Besides, the diversity of disease names also makes the recognition more difficult. As a result, we propose a system, L-LSTM-CRF, which is based on neural network architecture with effective linguistic representation. The representation has consisted of pre-trained word embedding, character and stem character representations obtained from a convolutional neural network, and a feature group embedding which is composed by three powerful features that are commonly utilized in the current systems. The feature group includes dictionary lookup, disease ending word, and abbreviation detection. After obtaining the linguistic representation, we passed the representation into the LSTM-CRF layer, which is leveraged to predict the labels of the sentences and extract the disease named entities. In the evaluation stage, we collected four corpora that are disease-related, such as the CDR corpus from BioCreative V CDR task, the NCBI disease corpus, the miRNA corpus and the DISAE corpus. Our approach achieves the state-of-the-art performances in these disease extraction corpora, and get 91.16% in CDR corpus.
論文目次 中文摘要 I
ABSTRACT II
LIST OF TABLES V
LIST OF FIGURES VI
1. INTRODUCTION 1
1.1 Background 1
1.2 Motivation 3
1.3 Our approach 6
1.4 Paper structure 7
2. RELATED WORKs 8
2.1 Conditional Random Field (CRF) 8
2.2 Machine learning-based system 9
2.3 Long Short-Term Memory 11
2.4 Neural network-based system 12
3. METHOD 14
3.1 Pre-processing 15
3.2 Linguistic Representation 15
3.2.1 Feature groups 16
3.2.2 Character and stem character representations 19
3.3 LSTM-CRF 20
4. EXPERIMENTS AND RESUITS 23
4.1 Dataset Description 23
4.1.1 CDR Corpus 23
4.1.2 NCBI Disease Corpus 25
4.1.3 DISAE Corpus 26
4.1.4 miRNA Corpus 27
4.1.5 Random Split Corpora 27
4.2 Evaluation metrics 28
4.3 Evaluation of Feature Group 28
4.4 Results 30
5. Discussion 32
5.1 Layer Effectiveness of L-LSTM-CRF 32
5.2 Embedding Effectiveness of L-LSTM-CRF 33
5.3 Error Analysis 34
5.4 Impact on follow-up research 36
6. CONCLUSIONS 37
REFERENCES 37
參考文獻 [1] C. H. Wei, et al. (2015). Overview of the BioCreative V chemical disease relation (CDR) task. Proceedings of the fifth BioCreative challenge evaluation workshop, Sevilla Spain.
[2] H. Gurulingappa, et al. (2012). "Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports." Journal of biomedical informatics 45(5): 885-892.
[3] I. Segura-Bedmar, P. Martinez, and M. H. Zazo (2013). Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013).
[4] J. Lafferty, A. McCallum, and F. CN Pereira (2001). "Conditional random fields: Probabilistic models for segmenting and labeling sequence data."
[5] S. Hochreiter, and J. Schmidhuber (1997). "Long short-term memory." Neural computation 9(8): 1735-1780.
[6] Y. LeCun, et al. (1989). "Backpropagation applied to handwritten zip code recognition." Neural computation 1(4): 541-551.
[7] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami and Chris Dyer (2016). “Neural architectures for named entity recognition.” In Proceedings of NAACL-2016, San Diego, California, USA, June.
[8] Xuezhe Ma and Eduard Hovy (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguis-tics, page 10641074.
[9] Li, L. and Y. Jiang (2017). Biomedical named entity recognition based on the two channels and sentence-level reading control conditioned LSTM-CRF. Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference on, IEEE.
[10] Habibi, M., et al. (2017). "Deep learning with word embeddings improves biomedical named entity recognition." Bioinformatics 33(14): i37-i48.
[11] H. C. Lee, Y. Y. Hsu, H. Y. Kao (2015). An enhanced CRF-based system for disease name entity recognition and normalization on BioCreative V DNER Task. Proceedings of the Fifth BioCreative Challenge Evaluation Workshop.
[12] Y.-Y. Hsu and H.-Y. Kao (2015). "Curatable named-entity recognition using semantic relations." IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 12(4): 785-792.
[13] R. Leaman, C. H. Wei, and Z. Lu (2015). "tmChem: a high performance approach for chemical named entity recognition and normalization." Journal of cheminformatics 7(1): S3.
[14] C. H. Wei, B. R Harris, H. Y. Kao, and Z. Lu (2013). "tmVar: a text mining approach for extracting sequence variants in biomedical literature." Bioinformatics 29(11): 1433-1439.
[15] R. Leaman and Z. Lu (2016). "TaggerOne: joint named entity recognition and normalization with semi-Markov Models." Bioinformatics 32(18): 2839-2846.
[16] I. J. Unanue, E. Z. Borzeshi, and M. Piccardi (2017). "Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition." Journal of biomedical informatics 76: 102-109.
[17] J. Pennington, R. Socher, and C. Manning (2014). Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP).
[18] C. J. Kuo, M. H. Ling, K. T. Lin, and C. N. Hsu (2009). BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature. BMC bioinformatics, BioMed Central.
[19] Andrew Viterbi (1967). "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm." IEEE transactions on Information Theory 13(2): 260-269.
[20] R. I. Doğan, R. Leaman, and Z. Lu (2014). "NCBI disease corpus: a resource for disease name recognition and concept normalization." Journal of biomedical informatics 47: 1-10.
[21] S. Bagewadi, T. Bobić, M. Hofmann-Apitius, J. Fluck, and R. Klinger (2014). "Detecting miRNA mentions and relations in biomedical literature." F1000Research 3.
[22] H. Gurulingappa, R. Klinger, M. Hofmann-Apitius, and J. Fluck (2010). An empirical evaluation of resources for the identification of diseases and adverse effects in biomedical literature. 2nd Workshop on Building and evaluating resources for biomedical text mining (7th edition of the Language Resources and Evaluation Conference).
[23] Y. Lou, et al. (2017). "A transition-based joint model for disease named entity recognition and normalization." Bioinformatics 33(15): 2363-2371.
[24] R. Leaman, R. Islamaj Doğan, and Z. Lu (2013). "DNorm: disease name normalization with pairwise learning to rank." Bioinformatics 29(22): 2909-2917.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2018-09-03起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2018-09-03起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw