進階搜尋


 
系統識別號 U0026-0808201815585700
論文名稱(中文) 透過生醫文獻探勘發現與分析測序基因集之特徵
論文名稱(英文) Gene Panel Characteristics Discovery and Analysis by Biomedical Literature Mining
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 106
學期 2
出版年 107
研究生(中文) 劉宸睿
研究生(英文) Chen-Ruei Liu
電子信箱 william0704@iir.csie.ncku.edu.tw
學號 P76054648
學位類別 碩士
語文別 英文
論文頁數 41頁
口試委員 指導教授-蔣榮先
共同指導教授-林鵬展
口試委員-沈孟儒
口試委員-李宗儒
口試委員-郝沛毅
中文關鍵字 測序基因集  生醫文獻探勘  主題模型  決策樹  資訊擷取 
英文關鍵字 biomedical text mining  gene panel test  decision tree  topic modeling  information retrieval 
學科別分類
中文摘要 測序基因集測試是一種針對腫瘤的基因序列測試,用來偵測常見以及罕見癌症中基因的突變,測試結果可以幫助臨床醫師在病人身上做出更適合的診斷。然而,什麼樣的基因會被選進測序基因集是一個值得探討的問題,但是通常我們無法清楚得知。可能是由於基因的致病性或是訊號傳遞路徑,也有可能只是期望對該基因進行檢驗。因此,本研究開發了一個能夠以理性的方式分析基因並發現基因集特性之生醫文獻探勘系統,並且以紀念史隆‧凱特琳癌症研究中心開發之測序基因集測試MSK-IMPACTTM為例。
為了生物醫學之可解釋性,本研究利用生醫文獻探勘技術對測序基因集進行分析。本研究之生醫文獻探勘系統使用了主題模型以及決策樹這兩種機器學習演算法。我們期望能使用機器學習演算法從大量文獻找出隱藏其中的資訊。然而部分機器學習方法雖然能夠在準確率以及召回率上得到良好的結果,卻無法對結果做出適當的解釋。在生醫文獻探勘領域中,尤其需要良好的可解釋性。因此本研究選擇能清楚得知決策路徑之決策樹以及能夠對結果做出良好解釋之主題模型對大量與人類基因相關之生醫文獻進行分析。
在實驗中,結果顯示本研究能合理的將感興趣的基因以與其相關的生醫詞彙表示,並能依據不同的測序基因集找出不同的特徵。此外,在主題模型以及決策樹上,我們都能對其所發現之結果做出具有生物意義的解釋,並使用現有之資料庫驗證其正確性。在個案探討中,我們發現決策樹與主題模型的結果有明顯相似之處。上述實驗結果顯示本研究能夠利用我們開發之生醫文獻探勘系統對測序基因集進行可解釋的分析,並且結果能夠對臨床醫師以及生物資訊研究員在分析測序基因集上有良好的幫助。
英文摘要 A gene panel test is a targeted tumor sequencing test in using of detecting gene mutations in both rare and common cancers. The testing result allows doctors to quickly find out whether a patient’s tumor carries clinically useful mutations and to match patients with available therapies or clinical trials that will most benefit them. For a certain gene panel, there are about hundreds of genes that have been selected into gene panel, but usually we have no idea why and how those genes are selected. In order to have a better understanding of the gene panel, we developed a biomedical literature mining pipeline which can analyze the function of gene panels and the genes in it. Our study used the gene panel test developed by Memorial Sloan Kettering Cancer Center, MSK-IMACTTM, as example.
For the biomedical explainability, our study utilized biomedical literature mining method to perform characteristic analysis on the gene panel. We want to dig out the useful information in the large-scale corpus by machine learning algorithms. However, most of machine learning algorithms can provide good precision and recall, but the results are hard to interpret. Therefore, we chose decision tree and topic modeling to analyze the literatures related to human genes since decision trees can provide clear decision-making process and the result of topic modeling is great to be interpreted in biomedical concepts.
The experiment result shows that our study can not only represent the certain genes in a rational manner and is able to find different characteristics of the gene panel. Besides, we can make an appropriate biomedical explanation on both the results of decision tree and topic modeling and verify them by a manual curated pathway database. In case studies, we also find that decision tree and topic modeling have similar results. We hope that our study can help doctors making decisions and help bioinformatics researchers understanding more details about gene panels.
論文目次 中文摘要 I
ABSTRACT III
ACKNOWLEDGEMENT V
CONTENTS VI
LIST OF TABLES VIII
LIST OF FIGURES IX
Chapter 1. Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Research Objective 2
1.4 Thesis Organization 3
Chapter 2. Related Work 4
2.1 MSK-IMPACTTM Gene Panel Test 4
2.2 Gene2Vec Based on Deep Learning 4
2.3 Biomedical Literature Mining 5
2.3.1 Named Entity Recognition 5
2.3.2 Machine-Learning-Based Approach 6
Chapter 3. Materials and Methods 7
3.1 Biomedical Term Tagging 8
3.1.1 Biomedical Term Tagging with Pubtator 9
3.1.2 Biomedical Term Tagging with MeSH 10
3.2 Gene-Feature Matrix Construction 11
3.2.1 Gene Window 12
3.2.2 Gene-Feature TF-IDF matrix Construction 12
3.3 Feature Selection 13
3.4 Gene Panel Characteristics Discovering 14
3.4.1 Topic Modeling 14
3.4.2 Decision Tree 16
Chapter 4. Experiments 18
4.1 Experimental Design 18
4.2 Study of Gene Feature Extracting 19
4.3 Study of Feature Selection 21
4.4 Panel Characteristic Discovery 24
4.4.1 Topic Model 24
4.4.2 Decision Tree 29
4.5 Case Study 33
Chapter 5. Conclusion and Future Work 37
5.1 Conclusions 37
5.2 Future Work 38
References 40
參考文獻 [1] M.Tischkowitz et al., “Gene-Panel Sequencing and the Prediction of Breast-Cancer Risk,” 2015.
[2] D. T.Cheng et al., “Memorial sloan kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): A hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology,” J. Mol. Diagnostics, vol. 17, no. 3, pp. 251–264, 2015.
[3] D. M.Hyman et al., “Precision medicine at Memorial Sloan Kettering Cancer Center: Clinical next-generation sequencing enabling next-generation targeted therapy trials,” Drug Discov. Today, vol. 20, no. 12, pp. 1422–1428, 2015.
[4] A.Zehir et al., “Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients,” Nat. Med., vol. 23, no. 6, pp. 703–713, 2017.
[5] J.Du, P.Jia, Y.Dai, C.Tao, Z.Zhao, and D.Zhi, “Gene2Vec: Distributed Representation of Genes Based on Co-Expression,” bioRxiv, no. Hinton 1986, p. 13, 2018.
[6] T.Mikolov, K.Chen, G.Corrado, and J.Dean, “5021-Distributed-Representations-of-Words-and-Phrases-and-Their-Compositionality,” pp. 1–9.
[7] C. H.Wei, H. Y.Kao, and Z.Lu, “GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains,” Biomed Res. Int., vol. 2015, 2015.
[8] R.Leaman, R. I.Doǧan, and Z.Lu, “DNorm: Disease name normalization with pairwise learning to rank,” Bioinformatics, vol. 29, no. 22, pp. 2909–2917, 2013.
[9] C. H.Wei, H. Y.Kao, and Z.Lu, “PubTator: a web-based text mining tool for assisting biocuration.,” Nucleic Acids Res., vol. 41, no. Web Server issue, pp. 518–522, 2013.
[10] C.Salton, G and Buckley, “Term-weighting approaches in automatic text retrieval,” Inf. Process. Manag., vol. 24, no. 5, pp. 513–523, 1988.
[11] M.Ikonomakis, S.Kotsiantis, and V.Tampakas, “Text classification using machine learning techniques,” WSEAS Trans. Comput., vol. 4, no. 8, pp. 966–974, 2005.
[12] W.Xu, X.Liu, and Y.Gong, “Document clustering based on non-negative matrix factorization,” Proc. 26th Annu. Int. ACM SIGIR Conf. Res. Dev. informaion Retr. - SIGIR ’03, p. 267, 2003.
[13] J.Choo, C.Lee, C. K.Reddy, and H.Park, “UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization,” Vis. Comput. Graph. IEEE Trans., vol. 19, no. 12, pp. 1992–2001, 2013.
[14] H.Larson, Introduction to Probability Theory and Statistical Inference. John Wiley & Sons, New York., 1982.
[15] L.Yeganova, W.Kim, S.Kim, and W. J.Wilbur, “Retro: Concept-based clustering of biomedical topical sets,” Bioinformatics, vol. 30, no. 22, pp. 3240–3248, 2014.
[16] F.Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2012.
[17] S. A.Forbes et al., “COSMIC: Somatic cancer genetics at high-resolution,” Nucleic Acids Res., vol. 45, no. D1, pp. D777–D783, 2017.
[18] A.Fabregat et al., “The Reactome Pathway Knowledgebase,” Nucleic Acids Res., vol. 46, no. D1, pp. D649–D655, 2018.
[19] N.Rappaport et al., “MalaCards: An amalgamated human disease compendium with diverse clinical and genetic annotation and structured search,” Nucleic Acids Res., vol. 45, no. D1, pp. D877–D887, 2017.
[20] M.Ashburner et al., “Gene ontology: Tool for the unification of biology,” Nature Genetics., vol. 25, no. 1, pp. 25-29, 2000.
[21] S. Carbon et al., “Expansion of the gene ontology knowledgebase and resources: The gene ontology consortium,” Nucleic Acids Res., vol. 45, no. D1, pp. D331–D338, 2017.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2019-08-08起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2019-08-08起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw