進階搜尋


下載電子全文  
系統識別號 U0026-0607201722443100
論文名稱(中文) 文章探勘和資料探勘在癌症研究之運用
論文名稱(英文) Application of text mining and data mining in cancer research
校院名稱 成功大學
系所名稱(中) 基礎醫學研究所
系所名稱(英) Institute of Basic Medical Sciences
學年度 105
學期 2
出版年 106
研究生(中文) 陳疇丞
研究生(英文) Chou-Cheng Chen
學號 S58971390
學位類別 博士
語文別 英文
論文頁數 93頁
口試委員 指導教授-何中良
召集委員-謝達斌
召集委員-王憶卿
召集委員-賴明德
口試委員-曾新穆
口試委員-薛佑玲
中文關鍵字 文章探勘  資料探勘  幹細胞  大腸癌  肝癌 
英文關鍵字 data mining  text mining  stem cell  colorectal cancer  liver cancer  cancer stem cell  TCGA 
學科別分類
中文摘要 這個研究主要是利用文章探勘與資料探勘來篩選有關肝癌與大腸癌的基因,我們利用EST資料探勘與文獻回顧找出五十個未知功能的基因,並且藉由實驗發現ZNF496、RMI2和U41可能為WNT標的基因並且與肝癌有關。在已知基因部分,我利用自己撰寫的文章探勘工具PubstractHelper、StemTextSearch與現有的資料探勘工具GeneCards、NCBI的GEO找出二十個可能基因,經由實驗驗發現IGF2BP1可能為再復發肝癌的在血液循環中oncofetal幹細胞相似標記。我利用TCGA的資料來資料探勘並且尋找出二十三個可能與大腸癌相關的基因,我們最後選出可以買到抗體的三個基因並且利用免疫組織化學染色測試是否在大腸癌有表現。這個研究顯示了我們可以利用文章探勘與資料探勘來幫助科學家縮小可能與癌症相關的候選基因。
英文摘要 This study aimed to use text and data mining to select candidate genes which are associated with liver and colorectal cancer. Fifty unknown candidate genes were selected by data mining the EST library, and ZNF496, RMI2 and U41 were found that may be associated with WNT target genes and liver cancer. Twenty known candidate genes were selected by text mining PubstractHelper and StemTextSearch, and data mining GeneCards and GEO of NCBI. IGF2BP1 was found to be associated with the oncofetal circulating cancer stem cell-like markers associated with the recurrence of hepatocellular carcinoma by experiment. Twenty-three candidate genes were selected by data mining from TCGA (the cancer genome atlas) data, and the three remaining candidate genes are examined as to whether they are expressed in colorectal cancer by IHC (immunohistochemistry). This study shows that text and data mining are alternative methods to help scientist narrow down their candidate genes which are associated with cancer.
論文目次 Contents
Introduction ..........................8
Methods ..........................12
Results ..........................25
Discussion...........................32
Conclusion...........................36
References ..........................36
Appendix 1 ..........................74
Appendix 2 ..........................78
Appendix 3 ..........................79
Figure Contents
Figure 1.The flowchart of the unknown gene selection and the main steps of the computational procedure...........41
Figure 2. An example query from PubstractHelper ....................42
Figure 3.The flowchart of the StemTextSearch database and the main steps of the computational procedure ..........43
Figure 4.The flowchart of the collection of stem-cell terms and the main steps of the computational procedure .........44
Figure 5.An example of a sentence in the new abstract which is generated by replacing the gene name by the symbol ‘* .........45
Figure 6.Example token sentences......................46
Figure 7.Example results defined by the R score ....................47
Figure 8.Example training corpus......................48
Figure 9.The flowchart of gene selection from TCGA and the main steps of the computational procedure............49
Figure 10.An example of the results obtained by the algorithm, with the algorithm shown in Appendix 2 ...........50
Figure 11.Example of the results containing multiple gene names or multiple stem-cell terms ............51
Figure 12.An example of the results on the web interface from a user query...............53
Figure 13.RT-PCR results of twenty candidate genes in liver cancer cell lines and whole blood tissue............54
Figure 14.RT-PCR results of twenty candidate genes in oncofetal pattern and IPS ..............55
Figure 15.RT-PCR results of IGF2BP1 in normal (N) and cancer (T) tissue ................56
Table Contents
Table 1.Sixty-five genes associated the Wnt pathway and EST libraries .................57
Table 2.The abbreviations of species terms were produced from taxdump.tar.gz. ...............61
Table 3.Twenty remaining genes were selected by text mining, data mining and literature review............. 62
Table 4.Primers of each gene for RT-PCR ......................63
Table 5.Primers of HNF4A and MSI1 for RT-qPCR .....................64
Table 6.Twenty-three remaining genes selected by data mining..................65
Table 7.Fifty remaining unknown genes were selected by data mining and literature review. .............67
Table 8.Precision and recall of each step .....................69
Table 9.RT-qPCR results of HNF4A and MSI1 in whole blood..................70
Table 10.RT-qPCR results of HNF4A and MSI1 in normal tissue and cancer tissue ...............71
Table 11.RT-qPCR results of IGF2BP1 (IMP1), U41 and Lin28B in liver tissue...............72
Table 12.The published information of 23 genes obtained by PubMed query ................73
參考文獻 Cheng, S.W., et al., Lin28B is an oncofetal circulating cancer stem cell-like marker associated with recurrence of hepatocellular carcinoma. PLoS One, 8(11): p. e80053. 2013.
2. Sklan, A., US Supreme Court rules on landmark gene patent case. Pharm Pat Anal, 2(5): p. 581. 2013.
3. Coordinators, N.R., Database Resources of the National Center for Biotechnology Information. Nucleic Acids Res, 45(D1): p. D12-D17. 2017.
4. Safran, M., et al., GeneCards Version 3: the human gene integrator. Database (Oxford), 2010: p. baq020. 2010.
5. Hsu, C.C., et al., Identifying LRRC16B as an oncofetal gene with transforming enhancing capability using a combined bioinformatics and experimental approach. Oncogene, 30(6): p. 654-67. 2011.
6. Widelitz, R., Wnt signaling through canonical and non-canonical pathways: recent progress. Growth Factors, 23(2): p. 111-6. 2005.
7. Rebholz-Schuhmann, D., et al., EBIMed--text crunching to gather facts for proteins from Medline. Bioinformatics, 23(2): p. e237-44. 2007.
8. He, X., et al., BSQA: integrated text mining using entity relation semantics extracted from biological literature of insects. Nucleic Acids Res, 38(Web Server issue): p. W175-81. 2010.
9. Fang, Y.C., H.C. Huang, and H.F. Juan, MeInfoText: associated gene methylation and cancer information from text mining. BMC Bioinformatics, 9: p. 22. 2008.
10. Wei, C.H., H.Y. Kao, and Z. Lu, PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res, 41(Web Server issue): p. W518-22. 2013.
11. Chen, C.C. and C.L. Ho, PubstractHelper: A Web-based Text-Mining Tool for Marking Sentences in Abstracts from PubMed Using Multiple User-Defined Keywords. Bioinformation, 10(11): p. 708-10. 2014.
12. Lee, H.J., et al., OncoSearch: cancer gene search engine with literature evidence. Nucleic Acids Res, 42(Web Server issue): p. W416-21. 2014.
13. Wohlers, I., et al., The Characterization Tool: A knowledge-based stem cell, differentiated cell, and tissue database with a web-based analysis front-end. Stem Cell Res, 3(2-3): p. 88-95. 2009.
14. Turenne, N., et al., Finding biomarkers in non-model species: literature mining of transcription factors involved in bovine embryo development. BioData Min, 5(1): p. 12. 2012.
15. Xie, W., J. Sun, and J. Wu, Construction and analysis of a protein-protein interaction network related to self-renewal of mouse spermatogonial stem cells. Mol Biosyst, 11(3): p. 835-43. 2015.
16. Chen, C.C. and C.L. Ho, StemTextSearch: Stem cell gene database with evidence from abstracts. J Biomed Inform. 2017.
17. Stirewalt, D.L., et al., Identification of genes with abnormal expression changes in acute myeloid leukemia. Genes Chromosomes Cancer, 47(1): p. 8-20. 2008.
18. Cancer Genome Atlas Research, N., et al., The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet, 45(10): p. 1113-20. 2013.
19. Gao, J., et al., Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal, 6(269): p. pl1. 2013.
20. Yin, F., et al., Microarray-based identification of genes associated with cancer progression and prognosis in hepatocellular carcinoma. J Exp Clin Cancer Res, 35(1): p. 127. 2016.
21. M., A.A.A., et al., PSEUDO GENETIC AND PROBABILISTIC-BASED FEATURE SELECTION METHOD FOR EXTRACTIVE SINGLE DOCUMENT SUMMARIZATION. Journal of Theoretical and Applied Information Technology, 32(1): p. 8. 2011.
22. Bird, S. and M. Liberman, A formal framework for linguistic annotation Speech Communication, 33(1-2): p. 38. 2000.
23. Y., M. and I. M., KEYWORD EXTRACTION FROM A SINGLE DOCUMENT USING WORD CO-OCCURRENCE STATISTICAL INFORMATION. International Journal on Artificial Intelligence Tools, 13(1). 2004.
24. Maglott, D., et al., Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res, 39(Database issue): p. D52-7. 2011.
25. MATSUO, Y. and M. Ishizuka, Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information. 13(01): p. 13. 2004.
26. Moldovan, S.M., et al., [Treatment of unilateral limbal stem cell deficiency syndrome by limbal autograft]. J Fr Ophtalmol, 22(3): p. 302-9. 1999.
27. Sangwan, V.S., et al., Simple limbal epithelial transplantation (SLET): a novel surgical technique for the treatment of unilateral limbal stem cell deficiency. Br J Ophthalmol, 96(7): p. 931-4. 2012.
28. Amescua, G., et al., Modified simple limbal epithelial transplantation using cryopreserved amniotic membrane for unilateral limbal stem cell deficiency. Am J Ophthalmol, 158(3): p. 469-75 e2. 2014.
29. Schwartz, A.S. and M.A. Hearst, A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput: p. 451-62. 2003.
30. Awad, H.A., et al., Autologous mesenchymal stem cell-mediated repair of tendon. Tissue Eng, 5(3): p. 267-77. 1999.
31. Lee, K., et al., BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations. Database (Oxford), 2016. 2016.
32. Reiter, R.E., et al., Prostate stem cell antigen: a cell surface marker overexpressed in prostate cancer. Proc Natl Acad Sci U S A, 95(4): p. 1735-40. 1998.
33. A, D.A.C.P., et al., Co-expression of stem cell markers ALDH1 and CD44 in non-malignant and neoplastic lesions of the breast. Anticancer Res, 34(3): p. 1427-34. 2014.
34. Zhang, Y., et al., Lef1 contributes to the differentiation of bulge stem cells by nuclear translocation and cross-talk with the Notch signaling pathway. Int J Med Sci, 10(6): p. 738-46. 2013.
35. Klein, D. and C.D. Manning, Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), Cambridge, MA: MIT Press: p. 8. 2003.
36. Bjorne, J., F. Ginter, and T. Salakoski, University of Turku in the BioNLP'11 Shared Task. BMC Bioinformatics, 13 Suppl 11: p. S4. 2012.
37. Lee, H.J., et al., CoMAGC: a corpus with multi-faceted annotations of gene-cancer relations. BMC Bioinformatics, 14: p. 323. 2013.
38. Berger, A.L., V.J. Della Pietra, and S.A. Della Pietra, A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1): p. 34. 1996.
39. Glader, B.E. and K. Backer, Elevated red cell adenosine deaminase activity: a marker of disordered erythropoiesis in Diamond-Blackfan anaemia and other haematologic diseases. Br J Haematol, 68(2): p. 165-8. 1988.
40. Xu, L., et al., Cellular retinol-binding protein 1 (CRBP-1) regulates osteogenenesis and adipogenesis of mesenchymal stem cells through inhibiting RXRalpha-induced beta-catenin degradation. Int J Biochem Cell Biol, 44(4): p. 612-9. 2012.
41. Eyler, C.E., et al., Brain cancer stem cells display preferential sensitivity to Akt inhibition. Stem Cells, 26(12): p. 3027-36. 2008.
42. Staniszewska, A.D., et al., Stat3 is required to maintain the full differentiation potential of mammary stem cells and the proliferative potential of mammary luminal progenitors. PLoS One, 7(12): p. e52608. 2012.
43. Guo, W., et al., Slug and Sox9 cooperatively determine the mammary stem cell state. Cell, 148(5): p. 1015-28. 2012.
44. Sun, G., et al., Histone demethylase LSD1 regulates neural stem cell proliferation. Mol Cell Biol, 30(8): p. 1997-2005. 2010.
45. Ono, T. and S. Kuhara, A novel method for gathering and prioritizing disease candidate genes based on construction of a set of disease-related MeSH(R) terms. BMC Bioinformatics, 15: p. 179. 2014.
46. DeLuca, D.S., et al., MaHCO: an ontology of the major histocompatibility complex for immunoinformatic applications and text mining. Bioinformatics, 25(16): p. 2064-70. 2009.
47. Urbanski, W.M. and B.G. Condie, Textpresso site-specific recombinases: A text-mining server for the recombinase literature including Cre mice and conditional alleles. Genesis, 47(12): p. 842-6. 2009.
48. Oh, J.H. and J.O. Deasy, A literature mining-based approach for identification of cellular pathways associated with chemoresistance in cancer. Brief Bioinform. 2015.
49. Mahmood, S., M. Shahbaz, and A. Guergachi, Negative and positive association rules mining from text using frequent and infrequent itemsets. ScientificWorldJournal, 2014: p. 973750. 2014.
50. Liu, R.-L. and Y.-C. Huang, Ranker enhancement for proximity-based ranking of biomedical texts. Journal of the American Society for Information Science and Technology, 62(12): p. 17. 2011.
51. Kim, J., et al., DigSee: Disease gene search engine with evidence sentences (version cancer). Nucleic Acids Res, 41(Web Server issue): p. W510-7. 2013.
52. Torii, M., et al., RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information. IEEE/ACM Trans Comput Biol Bioinform, 12(1): p. 17-29. 2015.
53. Kahl, P., et al., Androgen receptor coactivators lysine-specific histone demethylase 1 and four and a half LIM domain protein 2 predict risk of prostate cancer recurrence. Cancer Res, 66(23): p. 11341-7. 2006.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2017-07-27起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2017-07-27起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw