進階搜尋


下載電子全文  
系統識別號 U0026-2408201011143800
論文名稱(中文) 以基因本體論語意分析為基礎之必須基因預測研究
論文名稱(英文) Essential Gene Prediction Based on Gene Ontology Semantic Analysis
校院名稱 成功大學
系所名稱(中) 資訊工程學系碩博士班
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 98
學期 2
出版年 99
研究生(中文) 邱博義
研究生(英文) Po-I Chiu
學號 p7697451
學位類別 碩士
語文別 中文
論文頁數 76頁
口試委員 指導教授-曾新穆
口試委員-謝孫源
口試委員-高宏宇
口試委員-辛致煒
口試委員-黃宣誠
中文關鍵字 資料探勘  基因本體論  必須基因  關聯規則 
英文關鍵字 Data Mining  Gene Ontology  Essential Gene  Association Rule Mining 
學科別分類
中文摘要 必須基因為單一基因剔除後,便會造成細胞死亡的基因,此類基因引起廣泛的討論與研究,許多研究人員希望藉由預測必須基因,建立一套預測方法來協助疾病偵測與藥物開發。然而在必須基因的探討中,卻鮮有人關注基因表現的功能與必須基因的關連;因此,在本研究著重在利用基因本體論所建立的功能註解,使用關聯規則來探勘基因本體論與必須基因間的關係,以進行必須基因的預測。首先,我們利用探勘所得的關聯規則產生GOARC與GOCBA兩個特徵,並建立了CBA分類預測模型;接著,我們提出兩種結合方法,一種是將GOARC和GOCBA與其他特徵進行特徵結合,另一種則是使用其他特徵建立分類預測模型與CBA分類預測模型進行分類預測模型結合。透過實驗證明,我們利用關聯規則所產生的特徵與建立的分類模型,的確可以更準確預測必須基因,也可以提昇其他特徵建立的分類預測模型之預測能力;此外探勘出的關聯規則,透過基因本體論的語意分析,研究者可以進一步瞭解必須基因與生物功能間的關係。
英文摘要 Essential genes are indispensable for an organism's living. These genes are widely discussed, and many researchers proposed some prediction methods that not only find essential genes, but also help pathogens finding and drug development. However, few studies focused on the relationship between biological functions and essential genes. In this study, we adopted the association rule mining technique using Gene Ontology for the essential gene prediction. At first, we proposed two features named GOARC and GOCBA and use them to enhance the classifier which is constructed with the features proposed by previous studies. Secondly, we applied CBA algorithm without rule pruning for predicting essential genes. Additionally, we proposed a classification mechanism which considers the result with the CBA classifier and SVM classifier. Under the experimental evaluations and semantic analysis, our methods not only increase the precision of essential gene prediction, but also speed up the understanding of the essential genes’ semantics in biological functions.
論文目次 摘要 I
ABSTRACT II
目錄 IV
圖目錄 VII
第一章 導論 1
1.1 研究背景 1
1.2 研究動機 1
1.3 研究目的 2
1.4 研究方法 3
1.5 論文貢獻 4
1.6 論文架構 5
第二章 相關文獻 6
2.1 利用不同特徵值預測必須基因之研究 6
2.1.1 結論 11
2.2 關聯規則 12
2.2.1 頻繁項目集合探勘 12
2.2.2 產生關聯規則 14
2.2.3 頻繁封閉項目集合探勘 14
2.3 分類問題 15
2.3.1 SVM 15
2.3.2 CBA 16
2.4 基因本體論 19
2.4.1 本體論 19
2.4.2 分子功能 21
2.4.3 生物反應 21
2.4.4 細胞元件 21
第三章 研究方法與設計 22
3.1 方法架構 22
3.2 GO關聯規則 23
3.2.1 轉換成交易資料庫 24
3.2.2 關聯規則探勘 25
3.3 特徵擷取 26
3.3.1 相關文獻提供 26
3.3.2 拓撲學特徵 26
3.3.3 GOARC和GOCBA 32
3.4 減少取樣 34
3.5 建立分類預測模型 35
第四章 實驗分析 37
4.1 實驗資料與基本設定 37
4.2 實驗規劃 39
4.3 實驗結果 39
4.3.1 分類問題基本設定 40
4.3.2 GO關聯規則相關設定 43
4.3.3 分類預測特徵比較 49
4.3.4 分類預測模型比較 54
4.3.5 GO關聯規則討論 62
第五章 結論與未來研究方向 66
5.1 結論 66
5.2 未來研究方向 67
參考文獻 69
參考文獻 [1] The IlliMine Project. Software available at http://illimine.cs.uiuc.edu
[2] Saccharomyces Genome Database [Online]. Available: http://downloads.yeastgenome.org/
[3] M. L. Acencio and N. Lemke, "Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information," BMC Bioinformatics, vol. 10, pp. 290-307, 2009.
[4] R. Agrawal, T. Imieliński and A. Swami, "Mining association rules between sets of items in large databases," in Proceedings of the 1993 ACM SIGMOD international conference on Management of data, Washington, D.C., United States, 1993, pp. 207-216.
[5] R. Agrawal and R. Srikant, "Fast algorithms for mining association rules in large databases," in Proceedings of the 20th International Conference on Very Large Data Bases, 1994, pp. 487-499.
[6] U. Brandes, "A faster algorithm for betweenness centrality," Journal of Mathematical Sociology, vol. 25, pp. 163-177, 2001.
[7] L. Breiman, "Random forests," Machine Learning, vol. 45, pp. 5-32, 2001.
[8] C. J. C. Burges, "A tutorial on Support Vector Machines for pattern recognition," Data Mining and Knowledge Discovery, vol. 2, pp. 121-167, Jun 1998.
[9] C. Campbell, "Kernel methods: a survey of current techniques," Neurocomputing, vol. 48, pp. 63-84, Oct 2002.
[10] S. l. Cessieî and J. C. v. Houwelingen, "Ridge estimators in logistic regression," Applied Statistics, vol. 41, pp. 191-201, 1992.
[11] C.-C. Chang and C.-J. Lin. LIBSVM : a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
[12] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, Sep 1995.
[13] N. Cristianini and J. S. Taylor, An introduction to support vector machines: and other kernel-based learning methods: Cambridge University Press, 2000.
[14] J. Demšar, B. Zupan, G. Leban and T. Curk, "Orange: from experimental machine learning to interactive data mining," in Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy, 2004, pp. 537-539.
[15] N. C. Duarte, M. J. Herrgard and B. O. Palsson, "Reconstruction and validation of Saccharomyces cerevisiae iND750, a fully compartmentalized genome-scale metabolic model," Genome Research, vol. 14, pp. 1298-1309, Jul 2004.
[16] F. Fleuret, "Fast binary feature selection with conditional mutual information," Journal of Machine Learning Research, vol. 5, pp. 1531-1555, 2004.
[17] Y. Freund and L. Mason, "The alternating decision tree learning algorithm," in Proceedings of the Sixteenth International Conference on Machine Learning, 1999, pp. 124-133.
[18] Y. Freund and R. E. Schapire, "Experiments with a new boosting algorithm," presented at the International Conference on Machine Learning, 1996.
[19] G. Giaever, A. M. Chu, L. Ni, C. Connelly, L. Riles, S. Veronneau, S. Dow, A. Lucau-Danila, K. Anderson, B. Andre, A. P. Arkin, A. Astromoff, M. El-Bakkoury, R. Bangham, R. Benito, S. Brachat, S. Campanaro, M. Curtiss, K. Davis, A. Deutschbauer, K. D. Entian, P. Flaherty, F. Foury, D. J. Garfinkel, M. Gerstein, D. Gotte, U. Guldener, J. H. Hegemann, S. Hempel, Z. Herman, D. F. Jaramillo, D. E. Kelly, S. L. Kelly, P. Kotter, D. LaBonte, D. C. Lamb, N. Lan, H. Liang, H. Liao, L. Liu, C. Luo, M. Lussier, R. Mao, P. Menard, S. L. Ooi, J. L. Revuelta, C. J. Roberts, M. Rose, P. Ross-Macdonald, B. Scherens, G. Schimmack, B. Shafer, D. D. Shoemaker, S. Sookhai-Mahadeo, R. K. Storms, J. N. Strathern, G. Valle, M. Voet, G. Volckaert, C. Y. Wang, T. R. Ward, J. Wilhelmy, E. A. Winzeler, Y. Yang, G. Yen, E. Youngman, K. Yu, H. Bussey, J. D. Boeke, M. Snyder, P. Philippsen, R. W. Davis and M. Johnston, "Functional profiling of the Saccharomyces cerevisiae genome," Nature, vol. 418, pp. 387-391, July 25 2002.
[20] A. M. Gustafson, E. S. Snitkin, S. C. Parker, C. DeLisi and S. Kasif, "Towards the identification of essential genes using targeted genome sequencing and comparative analysis," BMC Genomics, vol. 7, p. 265, 2006.
[21] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten, "The WEKA data mining software: an update," ACM SIGKDD Explorations Newsletter, vol. 11, pp. 10-18, 2009.
[22] J. Han, J. Pei and Y. Yin, "Mining frequent patterns without candidate generation," in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, United States, 2000, pp. 1-12.
[23] M. A. Harris, J. Clark, A. Ireland, J. Lomax, M. Ashburner, R. Foulger, K. Eilbeck, S. Lewis, B. Marshall, C. Mungall, J. Richter, G. M. Rubin, J. A. Blake, C. Bult, M. Dolan, H. Drabkin, J. T. Eppig, D. P. Hill, L. Ni, M. Ringwald, R. Balakrishnan, J. M. Cherry, K. R. Christie, M. C. Costanzo, S. S. Dwight, S. Engel, D. G. Fisk, J. E. Hirschman, E. L. Hong, R. S. Nash, A. Sethuraman, C. L. Theesfeld, D. Botstein, K. Dolinski, B. Feierbach, T. Berardini, S. Mundodi, S. Y. Rhee, R. Apweiler, D. Barrell, E. Camon, E. Dimmer, V. Lee, R. Chisholm, P. Gaudet, W. Kibbe, R. Kishore, E. M. Schwarz, P. Sternberg, M. Gwinn, L. Hannick, J. Wortman, M. Berriman, V. Wood, N. de la Cruz, P. Tonellato, P. Jaiswal, T. Seigfried and R. White, "The Gene Ontology (GO) database and informatics resource," Nucleic Acids Research, vol. 32, pp. D258-261, Jan 1 2004.
[24] X. He and J. Zhang, "Why do hubs tend to be essential in protein networks?," PLoS Genetics, vol. 2, p. e88, Jun 2 2006.
[25] Y. C. Hwang, C. C. Lin, J. Y. Chang, H. Mori, H. F. Juan and H. C. Huang, "Predicting essential genes based on network and sequence analysis," Molecular BioSystems, vol. 5, pp. 1672-1678, Dec 2009.
[26] H. Jeong, S. P. Mason, A. L. Barabasi and Z. N. Oltvai, "Lethality and centrality in protein networks," Nature, vol. 411, pp. 41-42, May 3 2001.
[27] J. Kittler, M. Hatef, R. P. W. Duin and J. Matas, "On combining classifiers," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 20, pp. 226-239, 1998.
[28] R. Kohavi, "Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid," in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA, 1996, pp. 202-207.
[29] N. Landwehr, M. Hall and E. Frank, "Logistic model trees," Machine Learning, vol. 59, pp. 161-205, 2005.
[30] W. Li, J. Han and J. Pei, "CMAR: accurate and efficient classification based on multiple class-association rules," in Proceedings of the 2001 IEEE International Conference on Data Mining, 2001, pp. 369-376.
[31] H. Liang and W. H. Li, "Gene essentiality, gene duplicability and protein connectivity in human and mouse," Trends in Genetics, vol. 23, pp. 375-8, Aug 2007.
[32] B. Liu, W. Hsu and Y. Ma, "Integrating classification and association rule mining," in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York City, New York, USA, 1998, pp. 80-86.
[33] B. Liu, Y. Ma, C. K. Wong and P. S. Yu, "Scoring the data using association rules," Applied Intelligence, vol. 18, pp. 119-135, 2003.
[34] P. T. Monteiro, N. D. Mendes, M. C. Teixeira, S. d'Orey, S. Tenreiro, N. P. Mira, H. Pais, A. P. Francisco, A. M. Carvalho, A. B. Lourenco, I. Sa-Correia, A. L. Oliveira and A. T. Freitas, "YEASTRACT-DISCOVERER: new tools to improve the analysis of transcriptional regulatory associations in Saccharomyces cerevisiae," Nucleic Acids Research, vol. 36, pp. D132-D136, Jan 2008.
[35] J. Pei, J. Han and R. Mao, "CLOSET: an efficient algorithm for mining frequent closed itemsets," in ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000, pp. 21-30.
[36] J. R. Quinlan, C4.5: programs for machine learning: Morgan Kaufmann Publishers Inc., 1993.
[37] M. Seringhaus, A. Paccanaro, A. Borneman, M. Snyder and M. Gerstein, "Predicting essential genes in fungal genomes," Genome Reserach, vol. 16, pp. 1126-1135, Sep 2006.
[38] H. Shi, "Best-first decision tree learning," University of Waikato2007.
[39] C. Stark, B. J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz and M. Tyers, "BioGRID: a general repository for interaction datasets," Nucleic Acids Research, vol. 34, pp. D535-D539, Jan 1 2006.
[40] P.-N. Tan, M. Steinbach and V. Kumar, Introduction to data mining, 1st ed. Boston: Pearson Addison Wesley, 2006.
[41] M. C. Teixeira, P. Monteiro, P. Jain, S. Tenreiro, A. R. Fernandes, N. P. Mira, M. Alenquer, A. T. Freitas, A. L. Oliveira and I. Sa-Correia, "The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae," Nucleic Acids Research, vol. 34, pp. D446-D451, Jan 1 2006.
[42] J. Wang, J. Han and J. Pei, "CLOSET+: searching for the best strategies for mining frequent closed itemsets," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C., 2003, pp. 236-245.
[43] E. A. Winzeler, D. D. Shoemaker, A. Astromoff, H. Liang, K. Anderson, B. Andre, R. Bangham, R. Benito, J. D. Boeke, H. Bussey, A. M. Chu, C. Connelly, K. Davis, F. Dietrich, S. W. Dow, M. El Bakkoury, F. Foury, S. H. Friend, E. Gentalen, G. Giaever, J. H. Hegemann, T. Jones, M. Laub, H. Liao, N. Liebundguth, D. J. Lockhart, A. Lucau-Danila, M. Lussier, N. M'Rabet, P. Menard, M. Mittmann, C. Pai, C. Rebischung, J. L. Revuelta, L. Riles, C. J. Roberts, P. Ross-MacDonald, B. Scherens, M. Snyder, S. Sookhai-Mahadeo, R. K. Storms, S. Veronneau, M. Voet, G. Volckaert, T. R. Ward, R. Wysocki, G. S. Yen, K. Yu, K. Zimmermann, P. Philippsen, M. Johnston and R. W. Davis, "Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis," Science, vol. 285, pp. 901-906, Aug 6 1999.
[44] X. Yin and J. Han, "CPAR: classification based on predictive association rules," in Proceedings of the SIAM International Conference on Data Mining, San Fransisco, CA, 2003, pp. 331-335.
[45] H. Yu, D. Greenbaum, H. Xin Lu, X. Zhu and M. Gerstein, "Genomic analysis of essentiality within protein networks," Trends in Genetics, vol. 20, pp. 227-231, Jun 2004.
[46] H. Yu, P. M. Kim, E. Sprecher, V. Trifonov and M. Gerstein, "The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics," PLoS Computational Biology, vol. 3, p. e59, Apr 20 2007.
[47] E. Zotenko, J. Mestre, D. P. O'Leary and T. M. Przytycka, "Why do hubs in the yeast protein interaction network tend to be essential: reexamining the connection between the network topology and essentiality," PLoS Computational Biology, vol. 4, p. e1000140, 2008.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2012-08-27起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2012-08-27起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw