進階搜尋


 
系統識別號 U0026-0812200910373291
論文名稱(中文) 應用資料挖掘技術推論未知PLTP序列相關文獻
論文名稱(英文) Using data mining technique to find unknown PLTP sequences literatures
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 91
學期 2
出版年 92
研究生(中文) 郭弘志
研究生(英文) Hung-Chih Kuo
電子信箱 frankie@mail2000.com.tw
學號 r7690403
學位類別 碩士
語文別 中文
論文頁數 48頁
口試委員 指導教授-王惠嘉
口試委員-陳虹樺
口試委員-黃宇翔
中文關鍵字 生物資訊學  決策樹  資料挖掘  序列註解 
英文關鍵字 family protein  data mining  sequences annotation  bioinformatics  decision tree 
學科別分類
中文摘要 在後基因體時代,基因體分析的工作重點已經由「結構性基因體學」轉變為「功能性基因體學」。前者的工作主要為基因定位、獲得基因序列資料等,資訊科技的角色在此階段主要是協助管理資料。後者則是利用這些基因序列資料做進一步的分析,發掘基因序列對應的功能,資訊科技在此階段主要的角色則轉變為取得知識、分享知識。從結構性資料到非結構性資料、顯性知識到隱性知識、管理資料到利用資料產生知識,生物資訊學(bioinformatics)的輔助對於動物或植物基因研究上變得越來越重要。
在植物基因研究中,植物脂類傳導蛋白(Plant Lipid Transfer Protein, PLTP)已被證實參與許多植物物種的生長發育過程,PLTP在植物物種上扮演的角色有其探究的必要性。由於PLTP屬於family protein,目前在註解family protein類型的基因序列還沒有較有效率的方式。我們找出傳統以人工方式註解family protein基因序列主要面臨的四個問題: 大量序列、複雜的分析工具、無法被管理的文獻資料、無法重用的分析結果。我們重新以知識管理的角度思考這一類的基因註解問題,並發展了一個架構稱為Knowledge sharing for plant lipid transfer protein(KS-PLTP)。KS-PLTP主要是利用資料挖掘決策樹演算法產生的rules做為「未知序列」與「PLTP知識庫」連結的介面。在KS-PLTP架構中,主要的工作可分為精緻化(Refinement)、自動化(Automation)、知識存取(Retrieval) 三個模組,分別解決傳統PLTP序列註解過程中所面臨的四個問題。相較於傳統的註解方式,生物學家在KS-PLTP下可以進行更具效率的family protein基因註解動作,並使得複雜的生物資料獲得更好的管理。
英文摘要 Plant lipid transfer protein (PLTP) is important because it is a kind of family protein and has been verified in recent studies. However, there is no more productive and efficient approach for family protein annotation now. We induce four problems which tradition annotation approach will face. They are high throughput sequences, complex analysis tools, disorder literatures, and non-reusing analysis results. In this project, we try to modify of traditional annotation approach by using knowledge management. We develop a framework for the genome annotation of PLTP, and we call it 「knowledge sharing for plant lipid transfer protein」system (KS-PLTP). The main task of KS-PLTP is the interface which mapping 「unknown sequences」and「PLTP knowledge database」by data mining. KS-PLTP can be divided into three modules including refinement, automation and retrieval. These modules are used to solve four main problems of traditional PLTP annotation process. Comparing with tradition approach, the annotation of family protein will become more productive and efficient. Also biological data will become more manageable.
論文目次 英文摘要……………………………………………………………………………………Ⅰ
中文摘要……………………………………………………………………………………Ⅱ
誌謝…………………………………………………………………………………………Ⅲ
目錄…………………………………………………………………………………………Ⅳ
圖目錄………………………………………………………………………………………Ⅵ
表目錄………………………………………………………………………………………Ⅶ
1. 緒論 1
1.1. 研究背景 1
1.2. 研究動機與目的 3
1.3. 論文章節說明 6
2. 文獻探討 7
2.1. 資訊技術及知識管理概念在生物資料上的應用 7
2.2. PLTP序列的重要性 11
2.3. 資訊過濾 12
2.4. 資料挖掘中的決策樹方法 12
2.5. 資料挖掘所需生物資料的準備方法 13
3. 研究方法 16
3.1. 研究架構 16
3.1.1. 相關研究與KS-PLTP的相同處與相異處 16
3.1.2. 研究架構模型 17
3.2. KS-PLTP的精緻化模組(The Refinement module of KS-PLTP) 18
3.3. KS-PLTP的自動化模組(The Automation module of KS-PLTP) 19
3.4. KS-PLTP的知識存取模組(The Retrieval module of KS-PLTP) 20
4. 系統建置與驗證 22
4.1. 系統建構過程說明 22
4.2. 建立PLTP sequences pattern database 23
4.2.1. 資料來源的取得 23
4.2.2. 利用生物工具軟體及生物研究人員領域知識進行PLTP序列的分群 24
4.2.3. 決定PLTP序列的屬性 26
4.2.4. 取得屬性的possible values 27
4.2.5. 建立PLTP sequences pattern database過程的建議及經驗 28
4.3. 建立PLTP knowledge database 28
4.4. 由未知序列連結PLTP knowledge database 29
4.5. 使用資料挖掘學習決策樹規則 32
4.5.1. C4.5決策樹演算法產生的規則、數據及規則的解釋 32
4.5.2. C4.5決策樹演算法應用在PLTP所產生的問題及解決方式 34
4.5.3. C4.5決策樹演算法與倒轉遞(backpropagation)類神經演算法的比較 34
4.6. 實證性驗證:利用KS-PLTP找到蝴蝶蘭的PLTP序列 35
5. 結論及未來研究方向 38
5.1. 研究成果 38
5.2. 未來研究方向 39

參考文獻……………………………………………………………………………………40
附錄一 KS-PLTP系統的程式說明 44
附錄二 PLTP潛在的序列 pattern 46
參考文獻 Attwood, T., & Parry-Smith. (1999). Introduction to bioinformatics. Harlow, Essex, England : Longman.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., & Lipman, D.J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215, 403-410.
Andrade, M. A., Brown, N. P., Leroy, C., Hoersch, S., de Daruvar, A., Reich, C., Franchini, A., Tamames, J., Valencia, A., Ouzounis, C., & Sander, C. (1999). Automated genome sequence analysis and annotation. Bioinformatics, 15, 391-412.
Aubourg, S., Lecharny, A., & Bohlmann, J. (2002). Genomic analysis of the terpenoid synthase AtTPS gene family of Arabidopsis thaliana. Molecular Genetics and Genomics, 267, 730-745.
Bairoch, A., Bucher, P., & Hofmann, K. (1997). The PROSITE database, its status in 1997. Nucleic Acid Research, 25, 1-6.
Baker, D., & Sali, A. (2001). Protein structure prediction and structural genomics. Science, 294, 93-96.
Baldi, P., & Pollastri, G. (2002). A machine learning strategy for protein analysis. Bioinformatics, 17, 21-27.
Bazzan, A., Engel, P.M., Schroeder, L.F., & da Silva, S.C. (2002). Automated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques. Bioinformatics, 18, 35-43.
Becker, W.M., Kleinsmith, L.J., & Hardin, J. (2000). The World of the Cell fourth edition. San Francisco : Benjamin/Cummings.
Blein, J.P., Pierre, C.T., Marion, D., & Ponchet, M. (2002). From elicitins to lipid-transfer proteins: a new insight in cell signaling involved in plant defence mechanisms. TRENDS in Plant Science, 7(7), 293-296.
Buhot, N., Douliez, J. P., Jacquemard, A., Marion, D., Tran, V., Maume, B. F., Milat, M. L., Ponchet, M., Mikes,V., Kader, J. C., & Blein, J. P. (2001). A lipid transfer protein binds to a receptor involved in the control of plant defence responses. FEBS Letters, 509, 27-30.
Bohlmann, J., Meyer-Gauen., G., & Croteau, R. (1998). Plant terpenoid synthases: molecular biology and phylogenetic analysis. Proceedings of the National Academy of Sciences USA, 95, 4126-4133.
Clare, A., & King, R., D. (2002). Machine learning of functional class from phenotype data. Bioinformatics, 18, 160-166.
Douliez, J. P., Jegou, S., Pato C., Larre, C., Molle, D., & Marion, D. (2001). Identification of a new form of lipid transfer protein (LTP1) in wheat seeds. Journal of Agricultural and Food Chemistry, 49(4), 1805-1808.
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
Eisenberg, D., Marcotte, M., E., Xenarios, i., & Yeates, O., T. (2000). Protein function in the post-genomic era. Nature, 405, 823-826.
Felsenstein, J. (1989). PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics, 5, 164-166.
Gibas, C., & Jambeck, P. (2001). Developing Bioinformatics Computer Skills. Beijing : Cambridge : O’reilly.
Henikoff, J. G., Pietrokovski, S., & Henikoff, S. (1997). Recent enhancements to the Blocks Database servers. Nucleic Acid Research, 25, 222-225.
Hieter, P., & Boguski, M. (1997). Functional Genomics: It’s All How You Read It. Science, 278, 601-602.
Hincha, D. K., Neukamm, B., Sror, H. A. M., Sieg, F., & Weckwarth, W. (2001). Cabbage cryoprotectin is a member of the nonspecific plant lipid transfer protein gene family. Plant physiology, 125, 835-846.
Kretschmann, E., Fleischmann, W., & Apweiler, R. (2001). Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics, 17, 920-926.
King, R.D., Karwath, A., Clare, A., & Dehaspe, L. (2000a). Genome scale prediction of protein functional class from sequence using data mining. In Ramakrishnan, R., Stolfo, S. & Bayardo, R. (eds). The sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The Association for Computing Machinery, New York, 384-389.
King, R.D., Karwath, A., Clare, A., & Dehaspe, L. (2000b). Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining. Yeast(Comparative and Functional Genomics, 17(4), 283-293.
King, R.D., Karwath, A., Clare, A., & Dehaspe, L. (2001). The utility of different representations of protein sequence for predicting functional class. Bioinformatics, 17(5), 445-454.
Lim, A., & Zhang, L. (1999). WebPHYLIP: A Web Interface to PHYLIP, Bioinformatics, 15, 1068-1069.
Lupas, A. (1997). Prediction coiled-coil regions in proteins. Current Opinion in Structural Biology, 7, 388-393.
Maldonado, A.M., Doerner, P., Dixon, R.A., Lamb, C.J., & Cameron, R.K. (2002). A putative lipid transfer protein involved in systemic signaling to establish acquired resistance in Arabidopsis. Nature, 419, 399–403.
Mostafa, J., Mukhopadhyay, S., Lam, W., & Palakal, M. (1997). A Multilevel Approach to Intelligent Information Filtering: Model, System and Evaluation. ACM Transactions on Information Systems, 15(4), 368-399.
Mukhopadhyay, S., Mostfaf, J., Palakal, M., Lam, W., Xue, L., & Hudli, A. (1996). An adaptive multi-level information filtering system. In Proceedings of the 5the International Conference on User Modeling, Kailua-Kona, Hawaii, 21-28.F
Nonaka, I. (1994). A dynamic theory of organizational knowledge creation. Organization Science, 5, 14-37.
Nowak, R. (1995). Entering the postgenome era. Science, 270, 368-371.
Palakal, M., Mukhopadhyay, S., Mostafa, J., Raje, R., N’Cho, M., & Mishra, S.(2002). An intelligent biological information management system. Bioinformatics, 18(10), 1283-1288.
Park, S. Y., Jauh, G. Y., Mollet, J. C., Eckard, k. J., Nothnagel, E. A., Walling, L. L., & Lord, E. M. (2000). A lipid transfer-like protein is necessary for lily pollen tube adhesion an in vitro stylar matrix. The Plant Cell, 12, 151-163.
Pastorello, E. A., Farioli, L., Pravettoni, V., Ispano, M., Scibola, E., Trambaioli, C., Giuffrida, M. G., Ansaloni, R., Godovac-Zimmermann, J., Conti, A., Fortunato, D., & Ortolani, C. (2000). The maize major allergen, which is responsible for food-induced allergic reactions, is a lipid transfer protein. Allergy Clin Immunol, 106 (4), 744-751.
Pearson,W. R., & Lipman, D. J. (1998). Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences USA 85, 2444-2448.
Pierre, B., & SØren, B. (1998). Bioinformatics : The Machine Learning Approach. Cambridge, Mass : MIT press.
Pyee, J., Yu, H., & Kolattukudy, P. K. (1994). Identification of a lipid transfer protein as the major protein in the surface wax of broccoli (Brassica oleracea) leaves. Archives of Biochemistry and Biophysics, 311, 460-468.
Rost, B., & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232, 584-599.
Rost, B., Sander, C., & Schneider, R. (1994). PHD-an automatic mail server for protein secondary structure prediction. Compute Application Bioscience, 10, 53-60.
Rost, B., Casadio, R., Fariselli, P., & Sander, C. (1995). Prediction of helical transmebrane segments at 95% accuracy. Protein Science, 4, 521-533.
Rubinelli, P., Hu, Y., & Ma, H. (1998). Identification sequence analysis and expression studies of novel anther-specific genes of Arabidopsis thaliana. Plant Molecular Biology, 37, 607-619.
Sabala, I., Elfstrand, M., Farbos, I., Clapham, D., & Sara von Arnold. (2000). Tissue-specific expression of Pa 18, a putative lipid transfer protein gene, during embryo development in Norway spruce (Pices abies). Plant Molecular Biology, 42, 461-478.
Sander, C., & Schneider, R. (1991). Database of homology-derived structures and the structural meaning of sequence alignment. Proteins Structure Function Genetic, 9, 56-68.
Schroeder,L.F. and Bazzan,A. (2002) A multi-agent system to facilitate knowledge discovery: an application to bioinformatics. In Proceedings of the Workshop on Bioinformatics and Multi-Agent Systems.
Segura, A., Moreno, M., & Garcia-Olmedo, F. (1993). Purification and antipathogenic activity of lipid transfer proteins (LTPs) from the leaves of Arabidopsis and spinach. FEBS, 332(3), 243-246.
Sohal, A. K., Pallas, J. A., & Jenkins G. I. (1999). The promoter of a Brassica napus lipid transfer protein gene is active in a range of tissues and stimulated by light and viral infection in transgenic Arabidopsis. Plant Molecular Biology, 41(1), 75-87.
Stein, L. (2001). Genome annotation: From sequence to biology. Nature Genetics Review, 2, 493.
Wijaya, R., Neumann, G. M., Condron, R., Hughes, A. B., & Polya, G. M. (2000). Defense protein from seed of Cassia fistula include a lipid transfer homologue and protease inhibitory plant defensin. Plant Science, 159, 243-255.
Witten, I., H., & Frank E. (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco, Calif. : Morgan Kaufmann.
Wootton, J.C., & Federhen, S. (1996). Analysis of compositionally biased regions in sequence databases. Methods in Enzymology, 266, 554-571.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2003-07-07起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2003-07-07起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw