系統識別號 U0026-1806201312381200
論文名稱(中文) 多項式馬可夫簡易貝氏分類器結合狄氏先驗分配於基因序列分類之研究
論文名稱(英文) Dirichlet Priors for Markov Naïve Bayesian Classifiers with Multinomial Model for Gene Sequence Data
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 101
學期 2
出版年 102
研究生(中文) 韓昀達
研究生(英文) Yun-Da Han
學號 r76001095
學位類別 碩士
語文別 中文
論文頁數 82頁
口試委員 指導教授-翁慈宗
中文關鍵字 狄氏分配  基因序列資料  馬可夫模型  簡易貝氏分類器 
英文關鍵字 Dirichlet distribution  gene sequence data  Markov model  naïve Bayesian classifier 
中文摘要 近年來隨著多源基因體學與定序技術的發展,生物學家不再以傳統的方式進行生態環境的研究,因為在實驗室中所能培養出的物種相當有限,僅僅為生態環境中的百分之一。透過多源基因體學研究能夠直接從生態環境中擷取微生物的樣本,並且藉由定序技術讓生物學家能夠更進一步瞭解物種的資訊,從中探索生態環境中物種的多樣性。在分類的過程中,會利用N-mer的移動窗口對基因序列資料作特徵萃取,所萃取出的相鄰特徵會有(N-1)個字元重覆,因此萃取出的特徵集合具有關聯性,這與簡易貝氏分類器條件獨立的假設互相違背。本研究希望透過馬可夫簡易貝氏分器處理基因序列資料這種高維度與需要龐大運算需求的分類問題,不僅是因為馬可夫簡易貝氏分類器在運算效率上的優勢,也因為結合了馬可夫模型能夠改善簡易貝氏分類器在條件獨立假設的問題,其中本研究採用多項式模型作為機率模型,在概似機率的計算上考慮了特徵頻率,而會有較佳的分類表現。此外,本研究加入了先驗分配-狄氏分配,期望藉由馬可夫簡易貝氏分類器和狄氏分配的結合,透過兩種先驗分配參數-分子先驗分配參數與分母先驗分配參數的設定,提升分類正確率。本研究以兩種不同的方式-狄氏分配_分子分母與狄氏分配_分母分子對四個基因序列資料檔來作測試。實證結果發現,狄氏分配_分子分母,在同一個類別值內先進行分子參數的調整,再進行分母參數的調整會有較好的分類結果。該兩種方法在參數調整完畢後,其分類正確率已高於RDP分類器,相較於簡易貝氏分類器結合狄氏分配,多了分母先驗分配參數可供調整,因此有較高的分類結果。故本研究多項式馬可夫簡易貝氏分類器結合狄氏分配,透過先驗分配參數的設定,確實對分類正確率能夠有效的提升。
英文摘要 With the development of metagenomics and sequencing, biologists do not have to culture the microbes in a laboratory that is less than one percent of the microbes living in an ecological environment. In order to explore the diversity of species, biologists extract samples from an ecological environment directly by using the technologies for metagenomics. In the process of classifying gene sequence reads, the N-mer sliding window is generally used to extract features, and two adjacent features will have N-1 letters in common. This greatly violates the conditional independence assumption of the naïve Bayesian classifier. The Markov naïve Bayesian classifier releases the conditional independence assumption and should be a more appropriate classifier for gene sequence data. In this study, we attempt to embed multinomial models and Dirichlet priors for enumerator and denominator in the Markov naïve Bayesian classifier to enhance its accuracy in classifying gene sequence reads. Two methods enumerator-first and denominator-first are tested on four gene sequence sets, and the experimental results show that the enumerator-first method can generally achieve a higher prediction accuracy. Both methods can have a better performance than the well-known RDP classifier. Since the number of priors for a class value in the Markov naïve Bayesian classifier is two instead of one in the naïve Bayesian classifier, the best noninformative Dirichlet priors do enhance the performance of the Markov naïve Bayesian classifier.
論文目次 第一章、緒論 1
1.1、研究背景與動機 1
1.2、研究目的 2
1.3、研究架構 3
第二章、文獻探討 4
2.1、多源基因體學 4
2.2、簡易貝氏分類器 6
2.2.1、簡易貝氏分類器運作原理 6
2.2.2、簡易貝氏分類器在文件分類的機率模型 7
2.2.3、簡易貝氏分類器於基因序列的應用 8
2.3、馬可夫簡易貝氏分類器 10
2.3.1、馬可夫模型 10
2.3.2、馬可夫簡易貝氏分類器運作原理 12
2.4、 狄氏分配 13
2.5、小結 15
第三章、研究方法 16
3.1、研究流程 16
3.2、資料前置處理 18
3.3、多項式馬可夫簡易貝氏分類器 19
3.4、狄氏先驗分配參數的調整方法 21
3.5、驗證方式 23
第四章、實證研究 24
4.1、資料檔介紹 24
4.2、馬可夫簡易貝氏分類器之實證研究 25
4.3、狄氏分配之實證研究 27
4.3.1、Bacteria2035資料檔之分類正確率比較 27
4.3.2、Bacteria3672資料檔之分類正確率比較 29
4.3.3、Fungi4954資料檔之分類正確率比較 31
4.3.4、Fungi7730資料檔之分類正確率比較 34
4.4、小結 39
第五章、結論與建議 42
5.1、結論 42
5.2、建議 43
參考文獻 44
附錄一 狄氏分配_分子分母正確率變化表-Bacteria2035資料檔 48
附錄二 狄氏分配_分母分子正確率變化表-Bacteria2035資料檔 49
附錄三 狄氏分配_分子分母正確率變化表-Bacteria3672資料檔 50
附錄四 狄氏分配_分母分子正確率變化表-Bacteria3672資料檔 54
附錄五 狄氏分配_分子分母正確率變化表-Fungi4954資料檔 58
附錄六 狄氏分配_分母分子正確率變化表-Fungi4954資料檔 61
附錄七 狄氏分配_分子分母正確率變化表-Fungi7730資料檔 64
附錄八 狄氏分配_分母分子正確率變化表-Fungi7730資料檔 73
參考文獻 黃于珊,(2009)。多項式簡易貝氏分類器中廣義狄氏先驗分配之參數設定方法。國立成功大學資訊管理研究所碩士班碩士論文。
Bazinet, A. & Cummings M. (2012). A comparative evaluation of sequence classification programs. Bioinformatics, 13(1), 92-104.
Brady, A., Salzberg, S. L. (2009). Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6(9), 673-678.
Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3), 5432-5435.
Fan, I., McElroy, K., Thomas, T. (2012). Resources and costs for microbial sequence Analysis evaluated using virtual machines and cloud Computing. PLoS, 6, 1-9. Gerlach, W. & Stoye, J. (2011). Taxonomic classification of metagenomic shotgun sequences with CARMA3 . Nucleic Acids Research, 39(14), 1-11.
Ghosh, T., Gajjalla, P., Mohammed, M., & Mande, S. (2012). C16S — A hidden Markov model based algorithm for taxonomic classification of 16S rRNA gene sequences. Genomics, 99, 195-201.
Handelsman, J., Rondon, M.R., Brady, S., Clardy, J., & Goodman, R.M. (1998). Molecular biology provides access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology, 5,245- 249.
Horton, M., Bodenhausen, N., & Bergelson, J. (2010). MARTA: a suite of Java-based tools for assigning taxonomic status to DNA sequences. Bioinformatics, 26(4), 568-569.
Jeffrey, J., & Hugenholtz, P. (2012).Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. International Society for Microbial Ecology, 6, 94-103.
Kotamarti, R., Hahsler, M., & Raiford, D. (2010).Analyzing taxonomic classification using extensible Markov models. Bioinformatics, 26(18), 2235-2241.
Krause, L., Diaz, N., Goesmann, A., Kelley, S., Nattkemper, T., & Robert, F. (2008). Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Research, 36, 2030-2239.
Lan, Y., Wang, Q. (2012).Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms. PLoS, 7,1-15.
Largeron, C., Moulin, C., & Gery, M. (2011). Entropy based feature selection for text categorization. Symposium on Applied Computing, TaiChung, Taiwan.
Leonard E. Baum & J. A. Eagon. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society, 73(3), 360–363.
Liu, K., Chain, P., Hengartner, N., Kuske, C., & Xie, G. (2007). A Markov naïve Bayesian classifier for improved bacterial and fungal rRNA taxonomic assignment using short rRNA sequences. A working paper.
McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. Working Notes of the 1998 AAAI/ICML Workshop on Learning for Text Categorization, 41-48.
McHardy, A. C., Martin, H. G., Tsirigos, A., Hugenholtz, P., & Rigoutsos, I. (2007). Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods, 4(1), 63-72.
Pati, A. & Heath, L. (2011). ClaMS: A classifier for metagenomic sequences. Genomic Sciences, doi:10.4056, 248-253.
Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., and Sokhansanj, B. (2008). Metagenome fragment classification using N-mer frequency profiles. Advances in Bioinformatics, 2008, Article ID 205969, 12 pages.
Rosen, G. L., Reichenberger, E. R., & Rosenfeld, A. M. (2011). NBC: the naive Bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics, 27(1), 127-129.
Rosen, G., Lim, T. (2012). NBC update: The addition of viral and fungal databases to the naïve Bayes classification tool. BioMed Central, 4, 1-5.
Slabbinck, B., Waegeman, W., Dawyndt, P., Vos, P.,Baets, B. (2010). From learning taxonomies to phylogenetic learning: Integration of 16S rRNA gene data into FAME-based bacterial classification. BioMed Central, 11, 1-16.
Soergel, D., Dey, N., Knight, R., & Brenner, S. (2012). Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences. International Society for Microbial Ecology, 6 , 1440-1444.
Vilo, C., Dong, Q. (2012). Evaluation of the RDP classifier accuracy using 16S rRNA gene variable regions. Metagenomics, Article ID 235551 , 1-5.
Wong, T. T. (1998). Generalized Dirichlet distribution in Bayesian analysis. Applied mathematics and Computation, 97, 165-181.
Wong, T. T. (2007). Perfect aggregation of Bayesian analysis on compositional data. Statistical Papers, 48, 265-282.
Wong, T. T. (2009). Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Mining and Knowledge Discovery, 18(2), 183-213.
  • 同意授權校內瀏覽/列印電子全文服務,於2015-06-25起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2015-06-25起公開。

  • 如您有疑問,請聯絡圖書館