系統識別號 U0026-2701201516012800
論文名稱(中文) 應用多項式模型與無資訊廣義狄式先驗分配之簡易貝式分類器於基因序列資料分類之研究
論文名稱(英文) Naive Bayesian classifiers with multinomial models and noninformative generalized dirichlet priors for rRNA taxonomy assignment
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 103
學期 1
出版年 104
研究生(中文) 劉冠良
研究生(英文) Kuan-Liang Liu
電子信箱 kliangliu@gmail.com
學號 R78961039
學位類別 博士
語文別 英文
論文頁數 56頁
口試委員 指導教授-翁慈宗
中文關鍵字 廣義狄式先驗分配  簡易貝氏分類器  NGS  rRNA  序列分類 
英文關鍵字 generalized Dirichlet prior distribution  naïve Bayesian classifier  NGS  rRNA  taxonomy assignment 
中文摘要 高通量定序技術於這幾年的突破與發展,徹底改革了以往科學家研究微生物的方式,原本無法在實驗室培養的菌種,現在可以透過完整菌落樣本的DNA採樣與定序,取得特定時間與空間的微生物序列資料,以全面與系統性的角度分析微生物聚落.為了有效分析這些序列資料,密西根州大的Ribosomal Database Project (RDP)建立了RDP序列分類器,其利用DNA序列中的8-mer核甘酸頻率以及貝氏定理來辨識不同微生物種的序列資料,不僅分類速度快,對於大量的短序列資料也可以取得不錯的分類正確率.然而RDP分類器所使用的二項式模型卻無法考慮到序列中重複出現的8-mer核甘酸,歷史文獻資料同時亦顯示多項式模型相較於二項式模型於簡易貝式分類器的文件分類應用上會有較好的分類正確率.因此這份研究考量於序列中重複出現的8-mer核甘酸,建立使用於rRNA序列分類的多項式簡易貝式分類器.在與RDP分類器於250-bp、400-bp、800-bp與完整序列長度的實驗測試中.多項式模型都可以獲得較高的分類預測正確率.而為了避免因為不存在於訓練資料的8-mer核甘酸而使得該特徵似然估計為零的情況,簡易貝式分類器會在每個特徵的似然估計過程使用先驗資訊,然而每個特徵的重要性都不同,不同類別所擁有的序列數目之變異也相當大,因此本研究建立尋找最佳廣義狄式先驗分配參數的方法,允許不同的特徵在不同的類別狀況下擁有不同的信心水準.實驗測試結果顯示這樣的調整較RDP分類器以及固定先驗資訊可以獲得更高的分類預測正確率,另一方面我們亦在測試過程中發現,調整過程中所設定的組別參數對於分類預測正確率有正向的影響.
英文摘要 The introduction of next generation sequencing (NGS) has created a major revolution in biological ecology. Direct sequencing of hypervariable regions from rRNA genes can provide rapid and inexpensive analysis for ecological communities. In order to get deep understanding from these data, the Ribosomal Database Project developed the ‘RDP Classifier’ utilizing 8-mer nucleotide frequencies with Bayesian theorem to obtain taxonomy affiliation. This classifier is computationally efficient and works well with massive short sequences. However, the binary model employed in the RDP classifier does not consider the repetitive 8-mers in each reference sequence. Previous studies have pointed out that multinomial model usually results a better performance than binary model. In this research, we present the naïve Bayesian classifiers with multinomial models that take repetitive 8-mers into account for classifying rRNA sequences. The results were compared with those obtained from the binomial RDP classifier by 250-bp, 400-bp, 800-bp, and full-length reads to demonstrate that the multinomial approach can generally achieve a higher predictive accuracy. The number of instances for a specific class value in a rRNA sequence set can be less than ten. In such a case, allowing different confidence levels on the features in a noninformative prior have the potentiality to improve the performance of naïve Bayesian classifier. This study further develops a method to determine the best noninformative generalized Dirichlet priors for a naïve Bayesian classifier with multinomial models. The experimental results demonstrate that it can outperform the RDP classifier in all ranks and also suggest that the number of groups has a positive impact on the performance of the multinomial naïve Bayesian classifier.
摘要 I
誌謝 III
Chapter 1. Introduction 1
1.1. Research Background 1
1.2. Research motivation 2
1.3. Research objective 2
Chapter 2. Literature review 4
2.1. Ribosomal DNA complex 5
2.2. Next-Generation Sequencing – NGS 8
2.3. Genome signature 10
2.4. A rapidly taxonomy placement for rRNA reads 12
2.5. Naïve Bayesian classifier 13
2.6. Flattening constant 15
2.7. Dirichlet distributions and generalized Dirichlet distributions 17
Chapter 3. Naïve Bayesian classifier with multinomial model for rRNA taxonomic assignment 19
3.1. rRNA training set preparation 20
3.2. Feature extraction from rRNA gene sequence data 22
3.3. Probabilistic framework of naïve Bayesian classifier for taxonomic assignment 23
3.3.1. Binomial model 24
3.3.2. Multinomial model 25
3.3.3. Demonstration of binomial and multinomial model on artificial data set 26
3.4. Experimental result 26
3.4.1. Full-length comparison 27
3.4.2. Short read fragment comparison 29
3.5. Discussion 33
Chapter 4. Noinformative generalized Dirichlet priors for rRNA taxonomic assignment 36
4.1. Classification methods with noninformative generalized Dirichlet priors 37
4.1.1. Searching for parameters for generalized Dirichlet priors 37
4.2. Experimental results 41
4.2.1. Bacterial 16S gene sequence data 41
4.2.2. Fungal 28S gene sequence data 43
4.3. Discussion 44
Chapter 5. Conclusion and future work 47
References 50
參考文獻 References
Abe, T., Kanaya, S., Kinouchi, M., Ichiba, Y., Kozuki, T., & Ikemura, T. (2003). Informatics for unveiling hidden genome signatures. Genome Research, 13(4), 693-702.
Araujo, J. F., de Castro, A. P., Costa, M. M., Togawa, R. C., Júnior, G. J., Quirino, B. F., et al. (2012). Characterization of soil bacterial assemblies in Brazilian savanna-like vegetation reveals acidobacteria dominance. Microbial Ecology, 64(3), 760-770.
Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D. R., et al. (2011). Enterotypes of the human gut microbiome. Nature, 473(7346), 174-180.
Bailly-Bechet, M., Danchin, A., Iqbal, M., Marsili, M., & Vergassola, M. (2006). Codon usage domains over bacterial chromosomes. PLoS Computational Biology, 2(4), e37.
Bauer, M., Schuster, S. M., & Sayood, K. (2008). The average mutual information profile as a genomic signature. BMC Bioinformatics, 9, 48.
Beck, D., Settles, M., & Foster, J. A. (2011). OTUbase: an R infrastructure package for operational taxonomic unit data. Bioinformatics, 27(12), 1700-1701.
Bentley, S. D., & Parkhill, J. (2004). Comparative genomic structure of prokaryotes. Annual Review of Genetics, 38, 771-792.
Bokulich, N. A., Joseph, C. M., Allen, G., Benson, A. K., & Mills, D. A. (2012). Next-generation sequencing reveals significant bacterial diversity of botrytized wine. PLoS One, 7(5), e36357.
Cole, J. R., Chai, B., Farris, R. J., Wang, Q., Kulam, S. A., McGarrell, D. M., et al. (2005). The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Research, 33(Database issue), D294-296.
Cole, J. R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R. J., et al. (2009). The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Research, 37(Database issue), D141-145.
Cole, J. R., Wang, Q., Fish, J. A., Chai, B., McGarrell, D. M., Sun, Y., et al. (2014). Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Research, 42(Database issue), D633-642.
Connor, R. J., & Mosimann, J. E. (1969). Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution. Journal of the American Statistical Association, 64(325), 194-206.
Delcher, A. L., Bratke, K. A., Powers, E. C., & Salzberg, S. L. (2007). Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics, 23(6), 673-679.
DeLong, E. F. (2005). Microbial community genomics in the ocean. Nature Reviews Microbiology, 3(6), 459-469.
Deschavanne, P. J., Giron, A., Vilain, J., Fagot, G., & Fertil, B. (1999). Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Molecular and Biology Evolution, 16(10), 1391-1399.
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine learning, 29(2-3), 103-130.
Dunbar, J., Eichorst, S. A., Gallegos-Graves, L. V., Silva, S., Xie, G., Hengartner, N. W., et al. (2012). Common bacterial responses in six ecosystems exposed to 10 years of elevated atmospheric carbon dioxide. Environmental Microbiology, 14(5), 1145-1158.
Egert, M., Marhan, S., Wagner, B., Scheu, S., & Friedrich, M. W. (2004). Molecular profiling of 16S rRNA genes reveals diet-related differences of microbial communities in soil, gut, and casts of Lumbricus terrestris L. (Oligochaeta: Lumbricidae). FEMS Microbiology Ecology, 48(2), 187-197.
Eichorst, S. A., & Kuske, C. R. (2012). Identification of cellulose-responsive bacterial and fungal communities in geographically and edaphically different soils by using stable isotope probing. Applied and Environmental Microbiology, 78(7), 2316-2327.
Fienberg, S. E., & Holland, P. W. (1972). On the choice of flattening constants for estimating multinomial probabilities. Journal of Multivariate Analysis, 2(1), 127-134.
Good, I. J. (1965). The Estimation of Probabilities, MIT press, Cambridge, MA.
Hao, X., Jiang, R., & Chen, T. (2011). Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering. Bioinformatics, 27(5), 611-618.
Hugenholtz, P., Goebel, B. M., & Pace, N. R. (1998). Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. Journal of Bacteriology, 180(18), 4765-4774.
Jernigan, R. W., & Baran, R. H. (2002). Pervasive properties of the genomic signature. BMC Genomics, 3(1), 23.
Jumpponen, A., Jones, K. L., David Mattox, J., & Yaege, C. (2010). Massively parallel 454-sequencing of fungal communities in Quercus spp. ectomycorrhizas indicates seasonal dynamics in urban and rural sites. Molecular Ecology, 19 Suppl 1, 41-53.
Karlin, S., & Burge, C. (1995). Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics, 11(7), 283-290.
Kodratoff, Y., Cestnik, B., & Bratko, I. (1991). On estimating probabilities in tree pruning Machine Learning — EWSL-91, 482, 138-150.
Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K., & Hugenholtz, P. (2008). A bioinformatician's guide to metagenomics. Microbiology and Molecular Biology Reviews, 72(4), 557-578.
Kuske, C. R., Yeager, C. M., Johnson, S., Ticknor, L. O., & Belnap, J. (2012). Response and resilience of soil biocrust bacterial communities to chronic physical disturbance in arid shrublands. The ISME Journal, 6(4), 886-897.
Lan, Y., Wang, Q., Cole, J. R., & Rosen, G. L. (2012). Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms. PLoS One, 7(3), e32491.
Li, J., & Sayood, K. (2005). A genome signature based on markov modeling. Conference Proceedings IEEE Engineering in Medicine and Biology Society, 3, 2832-2835.
Liu, K. L., Porras-Alfaro, A., Kuske, C. R., Eichorst, S. A., & Xie, G. (2012). Accurate, rapid taxonomic classification of fungal large-subunit rRNA genes. Applied and Environmental Microbiology, 78(5), 1523-1533.
Liu, K. L., & Wong, T. T. (2013). Naïve Bayesian Classifiers with Multinomial Models for rRNA Taxonomic Assignment. IEEE/ACM Transactions on Computational Biology Bioinformatics.
Liu, Z., DeSantis, T. Z., Andersen, G. L., & Knight, R. (2008). Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research, 36(18), e120.
Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar, et al. (2004). ARB: a software environment for sequence data. Nucleic Acids Research, 32(4), 1363-1371.
McCallum, A. (1998). A comparison of event models for Naive Bayes text classification. In K. Nigam (Ed.) (pp. 41--48): AAAI Press.
McHardy, A. C., Martín, H. G., Tsirigos, A., Hugenholtz, P., & Rigoutsos, I. (2007). Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods, 4(1), 63-72.
Mitchell, T. M. (1997). Machine Learning: McGraw-Hill.
O'Brien, H. E., Parrent, J. L., Jackson, J. A., Moncalvo, J. M., & Vilgalys, R. (2005). Fungal community analysis by large-scale sequencing of environmental samples. Applied and Environmental Microbiology, 71(9), 5544-5550.
Olsen, G. J., Lane, D. J., Giovannoni, S. J., Pace, N. R., & Stahl, D. A. (1986). Microbial ecology and evolution: a ribosomal RNA approach. Annual Review of Microbiology, 40, 337-365.
Pace, N. R. (1997). A molecular view of microbial diversity and the biosphere. Science, 276(5313), 734-740.
Porras-Alfaro, A., Liu, K. L., Kuske, C. R., & Xie, G. (2014). From genus to phylum: large-subunit and internal transcribed spacer rRNA operon regions show similar classification accuracies influenced by database composition. Applied and Environmental Microbiology, 80(3), 829-840.
Pride, D. T., Meinersmann, R. J., Wassenaar, T. M., & Blaser, M. J. (2003). Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Research, 13(2), 145-158.
Ravel, J., Gajer, P., Abdo, Z., Schneider, G. M., Koenig, S. S., McCulle, S. L., et al. (2011). Vaginal microbiome of reproductive-age women. Proceedings of the National Academy of Sciences of the United States of America, 108 Suppl 1, 4680-4687.
Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., & Sokhansanj, B. (2008). Metagenome fragment classification using N-mer frequency profiles. Advances in Bioinformatics, 2008, 205969.
Rosen, G. L., & Lim, T. Y. (2012). NBC update: The addition of viral and fungal databases to the Naïve Bayes classification tool. BMC Research Notes, 5, 81.
Rosen, G. L., Reichenberger, E. R., & Rosenfeld, A. M. (2011). NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics, 27(1), 127-129.
Sandberg, R., Winberg, G., Bränden, C. I., Kaske, A., Ernberg, I., & Cöster, J. (2001). Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Research, 11(8), 1404-1409.
Sanger, F., Air, G. M., Barrell, B. G., Brown, N. L., Coulson, A. R., Fiddes, C. A., et al. (1977). Nucleotide sequence of bacteriophage phi X174 DNA. Nature, 265(5596), 687-695.
Sogin, M. L., Morrison, H. G., Huber, J. A., Mark Welch, D., Huse, S. M., Neal, P. R., et al. (2006). Microbial diversity in the deep sea and the underexplored "rare biosphere". Proceedings of the National Academy of Sciences of the United States of America, 103(32), 12115-12120.
Teeling, H., Meyerdierks, A., Bauer, M., Amann, R., & Glöckner, F. O. (2004). Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology, 6(9), 938-947.
Teeling, H., Waldmann, J., Lombardot, T., Bauer, M., & Glöckner, F. O. (2004). TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 5, 163.
Turnbaugh, P. J., Hamady, M., Yatsunenko, T., Cantarel, B. L., Duncan, A., Ley, R. E., et al. (2009). A core gut microbiome in obese and lean twins. Nature, 457(7228), 480-484.
Van de Peer, Y., Chapelle, S., & De Wachter, R. (1996). A quantitative map of nucleotide substitution rates in bacterial rRNA. Nucleic Acids Research, 24(17), 3381-3391.
Venter, J. C., Remington, K., Heidelberg, J. F., Halpern, A. L., Rusch, D., Eisen, J. A., et al. (2004). Environmental genome shotgun sequencing of the Sargasso Sea. Science, 304(5667), 66-74.
Walters, W. A., Caporaso, J. G., Lauber, C. L., Berg-Lyons, D., Fierer, N., & Knight, R. (2011). PrimerProspector: de novo design and taxonomic analysis of barcoded polymerase chain reaction primers. Bioinformatics, 27(8), 1159-1161.
Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261-5267.
Woese, C. R., & Fox, G. E. (1977). Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences of the United States of America, 74(11), 5088-5090.
Wong, T.-T. (1998). Generalized Dirichlet distribution in Bayesian analysis. Applied Mathematics and Computation, 97(2–3), 165-181.
Wong, T.-T. (2009). Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Mining and Knowledge Discovery, 18(2), 183-213.
Wong, T.-T. (2014). Generalized Dirichlet priors for Naïve Bayesian classifiers with multinomial models in document classification. Data Mining and Knowledge Discovery, 28(1), 123-144.
Yang, F., Zeng, X., Ning, K., Liu, K. L., Lo, C. C., Wang, W., et al. (2012). Saliva microbiomes distinguish caries-active from healthy human populations. ISME Journal, 6(1), 1-10.
  • 同意授權校內瀏覽/列印電子全文服務,於2015-02-06起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2015-02-06起公開。

  • 如您有疑問,請聯絡圖書館