進階搜尋


 
系統識別號 U0026-0407201402354100
論文名稱(中文) 結合多項式馬可夫貝氏分類器與廣義狄氏分配參數估算方法於基因序列分類之研究
論文名稱(英文) Methods for Setting Parameters of Generalized Dirichlet Priors for Markov Bayesian Classifiers with Multinomial Models in Gene Sequence Data
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 102
學期 2
出版年 103
研究生(中文) 陳朝友
研究生(英文) Chao-Yu Chen
學號 R76011155
學位類別 碩士
語文別 中文
論文頁數 49頁
口試委員 指導教授-翁慈宗
口試委員-蔡青志
口試委員-謝佩璇
口試委員-劉任修
中文關鍵字 廣義狄氏分配  馬可夫貝氏分類器  基因序列資料  共變異數矩陣 
英文關鍵字 Generalized Dirichlet distribution  Markov Bayesian classifier  Gene sequence  Covariance Matrix 
學科別分類
中文摘要 隨著多源基因體與定序技術的發展,也更加重視分類器在處理高維度基因序列資料的效能。由於基因序列在資料前置處理後會產生大量的特徵,因此需要透過特徵分組來降低處理時的維度。本論文採用多項式馬可夫貝氏分類器,不僅是因為馬可夫貝氏分類器在運算效能上的優勢,結合馬可夫模型能夠改善簡易貝氏分類器的條件獨立假設的使用限制,而多項式機率模型考量了特徵的出現次數,能夠提升分類正確率的表現。此外,本論文加入了廣義狄氏先驗分配,並以參數估算取代耗時的尋找參數過程,而先驗分配參數的估算是以特徵組為單位進行。從特徵組中計算出共變異數矩陣,依序從共變異數矩陣的每一列挑選可用的統計量,再利用參數估算的方法得到參數,並且挑選最大的參數組合,這樣的結合方式可以維持分類正確率並降低運算的複雜度。實驗結果顯示本研究方法相較於RDP及使用狄式分配尋找參數的方法能夠大幅降低運算時間,分類正確率高於RDP分類器,但低於使用狄式分配尋找參數的方法。
英文摘要 With the development of metagenomics and sequencing, biologists can culture microbes in an ecological environment. In order to explore the diversity of species, biologists extract samples from an ecological environment directly by using the technologies for metagenomics. In the process of classifying gene sequence, the N-mer sliding window is generally used to extract features, and two adjacent features will have N-1 letters in common. This violates the conditional independence assumption of the Naïve Bayesian classifier. Markov Bayesian classifier releases the conditional dependence assumption and should be a more appropriate classification tool for gene sequence data. The prior for a class value with instances larger than or equal to ten is set to follow a generalized Dirichlet distribution, and the priors for the other class values are set to follow Dirichlet distributions. Several ways are proposed to set the parameters of generalized Dirichlet and Dirichlet priors for the Markov Bayesian classifier with multinomial model for enhancing its prediction accuracy. The parameter estimation method is much faster than the searching method in setting the parameters of Generalized Dirichlet priors. The experimental results on four gene sequence sets demonstrate that our parameter estimation method is superior to the RDP method in both prediction accuracy and computational efficiency.
論文目次 第1章、緒論 1
1.1、研究背景與動機 1
1.2、研究目的 2
1.3、研究架構 2
第2章、文獻回顧 3
2.1、多源基因體學 3
2.2、多源基因體學的資料探勘方法 5
2.3、簡易貝氏分類器 8
2.3.1、簡易貝氏分類器運作原理 8
2.3.2、簡易貝氏分類器在文件分類的機率模型 10
2.4、馬可夫貝氏分類器 11
2.4.1、馬可夫模型 11
2.4.2、馬可夫貝氏分類器運作原理 13
2.5、先驗分配 14
2.5.1、狄氏分配 14
2.5.2、廣義狄氏分配 15
2.6、小結 16
第3章、研究方法 17
3.1、研究流程 17
3.2、資料前置處理 18
3.3、共變異數矩陣計算 22
3.4、先驗分配參數的設定與估算方法 24
3.4.1、特徵組-廣義狄氏分配之三種參數估算的方法 24
3.4.2、特徵-先驗分配之參數估算的方法 30
3.5、多項式馬可夫貝氏分類器 31
3.6、驗證方式 34
第4章、實證研究 35
4.1、資料檔介紹 35
4.2、結合廣義狄氏分配參數估算方法之馬可夫貝氏分類器實證研究 36
4.2.1、Bacteria2035的分類正確率 36
4.2.2、Fungi4954的分類正確率 37
4.2.3、Bacteria3672 38
4.2.4、Fungi7730的分類正確率 39
4.2.5、各資料檔在不同組數下的分類時間比較 40
4.3、各方法正確率與運算時間比較 42
4.4、小結 43
第5章、結論與建議 45
5.1、結論 45
5.2、建議與未來發展 46
參考文獻 47
參考文獻 劉超瑞,(2013)。 應用多項式簡易貝氏分類器於文件分類的推導廣義狄氏分配 參數之方法。國立成功大學資訊管理研究所碩士論文。
韓昀達,(2013)。 多項式馬可夫簡易貝氏分類器結合狄氏先驗分配於基因序列 分類之研究。國立成功大學資訊管理研究所碩士論文。
吳沐穎,(2012)。 簡易貝氏分類器中廣義狄氏先驗分配應用於基因序列資料分類 之研究。國立成功大學資訊管理研究所碩士論文。
Aitchison, J. (2003). Ageneral class of distributions on the simplex, Jornal of the Royal Statistical Society Series B, 7, 136-146.
Altschul S.F., Gish W., Miller W., Myers E.W. & Lipmanl D.J.J., (1990). Basic Local Alignment Search Tool, Molecular Biology, 215, 403-410.
Bazinet A., & Cummings M., (2012). A comparative evaluation of sequence classification programs. Bioinformatics, 13(1), 92-104.
Brady A., Salzberg S.L., (2009). Phymm and PhymmBL: metagenomics phylogenetic classification with interpolated Markov models. Nature Methods, 6(9), 673-678.
Edgar R.C., (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460-2461.
Ghosh T.S., Mohammed M.H., Komanduri D. & Mande S.S., (2011). ProViDE: a software tool for accurate estimation of viral diversity in metagenomic samples. Bioinformation, 6, 91–94.
Hao X., Jiang, R. & Chen T., (2011) Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering. Bioinformatics, 27, 611–618.
Huson D.H., Auch A.F., Qi J. & Schuster S.C., (2007). MEGAN analysis of metagenomic data. Genome Research, 17, 377–386.
Hamdelsman J., Rondon M.R., Brady S., Clardy J. & Goodman R.M., (1998). Molecular biology provides access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology, 5, 245- 249.
Jensen D.B., Vesth T.C., Hallin P., Pedersen A.G. & Ussery D.W., (2012). Bayesian prediction of bacterial growth temperature range based on genome sequences. BMC Genomics, 13(7), S3.
Liu K., Porras-Alfaro A., Kuske C.R., Eichorst S.A. & Xie G., (2012). Accurate, rapid taxonomic classification of fungal large subunit rRNA genes. Applied and Environmental Microbiology, 78(5), 1523-1533.
Mohammed M.H., Ghosh T.S., Reddy R.M., Reddy C.V., Singh N.K. & Mande S.S, (2011). INDUS - a composition-based approach for rapid and accurate taxonomic classification of metagenomic sequences. BMC Genomics, 12(3), S4.
Monzoorul H.M., Ghosh T.S., Komanduri D. & Mande S.S., (2009). SOrt-ITEMS: sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics, 25, 1722-1730.
McHardy A.C., Martin H.G., Tsirigos A., Hugenholtz P., & Rigoutsos. I., (2007). Accurate phylogenetic classification of variable-length DNA fragments. Nature methods, 4(1), 63-72.
Nalbantoglu O.U., Way S.F., Hinrichs S.H. & Sayood K., (2011). RAIphy: phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles. BMC bioinformatics, 12, 41.
Prabhakara S., & Acharya R., (2012). Unsupervised two-way clustering of metagenomic sequences. Biomedicine and Biotechnology, doi:153647.
Reddy R.M., Mohammed M.H. & Mande S.S., (2012). TWARIT: an extremely rapid and efficient approach for phylogenetic classification of metagenomic sequences. Gene, 505(2), 259-65.
Shruthi P., and Raj., (2012) Unsupervised Two-Way Clustering of Metagenomic Sequences. Journal of Biomedicine and Biotechnology, vol. 2012, 2012.
Sharpton T.J., Riesenfeld S.J., Kembel S.W., Ladau J., O’Dwyer J.P., Green J.L., Eisen A.J. & Pollard K.A., (2011). PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Computational Biology, 7, e1001061.
Wang Y., Leung H.C.M., Yiu S. & Chin FYL (2012). MetaCluster4.0: a novel binning algorithm for NGS reads and huge number of species. Computers in Biology and Medicine, 19, 241–249.
Wang Q., George M.G., James M.T., & James R.C, (2007). Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Applied and Environmental Microbiology, 73(16), 5261-5267.
Wong T. T. (2009). Alternative prior assumptions for improving the performance of naïve Bayesian classifier. Data Mining and Knowledge Discovery, 18, 183-213.
Wong T. T. (2010). Parameter estimation for generalized Dirichlet distributions from the sample estimates of the first and the second moments of random variables. Computational Statistics & Data Analysis, 54(7), 1756-1765.
Wong T.T. (2014). Generalized Dirichlet priors for naïve Bayesian classifiers with multinomial models in document classification. Data Mining and Knowledge Discovery. 28:123-144
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2019-07-14起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2019-07-14起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw