進階搜尋


下載電子全文  
系統識別號 U0026-2207201514261000
論文名稱(中文) 混合式先驗分配參數設定方法結合多項式簡易貝氏分類器於基因序列資料之研究
論文名稱(英文) Parameter setting methods of hybrid priors for naïve Bayesian classifiers with multinomial model in gene sequence data
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 103
學期 2
出版年 104
研究生(中文) 姚佳佑
研究生(英文) Chia-Yu Yao
學號 R76021045
學位類別 碩士
語文別 中文
論文頁數 55頁
口試委員 指導教授-翁慈宗
口試委員-蔡青志
口試委員-張秀雲
口試委員-陳榮泰
中文關鍵字 狄氏分配  廣義狄氏分配  簡易貝氏分類器  基因序列資料  共變異數矩陣 
英文關鍵字 Covariance matrix  Dirichlet distribution  gene sequence  generalized Dirichlet distribution  naïve Bayesian classifier 
學科別分類
中文摘要 隨著多源基因體學與定序技術的發展,更加重視分類器在處理高維度基因序列資料的效能,而簡易貝氏分類器具有高效率和容易使用的特性,已經被應用於處理基因序列資料的領域上。且先驗分配參數的使用已被證明可以用來提升多項式模型結合簡易貝氏分類器處理基因序列資料在分類正確率上的表現。基因序列資料的類別值數量龐大,有些類別值所屬資料樣本較少,限制了分類正確率往上提升的可能性,因此本研究採用狄氏先驗分配,並以參數估算的方式取代傳統耗時的搜尋參數過程,而先驗分配參數的估算以特徵為單位進行,從特徵中計算出共變異數矩陣,再從共變異數矩陣挑選各類別值較合適的統計量,利用3 種選取參數組合的方法與2 種搜尋參數組合的方法以求得更合適的參數組合。實驗結果顯示本研究之方法可以改善類別值所屬資料樣本過少的問題,且其中2 種參數搜尋方法能找到更合適的參數組合,在分類正確率上能獲得改善,但需犧牲一定程度的運算時間。
為持續提升分類正確率,本研究將狄氏分配之預測能力較差且樣本數量較多之類別值改採用廣義狄氏分配之參數進行分類預測,實驗結果顯示對於持續提升分類正確率的效果有限,其主要原因來自類別值數量龐大且各類別值所屬樣本偏少所造成。
英文摘要 Due to the development of metagenomics and sequencing, analysts pay more attention to the effectiveness of classification algorithms in processing high dimensional gene sequence data. Naïve Bayesian classifiers are a popular tool for classifying high dimensional gene sequence data because of its computational efficiency and easy implementation. Setting proper parameters for priors have been shown to be an effective way for improving the performance of the naïve Bayesian classifier with multinomial models, called multinomial naïve Bayesian classifiers, in gene sequence classification. Since the number of class values in a gene sequence data set is huge, and the number of instances for many class values is less than ten, the possibility of improving the classification accuracy of gene sequence data is generally limited. In this study, the covariance matrices for features are first calculated from available gene sequence data. Then several ways are proposed to set and search the parameters of Dirichlet priors for the naïve Bayesian classifiers with multinomial model. The experimental results on two gene sequence sets demonstrate that our proposed methods can improve the prediction accuracy of the multinomial naïve Bayesian classifier in acceptable computational time. The generalized Dirichlet priors are then introduced for the class values with low accuracy and large number of instances. The experimental results on the same gene sequence sets show that the improvement on prediction accuracy is limited because the number of class values is huge and the number of instances in many class values is small.
論文目次 摘要 I
誌謝 VI
目錄 VII
表目錄 IX
圖目錄 X
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究架構 2
第二章 文獻探討 3
2.1 簡易貝氏分類器 3
2.1.1 簡易貝氏分類器運作原理 3
2.1.2 簡易貝氏分類器在文件分類的機率模型 4
2.2 多源基因體學 5
2.2.1 基因序列資料的樣貌及特徵萃取 6
2.2.2 多源基因體學的資料探勘方法 7
2.3 先驗分配 9
2.3.1 狄氏分配 9
2.3.2 廣義狄氏分配 11
2.4 小結 12
第三章 研究方法 14
3.1 資料前置處理 15
3.2 共變異數矩陣計算19
3.2.1 特徵-共變異數矩陣計算 19
3.2.2 特徵組-共變異數矩陣計算 20
3.3 先驗分配參數的估算方法 22
3.3.1 狄氏分配之參數估算方法 22
3.3.2 廣義狄氏分配之參數估算方法 26
3.4 簡易貝氏分類器處理基因序列資料 34
3.4.1 狄氏分配參數結合簡易貝氏分類器運作 35
3.4.2 廣義狄氏分配參數結合簡易貝氏分類器運作 35
3.5 驗證方式 36
第四章 實證研究 37
4.1 資料檔介紹 37
4.2 簡易貝氏分類器結合狄氏分配參數估算方法之實證研究 38
4.2.1 參數選取方法對於參數個數之比較 38
4.2.2 各方法對於正確率與小樣本類別值預測正確筆數之比較 39
4.2.3 各方法運算時間之比較 44
4.3 小結 46
第五章 結論與建議 48
5.1 結論 48
5.2 建議與未來發展 49
參考文獻 50
附錄一 狄氏分配正確率變化表-BACTERIA 資料檔 53
附錄二 狄氏分配正確率變化表-FUNGI 資料檔 54
參考文獻 吳沐穎,(2012)。簡易貝氏分類器中廣義狄氏先驗分配應用於基因序列資料分類之研究。國立成功大學資訊管理研究所碩士論文。
劉超瑞,(2013)。應用多項式簡易貝氏分類器於文件分類的推導廣義狄氏分配參數之方法。國立成功大學資訊管理研究所碩士論文。
韓昀達,(2013)。多項式馬可夫簡易貝氏分類器結合狄氏先驗分配於基因序列分類之研究。國立成功大學資訊管理研究所碩士論文。
陳朝友,(2014)。結合多項式馬可夫貝氏分類器與廣義狄氏分配參數估算方法於基因序列分類之研究。國立成功大學資訊管理研究所碩士論文。
Aitchison, J. (1985). A general class of distributions on the simplex. Journal of the Royal Statistical Society. Series A, 47(1),136-146.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-410.
Bakhtiari, A. S., & Bouguila, N. (2014). A variational bayes model for count data learning and classification. Engineering Applications of Artificial Intelligence, 35, 176-186.
Bazinet, A. L., & Cummings, M. P. (2012). A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 92.
Chikhaoui, B., Wang, S., & Pigot, H. (2012). ADR-SPLDA: Activity discovery and recognition by combining sequential patterns and latent Dirichlet allocation. Pervasive and Mobile Computing, 8(6), 845-862.
Cole, J. R., Wang, Q., Fish, J. A., Chai, B. L., McGarrell, D. M., Sun, Y. N., Brown, C. T., Porras-Alfaro, A., Kuske, C. R., & Tiedje, J. M. (2014). Ribosomal Database Project:data and tools for high throughput rRNA analysis. Nucleic Acids Research, 42(D1), 633-642.
Connor, R. J., & Mosimann, J. E. (1969). Concepts of independence for proportions with a generalization of the Dirichlet distribution. Journal of the American Statistical Association, 64(325), 194-206.
Edgar, R. C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19), 2460-2461.
Handelsman, J., Rondon, M. R., Brady, S. F., Clardy, J., & Goodman, R. M. (1998). Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology, 5(10), 245-249.
Liao, R. Q., Zhang, R. C., Guan, J. H., & Zhou, S. G. (2014). A New Unsupervised Binning Approach for Metagenomic Sequences Based on N-grams and Automatic Feature Weighting. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(1), 42-54.
Liu, K. L., & Wong, T. T. (2013). Naive Bayesian Classifiers with Multinomial Models for rRNA Taxonomic Assignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(5), 1334-1339.
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. Paper presented at the AAAI-98 workshop on learning for text categorization.
Newton, I. L. G., & Roeselers, G. (2012). The effect of training set on the classification of honey bee gut microbiota using the Naive Bayesian Classifier. BMC Microbiology, 12, 221.
Prabhakara, S., & Acharya, R. (2012). Unsupervised Two-Way Clustering of Metagenomic Sequences. Journal of Biomedicine and Biotechnology, 2012, 1-11.
Reddy, R. M., Mohammed, M. H., & Mande, S. S. (2012). TWARIT: An extremely rapid and efficient approach for phylogenetic classification of metagenomic sequences. Gene, 505(2), 259-265.
Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., & Sokhansanj, B. (2008). Metagenome Fragment Classification Using N-Mer Frequency Profiles. Advances in Bioinformatics, 2008, 1 -12.
Sanli, K., Karlsson, F. H., Nookaew, I., & Nielsen, J. (2013). FANTOM: Functional and taxonomic analysis of metagenomes. BMC Microbiology, 14, 38.
Tuzhikov, A., Panchin, A., & Shestopalov, V. I. (2014). TUIT, a BLAST-based tool for taxonomic classification of nucleotide sequences. Biotechniques, 56(2), 78-84.
Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261-5267.
Werner, J. J., Koren, O., Hugenholtz, P., DeSantis, T. Z., Walters, W. A., Caporaso, J. G., Angenent, L. T., Knight, R., & Ley, R. E. (2012). Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. The ISME Journal, 6(1), 94-103.
Wong, T.-T. (1998). Generalized Dirichlet distribution in Bayesian analysis. Applied Mathematics and Computation, 97(2), 165-181.
Wong, T. T. (2007). Perfect aggregation of Bayesian analysis on compositional data. Statistical Papers, 48(2), 265-282.
Wong, T.-T. (2009). Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Mining and Knowledge Discovery, 18(2), 183-213.
Wong, T.-T. (2010). Parameter estimation for generalized Dirichlet distributions from the sample estimates of the first and the second moments of random variables. Computational Statistics & Data Analysis, 54(7), 1756-1765.
Zakrzewski, M., Bekel, T., Ander, C., Pühler, A., Rupp, O., Stoye, J., Schlüter, A., & Goesmann, A. (2013). MetaSAMS—a novel software platform for taxonomic classification, functional annotation and comparative analysis of metagenome datasets. Journal of Biotechnology, 167(2), 156-165.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2020-07-30起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2020-07-30起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw