進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-1307201618422900
論文名稱(中文) 用多項式簡易貝氏分類器分類基因序列資料時以遺傳密碼進行特徵萃取之研究
論文名稱(英文) Applying genetic code for feature extraction in classifying gene sequence data by multinomial naive Bayesian classifiers
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 104
學期 2
出版年 105
研究生(中文) 林修弘
研究生(英文) Xiu-Hong Lin
學號 R76033018
學位類別 碩士
語文別 中文
論文頁數 42頁
口試委員 指導教授-翁慈宗
口試委員-王維聰
口試委員-胡政宏
口試委員-陳榮泰
中文關鍵字 遺傳密碼  宏基因體學  簡易貝氏分類器  特徵萃取  基因序列 
英文關鍵字 feature extraction  gene sequence  genetic code  multinomial naïve Bayesian classifier 
學科別分類
中文摘要 以往人類在探究環境微生物時都是先從環境中採集樣本,再放入實驗室進行培養研究。但是近年來科學家發現實驗室中的環境只能培養自然環境中百分之一的微生物,限制了研究範圍,因此直接取樣以進行基因定序的宏基因體學技術更適合用於研究微生物種群。在處理宏基因體學序列時,簡易貝氏分類器由於其良好的分類效果和線性的運算成本被廣泛採用。雖然簡易貝氏分類器在研究中已經取得了不錯的效果,但宏基因體學序列資料類別值多、屬性維度高且分佈稀疏的特點限制了其分類效果的進一步提升。為此,已有大量學者針對這一問題進行了深入的研究,提出了屬性選擇、階層式處理、先驗分配優化等方案。本研究針對這一問題,引入生物學中的遺傳密碼對序列資料進行處理,並且改進了屬性萃取步驟並提出了組合式特徵使用方法,希望能夠進一步提升簡易貝氏分類器處理宏基因體學序列資料時的準確率。實驗結果顯示本文提出的研究方法不僅在準確率上略有提升,還能夠顯著提升運算速度。
英文摘要 We often collect microorganism samples from the environment and cultivate them in laboratories, while most of them cannot be cultivated well. Collecting gene sequences from their cells is therefore a better way to study the environment microbial populations. Multinomial naïve Bayesian classifiers are often used for analyzing gene sequence data because of its computational efficiency and easy implementation. The dimension is high, and the number of class values is large in a gene sequence set. Many studies have proposed approaches to improve the accuracy of the multinomial naïve Bayesian classifier, such as feature selection and prior setting methods. This study introduces the concept of genetic code to transform gene sequence, and proposes several ways to aggregate the features for classification. The experimental results on a gene sequence set show that the approach proposed in this study can significantly accelerate the computation of the multinomial naïve Bayesian classifier when the accuracy is improved.
論文目次 摘要 I
致謝 VI
目錄 VII
圖目錄 VIII
表目錄 IX
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究架構 2
第二章 文獻回顧 3
2.1 核酸序列及其生物學意義 3
2.2 宏基因體學介紹 5
2.3 基因序列資料 6
2.4 遺傳密碼與蛋白質的氨基酸序列 9
2.5 簡易貝氏分類器 12
2.5.1 簡易貝氏分類器運作原理 12
2.5.2 簡易貝氏分類器的機率模型 13
第三章 研究方法 15
3.1 利用遺傳密碼處理核酸序列資料 16
3.2 氨基酸序列特徵萃取 21
3.3 組合式特徵使用方法 24
3.4 簡易貝氏分類器的運用 25
3.5 評估方法 26
第四章 實證研究 28
4.1 資料檔介紹 28
4.2 四種遺傳密碼處理方式對比 29
4.3 簡易貝氏分類器Laplace修正參數選擇 30
4.4 組合式特徵使用方法的正確率對比 33
4.5 算法運行時間之比較 34
第五章 結論與建議 37
5.1 結論 37
5.2 建議與未來發展 38
参考文獻 40

參考文獻 姚佳佑,(2015)。結合多項式簡易貝氏分類器與狄氏先驗分配參數估算方法於基因序列分類之研究。國立成功大學資訊管理研究所碩士論文。
陳朝友,(2014)。結合多項式馬可夫貝氏分類器與廣義狄氏分配參數估算方法於基因序列分類之研究。國立成功大學資訊管理研究所碩士論文。
蔡忠霖,(2010)。應用多源基因體學分類問題之以熵值為基礎的特徵選取法。國立成功大學資訊管理研究所碩士論文。
朱玉賢,李毅,鄭曉峰,(2007)。現代分子生物學。高等教育出版社。
Alexander, T., Alexander, P., and Shestopalov, V. I. (2014). TUIT, a BLAST-based tool for taxonomic classification of nucleotide sequences. Biotechniques, 56(2), 78-84.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-410
Bazinet, A. L. and Cummings, M.P. (2012). A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 92.
CRICK, F. H. C. (1966). Codon-anticodon pairing:the wobble hypothesis. Journal of Molecular Biology, 19,548-555.
CRICK, F. H. C. (1970). Central dogma of molecular biology. Nature. 08, 227 (5258): 561–3.
CRICK, F. H. C., Barnett, L., Brenner S., and Watts-Tobin ,R.J. (1961).General nature of the genetic code for proteins. Nature, December 30, 4809, 1227-1232.
Cui, H.F. and Zhang, X.G.(2013).Alignment-free supervised classification of metagenomes by recursive SVM. BMC Genomics , 14:641.
Daeyaert, F., Moereels,H., and Lewi,P.J.(1998). Classification and identification of proteins by means of common and specific amino acid n-tuples in unaligned sequences. Computer Methods and Programs In Biomedicine, 56, 221–233.
Duan, L.G, Di, P. and Li A.P. (2014). A new naive Bayes text classification algorithm. TELKOMNIKA Indonesian Journal of Electrical Engineering, 12(2), February, 947 ~ 952.
David, A. R. and Anthony, J. R.(2008). Simplicity, function, and legibility in an
enhanced ambigraphic nucleic acid notation. BioTechniques, 44,811-813.
Edgar, R. C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19), 2460-2461.
Frenkel, F. E. and Korotkov, E. V. (2008). Classification of triplet periodicity in the DNA sequences of genes from KEGG databank. Molecular Biology, 42(4), 629–640.
Goés, F., Alves, R., Corrêa, L., Chaparro,C., and Thom,L. (2014). Advances in Bioinformatics and Computational Biology, 8826 , 17-24.
Handelsman, J., Rondon, M. R., Brady, S. F., Clardy, J., and Goodman, R. M. (1998). Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry and Biology, 5(10), 245-249.
Hoff, K. J., Tech, M., Thomas,L., Rolf,D.,Burkhard ,M., and Peter, M.(2008). Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics ,9:217.
Liao, R. Q., Zhang, R. C., Guan, J. H., and Zhou, S. G. (2014). A new unsupervised binning approach for metagenomic sequences based on N-grams and automatic feature weighting. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(1), 42-54.
Liu, K.L. and Wong,T.T. (2013). Naı¨ve Bayesian classifiers with multinomial models for rRNA taxonomic assignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(5), SEPTEMBER.
Peter, Y., Ekaterina,P., and Maxim, D., and Frank,K. (2006). Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nucleic Acids Research, 34(2).
Prabhakara, S. and Acharya, R. (2012). Unsupervised two-way clustering of metagenomic sequences. Journal of Biomedicine and Biotechnology, 1-11.
Proctor, G.N. (1994). Mathematics of microbial plasmid instability and subsequent differential growth of plasmid-free and plasmid-containing cells, relevant to the analysis of experimental colony number data. Plasmid, 32(2) September, 101–130.
Reddy, R. M., Mohammed, M. H., and Mande, S. S. (2012). TWARIT: An extremely rapid and efficient approach for phylogenetic classification of metagenomic sequences. Gene, 505(2), 259-265.
Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., and Sokhansanj, B. (2008). Metagenome fragment classification using N-mer frequency profiles. Advances in Bioinformatics, 1-12.
Saghir, H. and Dalila B. M. (2013). A random-forest-based efficient comparative machine learning Predictive DNA-codon metagenomics binning technique for WMD events & applications. IEEE International Conference on Technologies for Homeland Security, 12-14, Nov.
Tracey, A.K.F., Li, P.E., Matthew, B.S., and Patrick S. G. C. (2015) Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Research, 43(10), e69.
Wang, Q., Garrity, G. M., Tiedje, J. M., and Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261-5267.
Watson J and Crick F. (1953). Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171 (4356): 737 – 8.
Weil, G., Heus,K., Faraut, T., and Jacques,D.(2004). The cyclic genetic code as a constraint satisfaction problem. Theoretical Computer Science, 322,313-334
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2021-07-01起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw