系統識別號 U0026-2607201815154900
論文名稱(中文) 結合隱性特徵與明確特徵預測非激酶特異性磷酸化位點
論文名稱(英文) Combining implicit features and explicit features to predict non-kinase-specific phosphorylation sites
校院名稱 成功大學
系所名稱(中) 電機工程學系
系所名稱(英) Department of Electrical Engineering
學年度 106
學期 2
出版年 107
研究生(中文) 方沛涵
研究生(英文) Pei-Han Fang
學號 N26050156
學位類別 碩士
語文別 中文
論文頁數 28頁
口試委員 指導教授-張天豪
中文關鍵字 蛋白質磷酸化  位置特異性得分矩陣  XGBoost  深度學習 
英文關鍵字 protein phosphorylation  position-specific scoring matrix  XGBoost  deep learning 
中文摘要 磷酸化反應是真核生物中最重要的翻譯後修飾之一,並在細胞內的許多反應過程中有著至關重要的作用。關於激酶及其底物的研究對於理解細胞中的信號傳導網絡非常重要,並且有助於一些疾病開發新的治療方法,如癌症。由於相關的實驗需要耗費大量的時間與人力成本,因此磷酸化位點的預測變得很重要,多年來也發展出許多相關的研究以及工具,大抵上分為兩類 ─ 激酶特異性與非激酶特異性。激酶特異性的磷酸化位點預測需要同時輸入序列以及該序列的激酶,再預測序列中的絲氨酸 ( S )、蘇氨酸 ( T ) 及酪氨酸 ( Y ) 是否為磷酸化位點;非激酶特異性的磷酸化位點預測只需要輸入序列即可進行預測。隨著定序技術的發展,有很多序列還未確定的激酶或是某些激酶的已知序列過少,難以使用激酶特異性的方法來預測,因此非激酶特異性磷酸化位點的預測在此時就越漸重要。
在本研究中使用了兩大類特徵來進行非特異性磷酸化位點預測,分別是明確特徵與隱性特徵。明確特徵包括夏儂熵、相對熵、蛋白質二級結構的預測值、蛋白質非穩定結構預測值、溶劑可及區域、重疊性質、平均累計疏水性質、K 近鄰算法以及位置特異性矩陣。隱性特徵由卷積神經網絡以及循環神經網路產生。最後再將這兩大類特徵輸入XGBoost來進行預測。此方法在S / T / Y磷酸化位點上在測試資料集得到的 AUC 值分別為 0.8598 / 0.7547 / 0.6842,優於現行的其他方法。
英文摘要 Phosphorylation is one of the most important post-translational modification in Eukaryotes, and it plays a vital role of many reactions in cells. Because the related experiments require a lot of time and labor costs, the prediction of phosphorylation sites becomes more important. Many related research and tools have also been developed over the years, and they mostly divided into two categories – kinase-specific and non-kinase specific. The prediction of kinase-specific phosphorylation sites needs to input sequences and the kinase of sequences simultaneously, and then predict Serine, Threonine and Tyrosine is phosphorylation site or not. The prediction of non-kinase-specific phosphorylation sites only needs to input sequence. With the development of sequencing technology, the kinases of many sequences are unsure, and the corresponding sequences of some kinases are too few. So, the prediction of non-kinase-specific phosphorylation sites become more and more important.
In our research, we use explicit and implicit features to predict non-kinase-specific phosphorylation sites. Explicit features include shannon entropy, relative entropy, the prediction value of protein second structure, the prediction value of protein disorder, solvent accessible area, overlapping properties, averaged cumulative hydrophobicity, KNN and position-specific scoring matrix. Implicit features are generated by convolutional neural network and recurrent neural network. Finally, we input these features XGBoost to predict. The AUC of this method on S / T / Y phosphorylation sites are 0.8598 / 0.7547 / 0.6842 respectively, and it is better than other methods currently.
論文目次 圖目錄 XIII
表目錄 XIV
第一章 緒論 1
第二章 相關研究 3
2.1 磷酸化位點 3
2.2 磷酸化位點預測工具 3
2.2.1 NetPhos 3
2.2.2 PPRED 4
2.2.3 Musite 5
2.2.4 PhosphoSVM 6
2.2.5 Musitedeep 7
2.3 模型介紹 7
2.3.1 極限梯度增強 ( eXtreme Gradient Boosting , XGBoost) 7
2.3.2 卷積神經網絡 ( Convolutional Neural Network ) 8
2.3.3 注意力機制 ( Attention mechanisms ) 9
2.3.4 封閉循環單元 ( Gated recurrent units, GRU ) 10
第三章 研究方法 11
3.1 資料集 11
3.2 明確特徵 12
3.2.1 夏儂熵 ( Shannon entropy ) 12
3.2.2 相對熵 ( Relative entropy ) 12
3.2.3 蛋白質二級結構的預測值 ( Predicted protein secondary structure ) 12
3.2.4 蛋白質非穩定區段預測值 ( Predicted protein disorder ) 13
3.2.5 溶劑可及區域 ( Solvent accessible area ) 13
3.2.6 重疊性質 ( Overlapping properties ) 13
3.2.7 平均累計疏水性 ( Averaged cumulative hydrophobicity ) 13
3.2.8 K-近鄰概況分數 ( KNN ) 14
3.2.9 位置特異性矩陣 ( Position-specific scoring matrix, PSSM ) 14
3.3 隱性特徵 14
3.3.1 CNN生成隱性特徵 15
3.3.2 RNN生成隱性特徵 17
3.4 模型建構 18
第四章 研究結果與探討 19
4.1 曲線下面積 ( area under curve, AUC ) 19
4.2 預測結果 19
4.2.1 方法比較 19
4.2.2 結合明確特徵 20
4.2.3 添加RNN生成的隱性特徵 22
第五章 結論 25
5.1 結論 25
5.2 未來展望 25
參考文獻 26
參考文獻 1. Trost B, Kusalik A: Computational prediction of eukaryotic phosphorylation sites. Bioinformatics 2011, 27(21):2927-2935.
2. Blom N, Gammeltoft S, Brunak S: Sequence and structure-based prediction of eukaryotic protein phosphorylation sites1. Journal of molecular biology 1999, 294(5):1351-1362.
3. Hjerrild M, Stensballe A, Rasmussen TE, Kofoed CB, Blom N, Sicheritz-Ponten T, Larsen MR, Brunak S, Jensen ON, Gammeltoft S: Identification of phosphorylation sites in protein kinase A substrates using artificial neural networks and mass spectrometry. Journal of proteome research 2004, 3(3):426-433.
4. Iakoucheva LM, Radivojac P, Brown CJ, O’Connor TR, Sikes JG, Obradovic Z, Dunker AK: The importance of intrinsic disorder for protein phosphorylation. Nucleic acids research 2004, 32(3):1037-1049.
5. Biswas AK, Noman N, Sikder AR: Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC bioinformatics 2010, 11(1):273.
6. Gao J, Thelen JJ, Dunker AK, Xu D: Musite, a tool for global prediction of general and kinase-specific phosphorylation sites. Molecular & Cellular Proteomics 2010, 9(12):2586-2600.
7. Dou Y, Yao B, Zhang C: PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. Amino acids 2014, 46(6):1459-1469.
8. Wang D, Zeng S, Xu C, Qiu W, Liang Y, Joshi T, Xu D: MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction. Bioinformatics 2017, 33(24):3909-3916.
9. Bairoch A, Bucher P, Hofmann K: The PROSITE database, its status in 1997. Nucleic Acids Research 1997, 25(1):217-221.
10. Hunter T: The Croonian Lecture 1997. The phosphorylation of proteins on tyrosine: its role in cell growth and disease. Philosophical Transactions of the Royal Society B: Biological Sciences 1998, 353(1368):583-605.
11. Johnson LN, Noble ME, Owen DJ: Active and inactive protein kinases: structural basis for regulation. Cell 1996, 85(2):149-158.
12. Johnson LN, Lowe ED, Noble ME, Owen DJ: The structural basis for substrate recognition and control by protein kinases. FEBS letters 1998, 430(1-2):1-11.
13. Pinna LA, Ruzzene M: How do protein kinases recognize their substrates? Biochimica et Biophysica Acta (BBA)-Molecular Cell Research 1996, 1314(3):191-225.
14. Graves LM, Bornfeldt KE, Krebs EG: Historical perspectives and new insights involving the MAP kinase cascades. Advances in second messenger and phosphoprotein research 1997, 31:49.
15. Chen T, Guestrin C: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining: 2016. ACM: 785-794.
16. LeCun Y, Bottou L, Bengio Y, Haffner P: Gradient-based learning applied to document recognition. Proceedings of the IEEE 1998, 86(11):2278-2324.
17. Mnih V, Heess N, Graves A: Recurrent models of visual attention. In: Advances in neural information processing systems: 2014. 2204-2212.
18. Bahdanau D, Cho K, Bengio Y: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473 2014.
19. Chung J, Gulcehre C, Cho K, Bengio Y: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:14123555 2014.
20. Diella F, Gould CM, Chica C, Via A, Gibson TJ: Phospho. ELM: a database of phosphorylation sites—update 2008. Nucleic acids research 2007, 36(suppl_1):D240-D244.
21. Heazlewood JL, Durek P, Hummel J, Selbig J, Weckwerth W, Walther D, Schulze WX: PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor. Nucleic acids research 2007, 36(suppl_1):D1015-D1021.
22. Durek P, Schmidt R, Heazlewood JL, Jones A, MacLean D, Nagel A, Kersten B, Schulze WX: PhosPhAt: the Arabidopsis thaliana phosphorylation site database. An update. Nucleic acids research 2009, 38(suppl_1):D828-D834.
23. Zulawski M, Braginets R, Schulze WX: PhosPhAt goes kinases—searchable protein kinase target information in the plant phosphorylation site database PhosPhAt. Nucleic acids research 2012, 41(D1):D1176-D1184.
24. Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658-1659.
25. Capra JA, Singh M: Predicting functionally important residues from sequence conservation. Bioinformatics 2007, 23(15):1875-1882.
26. Mihalek I, Reš I, Lichtarge O: A family of evolution–entropy hybrid methods for ranking protein residues by importance. Journal of molecular biology 2004, 336(5):1265-1282.
27. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 1997, 25(17):3389-3402.
28. McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics 2000, 16(4):404-405.
29. Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server for the prediction of protein disorder. Bioinformatics 2004, 20(13):2138-2139.
30. Ahmad S, Gromiha MM, Sarai A: RVP-net: online prediction of real valued accessible surface area of proteins from single sequences. Bioinformatics 2003, 19(14):1849-1851.
31. Taylor WR: The classification of amino acid conservation. Journal of theoretical Biology 1986, 119(2):205-218.
32. Sweet RM, Eisenberg D: Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. Journal of molecular biology 1983, 171(4):479-488.

  • 同意授權校內瀏覽/列印電子全文服務,於2019-07-26起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2019-07-26起公開。

  • 如您有疑問,請聯絡圖書館