||Statistical Evaluation of DNA Sequence Alignment Based on BLAST and Dissimilarity Measures
||Department of Statistics
gene sequence alignment
在生物資訊學中，目前常用的序列比對工具為NCBI中的BLAST或者是其他基因序列間差異度衡量指標。然而，這些測度量適合的準確性和閾值，到目前為止尚未被研究。在本篇論文中，將提出一個基因序列間差異度的衡量指標，接著再利用ROC曲線下面積來評估各種不同測度量的精確度。透過對上的比例、敏感度與特異度來找尋合適的測度量之閾值。模擬結果顯示，對稱的K-L距離(Symmetric Kullback-Leibler discrepancy)的方法下之序列比對的精確度大於BLAST方法下的精確度，而本篇論文所提出的測度量之精確度也是大於BLAST方法下的精確度和對稱的K-L距離方法不相上下。
In biology, the current methods used for DNA sequence alignment are either NCBI BLAST or dissimilarity measures. However, the cutoff values of these measures are not studied throughout. In this study, a new dissimilarity measurement is proposed. Moreover, the area under ROC curve is provided to assess the accuracy for gene sequence alignment based on BLAST and dissimilarity measures. The hit rate, sensitivity and specificity are used to find the cutoff values. A simulation study was conducted to empirically investigate the accuracy of the proposed procedure. The simulation results show that the accuracy of gene sequence alignment based on Symmetric Kullback-Leibler discrepancy approach is larger than the accuracy based on BLAST. Besides, the accuracy of gene sequence alignment based on proposed method is also larger than the accuracy based on BLAST.
Chapter 1 Introduction 1
Chapter 2 Literature Review 6
2.1 Sequence alignment 6
2.2 Sequence alignment software：BLAST 7
2.3 Sequence dissimilarity based on BLAST 9
2.4 Sequence dissimilarity based on SK-LD 11
Chapter 3 Proposed Methods 17
3.1 Proposed dissimilarity measures 17
3.2 Apply ROC curve to compare the accuracy 18
3.3 Search cut-off value of sequence dissimilarity based on SK-LD 24
Chapter 4 Simulation Study 27
4.1 Simulation process 27
4.2 Simulation result 30
4.2.1 Accuracy of ROC curve 30
4.2.2 The cut-off value of based on SK-LD 38
Chapter 5 Conclusions and Further Research 47
1.Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lip man, D. J. (1990), “Basic local alignment search tool”. Journal of Molecular Biology, 215, 403-410.
2.Altschul, S.F., and Gish, W. (1996), “Local alignment statistics”. Methods Enzymolv, 266, 460–480.
3.Bamber, D. (1975), “The area above the ordinal dominance graph and the area below the receiver operating characteristic graph”. Journal of Mathematical Psychology, 12, 387-415.
4.Dembo, A., and Karlin, S. (1991), “Strong limit theorems of empirical functionals for large exceedances of partial sums of i.i.d variables”. Ann. Prob.. 19, 1737–1755.
5.Dembo, A., Karlin, S., and Zeitouni, O. (1994a), “Critical phenomena for sequence matching with scoring”. Ann. Prob., 22, 1993–2021.
6.Dembo, A., Karlin, S., and Zeitouni, O. (1994b), “Limit distribution of maximal non-aligned two-sequence segmental score”. Ann. Prob., 22, 2022–2039.
7.Frith, M.C., Hansen, U., Sponge, J.L., and Weng, Z. (2004), “ Finding functional sequence elements by multiple local alignment.” Nuclei Acids Research, 32, 189-200.
8.Holt, R.A. and Jones, S.J. (2008), “The new paradigm of flow cell sequencing ”. Genome research, 18 (6):839.
9.Karlin, S., and Altschul, S.F. (1990), “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes”. Proceedings of the National Academy of Science, U.S.A., 87, 2264–2268.
10.Lee, S.Y and Chuang, Y.K. (2010), “The Evolution and Development of DNA Sequence Technology ” . J Biomed Lab Sci, 22, 2.
11.Needleman, S. B. and Wunsch, C. D. (1970), “A general method applicable to the search for similarities in the amino acid sequence of two proteins ”. Journal of Molecular Evolution, 48, 443–453.
12.Pearson,W.R. and Lipman,D.J. (1988), “ Improved tools for biological sequence comparison”. Proc. Proceedings of the National Academy of Science, U.S.A., 85, 2444–2448.
13.Pearson,W.R. (1990), “Rapid and sensitive sequence comparison with FASTA and
FASTP”. Methods Enzymol., 183, 63–98.
14.Sanger, F., Air, G.M., Barrell, B.G., Brown, N.L., Coulson, A.R., Fiddes, C.A., Hutchison, C.A., Slocombe, P.M. and Smith, M. (1977), “Nucleotide sequence of bacteriophage phi X174 DNA”. Nature ,265 (5596): 687–95.
15.Smith, T. F., Waterman, M. S. and Fitch, W. M. (1981), “ Comparative biosequence metrics”. Journal of Molecular Evolution, 18, 38–46.
16.Smith, T.F., Waterman, M.S., and Burks, C. (1985), “The statistical distribution of nucleic acid similarities”. Nuclei Acids Research, 13, 645–656.
17.Tucker, T., Marra, M. and Friedman, J.M. (2009), “Massively parallel sequencing: the next big thing in genetic medicine”. The American Journal of Human Genetics, 85(2):142–154
18.Waterman, M.S., and Vingron, M. (1994), “Rapid and accurate estimates of statistical significance for sequence data basesearches”. Proc. Natl. Acad. Sci USA, 91, 4625–4628.
19.Wu, T. J., Hsieh, Y. C. and Li, L. A. (2001), “Statistical Measures of DNA Sequences Dissimilarity under Markov Chain Models of Base Composition”. Biometrics, 57, 441-448.
20.Wu, T.J., Huang, Y.H. and Li, L.A. (2005), “Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences” .
Biometrics, 21 (22): 4125–4132.
21.Zhang, Z., Schwartz, S., Wagner, L. and Miller, W. (2000), “A Greedy Algorithm for Aligning DNA Sequences”. Journal of Computational Biology, 7(1-2): 203-214.