進階搜尋


下載電子全文  
系統識別號 U0026-2307201323484200
論文名稱(中文) 次世代測序研究--藉由相似物種的比對以評估全新序列組裝
論文名稱(英文) Evaluating the De Novo Assembly by Related Species Alignment in Second-Generation Sequencing Study
校院名稱 成功大學
系所名稱(中) 統計學系碩博士班
系所名稱(英) Department of Statistics
學年度 101
學期 2
出版年 102
研究生(中文) 林振宇
研究生(英文) Chen-Yu Lin
學號 r26001142
學位類別 碩士
語文別 英文
論文頁數 162頁
口試委員 指導教授-鄭順林
口試委員-馬瀰嘉
口試委員-劉宗霖
中文關鍵字 次世代測序  Denovo 基因組裝  LAST 比對  基因組裝的評估  相似物種 
英文關鍵字 Second-Generation Sequencing  Denovo Assembly  LAST alignment  Evaluation of genome assembly  Closely related species 
學科別分類
中文摘要 為了評估非模式物種的基因組裝品質, 我們提供了新的指標以幫助我們判斷組裝
的好壞。 此指標的想法就是利用該非模式物種與相似物種之間的相似片段來協助
我們的評估, 如果一個組裝包含這些相似片段愈多, 則我們期望該組裝的品質有機
會優於其它相對包含較少相似片段的組裝。

我們使用了三種組裝工具(Velvet, SOAPdenovo以及ABySS)對模式物種Thaliana以及非模式物種 algae 進行組裝, 接著利用 LAST 比對工具對 Thaliana 的組裝結果與 Thaliana 的基因組以及相似物種 Lyrata 進行比對, 並且計算我們提供的指標以驗證指標的評估效應, 最後將此指標的評估想法應用在非模式物種 algae 的基因組裝以及 Wu (2012) 所提供的 algae 的組裝結果上。

在 Thaliana 的研究中, 我們用 Thaliana 所計算出的指標以及 Lyrata 所計算出的指標作為是選擇組裝品質的準則, 我們發現用 Thaliana 算出的指標所選到的組裝與用相似物種 (Lyrata) 算出的指標所選到的組裝在大部分的情況之下是一致的。

在 algae 的研究中, 我們發現了 Wu (2012) 提供的四種資料前處理 (品質修剪,Hammer, HiTEC, 隨機重排) 的組裝結果的指標表現, 以新的指標來看 Hammer 的表現略佳於其他三種資料的前處理。

我們比較我們提供的新指標以及傳統評估組裝的指標 (contigs 的數量, contigs 的最大長度, N50, contigs 的總長) 以協助denovo組裝品質的評估並且增加可能為較佳組裝的候選
英文摘要 In order to assess the quality of non-model genome assembly, we provided new indices that help us to judge a assembly. The idea of new indices is that using the similar segments among the closely related species of the non-model species to help the assessment. If one assembly contains more similar segments than the others, then we expect that its assembly quality may have some chance to be better the others.

We assembled the model species (Thaliana) and non-model
specis (algae) by using three assembly tools (Velvet, SOAPdenovo and ABySS). The assembly of model species, Thaliana was aligned with Thaliana genome and the related species of Thaliana, Lyrata by using LAST alignment. The proposed indices were examined by the consistency of the two alignments to check the assessment effect of the indices. Then, we apply the indices to the denovo assembly for non-model species algae.

In the study of model species, Thaliana, we separately use
that the indices which are calculated from the Thaliana genome and the genome of related speices, Lyrata to be the criterion of the quality of assembly. In most cases, we find the slected assembly by using the indices which are calculated from the Thaliana genome are the same as those by using the indices which are calculated from the related species, Lyrata.

In the study of non-model species, algae, we find that the indices of Wu’s (2012) assembly results of four data pre-processing (quality trim, Hammer, HiTEC and random shuffle)
indicate that Hammer is slightly better.

We compare the indices which we proposed with the traditional indices (number of contigs, maximum length, N50, total length of contigs) to assist quality evaluation of denovo assembly and select candidates of assembly which may have the better quality.
論文目次 1. Introduction 1
1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction second-generation sequencing . . . . . . . . . . . . . . . . . 3
1.3 Illumina sequencing procedure . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Research procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2. Literature Review 12
2.1 Evaluation of denovo assembly . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Alignment tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3. Methodology 16
3.1 Dynamic trimming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Principle of LAST alignment tool . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Proposed indices and traditional indices . . . . . . . . . . . . . . . . . . . 21
3.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 LAST alignment and calculation steps . . . . . . . . . . . . . . . . 26

4. Verification and Comparison 28
4.1 Sequence format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Data analysis on Thaliana . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5. Main Data Analysis 46
5.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Analysis process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Analysis of Algae-NCKU . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Analysis of Algae-NCBI . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6. Conclusion and Future Work 58
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Appendix A. The assembly results for Thaliana 65
A.1 Velvet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.1.1 Original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.1.2 Trimmed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2 SOAPdenovo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2.1 Original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.2.2 Trimmed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.3 ABySS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.3.1 Original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.3.2 Trimmed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Appendix B. The indices results for Thaliana 72

Appendix C. The assembly results for Algae-NCKU 80
C.1 Original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
C.2 Q30l50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
C.3 HiTEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
C.4 Hammer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
C.5 Random shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Appendix D. The indices results for Algae-NCKU 86
D.1 Original . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
D.2 Q30l50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
D.3 HiTEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
D.4 Hammer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
D.5 Random shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
D.6 The mean indices of assemblies of the four preprocessed data . . . . . . . . 109

Appendix E. The assembly results for Algae-NCBI 113
E.1 Velvet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
E.1.1 Original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
E.1.2 Trimmed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
E.2 SOAPdenovo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
E.2.1 Original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
E.2.2 Trimmed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
E.3 ABySS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
E.3.1 Original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
E.3.2 Trimmed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Appendix F. The indices results for Algae-NCBI 120
F.1 Velvet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
F.1.1 Original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
F.1.2 Trimmed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
F.2 SOAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
F.2.1 Original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
F.2.2 Trimmed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
F.3 ABySS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
F.3.1 Original data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
F.3.2 Trimmed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
F.4 The mean indices of assemblies for Algae-NCBI . . . . . . . . . . . . . . 157
F.4.1 Velvet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
F.4.2 SOAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
F.4.3 ABySS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
參考文獻 Altschul, Stephen F., Gish, Warren, Miller Webb, Myers Eugene W., and Lipman, David J., “Basic local alignment search tool”, Journal of Molecular Biology, 1990, Vol 215, Pages 403-410.

Abouelhoda, Mohamed Ibrahim, Kurtz Stefan, and Ohlebusch, Enno, “Replacing suffix trees with enhanced suffix arrays”, Journal of Discrete Algorithms, 2004, Vol 2, Pages 53-86.

Chang, Yu-Jung, Chen, Chien-Chih, Chen, Chuen-Liang, and Ho, Jan-Ming, “A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud
computing framework”, BMC Genomics, 2012, Vol 13(Suppl 7), S28.

Dalloul, Rami A., Long, Julie A., Zimin, Aleksey V., Aslam, Luqman, Beal, Kathryn, Blomberg, Le Ann, Bouffard, Pascal, Burt, David W., Crasta, Oswald, Crooijmans, Richard P. M. A., Cooper, Kristal, Coulombe, Roger A., De, Supriyo, Delany, Mary E., Dodgson, Jerry B., Dong, Jennifer J., Evans, Clive, Frederickson, Karin M., Flicek, Paul,
Florea, Liliana, Folkerts, Otto, Groenen, Martien A. M., Harkins, Tim T., Herrero, Javier, Hoffmann, Steve, Megens, Hendrik-Jan, Jiang, Andrew, Jong, Pieter de, Kaiser, Pete,
Kim, Heebal, Kim, Kyu-Won, Kim, Sungwon, Langenberger, David, Lee, Mi-Kyung, Lee, Taeheon, Mane, Shrinivasrao, Marcais, Guillaume, Marz, Manja, McElroy, Audrey P., Modise, Thero, Nefedov, Mikhail, Notredame, Cedric, Paton, Ian R., Payne, William S., Pertea, Geo, Prickett, Dennis, Puiu, Daniela, Qioa, Dan, Raineri, Emanuele, Ruffier, Magali, Salzberg, Steven L., Schatz, Michael C., Scheuring, Chantel, Schmidt, Carl J., Schroeder, Steven, Searle, Stephen M. J., Smith, Edward J., Smith, Jacqueline, Sonstegard, Tad S., Stadler, Peter F., Tafer, Hakim, Tu, Zhijian, Tassell, Curtis P. Van, Vilella, Albert J., Williams, Kelly P., Yorke, James A., Zhang, Liqing, Zhang, Hong-Bin, Zhang,
Xiaojun, Zhang, Yang, Reed, Kent M., “Multi-Platform Next-Generation Sequencing of the Domestic Turkey (Meleagris gallopavo): Genome Assembly and Analysis”, PLOS BIOLOGY, 2010, Vol 8, e1000475.

Earl, Dent, Bradnam, Keith, John, John St., Darling, Aaron, Lin, Dawei, Fass, Joseph, Yu, Hung On Ken, Buffalo, Vince, Zerbino, Daniel R., Diekhans, Mark, Nguyen, Ngan, Ariyaratne, Pramila Nuwantha, Sung, Wing-Kin, Ning, Zemin, Haimel, Matthias, Simpson, Jared T., Fonseca, Nuno A., Birol, Inancx, Docking, T. Roderick, Ho, Isaac Y., Rokhsar, Daniel S., Chikhi, Rayan, Lavenier, Dominique, Chapuis, Guillaume, Naquin, Delphine, Maillet, Nicolas, Schatz, Michael C., Kelley, David R., Phillippy, Adam M., Koren, Sergey, Yang, Shiaw-Pyng, Wu, Wei, Chou, Wen-Chi, Srivastava, Anuj, Shaw Timothy I., Ruby, J. Graham, Skewes-Cox, Peter, Betegon, Miguel, Dimon, Michelle T., Solovyev, Victor, Seledtsov, Igor, Kosarev, Petr, Vorobyev, Denis, Ramirez-Gonzalez, Ricardo, Leggett, Richard, MacLean, Dan, Xia, Fangfang, Luo, Ruibang, Li, Zhenyu, Xie, Yinlong, Liu, Binghang, Gnerre, Sante, MacCallum, Iain, Przybylski, Dariusz, Ribeiro, Filipe J., Yin, Shuangye, Sharpe, Ted, Hall, Giles, Kersey, Paul J., Durbin, Richard, Jackman, Shaun D., Chapman, Jarrod A., Huang, Xiaoqiu, DeRisi, Joseph L., Caccamo, Mario, Li, Yingrui, Jaffe, David B., Green, Richard E., Haussler, David, Korf, Ian, and Paten, Benedict, “Assemblathon 1: A competitive assessment of de novo short read assembly methods”, Genome Research, 2011, Vol 21, Page 2224-2241.

Frith, Martin C., Wan, Raymond, and Horton, Paul, “Incorporating sequence quality data into alignment improves DNA read mapping”, Nucleic Acids Research, Vol 38, e100.

Gnerre, Sante, MacCallum, Iain, Przybylski, Dariusz, Ribeiro, Filipe J., Burton, Joshua N., Walker, Bruce J., Sharpe, Ted, Hall, Giles, Shea, Terrance P., Sykes, Sean, Berlin, Aaron M., Aird, Daniel, Costello, Maura, Daza, Riza, Williamsa, Louise, Nicol, Robert, Gnirke, Andreas, Nusbaum, Chad, Lander, Eric S., and Jaffea, David B., “High-quality draft assemblies of mammalian genomes from massively parallel sequence data”, Proc Natl Acad
Sci, 2011, Vol 108, Page 1513-1518.

Holt, R.A., and Jones, S.J., “The new paradigm of flow cell sequencing", Genome Research, Vol 18, Pages 839-846.

Kent, James, “BLAT The BLAST-Like Alignment Tool”, Genome Research, 2002, Vol 12, Pages 656-664.

Kelly, David R, Schatz, Michael C, and Salzberg, Steven L, “Quake: Quality-aware detection and correction of sequencing errors", Genome Biology, 2010, Vol 11, R116.

Kielbasa, Szymon M., Wan Raymond, Kengo Sato, Horton Paul, and Frith Martin C., “Adaptive seeds tame genomic sequence comparison”, Genome Research, 2011, Vo1 21, Pages 487-493.

Li, Ruiqiang, Zhu, Hongmei, Ruan, Jue, Qian, Wubin, Fang, Xiaodong, Shi, Zhongbin, Li, Yingrui, Li, Shengting, Shan, Gao, Kristiansen, Karsten, Li, Songgang, Yang, Huanming, Wang, Jian, andWang, Jun, “De novo assembly of human genomes with massively parallel short read sequencing”, Genome Research, 2010, Vol 20, Page 265-272.

Li, R, Fan, W, Tian, G, Zhu, H, He, L, Cai, J, Huang, Q, Cai, Q, Li B, Bai, Y, Zhang, Z, Zhang, Y, Wang, W, Li, J, Wei, F, Li, H, Jian, M, Li, J, Zhang, Z, Nielsen, R, Li, D, Gu, W, Yang, Z, Xuan, Z, Ryder, OA, Leung, FC, Zhou, Y, Cao, J, Sun, X, Fu, Y, Fang, X, Guo, X, Wang, B, Hou, R, Shen, F, Mu, B, Ni, P, Lin, R, Qian, W, Wang, G, Yu, C, Nie, W, Wang, J, Wu, Z, Liang, H, Min, J, Wu, Q, Cheng, S, Ruan, J, Wang, M, Shi, Z, Wen, M, Liu, B, Ren, X, Zheng, H, Dong, D, Cook, K, Shan, G, Zhang, H, Kosiol, C, Xie, X, Lu, Z, Zheng, H, Li, Y, Steiner, CC, Lam, TT, Lin, S, Zhang, Q, Li, G, Tian, J, Gong, T, Liu, H, Zhang, D, Fang, L, Ye, C, Zhang, J, Hu, W, Xu, A, Ren, Y, Zhang, G, Bruford, MW, Li, Q, Ma, L, Guo, Y, An, N, Hu, Y, Zheng, Y, Shi, Y, Li, Z, Liu, Q, Chen, Y, Zhao, J, Qu, N, Zhao, S, Tian, F, Wang, X, Wang, H, Xu, L, Liu, X, Vinar, T, Wang, Y, Lam, TW, Yiu, SM, Liu, S, Zhang, H, Li, D, Huang, Y, Wang, X, Yang, G, Jiang, Z, Wang, J, Qin, N, Li, L, Li, J, Bolund, L, Kristiansen, K, Wong, GK, Olson, M, Zhang, X, Li, S, Yang, H, Wang, J, and Wang, J. “The sequence and de novo assembly of the giant panda genome”, Nature, 2010, Vol 463, Page 311-317.

Miller, Jason R., Delcher, Arthur L., Koren, Sergey, Venter, Eli, Walenz, Brian P., Brownley, Anushka, Johnson, Justin, Li, Kelvin, Mobarry, Clark, and Sutton, Granger, “Aggressive
assembly of pyrosequencing reads with mates”, Bioinformatics, 2008, Vol 24, Pages 2818-2824.

Morozova, Olena, and Marra, Marco A., “Applications of next-generation sequencing technologies in functional genomics”, Genomics, 2008, Vol 92, Pages 255-264.

Narzisi, Giuseppe, and Mishra, Bud, “Comparing de novo genome assembly: The long and short of it", PloS One, 2011, Vol 6, e19175.

Simpson, Jared T., Wong, Kim, Jackman, Shaun D., Schein, Jacqueline E., Jones, Steven J.M., and Birol˙Inancx, “ABySS: A parallel assembler for short read sequence data", Genome Research, 2009, Vol 19, Pages 1117-1123.

Simpson, Jared T., and Durbin, Richard, “Efficient de novo assembly of large genomes using compressed data structures”, Genome Research, 2012, Vol 22, Pages 549-556.

Schroder, Jan, Bailey, James, Conway, Thomas, and Zobel, Justin, “Reference-free validation of short read data", PloS One, 2010, Vol 5, e12681.

Sanger, F., Nicklen, S., and Coulson, A.R., “DNA sequencing with chain-terminating inhibitors”, Proc. Natl. Acad. Sci., 1977, Vol 75, Page 5463-5467.

Shendure, J., Porreca, G.J., Reppas, N.B., Lin, X., McCutcheon, J.P., Rosenbaum, A.M., Wang, M.D., Zhang, K., Mitra, R.D., and Church, G.M., “Accurate multiplex polony
sequencing of an evolved bacterial genome”, Science, 2005, Vol 309, Page 1728-1732.

Salzberg, Steven L., Phillippy, Adam M., Zimin Alekse, Puiu, Daniela, Magoc Tanja, Koren Sergey, Treangen, Todd J., Schatz, Michael C., Arthur L., Delcher, Roberts, Michael,
Guillaume, Pop, Mihai, and Yorke, James A., “GAGE: A critical evaluation of genome assemblies and assembly algorithms”, Genome Research, 2012, Vol 22, Pages 557-567.

Vezzi Francesco, Narzisi Giuseppe, Mishra Bud, “Feature-by-Feature Evaluating De Novo Sequence Assembly”, PLoS ONE, 2012, Vol 7, e31002.

Warren, R.L., Sutton, G.G., Jones, S.J., and Holt, R.A., “Assembling millions of short DNA sequences using SSAKE”, Bioinformatics, Vol 23, Page 500-501.

Wu, Yu-Fu, “Improving the De Novo Assembly by Quality Assessment and Error Correction of Second-Generation Sequencing Data”, Master’s thesis, National Cheng Kung University, 2012.

Zerbino, Daniel R., and Birney, Ewan, “Velvet: Algorithms for de novo short read assembly using de Bruijn graphs", Genome Research, 2008, Vol 18, Pages 821-829.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2018-07-29起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2018-07-29起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw