進階搜尋


 
系統識別號 U0026-0812200910183159
論文名稱(中文) 基因表現時間序列的叢集分析方法與系統實作
論文名稱(英文) Clustering Time-Series Gene Expression: A New Method and Implementation
校院名稱 成功大學
系所名稱(中) 資訊工程學系碩博士班
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 90
學期 2
出版年 91
研究生(中文) 陳延洛
研究生(英文) Len-Lo Chen
學號 p7689130
學位類別 碩士
語文別 中文
論文頁數 56頁
口試委員 指導教授-曾新穆
口試委員-李強
口試委員-蔣榮先
口試委員-何中良
中文關鍵字 基因微陣列  相似度量測  資料探勘  叢集分析  時間序列  基因表現 
英文關鍵字 microarray  similarity measure  data mining  gene expression  time-series  cluster analysis 
學科別分類
中文摘要 本研究提出一個適用於基因表現時間序列資料的叢集分析方法。雖然目前已有一些分析時間序列資料的方法,但它們無法適當地處理基因表現時間序列資料關於偏移量、程度變形、平移以及雜訊等問題。因此我們提出一個新的量測基因表現時間序列相似度的方法,稱之為GETSS。此方法可以解決兩個基因表現時間序列之間關於偏移量、程度變形、平移以及雜訊等問題,以找出兩個基因的相似表現反應部份。由實驗證明,我們的方法確實比一般的相關係數量測方法,更加能指出兩個基因之間的相關性。
在本篇論文裡,我們將GETSS與CAST、K-Medois以及HAC等現存的叢集方法結合,設計並實作一個系統,用來對基因表現時間序列資料進行叢集分析,並提供圖形化介面來呈現叢集分析的結果。透過這個系統,可以讓生物學家更方便且迅速地分析基因表現時間序列資料。

英文摘要 This research presents a new clustering analysis approach that is suitable for analyzing gene expression time-series data. Although some methods have been proposed for dealing with time-series data, they can not handle well the problems of offset, scaling, shift, and noise in gene expression time-series data. Therefore, we propose a new similarity measure named GETSS that can solve offset, scaling, shift, and noise problems in finding similar time-series expression patterns. Through experiments, our approach can reveal the correlation between two gene expression time series more correctly than other measures.
Based on the proposed similarity measuring approach, we also design and implement a system for clustering gene expression time-series data. In this system, the similarity measure GETSS was integrated with representative clustering methods like CAST, K-Medois and HAC. Hence, the biologists can analyze time-series gene expression in a more effective way.

論文目次 目錄

英文摘要………………………………………………………………………………I
中文摘要……………………………………………………………………………...II
誌謝…………………………………………………………………………………..III
目錄…………………………………………………………………………………..IV
圖目錄………………………………………………………………………………VII
表目錄……………………………………………………………………………...VIII

第一章 簡介………………………………………………………………………1
1.1 基因表現時間序列資料分析簡介…………………………………………1
1.2 研究動機……………………………………………………………………2
1.3 研究目的及貢獻……………………………………………………………5
1.4 本論文內容與架構…………………………………………………………6
第二章 相關研究………………………………………………………………...7
2.1 基因表現分析………………………………………………………………7
2.1.1 基因表現資料之叢集分析…..….……………………………………7
2.1.1.1 HAC之簡介……….………………………………..…………8
2.1.1.2 K-Means之簡介……….…………………………..…………..8
2.1.1.3 CAST之簡介……….……………………………..…………..9
2.1.2 基因表現時間序列資料之分析…..….……………………………..10
2.2 時間序列分析……………………………………………………………..11
2.2.1 序列正規化……..….………………………………………………..11
2.2.2 動態時間變形………..….…………………………………………..12
2.2.3 最長相同子序列..….………………………………………………..14

第三章 基因表現時間序列叢集方法………………………………………18
3.1 時間序列相似度…………………………………………………………..18
3.1.1 平移與雜訊問題……………..….…………………………………..19
3.1.2 平移與雜訊之處理方法分析..….…………………………………..21
3.1.3 解決平移與雜訊之新方法:GETSS………………………………..22
3.2 時間序列叢集分析………………………………………………………..27
3.2.1 與CAST結合………………………………………………………..27
3.2.2 與K-Medois結合……………………………………………………28
3.2.3 與HAC結合………………………………………………………...29
3.3 多維時間序列叢集分析…………………………………………………..29
第四章 實驗分析……………………………………………………………….31
4.1 資料簡介…………………………………………………………………..31
4.2 時間序列相似度方法之效益評估………………………………………..32
4.2.1 相似度比較…..….…………………………………………………..33
4.2.2 相似度提升之比較…..….…………………………………………..34
4.2.3 例子…………………..….…………………………………………..37
4.3 叢集分析…………………………………………………………………..43
第五章 系統設計與實作……………………………………………………...45
5.1 系統設計…………………………………………………………………..45
5.1.1 系統架構……..….…………………………………………………..45
5.1.2 系統流程……..….…………………………………………………..47
第六章 結論與未來研究方向
6.1 結論………………………………………………………………………..49
6.2 應用………………………………………………………………………..50
6.3 未來研究方向……………………………………………………………..50
參考文獻……………………………………………………………………………..52

圖目錄

圖1 YLR256W與YPL028W之基因表現平移前後比較圖………………………4
圖2 HAC之概念圖…………………………………………………………………8
圖3 密度基礎叢集方法之概念圖………………………………………………….9
圖4 一個動態時間變形的例子……………………………………………………12
圖5 Singularity的例子…………………………………………………………….13
圖6 變形步驟示意圖……………………………………………………………...14
圖7 LCS之概念圖………………………………………………………………...14
圖8 序列配對之概念圖…………………………………………………………...15
圖9 window stitching之概念圖…………………………………………………...16
圖10 兩序列有程度變形、偏移量、平移與雜訊關係之示意圖…………….…19
圖11 相似序列中平移和雜訊比較圖…………………………………………..…21
圖12 一個移動路徑的例子…………………………………………………….…23
圖13 表1a中的S和T在mismatch=2之移動路徑………………………….…26
圖14 343對activations的相似度分布圖…………………………………………34
圖15 YLR256W與YPL028W在mismatch 1的基因表現比較圖………………39
圖16 YBL021C與YNL052W在mismatch 1的基因表現比較圖………………40
圖17 YEL009C與YMR300C在mismatch 2的基因表現比較圖………………41
圖18 YAL040C與YER111C在mismatch 3的基因表現比較圖……………….42
圖19 分群改變示意圖…………………………………………………………….44
圖20 系統架構圖………………………………………………………………….46
圖21 系統流程圖………………………………………………………………….48

表目錄

表1 S與T的基因表現時間序列資料……………………..……………………..20
表2 Cho / Spellman 的基因表現時間序列資料…………..…………………...…31
表3 計算資料集1之相似度矩陣的執行時間…………………………………….32
表4 343對activated基因之相似度分布表………………..……………………..33
表5 343對activations之平均相似度………………..……….…………………..34
表6 相似度提升分布表………………..……….…………………………………35
表7 平均相似度提升分布表………………..……….……………………………35
表8 343對activations中相似度提升>0.5所佔的百分比……………………….36
表9 YLR256W與YPL028W的基因表現時間序列資料………………………..37
表10 YBL021C與YNL052W的基因表現時間序列資料…………………..…..37
表11 YEL009C與YMR300C的基因表現時間序列資料………………………..38
表12 YAL040C與YER111C的基因表現時間序列資料………………………..38
表13 343對activations分群結果表………………………………………………43

參考文獻 參考文獻

[1] Agrawal, R., Lin, K. I., Sawhney, H. S., and Shim, K., "Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases." In Proc. the 21st Int'l Conf. on Very Large Data Bases, Zurich, Switzerland, pp. 490-501, Sept. 1995.
[2] V Filkov, S Skiena, J Zhi (2001), "Analysis techniques for microarray time-series data", in RECOMB 2001: Proceedings of the Fifth Annual International Conference on Computational Biology, Montreal, Canada, pp. 124-131.
[3] Cho R.J., Campbell M.J., Winzeler E.A., Steinmetz L., Conway A., Wodicka L, Wolfsberg T.G., Gabrielian A.E., Landsman D., Lockhart D., and Davis R.W. “A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle.” Molecular Cell, Vol.2, 65-73, July 1998.
[4] Spellman, PT, Sherlock, G, Zhang, MQ, Iyer, VR, Anders, K, Eisen, MB, Brown, PO, Botstein, D, and Futcher, B. “Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization.” Mol Biol Cell. 9:3273-3297, 1998.
[5] Berndt, D. J., Clifford, J., “Using Dynamic Time Warping to Find Patterns in Time Series.” In KDD-94: AAAI Workshop on Knowledge Discovery in Databases. Pages 359-370, Seattle, Washington, July 1994.
[6] Keogh, Eamonn J. and Pazzani, Michael J. 2001, “Derivative Dynamic Time Warping.” In First SIAM International Conference on Data Mining (SDM'2001), April 5-7, Chicago, IL, USA.

[7] B. Bollobas, Gautam Das, Dimitrios Gunopulos, and H. Mannila., “Time-Series Similarity Problems and Well-Separated Geometric Sets.” In Proceedings of the Association for Computing Machinery Thirteenth Annual Symposium on Computational Geometry, pages 454--476, 1997.
[8] Ewing, B. and P. Green (2000), "Analysis of expressed sequence tags indicates 35,000 human genes". Nature Genetics 25, 232-234, 2000
[9] Brazma, A., and Vilo, J. (2000), “Gene expression data analysis.” FEBS Letters, 480, 17-24. BIOKDD01: Workshop on Data Mining in Bioinformatics (with SIGKDD01, Conference) page 29
[10] Ben-Dor, A. and Z. Yakhini (1999, March). “Clustering gene expression patterns.” In RECOMB99: Proceedings of the Third Annual International Conference on Computational Molecular Biology, Lyon, France, pages. 33--42
[11] P. Tamayo, D. Slonim, J. Mesirou, Q. Zhu, S. Kitareewan, E. Dmitrovsky, ES. Lander, TR. Golub “Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation.” Proc Natl Acad Sci USA 96:2907, 1999.
[12] Vincent S. M. Tseng, Ching-Pin Kao. “Efficiently Mining Gene Expression Data via Integrated Clustering and Validation Techniques.” Proceedings of the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2002, pages 432-437, Taipei, Taiwan, May 2002.
[13] Eisen, M., Spellman, P. T., Botstein, D., and Brown, P. O. (1998), “Cluster analysis and display of genome-wide expression patterns.” Proceedings of National Academy of Science USA 95:14863—14867
[14] Goldin, D. & Kanellakis, P. (1995) “On similarity queries for time-series data: constraint specification and implementation.” In proceedings of the 1st International Conference on the Principles and Practice of Constraint Programming. Cassis, France, Sept 19-22. pp 137-153.
[15] Donald J. Berndt and James Clifford. “Using Dynamic Time Warping to Find Patterns in Time Series.” In Proceedings of the AAAI-94 Workshop on Knowledge Discovery in Databases. Pages 359-370, Seattle, Washington, July 1994.
[16] Kruskall, J. B. & Liberman, M. (1983). “The symmetric time warping algorithm: From continuous to discrete.” In Time warps, String Edits and Macromolecules: The Theory and Practice of String Comparison. Addison-Wesley.
[17] Myers, C., Rabiner, L. & Roseneberg, A. (1980). “performance tradeoffs in dynamic time warping algorithms for isolated word recognition.” IEEE Trans. Acoustics, Speech, and Signal Proc., Vol. ASSP-28, 623-635.
[18] Tolga Bozkaya, Nasser Yazdani, and Meral Ozsoyoglu. “Matching and Indexing Sequences of Different Lengths.” In Proceedings of the Association for Computing Machinery Sixth International Conference on Information and Knowledge Management, pages 128--135, Las Vegas, NV, USA, November 1997. ACM.
[19] E. L. Lehmann. “Nonparametrics: Statistical Methods Based on Ranks.” Holden and Day, San Francisco, 1975.
[20] S Raychaudhuri, P D Sutphin, J T Chang, R B Altman (2001), "Basic microarray analysis: Grouping and feature reduction", Trends in Biotechnology, 19(5):189-193.
[21] Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., and Church, G. M. (1999). “Systematic determination of genetic network architecture.” Nature Genetics, 22(3):281-- 285.
[22] E. M. Voorhees, “Implementing agglomerative hierarchical clustering algorithms for use in document retrieval.” Information Processing & Management, 22:465-476, 1986.
[23] J.B. McQueen, “Some Methods of Classification and Analysis of Multivariate Observations.” Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281-297, 1976.
[24] L. Kaufman and P.J. Rousseeuw, “Finding groups in data: an Introduction to cluster analysis.” John Wiley & Sons, 1990.
[25] Aach, J. and Church, G. (2001). “Aligning gene expression time series with time warping algorithms.” Bioinformatics. Volume 17, pp 495-508.
[26] Alexander V. Lukashin and Rainer Fuchs. "Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters" Bioinformatics 17: 405-414., 2001.
[27] Z. Bar-Joseph, G. Gerber, D. Gifford, and T. Jaakkola. “A new approach to analyzing gene expression time series data.” In the Sixth Annual International Conference on Research in Computational Molecular Biology, 2002
[28] Mark S. Aldenderfer and Roger K. Blashfield, “Cluster Analysis.” Sage Publications, Inc., 1984
[29] M. Schena, D. Shalon, R. W. Davis and P. O. Brown, (1995) “Quantitative monitoring of gene expression patterns with a complementary DNA microarray.” Science 270:467-470
[30] DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen Y, Su YA, Trent JM. (1996) “Use of a cDNA microarray to analyze gene expression patterns in human cancer.” Nature Genetics 14(4):457-60
[31] DeRisi, J.L., Iyer, V. and Brown, P.O. (1997) “Exploring the metabolic and genetic control of gene expression on a genomic scale.” Science 278: 680-686.
[32] 高慶斌,“應用於基因表現探勘之高效率叢集方法及其效能評估”,國立成功大學資訊工程研究所,碩士論文,民國九十年六月
[33] 陳健慰,“二十一世紀基因分析的利器:基因微陣列之簡介及其應用” NTU BioMed Bulletin, No2, 2000
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2002-07-10起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2002-07-10起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw