進階搜尋


 
系統識別號 U0026-0812200911432901
論文名稱(中文) 基於符號表示法之高效能的時間子序列探勘方法
論文名稱(英文) Efficient Mining of Similar Subsequences in Time-Series Data by Using Symbolic Methods
校院名稱 成功大學
系所名稱(中) 資訊工程學系碩博士班
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 93
學期 2
出版年 94
研究生(中文) 劉建杰
研究生(英文) Jian-Jie Liu
學號 p7692169
學位類別 碩士
語文別 中文
論文頁數 59頁
口試委員 召集委員-李建億
口試委員-李強
口試委員-盧文祥
口試委員-辛致煒
指導教授-曾新穆
中文關鍵字 資料探勘  時間序列  基因微陣列  符號表示法  相似度子序列探勘 
英文關鍵字 Time-Series  Symbolic method  Data Mining  Microarray  Similar Subsequence 
學科別分類
中文摘要   時間序列的分析隨著各種領域的重視,如:多媒體、生物資訊、經濟學及科學上的應用,如何有效地由時間序列中探勘出有用的資訊就變得越來越重要。而時間序列資料的特性為高維度及龐大的資料量,因此許多研究者為了效率上的考量,採用了近似法的方式來取代原有的時間序列資料。近似法方法的主要精神是將原有的時間序列資料,對應到另一個維度,以供後續的分析。雖然近似演算法最為人所垢病的就是必須犧牲一些準確性,但我們仍提出一以近似演算法為基礎的方法。在本篇論文中,我們主要的目標在於有效率地搜尋時間序列中的相似子序列向量,且希望能夠兼顧準確性及效率,進而提供不同領域對於時間序列的不同需要,如:負相關序列於生物晶片上的應用。我們所提出的方法也是採用符號表示法,但改進原有的方法及參考其它方法的特性。在我們的實驗中証實了,我們所提出的方法大多優於Agrawal等人所提出的方法及Time-Lagged。
英文摘要   As the modern applications in various kinds of domains, such as multimedia, bioinformatics, finance and science, intensively increase, an efficient method becomes extremely important for retrieving useful knowledge from time-series data. Those kinds of information are usually high-dimensionality and involve huge amount of data, such that many researchers use the approximation-like methods to reduce the dimensionality of the data for performance improvement. The main concepts of those popular solutions are to transform the original data into some representatives and use them in later analysis. We proposed the techniques which have good quality in searching similar subsequences, although most approximation-like methods always lead to the increasing error rate. In this paper, we focus on the efficient method of similar subsequences searching which both consider the balance between performance and accuracy and give the ability to find patterns with different domain knowledge, like negative effect in time-series microarray Data. We proposed a solution which uses symbolic method for searching similar subsequences, and integrate the advantages of other methods. The experiments on biological data show that the scalability compared to Agrawal’s and Time-lagged method is much better.
論文目次 英文摘要......................................................I
中文摘要....................................................III
誌謝.........................................................IV
目錄..........................................................V
表目錄.....................................................VIII
圖目錄.......................................................IX

第一章 導論..................................................1
 1.1 研究背景...............................................1
 1.2 研究動機...............................................2
 1.3 問題描述...............................................3
 1.4 研究方法及貢獻.........................................7
 1.5 論文架構...............................................8

第二章 文獻探討..............................................9
 2.1 時間序列分析...........................................9
  2.1.1 基本的序列測量方法.................................9
  2.1.2 動態時間變形......................................10
  2.1.3 最長相同子序列....................................11
  2.1.4 Agrawal...........................................11
 2.2 時間序列轉換..........................................14
  2.2.1 傅利葉轉換(Discrete Fourier Transform)............14
  2.2.2 PAA&APCA..........................................14
  2.2.3 小波轉換(Wavelet Transfrom).......................15
  2.2.4 Time-Legged Method................................16
  2.2.5 Symbolic Aggregate approXimation(SAX).............16
 2.3 相關研究總結..........................................17

第三章 時間序列的子序列探勘方法.............................18
 3.1 方法討論..............................................18
 3.2 角度轉換..............................................19
 3.3 符號轉換..............................................20
 3.4 相似子序列探勘........................................22
  3.4.1 Suffix Tree.......................................22
  3.4.2 改進方法..........................................26
 3.5 負相關序列之尋找及所能提供之彈性......................30

第四章 系統設計與實作.......................................33
 4.1 系統架構..............................................33
 4.2 系統參數之設定........................................34

第五章 實驗分析.............................................35
 5.1 實驗一................................................35
  5.1.1 資料簡介..........................................35
  5.1.2 查詢系統之展示....................................36
  5.1.3 時間序列方法之評估比較............................39
 5.2 實驗二................................................42
  5.2.1 資料簡介..........................................42
  5.2.2 實驗結果..........................................43
 5.3 時間序列長度對於效率的影響............................50
 5.4 實驗總結..............................................51

第六章 結論與未來研究方向...................................52
 6.1 結論..................................................52
 6.2 應用..................................................53
 6.3 未來研究方向..........................................54

參考文獻.....................................................55
參考文獻 [1] R. Agrawal, K. I. Lin, H. S. Sawhney, and K. Shim, "Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases," In Proc. the 21st Int'l Conf. on Very Large Data Bases, Zurich, Switzerland, pp. 490-501, Sept. 1995.
[2] V. Filkov, S. Skiena, and J. Zhi, "Analysis techniques for microarray time-series data," in Proceedings of the Fifth Annual International Conference on Computational Biology (RECOMB), Montreal, Canada, pp. 124-131, 2001.
[3] R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. Lockhart, and Davis R.W. "A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle," Molecular Cell, 2, pp. 65-73, July 1998.
[4] Spellman, PT, Sherlock, G, Zhang, MQ, Iyer, VR, Anders, K, Eisen, MB, Brown, PO, Botstein, D, and B. Futcher, "Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization," Mol Biol Cell. 9, pp. 3273-3297, 1998.
[5] D. J. Berndt and J. Clifford, "Using Dynamic Time Warping to Find Patterns in Time Series," In KDD-94: AAAI Workshop on Knowledge Discovery in Databases, Seattle, Washington, pp. 359-370, July 1994.
[6] E. Keogh and M. J. Pazzani, "Derivative Dynamic Time Warping," In First SIAM International Conference on Data Mining (SDM') Chicago, IL, USA, pp. 5-7, April 2001.
[7] B. Bollobas, Gautam Das, Dimitrios Gunopulos, and H. Mannila., "Time-Series Similarity Problems and Well-Separated Geometric Sets," In Proceedings of the Association for Computing Machinery Thirteenth Annual Symposium on Computational Geometry, pp. 454--476, 1997.
[8] B. Ewing and P.. Green, "Analysis of expressed sequence tags indicates 35,000 human genes,". Nature Genetics 25, pp. 232-234, 2000.
[9] A. Brazma and J. Vilo, "Gene expression data analysis," FEBS Letters, 480, pp.17-24. BIOKDD01: Workshop on Data Mining in Bioinformatics (with SIGKDD01, Conference), pp. 29, 2000.
[10] A. Ben-Dor and Z. Yakhini, "Clustering gene expression patterns," In Proceedings of the Third Annual International Conference on Computational Molecular Biology( RECOMB99), Lyon, France, pp. 33-42, March 1999.
[11] P. Tamayo, D. Slonim, J. Mesirou, Q. Zhu, S. Kitareewan, E. Dmitrovsky, ES. Lander and TR. Golub, "Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation," Proc Natl Acad Sci, USA, 96, pp. 2907, 1999.
[12] Vincent S. M. Tseng, C. P. Kao, "Efficiently Mining Gene Expression Data via Integrated Clustering and Validation Techniques," Proceedings of the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining, (PAKDD), Taipei, Taiwan, pp. 432-437, May 2002.
[13] M. Eisen, P. T. Spellman, D. Botstein and P. O. Brown, "Cluster analysis and display of genome-wide expression patterns," Proceedings of National Academy of Science USA, 95, pp. 14863—14867, 1998.
[14] D. Goldin and P. Kanellakis, "On similarity queries for time-series data: constraint specification and implementation," In proceedings of the 1st International Conference on the Principles and Practice of Constraint Programming. Cassis, France, pp. 137-153, Sept 19-22 1995.
[15] J. B. Kruskall and M. Liberman, "The symmetric time warping algorithm: From continuous to discrete," In Time warps, String Edits and Macromolecules: The Theory and Practice of String Comparison. Addison-Wesley, 1983.
[16] C. Myers, L. Rabiner and A. Roseneberg, "performance tradeoffs in dynamic time warping algorithms for isolated word recognition," IEEE Trans. Acoustics, Speech, and Signal Proc, 28, pp. 623-635, 1983.
[17] Tolga Bozkaya, Nasser Yazdani, and Meral Ozsoyoglu, "Matching and Indexing Sequences of Different Lengths," In Proceedings of the Association for Computing Machinery Sixth International Conference on Information and Knowledge Management(ACM), Las Vegas, NV, USA, pp. 128-135, November 1997.
[18] S. Raychaudhuri, P. D. Sutphin, J. T. Chang and R. B. Altman, "Basic microarray analysis: Grouping and feature reduction," Trends in Biotechnology, 19(5), pp. 189-193, 2001.
[19] J.B. McQueen, "Some Methods of Classification and Analysis of Multivariate Observations," Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297, 1976.
[20] L. Kaufman and P. J. Rousseeuw, "Finding groups in data: an Introduction to cluster analysis," John Wiley & Sons, 1990.
[21] J. Aach and G. Church, "Aligning gene expression time series with time warping algorithms," Bioinformatics. 17, pp. 495-508, 2001.
[22] Alexander V. Lukashin and Rainer Fuchs. "Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters," Bioinformatics 17: 405-414., 2001.
[23] Z. Bar-Joseph, G. Gerber, D. Gifford, and T. Jaakkola. "A new approach to analyzing gene expression time series data," In the Sixth Annual International Conference on Research in Computational Molecular Biology, 2002.
[24] M. Schena, D. Shalon, R. W. Davis and P. O. Brown, "Quantitative monitoring of gene expression patterns with a complementary DNA microarray," Science, 270, pp. 467-470, 1995.
[25] J. DeRisi, L. Penland, P. O. Brown, M. L. Bittner, P. S. Meltzer, M. Ray, Y. Chen, Y. A. Su and J. M. Trent, "Use of a cDNA microarray to analyze gene expression patterns in human cancer," Nature Genetics, 14(4), pp. 457-60, 1996.
[26] J. L. DeRisi, V. Iyer and P. O. Brown, "Exploring the metabolic and genetic control of gene expression on a genomic scale," Science, 278, pp. 680-686, 1997.
[27] C. Faloutsos, M. Ranganathan and Y. Manolopoulos, "Fast Subsequence Matching in Time-Series Databases," In proceedings of the ACM SIGMOD Int’l Conference on Management of Data, Minneapolis, MN, pp. 419-429, May 24-27 1994.
[28] E. Keogh, K. Chakrabarti, M. Pazzani and S. Mehrotra, "Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases," In proceedings of ACM SIGMOD Conference on Management of Data. Santa Barbara, CA, pp. 151-162, May 21-24 2001.
[29] K. Chan and A. W. Fu, "Efficient Time Series Matching by Wavelets," In proceedings of the 15th IEEE Int'l Conference on Data Engineering. Sydney, Australia, pp. 126-133, Mar 23-26 1999.
[30] K. Chu and M. Wong, "Fast time-series searching with scaling and shifting," In Proceedings of ACM Principles on Database Systems, Philadelphia, PA, pp. 237–248, 1999.
[31] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, "Dimensionality reduction for fast similarity search in large time series databases," Journal of Knowledge and Information Systems, 3(3), pp. 263–286, 2000.
[32] T. Kahveci and A. Singh, "Variable length queries for time series data," In Proceedings of 17th Int’l Conf. on Data Engineering, Heidelberg, Germany, pp. 273–282, 2001.
[33] E.M. McCreight, "A space-economical suffix tree construction algorithm," Journal of the ACM, 23, pp. 262-272, 1976.
[34] E. Ukkonen, "On-line construction of suffix trees. Algorithmica," 14(3), pp. 249-260, September 1995.
[35] J. Lin and E. Keogh, "A Symbolic Representation of Time Series, with Implications for Streaming Algorithms," ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA, June 13, 2003.
[36] T. Argyros and C. Ermopoulos. "Efficient Subsequence Matching in Time Series Databases Under Time and Amplitude Transformations," TKDE, 2003.
[37] L. Ji and K. L. Tan, "Identifying Time-Lagged Gene Clusters on Gene Expression Data," Bioinformatics, September 16 2004.
[38] 高慶斌, "應用於基因表現探勘之高效率叢集方法及其效能評估," 國立成功大學資訊工程研究所, 碩士論文, 民國九十年六月
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2005-08-31起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2005-08-31起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw