||Time Series Data Preprocessing and Pattern Mining Techniques
||Institute of Computer Science and Information Engineering
time series analysis
In recent year, data mining techniques have been extensively used in data analysis due to the wide availability of huge amounts of data and imminent need for mining useful information from such data. The useful information is like King Solomon’s treasure which can be greatly helpful in many fields, such as business strategies, society analysis, scientific exploration, medical research and etc. However, real-world databases are highly susceptible to noisy and inconsistent due to the heterogeneous sources and huge size. Low-quality data is just like a blurring treasure map, explorers can never find the correct site. How to improve the quality of data and enhance the mining results thus presents a challenge. In addition, how to mine interesting patterns from the special case of transaction databases (i.e. data with small samples and large number of features) is also a critical issue.
In this dissertation, some effective preprocessing approaches are proposed for data analysis and pattern discovery. We address the issue on two types of data, time-series databases and transaction databases.
For the time series databases, we propose two discretization methods for pattern discovery. The first one is a segmentation-based discretization method, which combines the clustering techniques and genetic algorithm to derive patterns automatically. The second one is a reduction and symbolization-based discretization method, namely PIP-SAX. The proposed method is utilized in Emerging Patterns (EPs) mining. Experimental results on real financial dataset also show the effectiveness of the two proposed approaches.
Moreover, for the special case of transaction databases which has small samples and large number of features, a novel multi-information-based data reduction approach hybrid with a knowledge-based data integration approach is proposed. The proposed method has two main advantages. The first one is that the proposed approach can take multi-criterion into consideration in different order. The second one is that the proposed approach can get better results by integrating with heterogeneous knowledge databases.
摘 要 I
LIST OF FIGURES VIII
LIST OF TABLES IX
CHAPTER 1 INTRODUCTION 1
1.1 MOTIVATION 1
1.2 OVERVIEW OF THE DISSERTATION 3
1.2.1 Time Series Segmentation and Pattern Discovery Approaches 3
1.2.2 The PIP-SAX-Based Emerging Pattern Mining Approach 4
1.2.3 A Multi-Information-Based Gene Scoring Approach with Biological Knowledge 5
1.3 ORGANIZATION OF THE DISSERTATION 5
CHAPTER 2 TIME SERIES SEGMENTATION AND PATTERN DISCOVERY APPROACHES 6
2.1 INTRODUCTION 6
2.2 REVIEW OF DIMENSION REDUCTION APPROACHES 8
2.2.1 Discrete Wavelet Transformation 9
2.2.2 Perceptually Important Points (PIP) 10
2.3 REVIEW OF CLUSTER VALIDATION TECHNIQUES 11
2.4 REVIEW OF TIME SERIES SEGMENTATION APPROACHES 13
2.5 GENETIC COMPONENTS OF THE PREVIOUS APPROACH 15
2.5.1 Chromosome Representation and Initial Population 15
2.5.2 Clustering Segments into Groups 16
2.5.3 Fitness and Selection 17
2.5.4 Genetic Operators 19
2.6 THE PROPOSED PIP-BASED TIME SERIES SEGMENTATION AND PATTERN DISCOVERY ALGORITHM 19
2.6.1 Incorporating Perceptually Important Points (PIP) Approach 19
2.6.2 Enhanced Suitability Factor for Chromosome Evaluation 20
2.6.3 The Proposed Evolutionary Algorithm for PIP-based Time-Series Segmentation and Pattern Discovery 23
2.7 THE PROPOSED GRANULARITY-BASED TIME SERIES SEGMENTATION AND PATTERN DISCOVERY ALGORITHM 24
2.7.1 Incorporating Granular Computing Concept into Pattern Discovery 25
2.7.2 Improved Fitness and Selection 28
2.7.3 The Proposed Algorithm 31
2.8 EXPERIMENTAL RESULTS OF THE FIRST PROPOSED METHOD 33
2.8.1 The Efficiency of the Proposed Approach 34
2.8.2 Comparison of the Proposed and the Previous Approaches 36
2.8.3 The Impact of the Improved Suitability Factor 38
2.9 EXPERIMENTAL RESULTS OF THE SECOND PROPOSED METHOD 39
2.9.1 The Effectiveness of the Proposed Approach 39
2.9.2 Comparison of the Proposed and Previous approaches 42
2.10 SUMMARY 44
CHAPTER 3 THE PIP-SAX-BASED EMERGING PATTERN MINING APPROACH 47
3.1 INTRODUCTION 47
3.2 RELATED WORK 49
3.2.1 Symbolic Aggregative Approximation (SAX) 50
3.2.2 Perceptually Important Points (PIPs) 51
3.2.3 Mining Approaches for Emerging Patterns (EPs) 52
3.3 PROBLEM DEFINITION 53
3.4 THE PROPOSED APPROACH FOR MINING TIME SERIES EPS 55
3.4.1 The Proposed Framework 55
3.4.2 Data Transformation 56
3.4.3 EPs Mining Algorithm: TSEPsMiner 56
3.5 EXPERIMENTAL EVALUATION 59
3.5.1 Experimental Datasets 59
3.5.2 Number of Discovered EPs 60
3.5.3 Average Length of Discovered EPs 63
3.5.4 Average Growth Rate of Discovered EPs 64
3.6 SUMMARY 64
CHAPTER 4 A MULTI-INFORMATION-BASED GENE SCORING APPROACH WITH BIOLOGICAL KNOWLEDGE 66
4.1 INTRODUCTION 66
4.2 REVIEW OF PROPOSED FRAMEWORK 68
4.2.1 General Preprocessing Module 69
4.2.2 Multi-Information-Based Gene Scoring Method with Biological Knowledge Module 69
4.2.3 Classification Module 71
4.3 MULTI-INFORMATION-BASED GENE SCORING METHOD WITH BIOLOGICAL KNOWLEDGE 71
4.3.1 Gaussian Overlap Approach 71
4.3.2 Transformation of Gene Expression Values 71
4.3.3 Transformation of Biological Knowledge 72
4.3.4 Threshold Number of Misclassification (TNoM) Approach 74
4.4 EXPERIMENTAL RESULTS 75
4.4.1 Estimation of TNoM Score 76
4.4.2 Accuracy of Classification 78
4.4.3 Decision Tree Rules 79
4.4.4 The Most Informative GO Terms 79
4.5 SUMMARY 82
CHAPTER 5 CONCLUSIONS AND FUTURE WORK 83
 J. Aach and G. Church, Aligning gene expression time series with time warping algorithms, Bioinformatics, Vol. 17, pp 495-508, 2001.
 R. Agrawal, C. Faloutsos, & A. Swami, Efficient similarity search in sequence databases, Proc. of the 4th Conf. on Foundations of Data Organization and Algorithms, pp.69-84, 1993.
 R. Agrawal, T. Imielinski, & A. Swami, Mining association rules between sets of items in large databases, Proc. of 1993 ACM SIGMOD Int'l Conf. on Management of Data, pp.207-216, 1993.
 R. Agrawal, & R. Srikant, Fast algorithms for mining association rules, Proc. of the 20th VLDB Conference, pp.478–499, 1994.
 M. S. Aldenderfer and R. K. Blashfield, Cluster Analysis, Sage Publications, Beverly Hills, USA, 1984.
 J. M. Ale, & G. H. Rossi, An approach to discovering temporal association rules, Proc. of the 2000 ACM Symposium on Applied Computing, pp.294-300, 2000.
 U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, A.J. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue probed by oligonucleotide arrays, in Proceeding of the National Academy Sciences USA 96 (1999) 6745-6750.
 F. Al-Shahrour, R. Daz-Uriarte, J. Dopazo, Fatigo: a web tool for finding significant associations of gene ontology terms with groups of genes, Bioinformatics 20 (2004)578-580.
 J. Ayres, J. Flannick, J. Gehrke, & T. Yiu, Sequential pattern mining using a bitmap representation, Proc. of the 8th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, pp.429-435, 2002.
 F. Azuaje, O. Bodenreider, Incorporating ontology-driven similarity knowledge into functional genomics: An exploratory study, in Proceedings of IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE’04) (2004) 317-324.
 J. Bailey, T. Manoukian, & K. Ramamohanarao, Fast algorithms for mining emerging patterns, Proc. of 6th European Conf. on Principles and Practice of Knowledge Discovery in Databases, pp.39-50, 2002.
 S. D. Bay, & M. J. Pazzani, Detecting change in categorical data: mining contrast sets, Proc. of the 5th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, pp.302-306, 1999.
 R. Bellman, On the approximation of curves by line segments using dynamic programming, Communications of the ACM, Vol. 4, No. 6, pp. 284, 1961.
 A. Ben-dor, R. Shamir, Z. Yakhini, Clustering gene expression patterns, Journal of Computational Biology 6 (1999) 281-297.
 A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, Z. Yakhini, Tissue classification with gene expression profiles, Journal of Computational Biology 7 (2000) 559-584.
 A. Ben-Dor, N. Friedman, Z. Yakhini, Overabundance analysis and class discovery in gene expression data, Agilent Technical Report (2002).
 B. J. Berme, P. Pechukas, Gaussian model potentials for molecular interactions, Journal of Chemical Physics 56 (1972) 4213-4216.
 D. Berrar, W. Dubitzky, M. Granzow, R. Ells, in Proceedings of Oral and Poster Presenters' Abstracts, Critical Assessment of Microarray Data Analysis (2001) 23-28.
 K. Chan, & W. Fu, Efficient time series matching by wavelets, Proc. of the 15th IEEE Int'l Conf. on Data Engineering, pp.126-133, 1999.
 S. Chan, B. Kao, C. L. Yip, & M. Tang, Mining emerging substrings, Proc. of the 8th Int'l Conf. on Database Systems for Advanced Applications, pp.119- 126, 2003.
 J. R. Chen, “Making subsequence time series clustering meaningful,” The IEEE International Conference on Data Mining, pp. 114-121, 2005.
 H. Y. Chuang, H.F. Liu, S. Brown, C. McMunn-Coffran, C.Y. Kao, D.F. Hsu, Identifying significant genes from microarray data, in Proceedings of IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE’04) (2004) 358-365.
 F. L. Chung, T. C. Fu, V. T. Y. Ng, & R. W. P. Luk, An evolutionary approach to pattern-based time series segmentation, IEEE Trans. Evolutionary Computation, 8 (5): pp.471-489, 2004.
 P. Cohen and N. Adams, Unsupervised Segmentation of Categorical Time Series into Episodes, Proceeding of IEEE International Conference on Data Mining, pp. 99-106, 2002.
 C. Creighton, S. Hanash, Mining gene expression databases for association rules, Bioinformatics 19 (2003) 79-86.
 X. Cui, G.A. Churchill, Statistical tests for differential expression in cdna microarray experiments, Genome Biology 4 (2003) pp. 210.1-210.10.
 D. L. Davies and D. W. Bouldin, A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 1, No. 4, pp. 224 - 227, 2000.
 G. Dong, & J. Li, Efficient mining of emerging patterns: Discovering trends and differences, Proc. of the 5th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, pp.43–52, 1999.
 G. Dong, J. Li, & X. Zhang, Discovering jumping emerging patterns and experiments on real datasets, Proc. of the 9th Int'l Database Conference, pp.155-168, 1999.
 G. Dong, & J. Li, Mining border descriptions of emerging patterns from dataset pairs, Knowledge and Information Systems, 8, pp.178–202, 2005.
 G. Dong, X. Zhang, L. Wong, & J. Li, CAEP: classification by aggregating emerging patterns, Discovery Science, pp.30-42, 1999.
 J. Dunn, Well separated clusters and optimal fuzzy partitions, Journal of Cybernetics, Vol. 4, pp. 95 - 104, 1974.
 M. B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, in Proceeding of the National Academy Sciences USA 95 (1998) 14863-14868.
 S. Erdal, O. Ozturk, D. Armbruster, H. Ferhatosmanoglu and W.C. Ray, A time series analysis of microarray data, The IEEE Symposium on Bioinformatics and Bioengineering, pp. 366-375, 2004.
 H. Fan, M. Fan, K. Ramamohanarao, & M. Liu, Further improving emerging pattern based classifiers via bagging, Proc. of the 10th Pacific-Asia Conf. on Knowledge Discovery and Data, pp.91-96, 2006.
 C. L. Fancoua and J. C. Principe, A Neighborhood Map of Competing One Step Predictors for Piecewise Segmentation and Identification of Time Series, IEEE Internal Conference on Neural Networks, Vol. 4, pp.1906-1911, 1996.
 L. Feng, K. Ju and K. H. Chon, A method for segmentation of switching dynamic modes in time series, IEEE Transactions on Systems, Man and Cybernetics, Part B, Vol. 35, No. 5, pp. 1058-1064, 2005.
 T. C. Fu, F. L. Chung, V. Ng and R. Luk, Evolutionary segmentation of financial time series into subsequences, The Congress on Evolutionary Computation, Vol. 1, pp. 426 - 430, 2001.
 G. Getz, H. Gal, I. Kela, D.A. Notterman, E. Domany, Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data, Bioinformatics 19 (2003) 1079-1089.
 V. Guralnik and J. Srivastava, Event detection from time series data, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 33-42, 1999.
 I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines, Machine Learning 46 (2002) 389-422.
 J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, The Internal Conference on Data Engineering, pp. 106-115, 1999.
 J. Himberg, K. Korpiaho, H. Mannila,J. Tikanmaki and H.T.T. Toivonen, Time series segmentation for context recognition in mobile devices, The IEEE International Conference on Data Mining, pp. 203-210, 2001.
 Y. W. Huang and P. S. Yu, Adaptive Query Processing for Time-Series Data, in Proceeding of the 5th Int’l Conference on Knowledge Discovery and Data Mining, pp. 282-286, San Diego, CA, Aug 15-18, 1999.
 L. Hubert and J. Schultz, Quadratic assignment as a general data-analysis strategy, British Journal of Mathematical and Statistical Psychology, Vol. 29, pp. 190 - 241, 1976.
 A. Icev, C. Ruiz, E. F. Ryder, Distance-enhanced association rules for gene expression, in Proceedings of the 3rd International Workshop on Data Mining in Bioinformatics,(BIOKDD’03) (2003) 34-40.
 W. Jiang, Y. Xu, D. Shi and K. Xia, A novel dynamic clustering algorithm and its application in fuzzy modeling, IEEE International Conference on Granular Computing, pp. 284 – 289, 2009.
 E. Keogh, http://www.cs.ucr.edu/~eamonn/discords/, 2005.
 E. Keogh, K. Chakrabarti, M. Pazzani, & S. Mehrotra, Dimensionality reduction for fast similarity search in large time series databases, Knowledge and Information Systems, 3 (3): pp.263-286, 2001.
 E. Keogh, S. Chu, D. Hart and M. Pazzani, An online algorithm for segmenting time series, The IEEE Internal Conference Data Mining, pp. 289-296, 2001.
 E. Keogh, J. Lin and W. Truppel, Clustering of time series subsequences is meaningless: implications for previous and future research, The IEEE International Conference on Data Mining, pp. 115 – 122, 2003.
 P. Kralj, N. Lavrac, D. Gamberger, & A. Krstacic, Contrast set mining for distinguishing between similar diseases, Proc. of the 11th Conf. on Artificial Intelligence in Medicine, pp.109-118, 2007.
 A. Lendasse, J. Lee, E. de Bodt, V. Wertz and M. Verleysen, Dimension reduction of technical indicators for the prediction of financial time series - Application to the BEL20 Market Index, European Journal of Economic and Social Systems, Vol. 15, No. 2, pp. 31-48, 2001.
 H. Li, R. Li, & F. Wang, Blind separation algorithm for spatial independent time series, ICIC Express Letters, vol.4, no.2, pp.553-558, 2010.
 J. Li, G. Dong, & K. Ramamohanarao, Making use of the most expressive jumping emerging patterns for classification, Knowledge and Information Systems, 3 (2): pp.131-145, 2001.
 J. Li, L. Wong, Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns, Bioinformatics 18 (2002) 725-734.
 J. Li, & Q. Yang, Strong compound-risk factors: efficient discovery through emerging patterns and contrast sets, IEEE Transactions on Information Technology in Biomedicine, 11 (5): pp.544-552, 2007.
 J. Lin, E. Keogh, S. Lonardi, & B. Chiu, A symbolic representation of time series, with implications for streaming algorithms, Proc. of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp.2-11, 2003.
 J. Lin, & E. Keogh, Group SAX: extending the notion of contrast sets to time series and multimedia data, Proc. of the 10th European Conf. on Principles and Practice of Knowledge Discovery in Databases, pp.284-296, 2006.
 B. Lkhagva, Y. Suzuki, & K. Kawagoe, Extended SAX: extension of symbolic aggregate approximation for financial time series data representation, Proc. of IEICE the 17th Data Engineering Workshop, 2006.
 Y. Matsumoto, & J. Watada, Knowledge acquisition from time series data through rough sets analysis, International Journal of Innovative Computing, Information and Control, vol.5, no.12(B), pp.4885-4898, 2009.
 J. B. McQueen, Some methods of classification and analysis of mutivariate observations, The Symposium on Mathematical Satistics and Probability, 1967, pp. 281-297.
 Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, 1994.
 R. Ming, H. Ting, & J. Bailey, Mining minimal contrast sub-graph patterns, Proc. of the 3rd VLDB Workshop on Secure Data Management, pp.639-643, 2006.
 J. J. Oliver, R. A. Baxter and C. S. Wallace, Minimum Message Length Segmentation, The. Pacific-Asia Conference Knowledge Discovery Data Mining, pp. 222-233, 1998.
 W. Pan, A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments, Bioinformatics 18 (2002) 546-554.
 P. Pavlidis, C. Tang, W.S. Noble, Classification of genes using probabilistic models of microarray expression profiles, in Proceedings of the first international Workshop on Data Mining in Bioinformatics (BIOKDD’01) (2001) 15-21.
 W. Penny and S. Roberts, Dynamic models for nonstationary signal segmentation, Computers and Biomedical Research, Vol. 32, No. 6, pp. 483-502, 1999.
 D. B. Percival and A. T. Walden, Wavelet methods for time series analysis, published by Cambridge University Press, 2000.
 Jianlong Qi, Jian Tang, Integrating gene ontology into discriminative powers of genes for feature selection in microarray data, in Proceedings of the 22nd ACM Symposium on Applied Computing (SAC’07) (2007) 430-434.
 J. Rahnenfuhrer, F.S. Domingues, J. Maydt, T. Lengauer, Calculating the statistical significance of changes in pathway activity from gene expression data, Statistical Applications in Genetics and Molecular Biology 3 (2004) Article 16.
 H. Shatkay, Approximate Queries and Representations for Large Data Sequences, Technical Report cs-95-03, Department of Computer Science, Brown University, 1995.
 D.K. Slonim, P. Tamayo, J.P. Mesirov, T.R. Golub, E.S. Lander, Class prediction and discovery using gene expression data, in Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB’00) (2000) 263-272.
 URL of Taiwan Stock Exchange (TWSE). http://www.twse.com.tw/en/
 H. Tamura, K. Tanno, H. Tanaka, & Z. Zhang, Bi-directional function learning method for time series prediction, ICIC Express Letters, vol.2, no.4, pp.401-407, 2008.
 The Gene Ontology Consortium, The gene ontology (go) database and informatics resource, Nucleic Acids Research 32 (2004) D258-D261.
 V. S. Tseng, C. H. Chen, P. C. Huang and T. P. Hong, Segmentation of time series by the clustering and genetic algorithms, Pattern Recognition Letters, Vol. 30, 1190-1197, 2009.
 V. S. Tseng, C. H. Chen, P. C. Huang and T. P. Hong, A cluster-based genetic approach for segmentation of time series and pattern discovery, The IEEE Congress on Evolutionary Computation, pp. 1949-1953, 2008.
 Vincent S. Tseng and Ching-Ping Kao, Efficiently Mining Gene Expression Data via a Novel Parameterless Clustering Method, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 2, No. 4, pp. 355 - 365, 2005.
 J. P. C. Valente and I. L. Chavarrias, Discovering similar patterns in time series, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 497-505, 2000.
 L. J. van't Veer, H. Dai, Gene expression profiling predicts clinical outcome of breast cancer, Nature, 415 (2002) 530-536.
 H. Wakuya, Enrichment of inner information representations in bi-directional computing architecture for time series prediction, International Journal of Innovative Computing, Information and Control, vol.4, no.11, pp.3079-3090, 2008.
 C. Wang and S. Wang, Supporting content-based searches on time Series via approximation, Proceedings of the 12th International Conference on Scientific and Statistical Database Management, 2000.
 H. Wang, F. Azuaje, Gene expression correlation and gene ontology-based similarity: An assessment of quantitative relationships, in Proceeding of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB’04) (2004) 25-31.
 X. Y. Wang and Z. O. Wang, A Structure-Adaptive Piece-Wise Linear Segments Representation for Time Series, Proceeding of IEEE International Conference on Information Reuse and Integration, pp. 433 - 437, 2004.
 Weka [ http://www.cs.waikato.ac.nz/ml/weka/].
 X. Xu, A. Zhang, Selecting informative genes from microarray dataset by incorporating gene ontology, in Proceedings of IEEE Fifth Symposium on Bioinformatics and Bioengineering (BIBE’05) (2005) 241-245.
 L. Yan and Q. Liu, Granular resolution and granular reasoning, IEEE International Conference on Granular Computing, pp. 668 – 671, 2009.
 Y. H. Yang, S. Dudoit, P. Luu, T. P. Speed, Normalization for cDNA microarray data, in Proceeding of International Biomedical Optics Symposium (2001) 141-152.
 Y. Yao, Interpreting concept learning in cognitive informatics and granular computing, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, Vol. 39 , No. 4, pp. 855 – 866, 2009.
 B. K. Yi, & C. Faloutsos, Fast Time Sequence Indexing for Arbitrary Lp Norms, Proc. of the VLDB, pp.385-394, 2000.
 L. T. H. Yu, F. L. Chung, S. C. F. Chan, & S. M. C. Yuen, Using emerging pattern based projected clustering and gene expression data for cancer detection, Proc. of 2nd conf. on Asia-Pacific bioinformatics, pp.75-84, 2004.
 H. H. Yu, Vincent S. Tseng, J. H. Chuang, A multi-Information based gene scoring model with applications on analysis of hepatocellular carcinoma, in Proceedings of IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE’04) (2004) 345-350.
 M. J. Zaki, SPADE: an efficient algorithm for mining frequent sequences, Machine Learning, 42, pp.31-60, 2001.
 Z. Zhang and H. Wang, A fast subspace partition clustering algorithm for high dimensional data streams, IEEE International Conference on Intelligent Computing and Intelligent Systems, Vol. 1, pp. 491 – 495, 2009.