進階搜尋


 
系統識別號 U0026-0812200911492139
論文名稱(中文) 不平衡類別關鍵特徵之偵測與蛋白質功能排比訊號之疊合
論文名稱(英文) Detection of Interesting Patterns on Imbalanced Class and Registration of Signal Alignments on Protein Function
校院名稱 成功大學
系所名稱(中) 工程科學系碩博士班
系所名稱(英) Department of Engineering Science
學年度 94
學期 1
出版年 95
研究生(中文) 洪俊銘
研究生(英文) Chun-Min Hung
電子信箱 goodmans@giga.net.tw
學號 N9890116
學位類別 博士
語文別 英文
論文頁數 166頁
口試委員 口試委員-洪宗貝
口試委員-孫光天
口試委員-陳澤生
指導教授-黃悅民
指導教授-張明熙
口試委員-張素瓊
中文關鍵字 不平衡類別  資料探勘  信用評估  類神經網路    衝突敏感性結構  類神經模糊邏輯  決策樹  關鍵特徵  基因程式規劃法  適應函數  貝式因果樹  生物資訊  序列排比  蛋白質功能  動態程式規劃  疊合訊號  幾何轉換  不變矩  薄片曲面  平滑法  QR-分解  一對多查詢策略  根式單向勝者全拿策略  映像對應及堆積最佳化  強韌點比對法  族群間交換策略  小波重建 
英文關鍵字 fitness function  Bayes Causal Tree  Genetic Programming  Registration Signal  Geometric Transformation  Protein Function  Dynamic Programming  Bioinformatics  Sequence Alignment  Neuro-fuzzy  Imbalanced Class  Data Mining  Credit Scoring  Neural Network  Entropy  Conflict-Sensitivity Contexture  Mapping Correspondence Heaping Optimization  Rooted One-way Winner-Take-All (RO-WTA)  Interesting Pattern  Decision Tree  One-to-Many Query Strategy  QR-decomposition  Smoothing  Wavelet Reconstruction  Interpopulation-exchanged Strategy  Robust Point Match (RPM)  Thin-plate Spline  Moment Invariant 
學科別分類
中文摘要  近年來資料探勘方法大量應用於各種不同的領域,然而真實世界的狀況存在許多分佈上的偏差,致使一般的機率統計方法通常會找出已為人所熟知的規則,難以窺視領域知識內之關鍵特徵,如金融領域中的臨界信用風險特徵和分子生物領域中的新穎蛋白質功能特徵。本研究利用資料探勘方法有效偵測出具有不平衡類別資料集之關鍵特徵,並提出一個新的序列排比訊號之疊合方法應用於預測蛋白質之功能。
 首先,為了有效偵測出具有不平衡類別資料集之關鍵特徵,本研究廣泛利用各種一般類型的資料探勘工具,對銀行資料中含有信用評估關鍵特徵且類別分佈不平均的真實世界資料集進行資料探勘標準分析程序,包括十種各式類神經網路模型、C4.5、ID3、PRISM、NNge、IBk、Naive Bayes、Complement Naı¨ve Bayes、 BayesNetB、Random committee及Voted Perceptron。結果顯示以上方法當預測最具關鍵特徵的風險類別時完全失效,而分析其失效因素時發現其中最具影響預測正確性的因素的是各類別案例數量的比例,直接關係到矛盾案例的判定。在整合兩個方法進行資料清洗步驟後,進一步發現結合善長非矛盾案例的PRISM當過濾器加上簡單易慬的ID3決策樹當分類器,其總預測效果最佳。然而少數類別的關鍵特徵仍無法有效確認,為此本研究提出一個以熵為基礎的類神經模糊邏輯網路與多決策樹的混合系統,量化代表決策矛盾指標的衝突敏感性結構。結果有效大幅提昇少數類別關鍵特徵的分類預測正確性且其他類別的分類預測正確性也在合理範圍內。此外,本研究亦結論出許多極具價值的商業規則供業界參考。
 接著,經由上述研究顯示,經過時間空間感知後真實世界的資料的確可以在不平衡類別資料集中挖掘出許多關鍵特徵。在此結論基礎下,本研究進一步設計出一個彈性的訊號疊合混合系統有效預測複雜的蛋白質功能。基本上,此混合系統以基因程式規劃法為主體架構,配合貝式因果樹的資料結構,輸入多條部份功能已知的蛋白質序列及一條未知功能的蛋白質序列,藉由三個適應函數同步演化計算輸出一個局部序列功能分類之最佳因果樹。第一個適應函數轉換蛋白質序列產生之訊號萃取其不變矩特徵值,藉由薄片曲面內插理論平滑法衍生出的強韌點比對法進行幾何轉換形變的比對,比對排列不同生化環境變化下相同的蛋白質功能疊合訊號,計算QR-分解並解出一個最佳化的映像對應矩陣,再以一對多查詢策略及根式單向勝者全拿策略,堆積出較好的貝式因果樹,而強韌點比對值差異最小的因果樹會被留下。第二個適應函數以動態程式規劃為基礎的Smith–Waterman最佳化序列排比演算法,計算片段序列在因果樹節點上的排比分數總和,排比分數值最大的因果樹會被留下。第三個適應函數以真實環境分類資料定義序列的起止範圍,引導前兩個適應函數嵌在特定的合理範圍。最後,透過這三個因果樹之個體移入移出的族群間交換策略,選擇出最好的因果樹當作結果。因果樹上的終端節點是未知功能的蛋白質序列片斷,而內部節點是已知部份功能的蛋白質序列片斷,從根到終端節點演化的路徑可以看出每一個未知功能的蛋白質序列片斷演化的功能集合。
 結果證明使用訊號疊合方法能夠對時空環境變異所造就的生物多樣性進行細部分類,而在適當的退火溫度調控下,顯示最佳化局部序列排比、強韌點特徵比對、真實環境分類取樣,分別同時達到收歛的狀況,證實本系統可以依據真實環境分類還原已知序列之蛋白質功能原始分類。
 因本系統需大量的計算時間,因此在程式實作之初即已考慮平行處理之細節,其中最為費時的Smith–Waterman演算法不但先透過小波重建並微分取得局部極小訊號值集合之後,依此局部極小值集合所對應的序列位置將序列切割成小片段,而且當偵測到相同範圍的序列排比會自行引用先前的結果,使得本系統可以處理如長達數千氨基酸的SARS病毒序列。
 未來,結合另一項正在研究中的新穎基因序列發現方法以及微陣列基因表達的生物實驗,倘若對所預測的蛋白質功能加以證實後,由本論文所提的理論與方法,即可以廣泛應用在生物資訊領域中,相信對蛋白質體的生化代謝及傳導途徑了解有所助益。
英文摘要  Many data mining methods have recently been applied in a wide variety of fields. However, data collected for mining has many class-imbalance problems. These problems are difficult to solve using conventional probability and statistical methods, which usually find well-known rules and do not easily discover the patterns of interest in specific fields. For example, the critical risk patterns for credit scoring in finance and the function pattern for novel protein in molecular biology are both are difficult to find with general algorithms owing to class imbalance. In this study, one hybrid system was designed to effectively identify patterns of interest on imbalanced datasets, and developed another highly effective hybrid system to accurately predict protein functions in accordance with sequence alignment by signal registration.
 First, to effectively detect patterns of interest on an imbalanced dataset, this work applies extensively adopted data mining approaches to perform credit scoring tasks following the standard data mining procedure. Many experiments are also conducted with the banking data of real-world case, which comprise interesting patterns and imbalanced classes. These approaches include ten types of neural networks, C4.5, ID3, PRISM, NNge, IBk, Naive Bayes, Complement Naive Bayes, BayesNetB, Random Committee, and Voted Perceptron. Experimental results indicate that the above-mentioned methods have difficulty in prediction for the risk patterns of interest. The most significant factor in the difficulty regarding the prediction accuracy for interesting pattern is the ratio of amount of instances, which directly influence the judgments on contradiction. After performing the data cleaning step by combining the two above-mentioned methods, this study concludes that the combination of the PRISM method, which successfully classifies of non-contradiction instances as a filter, and the ID3 decision tree, which is easy to understand as a classifier, has the best total accuracy of prediction. However, the interesting patterns of minority class still cannot be effectively confirmed. Thus, this study develops the first hybrid system, an entropy-based neuro-fuzzy network with multiple decision trees quantifying the conflict-sensitivity contexture that denotes the index of contradictory decision-making. Experimental results reveal that the proposed method can effectively improve the prediction accuracy regarding interesting patterns in minority classes, while the accuracy of other classes also falls into the reasonable scopes. Additionally, this work describes many valuable business rules.
 Consequently, depending on the above-mentioned conclusions this following study addresses the theory in which the interesting patterns may be found in the imbalanced dataset of the real-world case, and may be aware of the change in space-time. Therefore, the second portion of this study designs another flexible hybrid system for signal registration, and applies it to predict the complex protein functions. The framework of the hybrid system is constructed by genetic programming, using Bayes causal tree as the data structure for individual representations. The system takes several protein sequences of known partial functions, and one targeted protein sequence of unknown functions, as the input. Next, the best causal tree for a local alignment of protein sequence to the multiple function classification is produced by simultaneous evolution with three fitness functions. The first fitness function is designed to evaluate similar features of the moment invariant into a set of signals, into which the fragments of a protein sequence are translated. The similar features are matched with one another by the robust point match (RPM) derived from the thin-plate spline theory with smoothing interpolation. The RPM performs the matches for geometric transformation to align signals classified in the same protein function with the variation of biochemistry environment, signal registration. The QR-decomposition in RPM is used to resolve the optimization of a mapping correspondence matrix. Furthermore, the RPM utilizes the one-to-many query and rooted one-way winner-take-all (RO-WTA) strategy to heap function nodes as the better Bayes causal tree based on the minimum difference between the known and unknown functions of protein. The second fitness function estimates the alignment score returned from the Smith-Waterman algorithm, which is an optimal sequence alignment method based on dynamic programming. These local sequence alignment scores for all nodes in the causal tree are added into a fitness value, and then the causal tree with the minimum value is obtained. The third fitness function estimates the coverage ranges in the light of the beginning and the end positions of known functions in real-world, and calculates the Bayes probability of causal tree so that the first and the second fitness functions are restricted to a reasonable range. Finally, the best causal tree is determined using an inter-population exchange strategy by immigrating and emigrating between the three populations with difference fitness functions. Each terminal node in the resulting causal tree is composed of a fragmented protein sequence of unknown function, while each internal node in the same tree involves with a fragmented protein sequence of known partial function. A set of functions in the path from the root node to each terminal node indicates an evolutionary motif for each fragmented protein sequence with an unknown function.
 The experimental results confirm that the classification of protein function to biodiversity caused by the change in space-time can be subdivided using signal registration. By appropriately controlling annealing temperature, the results demonstrate that the local sequence alignment, the feature matched by RPM, and the sampling of classification in the real world converge to a stable point. Thus, the proposed system may restore the unknown function to the original real-world classification of known protein function.
 Because the hybrid system requires considerable computation time when many long sequences are involved, it simultaneously distributes the evolutionary computation to many parallel processes. Although the step using the Smith-Waterman algorithm for sequence alignment is the most time-consuming part, this system not only separates a long sequence into many small fragmented sequences with a set of sequence positions according to locally minimum signals using wavelet reconstruction and differential equation, but also adopts the previous computation results as the same sequence alignment involved. Thus, the system can deal with thousand of amino acids, such as SARS virus sequences.
 In the future, if the predicted protein function can be confirmed after combining other ongoing studies, including the novel gene findings and the microarray-based gene expression experiments, then the methods and theorems proposed in this dissertation can be widely applied in Bioinformatics to help understand biochemical pathways in proteomics.
論文目次 Contents
Table of Contents : : : : : : : : : : : : : : : : : : : : : : : : : : : I
List of Tables : : : : : : : : : : : : : : : :: : : : :: : : : : : : : : : III
List of Figures : : : : : : : : : : : : : :: : : : : : : : : : : : : : : : V
CHAPTER 1. INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : 2
1.1 Motivation and Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Class Imbalance Problem . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Interesting Patterns . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Identi_cation of Protein Function . . . . . . . . . . . . . . . . 14
1.3 Organization of this Dissertation . . . . . . . . . . . . . . . . . . . . 17
CHAPTER 2. DATA MINING : : : : : : : : : : : :: : : : : : : : : : : 19
2.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Knowledge Discovery Process . . . . . . . . . . . . . . . . . . 20
2.1.2 Training Exercise . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Comparison of Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Back-propagation Learning . . . . . . . . . . . . . . . . . . . . 27
2.2.2 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Levenberg-Marquardt algorithm . . . . . . . . . . . . . . . . . 31
2.2.4 RBF Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.5 C4.5 and ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.6 PRISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.7 Conjunctive Rule . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.8 NNge and IBk . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.9 Naive Bayes, Complement Naive Bayes, and BayesNetB . . . . 37
2.2.10 Random Committee and Voted Perceptron . . . . . . . . . . . 37
CHAPTER 3. NFLDR ALGORITHM : : : : : : : : : : : : : : : : : : : : : : 39
3.1 Methods . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 40
3.1.1 Entropy-based Contexture . . . . . . . . . . . . . . . . . . . . 42
3.1.2 Conict Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.3 Two-Layer Decision Fusion . . . . . . . . . . . . . . . . . . . . 44
Row Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Column Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Cross Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Neuro-Fuzzy Logic with Decision Rules . . . . . . . . . . . . . . . . . 50
CHAPTER 4. AGCT MODEL : : : : : : : : : : : : : : : : : : : : : : : : : : 60
4.1 System Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.1 Transformation of Problem . . . . . . . . . . . . . . . . . . . 62
4.1.2 Feature Matching . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.1 Moment Invariant . . . . . . . . . . . . . . . . . . . . . . . . . 65
Uniqueness Theorem Regarding Moments . . . . . . . . . . . 66
Central Moments . . . . . . . . . . . . . . . . . . . . . . . . . 66
Pattern Identi_cation . . . . . . . . . . . . . . . . . . . . . . . 67
Absolute Moment Invariants . . . . . . . . . . . . . . . . . . . 69
4.2.2 Thin-Plate Splines . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.3 Optimization of Mapping, Correspondence, and Heaping . . . 73
One-to-many Query Strategy . . . . . . . . . . . . . . . . . . 74
Identi_cation of Protein Function . . . . . . . . . . . . . . . . 81
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.1 Genetic Programming . . . . . . . . . . . . . . . . . . . . . . 84
4.3.2 Cooperation of Multiple Fitness Functions . . . . . . . . . . . 87
CHAPTER 5. EXPERIMENTAL RESULTS : : : : : : : : : : : : : : : 91
5.1 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1.1 Banking Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1.2 Protein Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 95
Experiment 1. . . . . . . . . . . . . . . . . . . . . . . . 95
Experiment 2. . . . . . . . . . . . . . . . . . . . . . . . 95
Experiment 3.and 4. . . . . . . . . . . . . . . . . . . . 97
5.2.1 Interference by Noise . . . . . . . . . . . . . . . . . . . . . . . 101
Experiment 5. . . . . . . . . . . . . . . . . . . . . . . . 101
Experiment 6. . . . . . . . . . . . . . . . . . . . . . . . 101
5.2.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Experiment 7. . . . . . . . . . . . . . . . . . . . . . . . 102
Experiment 8. . . . . . . . . . . . . . . . . . . . . . . . 102
Experiment 9. . . . . . . . . . . . . . . . . . . . . . . . 103
Experiment 10. . . . . . . . . . . . . . . . . . . . . . . 103
Experiment 11. . . . . . . . . . . . . . . . . . . . . . . 103
5.2.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . 103
Experiment 12. . . . . . . . . . . . . . . . . . . . . . . 107
Experiment 13. . . . . . . . . . . . . . . . . . . . . . . 108
Experiment 14. . . . . . . . . . . . . . . . . . . . . . . 110
Experiment 15. . . . . . . . . . . . . . . . . . . . . . . 114
5.2.4 Experimental Results for NFLDR . . . . . . . . . . . . . 116
5.2.5 Interesting Mining in Two-layer Explanations . . . . . . . . . 120
5.2.6 Business Application . . . . . . . . . . . . . . . . . . . . . . . 123
Business rule 1. . . . . . . . . . . . . . . . . . . . . . . 123
Business rule 2. . . . . . . . . . . . . . . . . . . . . . . 124
5.2.7 Applying CSC with Non-Dominating Attributes . . . . . . . . 124
Business rule 3. . . . . . . . . . . . . . . . . . . . . . . 125
Business rule 4. . . . . . . . . . . . . . . . . . . . . . . 125
Business rule 5. . . . . . . . . . . . . . . . . . . . . . . 126
5.2.8 Analysis of the AGCT Model . . . . . . . . . . . . . . . . . . 127
CHAPTER 6. CONCLUSIONS : : : : : : : : : : : : : : : : : : : : : : : : : : 136
6.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2 Perspective Part I: Interesting Patterns -NFLDR . . . . . . . . . . . 139
6.3 Perspective Part II: Signal Registration -AGCT . . . . . . . . . . . . 140
參考文獻 Aha, D. W., Kibler, D. and Albert, M. K., Instance-based learning algorithms, Machine Learning 6(1), 37-66, 1991.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J., Basic local alignment search tool, J. Mol. Biol. 215(3), 403-410, 1990.

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25, 3389-3402, 1997.

An, A., Cercone, N. and Huang, X., A case study for learning from imbalanced data sets, Advances in Artificial Intelligence: Proc 14th Conf. Canadian Soci.

Compu. Studies of Intel. 1-15, 2001.
Bates, R. R., Sun, M., Scheuer, M. L. and Sclabassi, R. J., Detection of seizure foci by recurrent neural networks, Proc. of the 22nd Annual Int'l Conf. of the IEEE on Engineering in Medicine and Biology Society 1377-1379, 2000.

Batista, G. E. A. P. A., Prati, R. C. and Monard, M. C., A study of the behavior of several methods for balancing machine learning training data, SIGKDD Explorations 6(1), 20-29, 2004.

Benediktsson, J. A., Sveinsson, J. R. and Swain, P. H., Hybrid consensus theoretic classification, IEEE Trans. Geosci. Remote Sensing 35(4), 833-843, 1997.

Bojarczuk, C. C., Lopes, H. S. and Freitas, A., Genetic programming for knowledge discovery in chest-pain diagnosis, IEEE Engineering in Medicine and Biology 19, 38-44, 2000.

Breiman, L., Bagging predictors, Machine Learning 24(2), 123-140, 1996. Cai, D., Delcher, A., Kao, B. and Kasif, S., Modeling splice sites with bayes networks, Bioinformatics 16(2), 152-159, 2000.

Cendrowska, J., PRISM: An algorithm for inducing modular rules, Int'l J. of Man-Machine Studies 27(4), 349-370, 1987.

Chawla, N. V., Bowyer, K. W., Hall, L. O. and Kegelmeyer, W. P., SMOTE: Synthetic minority over-sampling technique, J. of Artificial Intelligence Research (JAIR) 16, 321-357, 2002.

Chou, P. Y. and Fasman, G. D., Prediction of the secondary structure of proteins from their amino acid sequence, Adv. Enzymol. Relat. Areas Mol. Biol. 47, 45-148, 1978.

Chui, H. and Rangarajan, A., A new algorithm for non-rigid point matching, in Proce. IEEE Conference on Computer Vision and Pattern Recognition 2, 44-51, 2000.

Coifman, R. R. and Wickerhauser, M. V., Entropy-based algorithms for best basis selection, IEEE Trans. on Inf. Theory 38(2), 713-718, 1992.

Coolen, A. and Prete, V. D., Statistical mechanics beyond the Hopfield model: solvable problems in neural network theory, Reviews in the Neurosciences 14, 181-193, 2003.

Delcher, A. L., Kasif, S., Goldberg, H. R. and Hsu, W. H., Probabilistic prediction of protein secondary structure using causal networks, In Proc. 11th AAAI National Conference on Artificial Intelligence 316-321, 1993.

Delcoigne, A. and Hansen, P., Sequence comparison by dynamic programming, Biometrika 62(3), 661-664, 1975.

DeRouin, E., Brown, J., Fausett, L. and Schneider, M., Neural network training on unequally represented classes, Intel. Eng. Sys. Through Arti. Neu. Net. 135-140, 1991.

Domingos, P., Metacost: A general method for making classifiers cost-sensitiv, Proc. of the 5th Int'l Conf. on Knowledge Discovery and Data Mining 155-164, 1999.

Dreiseitl, S. and Ohno-Machado, L., Logistic regression and artificial neural network
classification models: a methodology review, J. of Biomedical Informatics 35(5-6), 352-359, 2002.

Dresher, M., Moment spaces and inequalities, Duke Math. J. 20(2), 261-271, 1953.

Drummond, C. and Holte, R., C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling, Proc. ICML Workshop on Learning from Imbalanced Data Sets, 2003.

Durbin, R., Eddy, S., Krogh, A. and Mitchison, G., Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1998.

Elkan, C., The foundations of cost-sensitive learning, Proc. of the Seventeenth Int'l Joint Conf. on Arti_cial Intelligence (IJCAI'01) 973-978, 2001.

Estabrooks, A., Jo, T. and Japkowicz, N., A multiple resampling method for learning from imbalances data sets, Compu. Intel. 20(1), 18-37, 2004.

Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P., The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM 39, 27-34, 1996.
Fields, B. N., Knipe, D. M., Howley, P. M. and Griffin, D. E., Fields Virology, Lippincott Williams and Wilkins, Philadelphia, ed. 4, 2001.

Fletcher, R., Practical Methods of Optimization, Wiley, New Your, 1987.

Forgy, E. W., Cluster analysis of multivariate data: Efficiency versus interpretability, Biometric 21, 768-769, 1965.

Freund, Y. and Schapire, R. E., Experiments with a new boosting algorithm, Machine Learning: Proc. of the Thirteenth Int'l Conf. 148-156, 1996.
Freund, Y. and Schapire, R. E., A decision-theoretic generalization of on-line learning and an application to boosting, J. of Computer and System Sciences 55(1), 119-139, 1997.

Freund, Y. and Schapire, R. E., Large margin classi_cation using the perceptron algorithm. In Proc. 11th Annu. Conf. on Comput. Learning Theory, ACM Press, New York, 1998.

Gerritsen, R., Assessing loan risks: A data mining case study, IEEE IT Prof. 1, 16-21, 1999.

Gold, S., Rangarajan, A., Lu, C. P., Pappu, S. and Mjolsness, E., New algorithms for 2-d and 3-d point matching: pose estimation and correspondence, Pattern Recognition 31(8), 1019-1031, 1998.

Goldberg, D. E., Genetic and evolutionary algorithms come of age, Communications of the ACM 37(3), 113-119, 1994.

Goldman, S. A. and Warmuth, M. K., Learning binary relations using weighted majority voting, Machine Learning 20(3), 245-271, 1995.

Griew, S., Information gain in tasks involving different stimulus-response relationships, Nature 182, 18-19, 1958.

Guo, H. and Viktor, H. L., Learning from imbalanced data sets with boosting and data generation: The databoost-IM approach, SIGKDD Explorations 6(1), 30-39, 2004.

Hall, P. and M., T. D., Common structure of techniques for choosing smoothing parameters in regression problems, J. Roy. Statist. Soc. Ser. B 49(2), 184-198, 1987.

Haykin, S., Neural Networks: A Comprehensive Foundation, 2nd ed., Prentice-Hall, Ontario, Canada, 1999.

Heckerman, D., Geiger, D. and Chickering, D. M., Learning bayesian networks: The combination of knowledge and statistical data, Machine Leaning 20(3), 197-243, 995.

Hickey, R., Learning rare class footprints: the REFLEX algorithm, ICML Workshop on Learning from Imbalanced Data Sets 89-96, 2003.

Holmes, C. and Denison, D., Perfect sampling for wavelet reconstruction of signals, IEEE Trans. Signal Processing 50(2), 337-344, 2002.

Hu, M. K., Visual pattern recognition by moment invariants, IRE Transactions Information Theory 8, 179-187, 1962.

Hung, C. M., Huang, Y. M. and Chen, T. S., Assessing check credit with skewed data: A knowledge discovery case study, (ICS2002) Int'l Computer Symposium, Workshop on Artificial Intelligence, 2002.

Jang, J. R. and Sun, C. T., Neuro-fuzzy modeling and control, Proc. IEEE 83(3), 378-406, 1995.

Jang, J. S. R., ANFIS: Adaptive-network-based fuzzy inference systems, IEEE Trans. on Sys. Man Cybernetics 23(3), 665-685, 1993.

Japkowicz, N., Supervised versus unsupervised binary-learning by feedforward neural networks, Machine Learning 42(1-2), 97-122, 2001.

Japkowicz, N. and Stephen, S., The class imbalance problem: A systematic study, Intelligent Data Analysis 6(5), 429-450, 2002.

Jason, H. T. B. and Michael, P. Y., Applying fuzzy logic to medical decision making in the intensive care unit, Am. J Respir. Crit. Care Med. 167(7), 948-952, 2003.

Jiang, M., Zhu, X., Yuan, B., Tang, X., Lin, B., Ruan, Q. and Jiang, M., A fast hybrid algorithm of global optimization for feedforward neural networks, Proc. Signal Processing, WCCC-ICSP Int'l. Conf. 1(3), 1609-1612, 2000.

Jo, T. and Japkowicz, N., Class imbalances versus small disjuncts, SIGKDD Explorations 6(1), 40-49, 2004.

John, G. H. and Langley, P., Estimating continuous distributions in Bayesian classifiers, Proc. of the Eleventh Conf. on Uncertainty in Artificial Intelligence 338-345, 1995.

Judea, P., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, Los Angeles, 1997.

Koza, J. R., Genetic Programming on the programming of Computers by means of Natural Selection, Massachusetts Institute of Technology, USA, 1992.

Koza, J. R., Bennett, F. H. and Andre, D., Classifying proteins as extracellular using pro-grammatic motifs and genetic programming, Evolutionary Computation, in Proc. IEEE World Congress on Computational Intelligence 212-217, 1998.

Koza, J. R., Bennett, F. H., Andre, D. and Keane, M. A., Genetic programming III: Darwinian invention and problem solving: Book review, Evolutionary Computation, IEEE Transactions on 3(4), 251-253, 1999.

Krogh, A., Brown, M., Mian, I., Sjolander, K. and Haussler, D., Hidden markov models in computational biology: Applications to protein modeling, J. Mol. Biol. 235(5), 1501-1531, 1994a.

Krogh, A., Mian, S. and Haussler, D., A hidden markov model that finds genes in E. coli DNA, Nucl. Acids Res. 22(22), 4768-4778, 1994b.

Kubat, M. and Matwin, S., Addressing the curse of imbalanced training sets: One-sided selection, in: Proceedings of the 14th International Conference on Machine Learning 179-V186, 1997.

Kushchu, I., Genetic programming and evolutionary generalization, Evolutionary Computation, IEEE Transactions on 6(5), 431-442, 2002.

Lan, H., Data Mining practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, San Francisco, California, 1999.

Ling, C. X. and Li, C., Data mining for direct marketing: problems and solutions, Proc. of 4th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD-98), ACM 73-79, 1998.

Martin, B., Instance-based learning: Nearest neighbor with generalization, Master's thesis, University of Waikato, 1995.

Meshoul, S. and Batouche, M., Ant colony system with extremal dynamics for point matching and pose estimation, Pattern Recognition, Proc. 16th International Conference on 3, 823-826, Aug., 2002.

Mittelman, D., Sadreyev, R. and Grishin, N., Probabilistic scoring measures for pro_le-pro_le comparison yield more accurate short seed alignments, Bioinformatics 19(12), 1531-1539, 2003.

Mood, A. M., Graybill, F. A. and Boes, D. C., Introduction to the Theory of Statistics, McGraw-Hill, 1974.

Okamoto, S. and Satoh, K., An Average-Case Analysis of k-Nearest Neighbor Classifier, Springer Verlag, 1995.

Peterson, C. and Soderberg, B., A new method for mapping optimization problems onto neural networks, Internat. J. Neural Systems 1(1), 3-22, 1989.

Phua, C., Alahakoon, D. and Lee, V., Minority report in fraud detection: Classification of skewed data, SIGKDD Explorations 6(1), 50-59, 2004.

Powell, M. J. D., Restart procedures for the conjugate gradient method, Mathematical Programming 12, 241-254, 1977.

Powell, M. J. D., Radial basis functions for multivariable interpolation: A review, RMCS, IMA Conf. on Algorithms for the Approximation of Functions and Data 143-167, 1985.

Pytlak, R., A globally convergent conjugate gradient algorithm, Decision and Control, in Proc. 32nd IEEE Conf. on 3, 2890-2895, 1993.

Quinlan, J., C4.5: Programs for Machine Learnin, Morgan Kaufmann, USA, 1993.

Rangarajan, A., Gold, S. and Mjolsness, E., A novel optimizing network architecture with applications, Neural Comput. 8(5), 1041-1060, 1996.

Reichard, K. and Kaufmann, M., EPPS: mining the COG database by an extended phylogenetic patterns search, Bioinformatics 19(6), 784-788, 2003.

Rennie, J. D., Shih, L., Teevan, J. and Karger, D., Tackling the poor assumptions
of naive bayes text classifiers, ICML-2003 616-623, 2003.

Ressom, H., Reynolds, R. and Varghese, R., Increasing the efficiency of fuzzy logic-based gene expression data analysis, Physiological Genomics 13(2), 107-123, 2003.

Riesz, F. and Sz.-Nagy, B., Functional Analysis, Frederick Ungar, New York, 1955.

Rosenblatt, F., The perceptron: A probabilistic model for information storage and organization in the brain, Psychological Review, 1958. (Reprinted in Neurocomputing (MIT Press, 1988) 65, 386-407, 1988.

Rumelhart, D. E., Hinton, G. E. and Williams, R. J., Learning representations of back-propagation errors, Nature (London) 323, 533-536, 1986.

Rychlewski, L., Jaroszewski, L., Li, W. and Godzik, A., Comparison of sequence profiles. strategies for structural predictions using sequence information, Protein Sci. 9(2), 232-241, 2000.

Sadreyev, R. and Grishin, N., COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance, J. Mol. Biol. 326(1), 317-336, 2003.

Saerens, M., Latinne, P. and Decaestecker, C., Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure, Neural Computation 14(1), 21-41, 2002.

Schoning, U., Logic for Computer Scientists, Birkhauser, Berlin, 1989.

Sinkhorn, R., A relationship between arbitrary positive matrices and doubly stochastic matrices, Ann. Math. Statist. 35(2), 876-879, 1964.

Smets, P., Belief functions: the disjunctive rule of combination and the generalized bayesian theorem, International Journal of Approximate Reasoning 9(1), 1-35, 1993.

Smith, T. F. and Waterman, M. S., Identification of common molecular subsequences, J. Mol. Biol. 147(1), 195-197, 1981.

Stan, O. and Kamen, E. W., New block recursive MLP training algorithms using the levenberg-marquardt algorithm, Neural Networks, in IJCNN '99. Int'l. Joint Conf. 3, 1672-1677, 1999.

Steven, L. S., David, B. S. and Simon, K., Computational Methods in Molecular Biology, Vol. 32, Elsevier, New York, 1998.

Sugeno, M., Industrial applications of fuzzy control, Elsevier Science, 1985.

Thompson, J. D., Higgins, D. G. and Gibson, T. J., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-speci_c gap penalties and weight matrix choice, Nucleic Acids Res. 22(2), 4673-4680, 1994.

Tsakonas, A. and Dounias, G., Hybrid computational intelligence schemes in complex domains: An extended review, In Proc. Methods and Applications of Artificial Intelligence: Second Hellenic Conference on AI (SETN) 494-512, 2002.

Visa, S. and Ralescu, A., Learning imabalnaced and overlapping classes using fuzzy sets, Proc. ICML2003 Workshop on Learning from Imbalanced Data Sets, 2003.

Wahba, G., Spline models for observational data, SIAM, Philadelphia, PA, 1990.

Weiss, G. M. and Provost, F., Learning when training data are costly: The effect of class distribution on tree induction, J. Arti. Intel. Res. 19, 315-354, 2003.

Willett, P., Genetic algorithms in molecular recognition and design, Trends in Biotechnology 13(12), 516-521, 1995.

Witten, I. H. and Frank, E., Data Mining practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, San Francisco, California, 1999.

Wong, M. L., Lam, W., Leung, K. S., Ngan, P. S. and Cheng, J. C. Y., Discovering knowledge from medical databases using evolutionary algorithms, Engineering in Medicine and Biology Magazine IEEE 19(4), 45-55, 2000.

Yang, Z. R., Thomson, R., Hodgman, T. C., Dry, J., Doyle, A. K., Narayanan, A. and Wu, X., Searching for discrimination rules in protease proteolytic cleavage activity using genetic programming with a min-max scoring function, Biosystems 72(1-2), 159-176, 2003.

Yona, G. and Levitt, M., Within the twilight zone: a sensitive profile-profile comparison tool based on information theory, J. Mol. Biol. 315(5), 1257-1275, 2002.

Yuille, A. L. and Kosowsky, J. J., Statistical physics algorithms that converge, Neural Comput. 6(3), 341-356, 1994.

Zadrozny, B. and Elkan, C., Learning and making decisions when costs and probabilities are both unknown, Proc. Seventh Int'l Conf. Knowledge Discovery and Data Mining 204-213, 2001.

Zhang, J., Selecting typical instances in instance-based learning, Proc. 9th Int. Conf. Machine Learning 470-479, 1992.

Zheng, Q. and Chellappa, R., A computational vision approach to image registration, IEEE Trans. IP 2(3), 311-326, 1993.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2006-12-27起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2006-12-27起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw