進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-1506201813294100
論文名稱(中文) 交叉驗證評估法對分類方法效能估計值之影響
論文名稱(英文) The Impact of Cross-Validation Methods on the Performance Estimates of Classification Algorithms
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 106
學期 2
出版年 107
研究生(中文) 魏敏如
研究生(英文) Min-Ru Wei
學號 R76054030
學位類別 碩士
語文別 中文
論文頁數 61頁
口試委員 指導教授-翁慈宗
口試委員-蔡青志
口試委員-胡政宏
中文關鍵字 分類正確率  二元變數  相依性分析  評估法 
英文關鍵字 Accuracy  Binary variable  Dependency analysis  Evaluation method 
學科別分類
中文摘要 在資料探勘領域中,大多數研究在使用K等分交叉驗證法來評估分類方法的效能時,大多會認為所產生正確率的變異情形太大,因此發展出不同評估法來進行隨機切割,希望能夠讓分類正確率的變異數降低。但使用這些不同評估法所得到的預測結果可能有相依的情況,若忽略掉此相依性,有可能影響分類方法間的效能比較,因此本研究的目的在於,提出判斷兩評估法處理後所得的分類正確率之間是否會產生相依性,以及發展統計方法來比較兩相依評估法的成效差異。首先,是提出使用單一筆資料來判斷二元變數是否為獨立的方法,以檢驗兩評估法間所得的預測結果是否存有相依性,並發展存在相依性的統計檢定方法。當檢驗結果存在相依性時,即使用本研究所發展的統計檢定方法去比較不同評估法處理後的單一或多重資料檔,在同個分類方法下所得的分類正確率差異是否有顯著的不同,且此統計檢定方法僅侷限於平均分類正確率。研究中將根據所發展出的統計檢定去對不同評估法所得預測結果進行比較,選用了四種評估法對資料檔進行處理,採用最近鄰居法與決策樹分類方法,透過統計檢定去驗證不同評估法處理後所得分類正確率產生的顯著性。實驗結果顯示,使用本研究發展出的統計檢定方法,發現大部分的資料檔經不同評估法處理後,其預測結果存在相依性,且分類正確率不會有顯著的不同。另外,以往的研究顯示分層交叉驗證法對於不平衡資料檔上的成效較明顯,因此本研究亦對不平衡資料檔進行檢驗。由於有母數統計檢定上有它的侷限性,在平均分類正確率的變異數與不平衡資料檔效能評估時,並未發展新的統計檢定方法,將使用無母數統計檢定去作分析,其結果也沒有顯著的差異。可得到不論在一般資料檔的單一或多重資料檔以及不平衡資料檔的多資料檔上,不同評估法處理後所得的結果並沒有顯著差異。
英文摘要 Cross validation is a popular approach for evaluating the performance of classification algorithms. The variance of the accuracy estimate resulting from k-fold cross validation is generally relatively large, and several evaluated methods are therefore developed for reducing the variance. When a data set is processed by two evaluation methods for the same classification algorithm, the resulting accuracies for the two evaluation methods may not be independence for performance comparison. The purpose of this research is to propose statistical methods for comparing the performance for various evaluation methods. When a data set is classified by the same algorithm, the independence test for two binary random variables is first introduced to identify whether the predictions of the same instance for two evaluation methods are independent or not. Then statistical methods are proposed to comparing the performance of a classification algorithm on single data set or multiple data sets processed by two dependent evaluation methods. Classification algorithms decision tree induction and k-nearest neighbor are chosen to test the performance of four evaluation methods. The experimental results of the independence test on twenty ordinary data sets shows that the predictions of instances for various evaluation methods are generally dependent, and the testing results of our statistical methods suggested that the accuracy estimates resulting from various evaluation methods are not significantly different. Nonparametric statistical methods are employed to test the variance of accuracy for ordinary data sets and the mean and variance of F-measure for imbalanced data sets. Those tests also indicate that the performance of a classification algorithm will not be significantly different when data sets are processed by various evaluation methods.
論文目次 摘要..................................................Ⅰ
致謝..................................................Ⅴ
目錄..................................................Ⅵ
表目錄................................................Ⅶ
圖目錄................................................Ⅷ
第一章 緒論...........................................1
1.1 研究背景與動機..............................1
1.2 研究目的...................................3
1.3 研究架構...................................3
第二章 文獻探討........................................5
2.1 分類正確率評估方法..........................5
2.2 二元隨機變數相關測度........................8
2.3 分類方法效益與評估..........................10
2.4 小結.......................................11
第三章 研究方法........................................12
3.1 兩評估方法間之對照方式.......................12
3.2 不同評估方法間分類正確率之相依性檢定..........15
3.3 一般資料檔下比較兩相依評估法效能之方法........18
3.3.1 單一筆資料檔下比較兩相依評估法成效之方法..18
3.3.2 多個資料檔下比較兩相依評估法成效之方法....21
3.4 評估方法...................................26
第四章 實證研究........................................29
4.1 一般資料檔..................................29
4.2 相依性檢定..................................30
4.3 一般資料檔的效能評估.........................36
4.3.1單一資料檔下的比較........................42
4.3.2多資料檔下的比較..........................45
4.4 不平衡資料檔.................................49
4.5 小結........................................54
第五章 結論與建議.......................................56
5.1 結論........................................56
5.2 建議與未來發展...............................57
參考文獻...............................................59
參考文獻 葉柏揚(2017),探討重複執行K等分交叉驗證法之合適性研究,國立成功大學資訊管理研究所碩士學位論文。

吳建昆(2017),分層對使用K等分交叉驗證法來評估分類方法效能之影響,國立成功大學資訊管理研究所碩士學位論文。

何儀珊(2016),評估兩相依分類演算法效能之方法,國立成功大學資訊管理研究所碩士學位論文。

Alexander, R. A., Alliger, G. M., Carson, K. P. & Barrett, G. V. (1985). The empirical performance of measures of association in the 2×2 table. Educational and Psychological Measurement, 45(1), 79-87.

Bengio, Y. & Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research, 5, 1089-1105.

Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895-1923.

Davenport Jr, E. C. & El-Sanhurry, N. A. (1991). Phi/phimax: review and synthesis. Educational and Psychological Measurement, 51(4), 821-828.

Diamantidis, N. A., Karlis, D. & Giakoumakis, E. A. (2000). Unsupervised stratification of cross-validation for accuracy estimation. Artificial Intelligence, 116(1-2), 1-16.

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1-30.

Forman, G. & Scholz, M. (2010). Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsletter 12(1): 49-57.

Kaltenhauser, J. & Lee, Y. (1976). Correlation coefficients for binary data in factor analysis. Geographical Analysis, 8(3), 305-313.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of International Joint Conference on Artificial Intelligence, Montreal, Canada, 1137-1143.

López, V., Fernández, A., Moreno-Torres, J. G., &
Herrera, F. (2012). Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Systems with Applications 39(7): 6585-6608.

Moreno-Torres, J. G., Sáez, J. A. & Herrera, F. (2012). Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Transactions on Neural Networks and Learning Systems, 23(8), 1304-1312.

RodríGuez, J. D., Pérez, A. & Lozano, J. A. (2013). A general framework for the statistical analysis of the sources of variance for classification error estimators. Pattern Recognition, 46(3), 855-864.

Rodriguez, J. D., Perez, A. & Lozano, J. A. (2010). Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 569-575.

Tan, P. N., Steinbach, M. & Kumar, V.(2006). Introduction to Data Mining:Pearson Education:Pearson Education .

Weiss, S. M. (1991). Small sample error rate estimation for k-NN classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(3), 285-289.

Witten, I. H. & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques: Morgan Kaufmann.

Wong, T. T. (2015). Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognition, 48(9), 2839-2846.

Wong, T. T. (2017). Parametric methods for comparing the performance of two classification algorithms evaluated by k-fold cross validation on multiple data sets. Pattern Recognition, 65, 97-107.

Zeng, X. & Martinez, T. R. (2000). Distribution-balanced stratified cross-validation for accuracy estimation. Journal of Experimental & Theoretical Artificial Intelligence, 12(1), 1-12.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2023-06-15起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2023-06-15起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw