進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-1206201712455000
論文名稱(中文) 分層對使用K等分交叉驗證法來評估分類方法效能之影響
論文名稱(英文) The impact of stratification on the performance of classification algorithms evaluated by k-fold cross validation
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 105
學期 2
出版年 106
研究生(中文) 吳建昆
研究生(英文) Jian-Kuen, Wu
學號 R76031090
學位類別 碩士
語文別 中文
論文頁數 71頁
口試委員 指導教授-翁慈宗
口試委員-王維聰
口試委員-劉任修
口試委員-陳榮泰
中文關鍵字 K等分交叉驗證法  分層法  分類正確率  精確率  召回率 
英文關鍵字 K-fold Cross Validation  Stratification  Accuracy  Precision  Recall 
學科別分類
中文摘要 目前大多數研究會採用K等分交叉驗證法進行分類正確率估計值的計算,但卻鮮少研究會採用分層法來使各等份內的訓練資料與測試資料更具有代表性,以利降低所得估計值變異數。分層法能更貼切地估計出分類方法的表現程度,但是採用分層法的相關研究,對於一般資料檔與不平衡資料檔則以不同的方向探討,在一般資料檔中,大多以偏誤值與變異數的角度探討分層K等分交叉驗證法與變型方法之間的差異;在不平衡資料檔之下,則為透過精確率或召回率等相關測度探討分層K等分交叉驗證法於稀少類別下,是否能有效降低相關測度的變異程度的研究。許多研究提出各自不同的分層法方法,但目前尚未經過合理的有母數方法進行統計方法檢定,並且明確指出分層法應用於何種情況下較為合適。因此本研究探討這些分層法在實驗環境相同的比較條件下採用決策樹與最近鄰居分類方法,並以有母數統計檢定方法進行檢驗。實驗結果顯示,對於這些分層法相關的評估方法,不論在一般或不平衡資料檔的單一資料檔或多資料檔上,在一般K等分交叉驗證法中有無實施分層法,其所獲得的平均估計值表現結果,會與一般K等分交叉驗證法相近,並不會有較佳或較差的情形。然而,在單一資料檔上所得估計值之變異數的比較中,由於檢定顯示結果差異不大,若以時間層面之考量因素,且預計有較穩定的估計值,則可以實施一般分層法應用於K等分交叉驗證法上即可,若無時間考量,則可以進一步地對一般分層法上採用對該類別所屬資料進行測度衡量的進階分層法,所得到的估計值則會比一般分層法些微穩定。
英文摘要 K-fold cross validation is one of accuracy estimation methods used by many types of experimental research. Stratification method, however, is seldom performed in order to get more representative data in each partition. Stratification has the advantage of reducing the variance of estimators and thus better estimate the true accuracy. This research looks that stratification or imbalance dataset from a different perspective. General dataset is used to develop new algorithm from standard stratification on K-fold cross validation or investigate estimator from bias and variance. Imbalance dataset is used to discuss the performance of applying stratification from recall and precision or the others measure view in rare class value situation. Many types of research recommend their algorithm without the appropriate parametric method for statistical comparison. Therefore the purpose of this study is to compare these stratified methods in same condition environment, decision tree and k-nearest neighbors algorithm through reasonable statistical comparison. The results demonstrated that estimated value performance will closely with K-fold cross validation whether stratification implemented or not from single or multiple general or imbalanced dataset. Furthermore, when considering the factor of time complexity assuming stable estimator, standard stratification could be used on K-fold cross validation. By using advance stratification which takes into account features between data and data, the estimator will relatively more stable than standard stratification.
論文目次 摘要 I
目錄 VI
表目錄 VIII
圖目錄 IX
第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 3
1.3 研究架構 3
第二章 文獻探討 4
2.1 K等分交叉驗證法 4
2.1.1 分層K等分交叉驗證法 5
2.1.2 分層K等分交叉驗證法實務應用 9
2.2 偏誤與變異數 10
2.3 評估分類正確率 13
2.4 小結 15
第三章 研究方法 17
3.1 分類效能之評估方法 18
3.1.1 K等分交叉驗證法 18
3.1.2 基本分層K等分交叉驗證法 19
3.1.3 進階分層K等分交叉驗證法 20
3.2 一般資料檔之比較方法 28
3.3 不平衡資料檔之比較方法 33
3.4 小結 38
第四章 實證研究 39
4.1 一般資料檔 39
4.1.1 單一資料檔下的比較 41
4.1.2 多資料檔下的比較 52
4.2 不平衡資料檔 55
4.2.1 單一資料檔下的比較 56
4.2.2 多資料檔下的比較 62
4.3 小結 65
第五章 結論與建議 67
5.1 結論 67
5.2 建議與未來發展 69
參考文獻 70
參考文獻 林哲玄(2016),不平衡資料檔下比較兩分類演算法效能之統計方法,國立成功大學資訊管理研究所碩士學位論文。
陳育生(2015),多個資料檔下比較兩分類方法表現之有母數統計方法,國立成功大學資訊管理研究所碩士學位論文。
Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning 6(1): 37-66.
Bengio, Y. & Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research 5(Sep): 1089-1105.
Cano, J. R., Herrera, F., & Lozano, M. (2005). Stratification for scaling up evolutionary prototype selection. Pattern Recognition Letters 26(7): 953-963.
Diamantidis, N., Karlis, D., & Giakoumakis, E. A. (2000). Unsupervised stratification of cross-validation for accuracy estimation. Artificial Intelligence 116(1): 1-16.
Forman, G. & Scholz, M. (2010). Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsletter 12(1): 49-57.
Friedman, J. H. (1997). On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery 1(1): 55-77.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, Canada, 1137-1145.
Kohavi, R. & Wolpert, D. H. (1996). Bias plus variance decomposition for zero-one loss functions. Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 275-283.
López, V., Fernández, A., Moreno-Torres, J. G., & Herrera, F. (2012). Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Systems with Applications 39(7): 6585-6608.
Mantyjarvi, J., Himberg, J., & Seppanen, T. (2001). Recognizing human motion with multiple acceleration sensors. Proceedings of the 2001 IEEE International Conference on Systems, Man, and Cybernetics, Tucson, USA, 747-752.
Mojena, R. (1977). Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal 20(4): 359-363.
Moreno-Torres, J. G., Saez, J. A., & Herrera, F. (2012). Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Transactions on Neural Networks and Learning Systems 23(8): 1304-1312.
Parker, B. J., Gunter, S., & Bedo, J. (2007). Stratification bias in low signal microarray studies. BMC Bioinformatics 8: 326.
Rodriguez, J. D., Pérez, A., & Lozano, J. A. (2013). A general framework for the statistical analysis of the sources of variance for classification error estimators. Pattern Recognition 46(3): 855-864.
Rodriguez, J. D., Perez, A., & Lozano, J. A. (2010). Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(3): 569-575.
Webb, G. I. & Conilione, P. (2005). Estimating bias and variance from data. Pre-publication manuscript. Retrieved from http://www.csse.monash.edu/webb/-Files/WebbConilione06.pdf
Weiss, S. M. (1991). Small sample error rate estimation for k-NN classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(3): 285-289.
Welch, B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika 29(3/4): 350-362.
Williams, D. P., Myers, V., & Silvious, M. S. (2009). Mine classification with imbalanced data. IEEE Geoscience and Remote Sensing Letters 6(3): 528-532.
Witten, I. H. & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques: Morgan Kaufmann.
Wong, T.-T. (2015). Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognition 48(9): 2839-2846.
Yang, P., Xu, L., Zhou, B. B., Zhang, Z., & Zomaya, A. Y. (2009). A particle swarm based hybrid system for imbalanced medical data sampling. BMC Genomics 10(3): 1.
Zeng, X. & Martinez, T. R. (2000). Distribution-balanced stratified cross-validation for accuracy estimation. Journal of Experimental & Theoretical Artificial Intelligence 12(1): 1-12.
Zhang, Y., Wu, L., & Wang, S. (2011). Magnetic resonance brain image classification by an improved artificial bee colony algorithm. Progress in Electromagnetics Research 116: 65-79.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2022-07-01起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2022-07-01起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw