進階搜尋


下載電子全文  
系統識別號 U0026-0706201311513500
論文名稱(中文) 探討K等分交叉驗證法與全資料模型間分類正確性與一致性之研究
論文名稱(英文) A study for investigating classification accuracy and consistency between K-fold cross validation and complete-data model
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 101
學期 2
出版年 102
研究生(中文) 陳映伊
研究生(英文) Ying-Yi Chen
學號 r76001061
學位類別 碩士
語文別 中文
論文頁數 51頁
口試委員 指導教授-翁慈宗
口試委員-李昇暾
口試委員-王維聰
中文關鍵字 K等分交叉驗證法  不一致率 
英文關鍵字 K-fold cross validation  inconsistent rate 
學科別分類
中文摘要 在資料探勘領域的分類問題中,研究通常會透過K等分交叉驗證法挑選出一個最佳分類器後,再將此分類器利用現有的資料(available data)學習成一個模型對新的資料(new data)做預測和解釋。K等分交叉驗證法(K-fold Cross Validation)的運作模式是隨機的將資料檔切成互斥的K個等分,讓它們輪流當其餘的( K-1)-等分的資料去做訓練、學習後的測試資料,再將得出的K個學習模型的K筆分類正確率去做平均,利用此值來預估使用現有資料學習出的模型分類正確率。然而,無法保證使用現有資料學習出來的模型挑選出表現最好的分類器,和使用K等分交叉驗證法挑選出來的分類器相同。所以本研究提出了實驗方法,驗證使用K等分交叉驗證法的平均正確率用來預估利用現有資料所訓練出來的模型正確率是否合理,和提出一個不一致率來衡量K等分交叉驗證法的K個模型,和利用現有資料學習出來的模型對於新資料預測的不相同的程度,當不一致率較小時,代表使用現有資料學習出來的模型用來預測、解釋是較合適的。本研究採用三十個資料檔,實驗結果顯示,K等分交叉驗證法所得之平均分類正確率和全資料模型正確率,在統計上是無顯著不同的,但是因為其兩值偏誤在一個百分點以上的機會大於六成,所以以此正確率來選擇分類器時,會有三成以上的機會會發生誤判的情形。最後,在不一致率驗證時,四個分類器中,決策樹所得出來的分類結果是和K等分交叉驗證法所預估出來的分類結果較不一致的,代表其在利用全資料模型對於新資料的解釋力是較差的。
英文摘要 In classification applications, analysts generally use K-fold cross validation to find the classifier that has the best performance. Then the classifier generates a learning model from all available data for prediction and interpretation. The K-fold cross validation randomly divides all available data into K folds, and every fold is in turn used for testing the model learned from the other K-1 folds. The average of the accuracies resulting from the K folds is an estimate of the prediction accuracy of the model learned from all available data. However, this procedure does not guarantee that the model induced from all available data by the best classifier evaluated by K-fold cross validation will have the highest prediction accuracy on new data with respect to the other classifiers.

This study first designs an experiment to investigate whether the mean accuracy resulting from K-fold cross validation is a good estimate for the prediction accuracy of the model learned from all available data. An inconsistent rate is then introduced to measure the prediction consistency between the model learned from all available data and the K models induced from K-fold cross validation. When the inconsistent rate is small, using the model learned from all available data for prediction and interpretation will be appropriate.

The experimental results on 30 data sets indicate that the average of the mean accuracy resulting from K-fold cross validation and the average of the prediction accuracy of the model induced from all available data on new data are generally not significantly different. However, since the probability of the difference between the mean accuracy resulting from K-fold cross validation and the prediction accuracy resulting from the model induced from all available data to be larger than one percent is approximately 0.60, the probability of choosing a classifier with a lower prediction accuracy on new data is generally larger than 0.3. The inconsistent rate shows that among the four classifiers adopted in this study, decision tree learning is the worst one to generate a model from all available data for prediction and interpretation.
論文目次 目錄
第一章 緒論 1
1.1研究背景與動機 1
1.2研究目的 2
1.3研究架構 3
第二章 文獻回顧 4
2.1 K等分交叉驗證法 4
2.2 分類器 6
2.2.1 決策樹 6
2.2.2 簡易貝氏分類器 8
2.2.3 支撐向量機 10
2.2.4 邏輯斯迴歸 13
2.3 小結 15
第三章 研究方法 16
3.1 正確率驗證研究流程 16
3.2 不一致率驗證 18
3.3 分類器 24
第四章 實證研究 26
4.1資料檔屬性 26
4.2 正確率驗證 28
4.2.1決策樹 28
4.2.2簡易貝氏分類器 30
4.2.3支撐向量機 31
4.2.4邏輯斯迴歸 32
4.2.5 正確率驗證小結 33
4.3 不一致率驗證 38
4.3.1 決策樹 39
4.3.2 簡易貝氏分類器 40
4.3.3 支撐向量機 41
4.3.4 邏輯斯迴歸 42
4.3.5 不一致率驗證小結 43
4.4 正確率及不一致率驗證小結 43
第五章 結論與未來發展 45
5.1 結論 45
5.2未來發展 46
參考文獻 48
附錄 50
參考文獻 參考文獻
Astudillo, C. A. and Oommen, B. J. (2013). On achieving semi-supervised pattern recognition by utilizing tree-based SOMs. Pattern Recognition, 46(1), 293-304.
Asunction, A. and Newman, D.J. (2007). UCI machine learning repository http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California, School of Information and Computer Science.
Ballings, M. and Van den Poel, D. (2012). Customer event history for churn prediction: How long is long enough? Expert Systems with Applications, 39(18), 13517-13522.
Catal, C., Sevim, U., and Diri, B. (2011). Practical development of an Eclipse-based software fault prediction tool using Naive Bayes algorithm. Expert Systems with Applications, 38(3), 2347-2353.
Cawley, G. C. and Talbot, N. L. C. (2010). On over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11, 2079-2107.
Chattopadhyay, S., Davis, R. M., Menezes, D. D., Singh, G., Acharya, R. U., and Tamura, T. (2012). Application of Bayesian Classifier for the Diagnosis of Dental Pain. Journal of Medical Systems, 36(3), 1425-1439.
Chaves, R., Ramírez, J., Górriz, J. M., and Puntonet, C. G. (2012). Association rule-based feature selection method for Alzheimer’s disease diagnosis. Expert Systems with Applications, 39(14), 11766-11774.
Cheng, H. Y. K., Cheng, C. Y., and Ju, Y. Y. (2013). Work-related musculoskeletal disorders and ergonomic risk factors in early intervention educators. Applied Ergonomics, 44(1), 134-141.
de Lorimier, A. and El-Geneidy, A. M. (2013). Understanding the Factors Affecting Vehicle Usage and Availability in Carsharing Networks: A Case Study of Communauto Carsharing System from Montreal, Canada. International Journal of Sustainable Transportation, 7(1), 35-51.
Fushiki, T. (2011). Estimation of prediction error by using K -fold cross-validation. Statistics and Computing, 21(2), 137-146.
Hwang, K. S., Chen, Y. J., Jiang, W. C., and Yang, T. W. (2012). Induced states in a decision tree constructed by Q-learning. Information Sciences, 213, 39-49.
Karami, G., Attaran, N., Hosseini, S. M. S., and Hossein, S. M. S. (2012). Bankruptcy Prediction, Accounting Variables and Economic Development: Empirical Evidence from Iran. International Business Research, 5(8), 147-152.
Karimaldini, F., Teang Shui, L., Ahmed Mohamed, T., Abdollahi, M., and Khalili, N. (2012). Daily Evapotranspiration Modeling from Limited Weather Data by Using Neuro-Fuzzy Computing Technique. Journal of Irrigation and Drainage Engineering, 138(1), 21-34.
Marcot, B. G. (2012). Metrics for evaluating performance and uncertainty of Bayesian network models. Ecological Modelling, 230(0), 50-62.
Nahar, J., Imam, T., Tickle, K. S., and Chen, Y. P. P. (2013). Computational intelligence for heart disease diagnosis: A medical knowledge driven approach. Expert Systems with Applications, 40(1), 96-104.
Pan, S., Iplikci, S., Warwick, K., and Aziz, T. Z. (2012). Parkinson’s Disease tremor classification – A comparison between Support Vector Machines and neural networks. Expert Systems with Applications, 39(12), 10764-10771.
Rodriguez, J. D., Perez, A., and Lozano, J. A. (2010). Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 569-575.
Subramanian, J. and Simon, R. (2011). An evaluation of resampling methods for assessment of survival risk prediction in high-dimensional settings. Statistics in Medicine, 30(6), 642-653.
Sun, J. and Li, H. (2012). Financial distress prediction using support vector machines: Ensemble vs. individual. Applied Soft Computing, 12(8), 2254-2265.
Turrado García, F., García Villalba, L. J., and Portela, J. (2012). Intelligent system for time series classification using support vector machines applied to supply-chain. Expert Systems with Applications, 39(12), 10590-10599.
Valle, M. A., Varas, S., and Ruz, G. A. (2012). Job performance prediction in a call center using a naive Bayes classifier. Expert Systems with Applications, 39(11), 9939-9945.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2014-06-13起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2015-06-13起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw