進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-0507201818281800
論文名稱(中文) 簡易貝氏分類器在不平衡資料集上效能改善之研究
論文名稱(英文) A Study on the Performance Improvement of Naive Bayesian Classifier on Imbalanced Data Sets
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 106
學期 2
出版年 107
研究生(中文) 姚靜姍
研究生(英文) Ching-Shan Yao
學號 R76054080
學位類別 碩士
語文別 中文
論文頁數 56頁
口試委員 指導教授-翁慈宗
口試委員-蔡青志
口試委員-胡政宏
中文關鍵字 簡易貝氏分類器  不平衡資料集  屬性排序  特徵選取  廣義狄氏分配 
英文關鍵字 feature selection  generalized Dirichlet distribution  imbalanced data set  naive Bayesian classifier  selective naive Bayes 
學科別分類
中文摘要 在眾多分類方法中,由於簡易貝氏分類器具有使用簡便、運算效能佳以及分類預測正確率高之優點,因此被廣泛地應用在許多分類任務上。然而許多分類方法都會有一個基本假設,就是資料的分布並非高度傾斜,因此當我們把分類方法應用於資料時,在大多數的情況下可以取得理想的結果,但是在一資料集當中,大多數的資料會集中在某一類別,而所關注的則是占少數資料的類別,就形成了所謂不平衡資料集。由於簡易貝氏分類器本身的學習機制是將類別值機率與所有屬性條件機率相乘,因此在不平衡資料集中,因兩類別的資料筆數相差甚遠,致使類別值機率差異較大,可能會導致簡易貝氏分類器在學習的過程中,誤將少數類別資料預測為多數類別,因此將測試類別值機率對簡易貝氏分類器的影響,此外,透過貝氏屬性挑選法對屬性做重要性排序之外,亦會做特徵選取,並導入先驗分配來調整屬性參數,以此提升簡易貝氏分類器的分類效能。從UCI資料存放站下載10個資料集,並將其處理成不平衡資料集來進行試驗,在實證結果中顯示考量類別值機率與否所造成的影響不大,但導入先驗分配可顯著提升簡易貝氏分類器在不平衡資料集上的效能,且此改善可使得簡易貝氏分類器與分類方法RIPPER相匹配,但是與Random Forest相比之下還是顯得稍為劣勢。
英文摘要 The number of positive instances is generally far larger than the number of negative instances in an imbalanced dataset. Since the number of positive instances is few, the probability estimates for calculating the classification probability of this class value can be unreliable in applying naïve Bayesian classifier. This could be the main reason for naïve Bayesian classifier to have a relatively poor performance on imbalanced data sets. This study first investigates whether the occurring probabilities of class values should be considered in calculating classification probabilities. Then attributes are ranked for introducing generalized Dirichlet priors to improve the performance of naïve Bayesian classifier on imbalanced data sets. The experimental results obtained from 10 data sets show that removing the occurring probabilities of class values in calculating classification probability is not necessary and that introducing priors for attributes can generally achieve a higher F-measure on imbalanced data sets. The naïve Bayesian classifier with priors can have a competitive performance with respective to RIPPER algorithm, while its F-measure is lower than that of Random Forest in the most imbalanced data sets.
論文目次 第一章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究流程 3
第二章 文獻探討 4
2.1 簡易貝氏分類器 4
2.2 用於不平衡資料集的分類方法 7
2.2.1 不平衡資料集 7
2.2.2 不同層面的分類方法 9
2.3 改善簡易貝氏分類器效能之方法 12
2.3.1 離散化方法 13
2.3.2 特徵選取與屬性排序 14
2.3.3 先驗分配 14
2.4 評估指標 17
2.5 小結 19
第三章 研究方法 20
3.1 資料前置處理 20
3.2 屬性值機率估算 25
3.3 結果評估方法 31
第四章 實證研究 33
4.1 資料集特性 33
4.2 類別值機率考慮與否之比較 38
4.3 屬性參數調整前後之比較 41
4.4 GDRC-NB與其它分類方法之比較 47
4.5 小結 49
第五章 結論與建議 50
5.1 結論 50
5.2 未來研究與發展 51
參考文獻 52
參考文獻 Addin, O., Sapuan, S. M., Mahdi, E., & Othman, M. (2007). A naive-Bayes classifier for damage detection in engineering materials. Materials and Design, 28(8), 2379-2386.

Barandela, R., Sanchez, J. S., Garcia, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems.pdf. Pattern Recognition., 36(3), 849-851.

Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20-29.

Catlett, J. (1991). On changing continuous attributes into ordered discrete attributes. Proceedings of the Fifth European Working Session on Learning on Machine Learning, 164-178.

Cestnik, B. & Bratko, I. (1991). On estimating probabilities in tree pruning. Proceedings of the Fifth European Working Session on Learning on Machine Learning, 138-150.

Chawla, N.V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.

Chawla, N.V, Japkowicz, N., & Drive, P. (2004). Editorial : special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1), 1-6.

Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with naive Bayes. Expert Systems with Applications, 36(3 PART 1), 5432–5435.

Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infection. Artificial Intelligence in Medicine, 37(1), 7–18.

Connor, R. J. & Mosimann, J. E. (1969). Concepts of independence for proportions with a generalization of the Dirichlet distribution. Journal of the American Statistical Association, 64(325), 194-206.

Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2), 103–130.

Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. Proceedings of the Twelfth International Conference on Machine Learning, 194-202.

Eitrich, T., Kless, A., Druska, C., Meyer, W., & Grotendorst, J. (2007). Classification of highly unbalanced CYP450 data of drugs using cost sensitive machine learning techniques. Journal of Chemical Information and Modeling, 47(1), 92–103.

Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20(1), 18–36.

Fayyad, U. & Irani, K.(1993). Multi-interval discretization of continuous-valued attributes
for classification learning. Proceedings of the Thirteenth International Joint Conference on
Artificial Intelligence, 1022–1027.

He, H. & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.

Kerber, R. (1992). Chimerge: discretization of numeric attributes. Proceedings of the Tenth National Conference on Artificial Intelligence, 123–128.

Kohavi, R. & Sahami, M. (1996). Error-based and entropy-based discretization of continuous features. Journal of Microscopy, 237(3), 487–96.

Kotsiantis, S. B., & Pintelas, P. E. (2003). Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing & Teleinformatics, 1(1), 46-55.

Kubat, M. & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one sided selection. Proceedings of the Fourteenth International Conference on Machine Learning, 179–186.

Kubat, M., Holte, R. C., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2), 195–215.

Kuncheva, L. I. & Rodríguez, J. J. (2014). A weighted voting framework for classifiers ensembles. Knowledge and Information Systems, 38(2), 259-275.

Langley, P. & Sage, S. (1994). Induction of selective bayesian classifiers. Proceedings of the Tenth International Conference on Uncertainty in Artificial Intelligence, 399–406.

Li, Y., Guo H., Liu, X., Li, Y., & Li, J. (2016). Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowledge-Based Systems, 94, 88-104.

López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141.

Maragoudakis, M., Kermanidis, K., Garbis, A., & Fakotakis, N. (2000). Dealing with imbalanced data using bayesian techniques. International Conference on Language Resources and Evaluation, 1045–1050.

Mitchell, T. M. (1997). Bayesian learning. Machine Learning, 154-199. New York: McGraw-Hill Companies.

Moreno-Torres, J. G. & Herrera, F. (2010). A preliminary study on overlapping and data fracture in imbalanced domains by means of genetic programming-based feature extraction. Proceedings of the Tenth International Conference on Intelligent Systems Design and Applications, 501–506.

Napierała K., Stefanowski J., & Wilk S. (2010). Learning from imbalanced data in presence of noisy and borderline examples. Proceedings of the Seventh International Conference on Rough Sets and Current Trends in Computing, 158-167.

Orriols-Puig, A., Bernadó-Mansilla, E., Goldberg, D. E., Sastry, K., & Lanzi, P. L. (2009). Facetwise analysis of XCS for problems with class imbalances. IEEE Transactions on Evolutionary Computation, 13(5), 1093–1119.

Rijsbergen, C. J. V. (1979). Information retrieval. Information Retrieval Group. Butterworths, London.

Schneider, K. M. (2003). A comparison of event models for naive Bayes anti-spam e-mail filtering. Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, 307–314.

Sobran, N. M. M., Arfah, A., & Ibrahim, Z. (2013). Classification of imbalanced dataset using conventional naïve Bayes classifier. Proceedings of the International Conference on Artificial Intelligence and Computer Science (AICS2013), 35-42.

Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: a review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04), 687-719.

Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., & Zhou, Y. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5), 1623-1637.

Tahir, M. A., Kittler, J., Mikolajczyk, K., & Yan, F. (2009). A multiple expert approach to the class imbalance problem using inverse random under sampling. Proceedings of the Eighth International Workshop on Multiple Classifier Systems, 82-91.

Tan, P. N., Steinbach, M., & Kumar, V. (2006). Classification: alternative techniques. Introduction to Data Mining, 207-315.

UCI (2018). Centre for machine learning and intelligent systems. Retrieved from http://archive.ics.uci.edu/ml

Wang, J., You, J., Li, Q., & Xu, Y. (2012). Extract minimum positive and maximum negative features for imbalanced binary classification. Pattern Recognition, 45(3), 1136-1145.

Weiss, G. M. & Provost, F. (2003). Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19, 315-354.

Weiss, G. M. (2009). Mining with rare cases. Data Mining and Knowledge Discovery Handbook, 747-757.

Wong, T. T. (1998). Generalized Dirichlet distribution in Bayesian analysis. Applied Mathematics and Computation, 97(2-3), 165-181.

Wong, T. T. (2009). Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Mining and Knowledge Discovery, 18(2), 183-213.

Wong, T. T. & Chang, L. H. (2011). Individual attribute prior setting methods for naïve Bayesian classifiers. Pattern Recognition, 44(5), 1041–1047.

Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-sensitive learning by cost-proportionate example weighting. Proceedings of the Third IEEE International Conference on Data Mining, 435-442.

Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 6(1), 80-89.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2021-05-26起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2023-05-25起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw