系統識別號 U0026-0306202013235900
論文名稱(中文) 基於樣本過濾之混合型分類演算法
論文名稱(英文) Hybridization of Basic Classification Algorithms Based on Instance Filtering
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 108
學期 2
出版年 109
研究生(中文) 楊乃玉
研究生(英文) Nai-Yu Yang
學號 R78031012
學位類別 博士
語文別 英文
論文頁數 68頁
口試委員 指導教授-翁慈宗
中文關鍵字 決策樹  混合型分類方法  樣本過濾  k最鄰近法  簡易貝氏分類器  支持向量機 
英文關鍵字 Decision tree induction  hybrid classification  instance filtering  k-nearest neighbors  naïve Bayesian classifier  support vector machine 
中文摘要 在大量數據快速產生的時代,資料探勘演算法可用於發掘隱藏於資料背後有價值的資訊與知識,因而廣泛被應用。其中,基礎分類演算法係藉由已知的資料做為訓練資料進行學習,建立模型以預測新資料的類別。所以,訓練資料的品質對於分類模型之預測結果影響甚鉅。倘若訓練資料包含雜訊或冗餘屬性,可能導致過度學習,進而干擾基礎分類演算法之表現。因此,單一基礎分類演算法所建立之預測模型,具有較高的不穩定性與侷限性。後續有研究提出集成演算法改善此情況,其應用一個或多個基礎分類演算法產生一組模型,再透過多數決的投票方式預測新資料的類別,以提升基礎分類演算法之分類正確率及穩定性。惟集成演算法會建構多樣性的預測模型,亦造成分類結果不易詮釋,訓練成本較高。另外,目前有研究提出混合型分類方法,結合不同的基礎分類演算法進行資料前置處理,刪除冗餘屬性或進行樣本過濾。其主要係將分類錯誤的資料視為雜訊,在訓練資料中移除,以提升分類表現。但是這些被排除的資料仍可能帶有某些有助於分類的重要訊息,導致資訊損失。是故,本研究提出基於樣本過濾的混合型分類演算法,藉由分類演算法的組合,除了可執行資料前置處理,亦能建構多個分類模型,當新資料進行預測時,僅須選擇單一模型進行分類,則更能有助於分類結果之詮釋。再者,不同的分類演算法有其適合的資料型態。為此,本研究選擇常見的基礎分類演算法進行模式組合,包含:適用於離散型態資料的簡易貝氏分類器和決策樹;以及適用於連續型態資料的k最鄰近法和支持向量機,各測試20個資料檔。實驗結果發現本研究提出基於樣本過濾的混合型分類方法之分類正確率顯著地優於基礎分類演算法及先前提出的混合型分類演算法。
英文摘要 The quality of training data has a considerable influence on the learning results of a basic classification algorithm. A model induced by a basic classification algorithm generally exhibits a high degree of instability and limitations. Ensemble algorithms that produce a set of models by employing one or more basic classification algorithms are proposed to resolve the deficiencies of basic classification algorithms. However, when a prediction made by the majority vote of a set of models is difficult to interpret, and the training cost of the models is relatively high. A hybrid classification algorithm integrates basic ones for data preprocessing and class prediction. Misclassified instances are generally considered as noise and thus excluded from learning. However, the excluded data may contain useful information in classifying some new instances. This study proposes hybrid classification algorithms based on instance filtering, and each one of them is a combination of two basic algorithms. One plays the role of instance filtering, and the other is to build three classification models. Every new instance will be classified by only one of the three models, and hence the interpretation of every prediction remains easy. Naïve Bayesian classifier and decision tree induction are the two basic algorithms for composing hybrid ones to process discrete data, and the hybrid algorithms for continuous data are composed of k-nearest neighbor and support vector machine. The hybrid classification algorithms are tested on 20 data sets to demonstrate that they can outperform basic algorithms and the hybrid algorithm proposed by a previous study.
論文目次 Abstract I
摘要 II
誌謝 III
Contents IV
List of Tables VI
List of Figures VIII
Chapter 1 Introduction 1
1.1 Research background and motivation 1
1.2 Research objective 3
1.3 Organization 4
Chapter 2 Literature review 5
2.1 Basic classification algorithms 5
2.1.1 Naïve Bayesian classifier 5
2.1.2 Decision tree induction 7
2.1.3 Support vector machine 9
2.1.4 k-nearest neighbors 11
2.2 Hybrid classification and ensemble classification 12
2.2.1 Hybrid classification algorithms based on feature selection 13
2.2.2 Hybrid classification algorithms based on instance filtering 14
2.2.3 Ensemble algorithm 16
2.3 Summary 17
Chapter 3 Research methodology 19
3.1 Research mechanism 19
3.2 Data preprocessing 21
3.3 Model induction 23
3.4 Models selection 26
3.4.1 Models selection for discrete data 26
3.4.2 Models selection for continuous data 29
3.5 Performance assessment 30
Chapter 4 Empirical studies for discrete data 32
4.1 Introduction of data sets 32
4.2 Performance comparison 33
4.3 The benefit of instance filtering 40
Chapter 5 Empirical studies for continuous data 46
5.1 Introduction of data sets 46
5.2 Performance comparison 47
5.3 The benefit of instance filtering 53
Chapter 6 Conclusions and directions for future work 59
6.1 Conclusions 59
6.2 Directions for future research 60
References 62

參考文獻 Abbasi, Z. and Rahmani, M. (2019). An instance selection algorithm based on ReliefF. International Journal on Artificial Intelligence Tools. 28, 1-14.
Abpeykar, S., Ghatee, M., and Zare, H. (2019). Ensemble decision forest of RBF networks via hybrid feature clustering approach for high-dimensional data classification. Computational Statistics and Data Analysis. 131, 12-36.
Aburomman, A. A. and Reaz, M. B. I. (2017). A survey of intrusion detection systems based on ensemble and hybrid classifiers. Computers and Security. 65, 135-152.
Amin, M. S., Chiam, Y. K., and Varathan, K. D. (2019). Identification of significant features and data mining techniques in predicting heart disease. Telematics and Informatics. 36, 82-93.
Andreola, R. and Haertel, V. (2010). Classification of hyperspectral images with support vector machines. Boletim de Ciências Geodésicas. 16, 210-231.
Basicevic, I., Kukolj, D., Ocovaj, S., Cmiljanovic, G., and Fimic, N. (2018). A Fast Channel Change Technique. IEEE Transactions on Consumer Electronics. 64, 418-423.
Breiman, L. (1996). Bagging predictors. Machine Learning. 24, 123-140.
Breiman, L. (2001). Random forests. Machine Learning. 45, 5-32.
Breiman, L., Friedman, J. H., Olshen, R., and Stone, C. J. (1984). Classification and regression trees. Chapman and Hall, New York.
Chang, J. H., Lai, C. F., Huang, Y. M., and Chao, H. C. (2010). 3PRS: a personalized popular program recommendation system for digital TV for P2P social networks. Multimedia Tools and Applications. 47, 31-48.
Chen, K., Kurgan, L., and Rahbari, M. (2007). Prediction of protein crystallization using collocation of amino acid pairs. Biochemical and Biophysical Research Communications. 355, 764-769.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning. 20, 273-297.
Damerau, F. J., Zhang, T., Weiss, S. M., and Indurkhya, N. (2004). Text categorization for a comprehensive time-dependent benchmark. Information Processing and Management. 40, 209-221.
De Caigny, A., Coussement, K., and De Bock, K. W. (2018). A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees. European Journal of Operational Research. 269, 760-772.
Domingos, P. and Plazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero one loss. Machine Learning. 29, 103-130.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
Ebadati, O. M. E. and Ahmadzadeh, F. (2019). Classification spam email with elimination of unsuitable features with hybrid of GA-naive Bayes. Journal of Information & Knowledge Management. 18, 1950008.
Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M. A., and Strachan, R. (2014). Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks. Expert Systems with Applications. 41, 1937-1946.
Florez-Lopez, R. and Ramon-Jeronimo, J. M. (2015). Enhancing accuracy and interpretability of ensemble strategies in credit risk assessment. A correlated-adjusted decision forest proposal. Expert Systems with Applications. 42, 5737-5753.
García, S., Luengo., J., and Herrera, F. (2016). Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowledge-Based Systems. 98, 1-29.
González Perea, R., Camacho Poyato, E., Montesinos, P., and Rodriguez Diaz, J. A. (2019). Prediction of irrigation event occurrence at farm level using optimal decision trees. Computers and Electronics in Agriculture. 157, 173-180.
Govindarajan, M. and Chandrasekaran, R. M. (2010). Evaluation of k-Nearest Neighbor classifier performance for direct marketing. Expert Systems with Applications. 37, 253-258.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations. 11, 11-18.
Jamjoom, M. and Hindi, K. E. (2016). Partial instance reduction for noise elimination. Pattern Recognition Letters. 74, 30-37.
Jena, R. K. (2018). Predicting students’ learning style using learning analytics: a case study of business management students from India. Behaviour and Information Technology. 37, 978-992.
Kearns, M. and Valiant, L. (1994). Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the Association for Computing Machinery. 41, 67-95.
Kotsiantis, S. (2014). A hybrid decision tree classifier. Journal of Intelligent & Fuzzy Systems. 26, 327-336.
Krawczyk, B., Triguero, I., García, S., Wo'zniak, M., and Herrera, F. (2019). Instance reduction for one-class classification. Knowledge and Information Systems. 59, 601-628.
Langley, P. and Sage, S. (1994). Induction of selective Bayesian classifiers. Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence. 399-406.
Melgani, F. and Bruzzone, L. (2004). Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on Geoscience and Remote Sensing. 42, 1778-1790.
Miranda, E., Irwansyah, E., Amelga, A. Y., Maribondang, M. M., and Salim, M. (2016). Detection of cardiovascular disease risk’s level for adults using naive Bayes classifier. Healthcare Informatics Research. 22, 196-205.
Mohanty, M., Sahoo, S., Biswal, P., and Sabut, S. (2018). Efficient classification of ventricular arrhythmias using feature selection and C4.5 classifier. Biomedical Signal Processing and Control. 44, 200-208.
Noor, F., Shah, A., Akram, M. U., and Khan, S. A. (2019). Deployment of social nets in multilayer model to identify key individuals using majority voting, Knowledge and Information Systems. 58, 113-137.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning. 1, 81-106.
Quinlan, J. R. (1993). C4.5: programs for machine learning. Morgan Kaufmann, San Mateo.
Ramírez-Gallego, S., Krawczyk, B., García, S., and Wozniak, M. (2017). A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing. 239, 39-57.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning. 5, 197-227.
Shah, A. A., Ehsan, M. K., Ishaq, K., Ali, Z., and Farooq, M. S. (2018). An efficient hybrid classifier model for anomaly intrusion detection system. International Journal of Computer Science and Network Security. 18, 127-136.
Singh, N. and Singh, P. (2019). A novel bagged naive Bayes-decision tree approach for multi-class classification problems. Journal of Intelligent & Fuzzy Systems. 36, 2261-2271.
Tan, P. N., Steinbach, M., and Kumar, V. (2006). Introduction to Data Mining. Addison Wesley.Massachusetts, 2006.
Timus, O. and Bolat, E. D. (2017). k-NN-based classification of sleep apnea types using ECG. Turkish Journal of Electrical Engineering and Computer Sciences. 25, 3008-3023.
Trovato, G., Chrupala, G., and Takanishi, A. (2016). Application of the naive Bayes classifier for representation and use of heterogeneous and incomplete knowledge in social robotics. Robotics. 5, 6-26.
Turhan, B. and Bener, A. (2009). Analysis of naive Bayes' assumptions on software fault data: an empirical study. Data and Knowledge Engineering. 68, 278-290.
Vural, M. and Gok, M. (2017). Criminal prediction using Naive Bayes theory. Neural Computing and Application. 28, 2581-2592.
Wiharto, W., Kusnanto, H., and Herianto, H. (2016). Interpretation of clinical data based on C4.5 algorithm for the diagnosis of coronary heart disease. Healthcare Informatics Research. 22, 186-195.
Wijaya, A. and Bisri, A. (2016). Hybrid decision tree and logistic regression classifier for email spam detection. Proceedings of the 8th International Conference on Information Technology and Electrical Engineering. 1-4.
Wong, T. T. (2015). Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognition. 48, 2839-2846.
Wong, T. T. (2017). Parametric methods for comparing the performance of two classification algorithms evaluated by k-fold cross validation on multiple data sets. Pattern Recognition. 65, 97-107.
Zareapoor, M., Shamsolmoali, P., Jain, D. K., Wang, H., and Yang, J. (2018). Kernelized support vector machine with deep learning: An efficient approach for extreme multiclass dataset. Pattern Recognition Letters. 115, 4-13.
Zhang, H., He, H., and Zhang, W. (2018). Classifier selection and clustering with fuzzy assignment in ensemble model for credit scoring. Neurocomputing. 316, 210-221.
Zhang, H., Yu, P., Zhang, T. G., Kang Y. L., Zhao, X., Li, Y. Y., He, J. H., and Zhang, J. (2015). In silico prediction of drug-induced myelotoxicity by using Naïve Bayes method. Molecular Diversity. 19, 945-953.
Zhang, L., Hu, H., and Zhang, D. (2015). A credit risk assessment model based on SVM for small and medium enterprises in supply chain finance. Financial Innovation. 1, 1-21.
Zhang, M. L., Pena, J. M., and Robles, V. (2009), Feature selection for multi-label naive Bayes classification. Information Sciences. 179, 3218-3229.
Zhang, S., Li, X., Zong, M., Zhu, X., and Wang, R. (2018). Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems. 29, 1774-1785.
Zhang, X. and Mahadevan, S. (2019). Ensemble machine learning models for aviation incident risk prediction. Decision Support Systems. 116, 48-63.
  • 同意授權校內瀏覽/列印電子全文服務,於2023-06-08起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2023-06-08起公開。

  • 如您有疑問,請聯絡圖書館