進階搜尋


 
系統識別號 U0026-0812200910200468
論文名稱(中文) 以主成分分析應用在決策樹名目屬性之二元分割上
論文名稱(英文) none
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 90
學期 2
出版年 91
研究生(中文) 徐芳玲
研究生(英文) Feng-lin Hsu
學號 r7689402
學位類別 碩士
語文別 中文
論文頁數 59頁
口試委員 指導教授-葉榮懋
口試委員-林清河
口試委員-耿伯文
中文關鍵字 二元分割  主成分分析  決策樹  不純度 
英文關鍵字 factorial analysis  binary partition  decision tree  entropy 
學科別分類
中文摘要 由於資訊科技的日新月異,資料的儲存與處理規模均與過去有相當大的落差。要如何從龐大的資料量中擷取出有用的資訊以提供給決策者參考,一直是資料探勘領域裡所關注的重點。
決策樹由於其運算容易,又能產生清楚的規則,使其成為資料探勘中最常用的分類技術之一。但是,當欲處理的資料量龐大,且屬性的屬性值相當多的情況之下,會造成決策樹的分支多且過於複雜,資料在處理上的效率也會大打折扣。為此簡化決策樹的方法也不斷的提出,以用來改進決策樹在大型資料庫中的績效表現。
本研究所欲採用的簡化決策樹方法為將各個名目屬性中的值做二元分割,而二元分割基準以主成分分析中各屬性值所對應的成分分數為準,以期能減少過多與不必要的決策樹分支。本研究採用主成份分析的主因:在於其可以利用少數的潛在變量或成分便能有效代表許多彼此有關的變項之結構;此外,利用主成分分析,我們也可以將少數幾個變項予以線性組合,使經由線性組合而得的成分之變異數為最大,以使資料在這些成分方面顯出最大的差異來。亦即,本研究採用可以表示大部分變異的主成分,並利用該成分裡變異最大化且經過標準化的成分分數,作為二元分割屬性值的基準,以冀能減少過多的屬性值,並進一步的達到簡化與改善決策樹的目的。後續的研究工作主要以驗證此種二元分割方式在執行效率、不純度與正確率上的表現與其他類似二元分割方式的比較上是否有令人滿意的結果,並尋找適當之個案以進行實務上之應用。


英文摘要 none
論文目次 第一章 緒論 1
第一節 研究背景與動機 1
第二節 研究目的 2
第三節 研究範圍與限制 2
第四節 研究流程與步驟 4
第二章 文獻探討 5
第一節 資料擷取 5
2.1.1. 資料擷取技術應用範圍 5
2.1.2. 決策樹 6
2.1.3. 決策樹的優點 8
2.1.4. 決策樹所遭遇的挑戰 9
2.1.5. 決策樹複雜度 11
第二節決策樹簡化 12
2.2.1. 決策樹簡化的分類 12
2.2.2. 二元樹 16
2.2.3. 屬性分割理論 18
第三章 研究方法 20
第一節 決策樹 20
3.1.1. 決策樹方法 20
3.1.2. 特徵評估準則 21
第二節 屬性分割 23
3.2.1. 特徵根與特徵向量 23
3.2.2. 主成分分析 25
3.2.3. 二元分割方式 27
3.2.4. 決策樹建立步驟與流程 29
第三節 評估方法 32
第四章 實證研究 34
第一節 實驗資料簡介 34
4.1.1. 皮膚醫學資料庫 (Dermatology Database) 35
4.1.2. 霍奇金氏病資料庫 (Lymphography Database) 35
4.1.3. 蕈類資料庫 (Mushrooms Database) 35
4.1.4. 保險公司標竿資料庫 (The Company Insurance Benchmark) 36
4.1.5. 資料前置處理 36
第二節 系統設計 38
4.2.1. 測試流程 38
4.2.2. 實驗對照設定 38
第三節 實驗結果 40
4.3.1. 皮膚醫學資料庫 (Dermatology Database) 40
4.3.2. 霍奇金氏病資料庫 (Lymphography Database) 42
4.3.3. 蕈類資料庫 (Mushrooms Database) 43
4.3.4. 保險公司標竿資料庫 (The Company Insurance Benchmark, COIL 2000) 46
第四節 結果分析 48
4.4.1. 未分割決策樹與二元分割決策樹之比較 48
4.4.2. 二元分割決策樹之比較 50
第五章 結論與建議 51
第一節 結論 51
第二節 建議 52
參考文獻 53
參考文獻 [1] Aha, D.W. & Breslow, L.A. Comparing simplification procedures for decision trees. Artificial Intelligence and Statistics, 5, 199-206, 1998.
[2] Allwein, E. L., Schapire, R. E. & Singer, Y. Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113-141, 2000.
[3] Auer, P., Holte, R. C. & Maass, W. Theory and applications of agnostic PAC-learning with small decision trees. Proceedings of the twelfth international conference on machine learning, 21-29, 1995.
[4] Blake, C. L. & Merz, C. J. UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science, 1998.
[5] Brodley, C. E. & Utgoff, P. E. Multivariate decision trees. Machine Learning, 19, 45-77, 1995.
[6] Cherkauer, K. J. & Shavlik, J. W. Growing simpler decision trees to facilitate knowledge discovery. Proceedings of the Second International Conference on Knowledge Discovery and DataMining, 315–318, 1996.
[7] Coppersmith, D., Hong, S. j. Partitioning nominal attributes in decision trees. Data Mining and Knowledge, 3, 197-217, 1999.
[8] Cox, L. A., Qiu, Y. & Kuehner, W. Heuristic least-cost computation of discrete classification functions with uncertain argument values. Annals of Operations Research, 21(1), 1-30, 1989.
[9] Dechter, R. Decomposing a relation into a tree of binary relations. Journal of Computer and System Sciences, 41(1), 2-24, 1990.
[10] Deogun, J. S., Raghavan, V. V., Sarkar, A., & Sever, H. Data mining: research trends, challenges, and applications. Dordrecht: Kluwer Academic Publishers, 1997.
[11] Esposito, F., Malerba, D. & Semeraro. G. A further study of pruning methods in decision tree induction. Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, 211-218, 1995.
[12] Fayyad, U. M. & Irani, K. B. Multi interval discretization of continous attributes for classification learning. Proceedings of 13th International Joint Conference on Artificial Intelligence, 1022-1027, 1990.
[13] Frederickson, G. N. Optimal algorithms for tree partitioning. Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms, 168-177, 1991.
[14] Ganti, V., Gehrke, J. & Ramakrishnan, R. Mining very large databases. IEEE Computer, 32(8), 38-44, 1999.
[15] Gehrke, J., Ganti, V., Ramakrishnan, R. & Loh, W.Y. BOAT---Optimistic decision tree construction. Proceedings of the SIGMOD Conference 1999, 169-180, 1999.
[16] Growe, G. A. Comparing algorithms and clustering data: components of the data mining process. Computer Science Department, Grand Valley State University, 1999.
[17] Heath, D. A geometric framework for machine learning. PhD thesis, Johns Hopkins, 1992.
[18] Ho, K. M. & Scott, P. D. Binary decision trees. Technical report CSM-313, Department of Computer Sciences, University of Essex, 1999.
[19] Hoeffgen, K. U., Simon, H. U. & Horn, K. S. Robust trainability of single neurons. Journal of Computer system Sciences, 50(1), 114-125, 1995.
[20] Hyafil, L. & Rivest, R. L. Constructing optimal binary decision tree is NP-complete. Information Processing Leeters, 5(1),15-17, 1976.
[21] John, G. Robust Decision Trees: Removing Outliers in Databases. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 174–179, 1995.
[22] Krishnaswamy, R., Alijani , G. S., & Su , S. C. On constructing binary space partitioning trees. NY: ACM Press New York, 230-235, 1990.
[23] Lim, T. S., Loh, W. Y. & Shih, Y. S. An empirical comparison of decision tree and other classification methods. Technical Report 979, Department of Statistics, University of Wisconsin, 1997.
[24] Marshall, R. J. Generation of Boolean classification rules. Proceedings of Computational Statistics 2000, 2000.
[25] Martin, J. K. & Hirschberg, D. S. The time complexity of decision tree induction. Department of Information and Computer Science, University of California, 1995.
[26] Matousek, J. Efficient partition trees. Discrete & Computational Geometry, 8, 315-334, 1992.
[27] Merckt, T. V. D. & Quinlan, J. R. Two-threshold splits of continous attributes in decision trees. The Basser Department of Computer Science, University of Sydney, 1996
[28] Miller, D., Rao, A., Rose, K. & Gersho, A. A global optimization technique for statistical classifier design. IEEE Transactions on Signal Processing, 4, 3108-3121, 1996.
[29] Mitchell, M. Generalization as search. Artificial Intelligence, 18, 203-226, 1982.
[30] Murphy, O. J. & McCraw, R. L. Designing storage efficient decision trees. IEEE Transactions on Computers, 40(3), 315-319, 1991.
[31] Murphy, S. K. Automatic construction of decision trees from data: a multi-disciplinary survey. Data Mining and Knowledge Discovery, 2, 345-389, 1998.
[32] Murthy, S. K. On growing better decision trees from data. Department of Computer Science, Johns Hopkins University, 1995.
[33] Nagaraj, S. V. Optimal binary search trees. Theoretical Computer Science, 188, 1-44, 1996.
[34] Naumov, G. E. NP-completeness of problems of construction of optimal decision trees. Soviet Physics, 36(4), 270-271, 1991.
[35] Oates, T. & Jensen, D. The effects of training set size on decision tree complexity. Proceedings of the Fourteenth International Conference on Machine Learning, 1997.
[36] Pagallo, G. & Haussler, D. Boolean feature discovery in empirical learning. Machine Learning, 5, 71–100, 1990.
[37] Peshkin, L. Dimensionality reduction – a primer. Department of Computer Science, Brown University, 1995.
[38] Putten , P. V. & Someren , M. V. CoIL Challenge 2000: The Insurance Company Case. Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09. June 22, 2000.
[39] Quinlan, J. R. Programs for machine learning. San Mateo: Morgan Kaufmann, 1993.
[40] Quinlan, J. R. Improved use of continuous attributes in C4.5. Journal of Artificial IntelligenceResearch, 4, 77–90, 1996.
[41] Ragavan, H. & Rendell, L. Lookahead feagure construction for learning hard concepts. Preceedings of the Tenth International Conference on Machine Learning, 252-259, 1993.
[42] Tino, B. L. & Niels, A. N. A comparison of five elicitation techniques for elicitation of attributes of low involvement products. Journal of Economic Psychology, 20, 315-341, 1998.
[43] Utgoff, P.E. & Clouse , J. A. A Kolmogorov-Smirnoff metric for decision tree induction. Technical Report 96-3, Department of Computer Science, University of Massachusetts, 1996.
[44] Utgoff, P.E. Decision tree induction based on efficient tree restructuring. Technical Report 95-18, Department of Computer Science, University of Massachusetts, 1996.
[45] Wuuthrich, B. & Karlapalem, K. Data mining opportunities in very large object oriented databases. ACM-SIGMOD workshop on Research Issues on Data Mining and Knowledge Discovery, 1996.
[46] Zheng, Z. Constructing nominal X-of-N attributes. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, 1064–1070, 1995.
[47] Zimenkov, A. Tree classifiers. Department of Information Technology, Lappeenranta University of Technology, 2000.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2003-07-12起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2005-07-12起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw