進階搜尋


下載電子全文  
系統識別號 U0026-1107201315395800
論文名稱(中文) 利用混合式中文特徵選取法於知識文件分類
論文名稱(英文) A Hybrid Chinese Feature Selection Method for Knowledge Document Classification
校院名稱 成功大學
系所名稱(中) 工業與資訊管理學系專班
系所名稱(英) Department of Industrial and Information Management (on the job class)
學年度 101
學期 2
出版年 102
研究生(中文) 郭冠忠
研究生(英文) Kuan-Chung Kuo
學號 r37001333
學位類別 碩士
語文別 中文
論文頁數 45頁
口試委員 指導教授-王惠嘉
口試委員-陳偉凡
口試委員-王維聰
口試委員-劉任修
中文關鍵字 中文分詞  特徵選取  SVM  混合式分類 
英文關鍵字 Chinese Segmentation  feature selection  SVM  hybrid classification 
學科別分類
中文摘要 一般企業都會擁有自己的知識管理系統,作為教育訓練、經驗傳承之用,蒐集該產業相關的知識文件是許多企業處理外顯知識的重要工作之一,然而隨著資訊科技的發展,資訊量急速的增加,資訊取得便利更加容易,知識文件分類變成是企業管理資訊、知識相當的重要的一項工作。在進行知識文件分類之前,必須先對文件進行文字前處理,以方便擷取特徵值,特徵值選取的好壞影響著分類正確率。然而,文字前處理的方法、難度,會因為語言而有所不同,其中,中文由於字詞間沒有空白的關係,在進行分詞時,困難度較高,目前主要有兩種處理方式,一種是依靠詞庫的輔助,另外一種是藉由純統計的方式。
以詞庫為主的中文分詞系統會有字詞涵蓋率的問題,新的字詞不斷出現,而且每個中文分詞系統所使用的語料庫所蒐集的字詞不盡相同,因此本研究提出一個混合式的中文特徵選取法,將使用詞庫的Stanford Word Segmenter和CKIP(Chinese Knowledge Information Processing)中文分詞系統所獲得的特徵子集合,加上純統計的n-grams方法所獲得特徵子集合,做為最終的特徵集合。每個特徵子集合都是透過TF-ICF(Term Frequency-Inverse Category Frequency)進行權重分析所獲得,最後藉由SVM分類器來進行驗證。
實驗後發現,本研究與單純只使用單一中文分詞法相比,文件分類正確率能夠有效提升。本研究所使用的TF-ICF,考量了類別間的差異,成效也比TF、TF-IDF(Term Frequency-Inverse Document Frequency)好。利用本研究所提出的方法,能夠幫助企業更準確的進行中文知識文件的分類。
英文摘要 Enterprises have knowledge management systems for training employees, and the knowledge documents of industries are very important sources of explicit knowledge. Knowledge documents classification is a significant work for enterprises today. For selecting features which affecting the accuracy of classification, it is necessary to do text pre-processing before classifying knowledge documents. Unfortunately, Chinese sentences are not easy to segment in text pre-processing phase, because there is no white space between two Chinese terms. Currently, there are two common methods to do Chinese segmentation: One is based on dictionary, the other is based on statistics.
Unknown term is always a problem of the Chinese segmentation system based on dictionary. A dictionary could not cover all terms, because the newest terms are created without end. For resolving this problem, this study used two dictionary-based Chinese segmentation systems, Stanford Chinese Word Segmenter and CKIP segmentation system, and one statistical-based method, n-grams method, and calculating the TF-ICF(Term Frequency-Inverse Category Frequency) score of terms to select the final features, then, classifying and validating with SVM classifier. This study found that the hybrid Chinese feature selection method has better accuracy of classification, compared with the method using single Chinese segmentation system. The performance of TF-ICF is better than TF and TF-IDF. The hybrid Chinese feature selection can improve the accuracy of Chinese knowledge documents classification.
論文目次 第一章 緒論 1
第一節 研究背景與動機 1
第二節 研究目的 3
第三節 研究流程 3
第四節 研究範圍及限制 5
第五節 論文大綱 5
第二章 文獻探討 6
第一節 資料檢索 6
第二節 文字前處理 7
2.2.1 中文分詞 7
2.2.2 詞性標記 8
2.2.3 特徵值選取 10
第三節 文件分類方法 11
2.3.1 支援向量機(SVM) 11
第四節 小結 12
第三章 研究方法 13
第一節 研究架構 13
第二節 資料蒐集與文字前處理 16
第三節 訓練資料特徵選取 26
第四節 分類驗證模組 31
第五節 小結 33
第四章 系統建置與驗證 34
第一節 系統建置環境 34
第二節 實驗設計 35
第三節 實驗結果分析 37
第五章 結論及未來研究方向 39
第一節 結論 39
第二節 未來研究方向 40
參考文獻 42
參考文獻 Baharudin, B., Lee, L. H. & Khan, K. 2010. A Review of Machine Learning Algorithms for Text-Documents Classification. Journal of Advances in Information Technology, 1, 4-19.
Boser, B. E., Guyou, I. M. & Vapnik, V. N. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Computational learning theory, 1992 Pittsburgh, Pennsylvania, United States.
Carlberger, J. & Kann, V. 1999. Implementing an efficient part-of-speech tagger. Software-Practice & Experience, 29, 815-832.
Changa, P.-C., Tsengb, H., Jurafskya, D. & Manninga, C. D. 2009. Discriminative reordering with chinese grammatical relations features. to appear in NAACL 2009 Third Workshop on Syntax and Structure in Statistical Translation.
Chen, Y., Miao, D., Wang, R. & Wu, K. 2011. A rough set approach to feature selection based on power set tree. Knowledge-Based Systems, 24, 275-281.
Cheng, Y., Asahara, M. & Matsumoto, Y. 2005. Machine Learning-based Dependency Analyzer for Chinese. Journal of Chinese Language and Computing, 15, 13-24.
Chiu, D.-Y., Lee, C.-C. & Pan, Y.-C. 2010. An Automated Error Detection for News Webpages of Chinese Portal. Journal of Software, 5, 1334-1341.
Chu, C., Nakazawa, T., Kawahara, D. & Kurohashi, S. 2012. Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation. the 16th EAMT Conference, 28-30.
Cohn, T. & Blunsom, P. 2009. A Bayesian Model of Syntax-Directed Tree to String Grammar Induction. Conference on Empirical Methods in Natural Language Processing, 352-361.
Cordon, O., Herrera-Viedma, E., Lopez-Pujalte, C., Luque, M. & Zarco, C. 2003. A review on the application of evolutionary computation to information retrieval. International Journal of Approximate Reasoning, 34, 241-264.
Fragoudis, D., Meretakis, D. & Likothanassis, S. 2005. Best terms: an efficient feature-selection algorithm for text categorization. Knowledge and Information Systems, 8, 16-33.
Group, T. S. N. L. P. 2012. Chinese Natural Language Processing and Speech Processing [Online]. Stanford University. Available: http://nlp.stanford.edu/projects/chinese-nlp.shtml.
Hao, P.-Y., Chiang, J.-H. & TU, Y.-K. 2007. Hierarchically SVM classification based on support vector clustering method and its application to document categorization. Expert Systems with Applications, 33, 627-635.
Kao, L.-J., Chiu, C.-C. & Chiu, F.-Y. 2012. A Bayesian latent variable model with classification and regression tree approach for behavior and credit scoring. Knowledge-Based Systems, 36, 245-252.
Kumar, M. A., & Gopal, M. 2010. A hybrid SVM based decision tree. Pattern Recognition, 43(12), 3977-3987.
Lazaro-Gredilla, M., Gomez-Verdejo, V. & Parrado-Hernandez, E. 2012. Low-cost model selection for SVMs using local features. Engineering Applications of Artificial Intelligence, 25, 1203-1211.
Lee, L. H., Wan, C. H., Rajkumar, R. & Isa, D. 2012. An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. Applied Intelligence, 37, 80-99.
Levy, R. & Manning, C. 2003. Is it harder to parse Chinese, or the Chinese Treebank? Proceedings of ACL 2003.
Ma, W.-Y. & Chen, K.-J. 2003. Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff. Proceedings of ACL, Second SIGHAN Workshop on Chinese Language Processing, 17, 168-171.
McLachlan, G. J., Do, K.-A., & Ambroise, C. (2004). Analyzing Microarray Gene Expression Data, Wiley-Interscience.
Mengle, S. S. R. & Goharian, N. 2009. Ambiguity measure feature-selection algorithm. Journal of the American Society for Information Science and Technology, 60, 1037-1050.
Ogura, H., Amano, H. & Kondo, M. 2009. Feature selection with a measure of deviations from Poisson in text categorization. Expert Systems with Applications, 36, 6826-6832.
Pawlak, Z. 1982. Rough sets. International Journal of Computer & Information Sciences, 11, 341-356.
Ray, S. & Chandra, N. 2012. A Technique for Proper Feature Selection with Automated Text Categorization in the Vector Space Model. International Journal of Emerging Technology and Advanced Engineering, 2, 243-246.
Salton, G., Wong, A. & Yang, C. S. 1975. A vector space model for automatic indexing. Communications of the ACM, 18, 613-620.
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34, 1-47.
Subasi, A. 2013. Classification of EMG signals using PSO optimized SVM for diagnosis of neuromuscular disorders. Computers in Biology and Medicine, 43, 576-586.
Sun, A., Lim, E.-P., & Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1), 191-201.
Sun, J. & Li, H. 2012. Financial distress prediction using support vector machines: Ensemble vs. individual. Applied Soft Computing, 12, 2254-2265.
Vapnik, V. N. 1995. The nature of statistical learning theory, New York, NY, USA, Springer-Verlag New York, Inc.
Wang, T.-Y. & Chiang, H.-M. 2007. Fuzzy support vector machine for multi-class text categorization. Information Processing & Management, 43, 914-929.
Wei, Z., Miao, D., Chauchat, J.-H., Zhao, R. & Li, W. 2009. N-grams based feature selection and text representation for Chinese Text Classification. International Journal of Computational Intelligence Systems, 2, 365-374.
Yang, J., Liu, Y., Liu, Z., Zhu, X. & Zhang, X. 2011. A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowledge-Based Systems, 24, 904-914.
Yang, J., Liu, Y., Zhu, X., Liu, Z. & Zhang, X. 2012. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing & Management, 48, 741-754.
Yang, Y. & Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning, 412–420.
Youn, E. & Jeong, M. K. 2009. Class dependent feature scaling method using naive Bayes classifier for text datamining. Pattern Recognition Letters, 30, 477-485.
Yuan, L.-C. 2012. Improved hidden Markov model for speech recognition and POS tagging. Journal of Central South University of Technology, 19, 511-516.
Yuxia Sun, Weiguang Qu, Junsheng Zhou, Xuri Tang, YIng Di & Wu, W. 2011. An improved feature selection method in chinese text categorization. International Journal of Knowledge and Language Processing, 2, 48-55.
Zhang, H. & Ren, F. 2010. Chinese POS tagging using restricted maximum entropy model. Chinese Journal of Electronics, 19, 39-42.
Zhao, H., Huang, C.-N., Li, M. & Lu, B.-L. 2010. A Unified Character-Based Tagging Framework for Chinese Word Segmentation. ACM Transactions on Asian Language Information Processing, 9, 1-32.
謝佑明. 2012. 具有新詞辨識能力的中文斷詞系統 [Online]. 台灣中央研究院 資訊科學所 中文組實驗室 中文詞知識庫小組. Available: http://rocling.iis.sinica.edu.tw/CKIP/wordsegment.htm.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2015-08-01起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2015-08-01起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw