系統識別號 U0026-2408201510201800
論文名稱(中文) 以主題偵測與追蹤建置階層式知識檢索方法
論文名稱(英文) Hierarchical Knowledge Retrieval Based On Topic Detection and Tracking
校院名稱 成功大學
系所名稱(中) 工業與資訊管理學系碩士在職專班
系所名稱(英) Department of Industrial and Information Management (on the job class)
學年度 103
學期 2
出版年 104
研究生(中文) 吳克松
研究生(英文) Ko-Sung Wu
學號 R37021139
學位類別 碩士
語文別 中文
論文頁數 61頁
口試委員 指導教授-王惠嘉
中文關鍵字 文字探勘  主題偵測與追蹤  知識檢索  特徵選取  文件分群 
英文關鍵字 Text Mining  Topic Detection and Tracking  Knowledge Retrieval  Feature Selection  Document Clustering 
中文摘要 知識是企業重要的資產,隨著網際網路、資訊硬體設備快速的發展,儲存於系統中的非數據化知識越來越多且越複雜,導致使用者在利用傳統關鍵字查詢時,雖有找到符合的資料,但往往因為筆數過多、無法快速找到真正所需的資訊。面對這樣的資訊超載、無法有效檢索的窘境,主題相關的概念紛紛被提出應用,所謂的相關是指檢索詞彙與文章內文之間的一種吻合關係,雖然由主題的觀點來探討相關,較能滿足使用者的檢索需求,但大多是以全文為分析對象,忽略了文件特定部份的重要性,而且分析所得的主題多為單詞、不具關聯等特性。
英文摘要 With the more and more complex document-digitizing, the ability to find the desired information and related topics accurately is becoming more critical and difficult. This study proposes a novel approach of hierarchical knowledge retrieval based on topic detection and tracking, which retrieves relevant information from large volumes of documents and extracts the main topics to users. The part of speech is combined with bigram to obtain meaningful compound terms in data preprocessing. Different from other practice of feature selection, this method considers term weighting for different term of fields. Then calculates the similarity between documents, the hierarchy-related topics are generated after Single-Pass and AHC clustering. Results from our system are evaluated against the system of full text search on the intranet, indicating that this approach has improved not only the precision rate but also the F-measure. It's advantageous in moving up the efficiency of knowledge retrieval.
論文目次 第一章 緒論 1
1.1 研究背景 1
1.2 研究動機及目的 2
1.3 研究範圍與限制 4
1.4 研究流程 4
1.5 論文架構 6
第二章 文獻探討 7
2.1 分群方法 7
2.1.1 階層式分群演算法(Hierarchical Clustering Algorithms) 7
2.1.2 分割式分群演算法(Partitional Clustering Algorithms) 9
2.1.3 密度分群法(Density-Based Clustering) 10
2.1.4 網格分群法(Grid-Based Clustering) 11
2.1.5 模型分群法(Model-Based Clustering) 12
2.2 資訊檢索(Information Retrieval, IR) 12
2.3 特徵選取(Feature Selection) 14
2.4 主題偵測與追蹤(Topic Detection and Tracking, TDT) 16
2.4.1 相關任務 16
2.4.2 評估方法 19
2.4.3 主題偵測任務 19
2.4.4 主題追蹤任務 24
2.5 小結 26
第三章 研究方法 27
3.1 研究架構 27
3.2資料前置處理模組 29
3.3特徵選取模組 32
3.4 文件分群模組 33
3.4.1 計算文件相似度 33
3.4.2 文件分群 35
3.5 文件分類模組 36
3.6 主題偵測模組 37
第四章 系統建置與驗證 38
4.1系統實作設計 38
4.1.1 Data Collection 38
4.1.2 Data Preprocessing 39
4.1.3 Document Clustering 39
4.1.4 Topic Retrieval 39
4.2實驗方法 40
4.2.1 資料來源 40
4.2.2 比較對象 40
4.2.3 評估指標 41
4.2.4 實驗方法設計 41
4.3實驗結果與分析 43
4.4系統畫面範例 48
第五章 結論及未來研究方向 50
5.1研究成果 50
5.2未來研究方向 52
參考文獻 53
附錄一 詞類標記說明表 58
附錄二 停用字集(Stoplist) 60
參考文獻 參考文獻
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications (Vol. 27, No. 2, pp. 94-105). ACM.
Alavi, M., & Leidner, D. E. (2001). Review: Knowledge management and knowledge management systems: Conceptual foundations and research issues. MIS quarterly, 107-136.
Albanese, M., Capasso, P., Picariello, A., & Rinaldi, A. M. (2005). Information retrieval from the web: an interactive paradigm Advances in Multimedia Information Systems (pp. 17-32): Springer.
Aljaber, B., Stokes, N., Bailey, J., & Pei, J. (2010). Document clustering of scientific texts using citation contexts. Information Retrieval, 13(2), 101-131.
Allan, J. (2002). Topic detection and tracking: event-based information organization (Vol. 12, No. 5, pp. 87-101): Springer.
Allan, J., Lavrenko, V., & Jin, H. (2000). First story detection in TDT is hard. Paper presented at the Proceedings of the ninth international conference on Information and knowledge management.
Baeza-Yates, R. (2003). Information retrieval in the web: beyond current search engines. International Journal of Approximate Reasoning, 34(2), 97-104.
Berkhin, P. (2006). A survey of clustering data mining techniques Grouping multidimensional data (pp. 25-71): Springer.
Bosch, A. V. D. (2010). Hidden Markov Models. In C. Sammut & G. Webb (Eds.), Encyclopedia of Machine Learning (pp. 493-495): Springer US.
Carthy, J. (2004). Lexical Chains versus Keywords for Topic Tracking. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing (Vol. 2945, pp. 507-510): Springer Berlin Heidelberg.
Carthy, J., & Sherwood-Smith, M. (2002, 6-9 Oct. 2002). Lexical chains for topic tracking. In Systems, Man and Cybernetics, 2002 IEEE International Conference on (Vol. 7, pp. 5-pp). IEEE.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22-29.
Cortright, J., Bosworth, B., Dabson, B., Mayer, H., Munnich, L., & Waits, M. J. (2002). 21st Century Economic Strategy: Prospering in a Knowledge-based Economy. Prepared for the Oregon Business Council.
Dai, X.-Y., Chen, Q.-C., Wang, X.-L., & Xu, J. (2010, 11-14 July 2010). Online topic detection and tracking of financial news based on hierarchical clustering. Paper presented at the Machine Learning and Cybernetics (ICMLC), 2010 International Conference on (Vol. 6, pp. 3341-3346). IEEE.
Doddington, G. (2000). Topic Detection and Tracking - Introduction and Overview. from http://www.itl.nist.gov/iad/mig/tests/tdt/2000/Papers-n-slides/NIST-overview/2000.11-Meeting.Overview/index.htm
Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., & Xu, X. (1998). Incremental clustering for mining in a data warehousing environment. Paper presented at the VLDB (Vol. 98, pp. 323-333).
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine learning, 2(2), 139-172.
Gennari, J. H., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial intelligence, 40(1), 11-61.
Hai, D., Hussain, F. K., & Chang, E. (2008, 26-29 Feb. 2008). A survey in traditional information retrieval models. In 2008 2nd IEEE International Conference on Digital Ecosystems and Technologies (pp. 397-402).
Halkidi, M. (2009). Hierarchial Clustering. In L. Liu & M. T. ÖZsu (Eds.), Encyclopedia of Database Systems (pp. 1291-1294): Springer US.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in english: 288-289.
Han, J., Kamber, M., & Pei, J. (2006). Data mining: concepts and techniques: Morgan kaufmann.
Hinneburg, A., & Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. Paper presented at the KDD.
Kauffman, L., & Rousseeuw, P. (1990). Finding groups in data. An introduction to cluster analysis. New York: John Willey & Sons.
Kowalski, G. J., & Maybury, M. T. (2000). Information storage and retrieval systems: theory and implementation (Vol. 8, pp. 156-157): Springer.
Kozima, H. (1993). Text segmentation based on similarity between words. Paper presented at the Proceedings of the 31st annual meeting on Association for Computational Linguistics.
Li, S., Lv, X., Li, Y., & Shi, S. (2010b, 23-25 June 2010). Study on feature selection algorithm in topic tracking. Paper presented at the Software Engineering and Data Mining (SEDM), 2010 2nd International Conference on.
Li, S., Lv, X., Li, Y., & Shi, S. (2010d, 14-15 Aug. 2010). Study on Key Technology of Topic Tracking Based on SVM. Paper presented at the Information Engineering (ICIE), 2010 WASE International Conference on.
Li, S., Lv, X., Wang, T., & Shi, S. (2010c, 9-10 Oct. 2010). The key technology of topic detection based on K-means. Paper presented at the Future Information Technology and Management Engineering (FITME), 2010 International Conference on.
Li, S., Lv, X., Zhou, Q., & Shi, S. (2010a, 20-23 June 2010). Study on key technology of topic tracking based on VSM. In Information and Automation (ICIA), 2010 IEEE International Conference on (pp. 2419-2423). IEEE.
Li, S., Xia, C., Li, S., & Zhang, W. (2011, 24-26 Dec. 2011). Topic tracking based on Naive bayes. In Computer Science and Network Technology (ICCSNT), 2011 International Conference on (Vol. 2, pp. 1046-1049). IEEE.
Liu, N. (2009). Topic Detection and Tracking. In L. Liu & M. T. ÖZsu (Eds.), Encyclopedia of Database Systems (pp. 3121-3124): Springer US.
Liu, R., & Guo, W. (2011, 10-12 June 2011). HMM-based state prediction for Internet hot topic. In Computer Science and Automation Engineering (CSAE), 2011 IEEE International Conference on (Vol. 1, pp. 157-161). IEEE.
Liu, X., Ren, F., & Yuan, C. (2010, 21-23 Aug. 2010). Use relative weight to improve the kNN for unbalanced text category. In Natural Language Processing and Knowledge Engineering (NLP-KE), 2010 International Conference on (pp. 1-5). IEEE.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval (Vol. 1, p. 496). Cambridge: Cambridge University Press.
Marchionini, G. (2004). From information retrieval to information interaction. In S. McDonald & J. Tait (Eds.), Advances in Information Retrieval, Proceedings (Vol. 2997, pp. 1-11).
Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. NATIONAL INST OF STANDARDS AND TECHNOLOGY GAITHERSBURG MD.
Ng, R. T., & Han, J. (1994). Efficient and Effective Clustering Methods for Spatial Data Mining. Paper presented at the Proc. 20th Int. Conf. on Very Large Data Bases, 144-155. Santiago, Chile.
Omar, A. H., & Salleh, M. N. M. (2013). Modeling Unstructured Document Using N-gram Consecutive and WordNet Dictionary. Paper presented at the pie (Vol. 77, p. 1).
Papka, R., & Allan, J. (1998). On-line new event detection using single pass clustering. UMass Computer Science.
Patra, B. K., Hubballi, N., Biswas, S., & Nandi, S. (2010). Distance based fast hierarchical clustering method for large datasets. Paper presented at the Rough Sets and Current Trends in Computing (pp. 50-59). Springer Berlin Heidelberg.
Qiu, L.-Q., Pang, B., & Zhao, L.-P. (2008). An event detection algorithm based on improved STC. 2008 IEEE International Conference on Networking, Sensing and Control (ICNSC '08), 528-532.
Raman, S., Chaurasiya, V., & Venkatesan, S. (2012). Performance comparison of various information retrieval models used in search engines. In Communication, Information & Computing Technology (ICCICT), 2012 International Conference on (pp. 1-4). IEEE.
Rui, X., & Wunsch, D., II. (2005). Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3), 645-678.
Ruocco, M., & Ramampiaro, H. (2010). Event Clusters Detection on Flickr Images Using a Suffix-tree Structure. Proceedings 2010 IEEE International Symposium on Multimedia (ISM 2010), 41-48.
Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
Schiuma, G. (2012). Managing knowledge for business performance improvement. Journal of Knowledge Management, 16(4), 515-522.
Shah, C., Croft, W. B., & Jensen, D. (2006). Representing documents with named entities for story link detection (SLD). In Proceedings of the 15th ACM international conference on Information and knowledge management (pp. 868-869). ACM.
Steinley, D., & Brusco, M. J. (2007). Initializing k-means batch clustering: A critical evaluation of several techniques. Journal of Classification, 24(1), 99-121.
Velmurugan, T., & Santhanam, T. (2010). Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points. Journal of Computer Science, 6(3).
Wang, W., Yang, J., & Muntz, R. (1997). STING: A statistical information grid approach to spatial data mining. Paper presented at the VLDB (Vol. 97, pp. 186-195).
Wei, Y.-q., Liu, P.-y., & Zhu, Z.-f. (2008, 6-8 Oct. 2008). A Feature Selection Method based on Improved TFIDF. In Pervasive Computing and Applications, 2008. ICPCA 2008. Third International Conference on (Vol. 1, pp. 94-97). IEEE.
Xu, J., & Croft, W. B. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 254-261). ACM.
Xue, Z., Li, G., Zhang, W., Pang, J., & Huang, Q. (2014). Topic detection in cross-media: a semi-supervised co-clustering approach. International Journal of Multimedia Information Retrieval, 1-13.
Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 46-54). ACM.
Zhang, D., & Li, S. (2011, 9-11 Sept. 2011). Topic detection based on K-means. In Electronics, Communications and Control (ICECC), 2011 International Conference on (pp. 2983-2985). IEEE.
Zhang, X. (2010). Support Vector Machines. In C. Sammut & G. Webb (Eds.), Encyclopedia of Machine Learning (pp. 941-946): Springer US.
Zhe, G., Zhe, J., Shoushan, L., Bin, T., Xinxin, N., & Yang, X. (2011, 24-26 Dec. 2011). An adaptive topic tracking approach based on Single-Pass clustering with sliding time window. In Computer Science and Network Technology (ICCSNT), 2011 International Conference on (Vol. 2, pp. 1311-1314). IEEE.
  • 同意授權校內瀏覽/列印電子全文服務,於2020-08-31起公開。

  • 如您有疑問,請聯絡圖書館