進階搜尋


 
系統識別號 U0026-0812200913555662
論文名稱(中文) 利用增廣資訊的一個資訊熵值為基礎的階層式搜尋結果分群方法
論文名稱(英文) An Entropy-Based Hierarchical Search Result Clustering Method by Utilizing Augmented Information
校院名稱 成功大學
系所名稱(中) 資訊工程學系碩博士班
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 95
學期 2
出版年 96
研究生(中文) 蕭新維
研究生(英文) Hsin-Wei Hsiao
電子信箱 p7694103@mail.ncku.edu.tw
學號 p7694103
學位類別 碩士
語文別 英文
論文頁數 51頁
口試委員 指導教授-高宏宇
口試委員-謝孫源
口試委員-吳俊興
口試委員-鄧維光
中文關鍵字 增廣資訊  資訊熵值  片段內文  分群  搜尋引擎 
英文關鍵字 Clustering  Snippet  Entropy  Augmented Information  Search Engine 
學科別分類
中文摘要 因為搜尋引擎技術的進步,以及網頁數量的大量增加,搜尋引擎所回傳的搜尋結果往往是參雜混亂的。特別是針對那些一個字可能有多種主題的搜尋關鍵字,搜尋結果的多樣主題的混亂程度會更常見。因此對於不同主題的搜尋結果來做的分群的技術被廣泛地發展起來。傳統的分群方法中,有些研究學者利用兩個文件或多個文件之間的相似程度來做分群的依據,或是利用機器學習為主的分群方式來訓練一些文件來得到分群的規則。但是一般文件結構和一般網頁內文結構並不會完全相同,因此不能確定,在一般文件上分群能得到很好的結果的技術,用在網頁的分群上也能夠一樣的好。
搜尋引擎能夠回傳數百到數千個網頁的標題內容,包含該網頁的片段內文以及該網頁的網址資訊。幾乎所有的網頁分群技術也必須要從這些搜尋引擎的回傳內容來得到一些更進一步的資訊。除此之外,效率也是搜尋結果分群的問題中一項很重要的議題。在網頁分群的技術中,我們不能像使用一般文件分群的技術那樣去分析整個文件的內文。假設我們在網頁分群中使用了文件分群的技術,則很有可能花費很多時間去得到最後的分群結果。對於一個即時的分群系統來說,太長的執行時間是不能被允許的。基於這個理由,勢必發展出更具效率的方法來解決這項問題。
在這篇論文中我們提出了幾個更有效率的方法來解決這項問題。我們改進了先前所提出來的一個方法。我們利用了一些搜尋引擎會回傳的增廣的資訊以及將這些增廣資訊和資訊熵值理論整合起來。我們利用了這些新的方法來得到更好的搜尋結果以及減少了執行時間,從我們的實驗也證明了,我們所提出的方法的確能夠提高整體分群結果的品質。
英文摘要 Because of the improvement of the technology of search engines, and the massively increase of the number of web pages, the results returned by the search engines are always mixed and disordered. Especially for the queries with multiple topics, the mixed and disorderly situation of the search results would be more obvious. The technology of clustering search results with different topics has therefore been extensively developed. For traditional clustering methods, some researchers clustered the document sets using the similarity between two or more documents, or exploited machine learning clustering manner training some documents to get the cluster rules. However, the structure between web pages and general documents are not always the same. It can not confirm that the technologies with good performance on general documents clustering always perform well on the web pages clustering.
The search engines can return information of several hundred to thousand of the pages’ titles, snippets and URLs. Almost all of the technologies about search result clustering must attain further information from the contents of the returned lists. Besides, the efficiency issue is also crucial for the clustering of web pages. In web pages clustering it can not use the same technology of analyzing all the contents to calculate its cluster as general document clustering. Supposing that we apply the method of document clustering on web pages clustering, it might waste a lot of time to get the clustered results. Long execution time is not permitted for a real-time clustering system. For this reason, more efficient methods must be developed to conquer these issues.
In this paper we propose some methods with better efficiency that will conquer these issues. We improve one of the previous technologies. We utilize and augment information that search engines returned and integrate the augmented information and entropy calculation in the information. We apply several new methods to attain better clustered search results and reduce execution time. From our experiments is also indicate that these methods we proposed would obtain clustered results with high quality.
論文目次 中文摘要 IV
ABSTRACT V
致謝 VI
CONTENT VII
FIGURE LISTING IX
TABLE LISTING XI
1. INTRODUCTION 1
1.1 MOTIVATION 1
1.2 SUMMARIZATION OF OUR PROPOSED METHODS 2
1.3 PAPER SECTION DESCRIPTION 3
2. RELATED WORK 4
2.1 ONLINE SEARCH ENGINES WITH CLUSTERED RESULT 4
2.2 PREVIOUS TECHNOLOGY 6
2.3 CONCEPTS HIERARCHY 7
3. OVERALL DESCRIPTION OF PREVIOUS METHODS 9
3.1 PRE-REQUIREMENTS 9
3.2 DETAIL OF KRISHNA KUMMAMURU’S CLUSTERING METHOD 10
3.3 CONCEPTS HIERARCHY CONSTRUCTION 13
4. PROPOSED METHODS FOR CLUSTERING 14
4.1 OUR IMPROVEMENT 14
4.1.1 Different Weight for Terms Extracted from Titles and Snippets Respectively - M1 14
4.1.2 URL Information – M2 17
4.1.3 Integrate DisCover with Cluster Entropy – M3 and M4 19
4.1.4 Integration of Subtraction, Intersection and Cluster Entropy - M5 21
4.2 CHINESE SEARCH RESULT CLUSTERING 23
4.3 CONCEPT HIERARCHY MODIFICATION 24
5. SYSTEM DESCRIPTION 26
6. EXPERIMENT AND EVALUATION 30
6.1 TESTING DATA SET OF EXPERIMENTS 30
6.2 NEW EVALUATION MANNER FOR CLUSTERED RESULTS 30
6.3 EXPERIMENT RESULTS 31
6.3.1 Different Inputs for Clustering Method 31
6.3.2 Overall Comparison for Our Methods 34
6.3.3 Analysis of Clustered Results for Different Methods 42
6.3.4 Cluster Number Cutoff 44
6.3.5 Compare to Other Clustered Search Engines 45
6.3.6 Evaluation of Concepts Hierarchy 46
7. CONCLUSION AND FUTURE WORK 49
8. REFERENCES 50
參考文獻 [1] Smola, A. J. and Schlkopf, B. A Tutorial on Support Vector Regression. NeuroCOLT2 Technical Report Series, NC2-TR-1998-030. October, 1998.
[2] G. Ball and D. A. Hall. A Clustering Technique for Summarizing Multivariate Data. Behavioral Science 1967, pages 153-155.
[3] Doug Beeferman and Adam Berger. Agglomerative Clustering of a Search Engine Query Log. In SIGKDD 2000, pages 407-416.
[4] Mo Chen, Jian-Tao Sun, Hua-Jun Zeng and Kwok-Yan Lam. A Practical System of Keyphrase Extraction for Web Pages. In CIKM 2005, pages 277-278.
[5] Lee-Feng Chien. PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval. In SIGIR 1997, pages 50-58.
[6] Paolo Ferragina and Antonio Gulli. A Personalized Search Engine Based on Web-Snippet Hierarchical Clustering. In WWW 2005, pages 801-810.
[7] Dawn J. Lawrie, W. Bruce. Croft. Generating Hierarchical Summaries for Web Searches. In SIGIR 2003, pages 457-458.
[8] Xiang Ji, Wei Xu and Shenghuo Zhu. Document Clustering with Prior Knowledge. In SIGIR 2006, pages 405-411.
[9] In-Ho Kang and GilChang Kim, Query Type Classification for Web Document Retrieval. In SIGIR 2003, pages 64-71.
[10] Krishna Kummamuru and Raghu Krishnapuram. A Clustering Algorithm for Asymmetrically Related Data with Application to Text Mining. In CIKM 2001, pages 571-573.
[11] Krishna Kummamuru, Ajay Dhawale, and Raghu Krishnapuram. Fuzzy Co-clustering of Documents and Keywords. In FUZZIEEE 2003, pages 772-777.
[12] Krishna Kummamuru, Rohit Lotlikar, Shourya Roy, Karan Singal and Raghu Krishnapuram. A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. In WWW 2004, pages 658-665.
[13] Raghu Krishnapuram and Krishna Kummamuru. Automatic Taxonomy Generation. In IFSA 2003, pages 52-63.
[14] Uichin Lee, Zhenyu Liu and Junghoo Cho. Automatic Identification of User Goals in Web Search. In WWW2005, pages 391-400.
[15] Mark Sanderson. Word Sense Disambiguation and Information Retrieval. In SIGIR 1994, pages 142-151.
[16] Mark Sanderson and W. Bruce Croft. Deriving Concept Hierarchies from Text. In SIGIR 1999, pages 206-213.
[17] Jian-Tao Sun, Xuanhui Wand, Dou, Shen, Wua-Jun and Zeng. Zheng Chen. CWS: A Comparative Web Search System. In WWW 2006, pages 467-476.
[18] Hiroyuki Toda, Ryoji Kataoka. A Search Result Clustering Method using Informatively Named Entities. In WIDM 2005, pages 81-86.
[19] Anton V. Leouski and W. Bruce Croft. An Evaluation of Techniques for Clustering Search Results. Technical Report IR-76.
[20] Anton V. Leouski and James Allan. Improving Interactive Retrieval by Combining Ranked List and Clustering. In RIAO 2000, pages 665-681.
[21] Baeza-Yates and Ribeiro-Neto. Modern Information Retrieval.
[22] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma and Jinwen Ma. Learning to Cluster Web Search Results. In SIGIR 2004, pages 210-217.
[23] Ying Zhao and George Karypis. Evaluation of Hierarchical Clustering Algorithm for Document Datasets. In CIKM 2002, pages 515-524.
[24] Oren Zamir and Oren Etzioni, Web Document Clustering: A Feasibility Demonstration. In SIGIR 1998, pages 46-54.
[25] Oren Zamir and Oren Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results. In WWW 1999, pages 1361-1374.
[26] http://www.google.com
[27] http://search.yahoo.com
[28] http://search.msn.com
[29] http://www.vivisimo.com
[30] http://dmoz.com
[31] http://ckipsvr.iis.sinica.edu.tw/
[32] http://clusty.com
[33] http://www.kartoo.com/
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2008-08-20起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2008-08-20起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw