進階搜尋


 
系統識別號 U0026-0812200914370774
論文名稱(中文) 應用標題產生機制於文件摘要之設計與實作
論文名稱(英文) Design and Implementation of the Topic Generation Methods for Document Summarization
校院名稱 成功大學
系所名稱(中) 電腦與通信工程研究所
系所名稱(英) Institute of Computer & Communication
學年度 96
學期 2
出版年 97
研究生(中文) 許恒耀
研究生(英文) Hen-Yao Hsu
學號 q3695122
學位類別 碩士
語文別 英文
論文頁數 43頁
口試委員 口試委員-劉豐榮
口試委員-郭耀煌
指導教授-楊竹星
口試委員-曾黎明
口試委員-葉俊雄
中文關鍵字 文件摘要  搜尋引擎  標題  字彙分佈 
英文關鍵字 document summarization  search engine  topic  word distribution 
學科別分類
中文摘要 近年來,越來越多的網際網路使用者利用搜尋引擎以找尋所需要的資訊,然而在搜尋引擎回傳的資訊中常包含大量不相關的資訊,讓使用者需花上許多不必要的時間來找出真正所需要的資訊。為了幫助使用者能夠快速的找到正確的資訊,此論文提出一個高效率的演算法DiSco (Distribution Scoring),此演算法提供將搜尋引擎所回傳的資訊,自動產生摘要以回傳給使用者。DiSco以標題(Topic)作為摘要的單位,考慮字彙在文件中出現位置所代表的重要性。利用一評分機制設定權重,並依字彙的權重高低以擷取出標題集合作為文件摘要之用。
論文中使用Reuters-21578、Google Trends兩類資料集進行模擬實驗。衡量標準使用資料涵蓋率、資料重疊率以及計算時間。為了進一步提昇DiSco的處理效率,論文並根基於DiSco設計一更加快速的演算法,並討論在提昇計算效能後與文件涵蓋率、文件重疊率之間的權衡關係。
實驗結果顯示,本論文提出的自動產生摘要機制,在衡量標準上均有優異的表現。實驗結果亦顯示此機制所自動產生的標題集合,能更適合作為文件摘要之用,以更加滿足使用者快速得到資訊的需求。
英文摘要 In recently years, more and more Internet users rely on the search engines to help them find the information they need. However, the information they often consists of a large amount of irrelevant parts. It would often spend Internet users much time to achieve the correct information users need. To help Internet users find the information they are looking for quickly, an efficient algorithm for automatically building the summaries of a collection of documents found by a search engine in response to a user query, called DiSco (Distribution Scoring), is proposed. Topics are the basic units of the summaries DiSco generated. The main idea of DiSco is to consider the distribution of lexicons in a document, while the distribution is in practice thought to be related to different importance of lexicons. Using a scoring mechanism to score the weight of individual lexicons, DiSco could generate the topic sets to be the summaries of the document based on the weights.
To demonstrate the performance of the proposed algorithm in this thesis, Reuters-21578 text categorization collection and the search results of the hot queries from Google Trends are used in the simulation. Moreover, several measure methods such as coverage, overlap, and the computation time are employed in evaluating the performance of the proposed algorithm. To further improve the efficiency of the proposed algorithm, an alternative version of DiSco is designed and implemented. The tradeoff between computation time and the quality of the summarization is also discussed.
All the simulation results indicate that the proposed algorithm, which is based on the distribution of lexicons, outperforms all the existing algorithms in terms of not only the benchmark of data coverage rate, data overlap rate and the computation time. The simulation results also indicate that the topic sets generated by the proposed algorithm are better suited for document summarization to fit the requirement of getting information quickly.
論文目次 List of Tables iii
List of Figures iv
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2 Related Works 5
2.1 Automatic Taxonomy Generation Algorithms . . . . . . . . . . . . . . . . . 5
2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 3 The Proposed Algorithm 11
3.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 The Alternative Version of DiSco . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 4 Simulation Results 16
4.1 Simulating Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Google Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Reuters-21578 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.1 Google Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.2 Reuters-21578 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Simulation Results of the Alternative Version . . . . . . . . . . . . . . . . . 21
4.4.1 Google Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.2 Reuters-21578 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.3 Summary of the Simulation Results . . . . . . . . . . . . . . . . . . 24
Chapter 5 Conclusion and Future Work 30
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Appendix A Stop-word List 36
Appendix B The Categories of Google Trends, Feb 29, 2008 40
Appendix C The Categories of Reuters-21578 42
參考文獻 [1] R. Baeze-Yates and B. Ribeiro-Neto, Modern Information Retrieval. ACM Press, 1999.
[2] S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine,”
Computer Networks and ISDN Systems, vol. 30, no. 1–7, pp. 107–117, 1998.
[3] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” Journal of the
ACM, vol. 46, pp. 668–677, 1999.
[4] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins,
and J. Wiener, “Graph structure in the web,” in Proceedings of the 9th international
World Wide Web conference on Computer networks : the international journal of computer
and telecommunications netowrking, 2000, pp. 309–320.
[5] E. Agichtein, E. Brill, and S. Dumais, “Improving web search ranking by incorporating
user behavior information,” in SIGIR ’06: Proceedings of the 29th annual international
ACM SIGIR conference on Research and development in information retrieval, 2006,
pp. 19–26.
[6] H. Berghel, “Cyberspace 2000: dealing with information overload,” Commun. ACM,
vol. 40, no. 2, pp. 19–24, 1997.
[7] “Vivisimo,” http://search.vivisimo.com/.
[8] “Snaket,” http://snaket.di.unipi.it/.
[9] “Iboogie,” http://www.iboogie.com/.
[10] “Grokker,” http://www.grokker.com/.
[11] “Kartoo,” http://www.kartoo.com/flash04.php3.
[12] O. Zamir and O. Etzioni, “Grouper: a dynamic clustering interface to web search results,”
Computer Networks, vol. 31, no. 11–16, pp. 1361–1374, 1999.
[13] W. Lam and Y. Han, “Automatic textual document categorization based on generalized
instance sets and a metamodel,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 25, no. 5, pp. 628–633, 2003.
[14] K. M. Hammouda and M. S. Kamel, “Efficient phrase-based document indexing for web
document clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 16,
no. 10, pp. 1279–1296, 2004.
[15] H. Al-Mubaid and S. A. Umair, “A new text categorization technique using distributional
clustering and learning logic,” IEEE Transactions on Knowledge and Data Engineering,
vol. 18, no. 9, pp. 1156–1165, 2006.
[16] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and
representations for text categorization,” in Proceedings of the 7th international conference
on Information and knowledge management, 1998, pp. 148–155.
[17] H. Chen and S. Dumais, “Bringing order to the web: automatically categorizing search
results,” in Proceedings of the SIGCHI conference on Human factors in computing systems,
2000, pp. 145–152.
[18] S. T. Dumais and H. Chen, “Hierarchical classification of Web content,” in Proceedings
of SIGIR-00, 23rd ACM International Conference on Research and Development in
Information Retrieval, 2000, pp. 256–263.
[19] I. Mani and E. Bloedorn, “Multi-document summarization by graph search and matching,”
in Proceedings of the Fifteenth National Conference on Artificial Intelligence
(AAAI-97), 1997, pp. 622–628.
[20] K. R. McKeown, J. L. Klavans, V. Hatzivzssiloglou, R. Barzilay, and E. Eskin, “Towards
multidocument summarization by reformulation: Progress and prospects,” in Proceedings
of the Sixteenth National Conference on Artificial Intelligence, 1999, pp. 453–460.
[21] H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma, “Learning to cluster web search
results,” in Proceedings of the 27th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, 2004, pp. 210–217.
[22] J. Dean and M. R. Henzinger, “Finding related pages in theWorldWideWeb,” Computer
Networks (Amsterdam, Netherlands: 1999), vol. 31, no. 11–16, pp. 1467–1479, 1999.
[23] E. Amitay and C. Paris, “Automatically summarising web sites - is there a way around
it?” in Proceedings of the 9th international conference on Information and knowledge
management, 2000, pp. 173–179.
[24] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing
order to the web,” Stanford Digital Library Technologies Project, Tech. Rep., 1998.
[25] G. Carenini, R. T. Ng, and X. Zhou, “Summarizing email conversations with clue
words,” Proceedings of the 16th international conference on World Wide Web (WWW
2007), pp. 91–100, 2007.
[26] P. G. Anick and S. Tipirneni, “The paraphrase search assistant: terminological feedback
for iterative information seeking,” in Proceedings of the 22nd annual international ACM
SIGIR conference on Research and development in information retrieval, 1999, pp. 153–
159.
[27] K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, and R. Krishnapuram, “A hierarchical
monothetic document clustering algorithm for summarization and browsing search results,”
in Proceedings of the 13th conference on World Wide Web (WWW 2004), 2004,
pp. 658–665.
[28] M. Sanderson andW. B. Croft, “Deriving concept hierarchies from text,” in Proceedings
of the 22nd annual international ACM SIGIR conference on Research and Development
in Information Retrieval, 1999, pp. 206–213.
[29] R. H. Thompson and W. B. Croft, “Support for browsing in an intelligent text retrieval
system,” Int. J. Man-Mach. Stud., vol. 30, no. 6, pp. 639–668, 1989.
[30] P. Pirolli, P. Schank, M. Hearst, and C. Diehl, “Scatter/gather browsing communicates
the topic structure of a very large text collection,” in CHI ’96: Proceedings of the
SIGCHI conference on Human factors in computing systems, 1996, pp. 213–220.
[31] D. Lawrie, W. B. Croft, and A. Rosenberg, “Finding topic words for hierarchical summarization,”
in Proceedings of the 24th annual international ACM SIGIR conference,
2001, pp. 349–357.
[32] D. J. Lawrie and W. B. Croft, “Generating hierarchical summaries for web searches,” in
Proceedings of the 26nd annual international ACM SIGIR conference on Research and
development in information retrieval, 2003, pp. 457–458.
[33] K. Kummamuru and R. Krishnapuram, “A clustering algorithm for asymmetrically related
data with applications to text mining,” in Proceedings of the tenth international
conference on Information and knowledge management, 2001, pp. 571–573.
[34] P. Ferragina and A. Gulli, “A personalized search engine based on web-snippet hierarchical
clustering,” In Special interest tracks and posters of the 14th international conference
on World Wide Web (WWW 2005), pp. 801–810, 2005.
[35] ——, “A personalized search engine based on web-snippet hierarchical clustering,” Software:
Practice and Experience, vol. 38, no. 2, pp. 189–225, 2008.
[36] “Stop word list,” http://www-fog.bio.unipd.it/waishelp/stoplist.html.
[37] M. Porter, “An algorithm for suffix stripping,” Program, vol. 14, no. 3, pp. 130–137,
1980.
[38] Google, “Hot trends, Feb 29, 2008,” http://www.google.com/trends/hottrends?sa=Xn
&date=2008-2-29.
[39] Reuters-21578, “text categorization test collection,” http://www.daviddlewis.com/
resources/testcollections/reuters21578/.
[40] U. M. L. Archive, http://kdd.ics.uci.edu/.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2009-09-03起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2009-09-03起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw