進階搜尋


 
系統識別號 U0026-0812200910185954
論文名稱(中文) 噢!別再出現404錯誤訊息了!
論文名稱(英文) 404 Error: Oh, not again!
校院名稱 成功大學
系所名稱(中) 資訊工程學系碩博士班
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 90
學期 2
出版年 91
研究生(中文) 簡政傑
研究生(英文) Cheng-Chieh Chien
學號 p7689403
學位類別 碩士
語文別 中文
論文頁數 67頁
口試委員 口試委員-曾新穆
指導教授-李強
口試委員-陳銘憲
口試委員-陳耀輝
口試委員-貝若爾
中文關鍵字 網頁比對  資訊擷取  搜尋引擎  遺失連結  404錯誤  全球資訊網 
英文關鍵字 page comparison  Information Retrieval  search engine  lost link  WWW  404 error 
學科別分類
中文摘要 在過去十年中,World Wide Web(WWW)變成了Internet上用來獲取各種資訊的最重要媒介.然而,有些網頁可能被移動或是刪除,使得原先紀錄在我們的個人電腦或是搜尋引擎資料庫中的URL是過時的.

這個問題稱之為"lost link"並且產生HTTP 1.1 通訊協定中 "404"的錯誤代碼.一般來說,瀏覽網頁時發生"404 error"訊息的機率是非常頻繁的.在本篇論文中,我們提出一個新的ranking技術來解決這個問題,我們稱之為2-dimensional distance.

有別於已經提出的網頁比對技術只考慮了文字內容的distance,我們所提出的網頁間的2-dimensional distance則是同時考慮了style distance和text distance.我們的實驗也顯示了2-dimensional distance機制可以找到更正確的結果.我們也藉著2-dimensional distance設計了一個lost-link search engine的原型.

英文摘要 In the past decade, World Wide Web (WWW) become the most important medium for retrieving all kinds of information on internet.However, some web pages could be moved or deleted such that the URLs recorded in our personal computers or in the search engine databases are obsolete.

This problem is named "lost link" that is coded "404" by protocol of HTTP 1.1 . Currently, the average probability of the "404 error" messages in browsing web page is quite often.
In this thesis, we address this issue by proposing
a novel ranking technique, called 2-dimensional distance.

Our proposed 2-dimensional distance between two pages considers the style distance and text distance simultaneously, instead of considering only text distance in the proposed page comparison techniques.Our experiments also shows that the 2-dimensional distance mechanism
can find more accurate results. We also designed a prototype of a lost-link search engine by using
2-dimensional distance.

論文目次 Abstract i
Acknowledgements iii
Table of Contents iv
Table of Figures vi
Table of Tables viii
Table of Algorithms ix
1 Introduction 1
2 Related Work 5
2.1 Overlap of shingles
2.2 Improve Performance of Shingle-method
2.3 Mirror Site and Web Collection
2.4 Drawback
2.5 Pages-clustering Base on Suffix Tree
2.6 Our work and the difference
3 MainWork
3.1 Definitions and Design Concepts
3.2 Phase 1: Text Comparison Phase
3.3 Phase 2: Style Comparison Phase
3.3.1 Adjust Topology of PST
3.3.2 Adjust Order of Paths
3.4 Short Summary
4 Performance Study
4.1 Environment
4.2 Experiment 1: original data
4.3 Experiment 2: modified data
5 Implementation Issue
5.1 Related Technique
5.2 ProgramManual
6 Conclusions
6.1 Conclusions
6.2 FutureWork
A Appendix
Biography
參考文獻 [AKM95]Keith Andrews, Frank Kappe and Hermann Maurer,
"Serving information to the Web with Hyper-G",
Computer Networks and ISDN Systems,
Volume 27, Issue 6, April, 1995, pp. 919-926.

[AM01]Javed A. Aslam and Mark Montague,
"Models for metasearch",
In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval,
New Orleans, Louisiana, United States, 2001, pp. 276-284.

[ACMPR01]Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke and Sriram Raghavan,
"Searching the Web",
ACM Transactions on Internet Technology,
Volume 1, August, 2001, pp. 2-43.


[BGMZ97]A. Z. Broder, S. C. Glassman, M. S. Manasse and G. Zweig,
"Syntactic clustering of the Web",
In Proceedings of the Sixth International World Wide Web Conference,
Santa Clara, California USA, April 7-11, 1997, pp.391-404.

[BRO97]Andrei Z. Broder,
"On the resemblance and containment of documents",
In Proceedings of Compression and complexity of Sequences(SEQUENCE'97),
1997, pp. 21-29.

[BB98]Krishna Bharat and Andrei Broder,
"A technique for measuring the relative size and overlap of public web search engines",
In Proceedings of the 7th International World Wide Web Conference,
Brisbane, Australia, April 1998, pp. 379-388.


[BH98]Krishna Bharat and Monika Rauch Henzinger,
"Improved Algorithms for Topic Distillation in a Hyperlinked Environment",
In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval,
Melbourne Australia, 1998, pp. 104-111.

[BP98]Sergey Brin and Lawrence Page,
"The anatomy of a Large-Scale hypertextual Web Search Engine",
Computer Networks and ISDN Systems,
Volume 30, Issue 1-7, April, 1998, pp. 107-117.


[BB99]Krishna Bharat and Andrei Broder,
"Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content",
In Proceedings of the 8th International World Wide Web Conference,
Toronto, Canada, May 11-14, 1999, pp. 501-512.

[BBDH00]Krishna Bharat and Andrei Z. Broder and Jeffrey Dean and Monika Rauch Henzinger,
"A comparison of techniques to find mirrored hosts on the WWW"
Journal of the American Society of Information Science,
Volume 51, Number 12, 2000, pp. 1114-1122.

[BKMRRSTW00]Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopolan, Raymie Stata, Andrew Tomkins, and Janet L. Wiener,
"Graph structure in the Web",
In Proceedings of the 9th International World Wide Web Conference,
Amsterdam, The Netherlands, May 2000, pp. 309-320.

[BO99]Tolga Bozkaya and Meral Ozsoyoglu,
"Indexing large metric spaces for similarity search queries",
ACM Transactions on Database Systems,
Volume 24, Issue 3, September 1999, pp. 361-404.

[CPZ97]P.Ciaccia, M.Patella and P.Zezula,
"M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces",
In Proceedings of the 23rd International Conference on Very Large Data Bases ,
Athens, Greece, August 1997, pp. 426-435.

[CSM97]Michal Cutler, Yungming Shih and Weiyi Meng,
"Using the Structure of HTML Documents to Improve Retrieval",
In USENIX Symposium on Internet Technologies and Systems(NSITS'97),
Decemember 1997, pp. 241-251.

[CDI98]Soumen Chakrabarti and Byron E. Dom and Piotr Indyk,
"Enhanced hypertext categorization using hyperlinks",
In Proceedings of {SIGMOD-98, {ACM International Conference on Management of Data,
Seattle, 1998, pp. 307-318.


[CSG00]Junghoo Cho, Narayanan Shivakumar,and Hector Garcia-Molina ,
"Finding replicated Web collections",
In Proceedings of the 2000 ACM SIGMOD on Management of data,
Dallas, Texas, United States, June 2000, pp. 355-366.

[CHU00]Yu-Chi Chung,
"Design and Implementation of a Client Side History Map and Web Page Handling System",
Master thesis, National Cheng-Kung University, R.O.C, 2000.

[CHINATIMES]
中時電子報, http://www.chinatimes.com.


[CHE99]Che-Min Chen,
"Design and Implementation of QBT in WWW",
Master thesis, National Cheng-Kung University, R.O.C, 1999.

[DH99]Jeffrey Dean and Monika Rauch Henzinger,
"Finding Related Pages in the World Wide Web",
Computer Networks,
Volume 31, Issue 11-16, May 17, 1999, pp. 1467-1479.

[DKMRST01]
Stephen Dill, S. Ravi Kumar, Kevin S. McCurley, Sridhar Rajagopalan, D. Sivakumar and Andrew Tomkins,
"Self-similarity in the Web",
In proceedings of International Conference on Very Large Databases,
September 11-14, 2001, Roma Italy, pp. 69-78.

[DOM]Document Object Model, http://www.w3.org/DOM/.

[FSGMU98]Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani and Jeffrey D.Ullman,
"Computing iceberg queries efficiently",
In proceedings of International Conference on Very Large Databases(VLDB'98),
August,1998, pp. 299-310.

[GUS97]D.Gusfield ,
"Algorithms on strings,trees and sequences",
CamBridge University Press, chap 6, 1997.

[GKR98]D. Gibson, J. Kleinberg, P. Raghavan.
"Inferring Web communities from link topology"
In Proceedings of the 9th ACM conference on Hypertext and hypermedia,
Pittsburgh, Pennsylvania, United States, 1998, pp. 225-234.

[GT99]Holmes Geoffrey and Leonard Trigg,
"A diagnostic tool for tree based supervised classification learning algorithms",
In Proceedings of the Sixth International Conference on Neural Information,
Western Australia, November 1999, pp. 514-519.

[HMCCA97]Joachim Hammer, Hector Garcia-Molina, Junghoo Cho, Arturo Crespo and Rohan Aranha,
"Extracting Semistructured Information from the Web",
In Proceedings of the Workshop on Management of Semistructured Data,
May 1997, pp. 18-25.

[HGI00]Taher H. Haveliwala and Aristides Gionis and Piotr Indyk,
"Scalable Techniques for Clustering the Web",
Third International Workshop on the Web and Databases,
Dallas, Texas, May 18-19, 2000, pp. 129-134.

[Hof00]Thomas Hofmann,
"Learning Probabilistic Models of the Web",
In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval,
Athens Greece, 2000, pp. 369-371.

[HRMP00]Jun Hirai, Sriram Raghavan, Hector Garcia-Molina and Andreas Paepcke,
"WebBase : A repository of web pages",
Computer Networks,
Volume 33, Issue 1-6, June 2000, pp. 277-293.

[HTML]W3C (World Wide Web Consortium),
"HTML 4.0 Specification",
http://www.w3.org/TR/1998/REC-html40-19980424 , April 1998.

[KRRSTU00]S.R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and Eli Upfal,
"The Web as a graph",
In Proceedings of the 19th ACM Symposium on Principles of Database Systems,
2000, pp. 1-10.

[KLST00]Ming-Yang Kao, Tak Wah Lam, Wing-Kin Sung, Hing-Fing Ting,
"Unbalanced and Hierarchical Bipartite Matchings with Applications to Labeled Tree Comparison",
Algorithms and Computation 11th International Conference(ISAAC 2000),
Taipei, Taiwan, December 2000, pp. 479-490.

[KIMO]Yahoo!奇摩, http://tw.yahoo.com.

[LCVA01]Wen-Syan Li, K. Selcuk Candan, Quoc Vu and Divyakant Agrawal,
"Retrieving and Organizing Web Pages by "Information Unit"",
In Proceedings of the 10th international World Wide Web conference,
Hong Kong, 2001, pp. 230-244.

[PBMW98]Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd,
"The PageRank Citation Ranking: Bringing Order to the Web",
Technical Report,
Computer Systems Laboratory, Stanford University, Stanford, CA, 1998.

[PMMC00]Andreas Paepcke, Hector Garcia-Molina, Gerard Rodriquez-Mula and Junghoo Cho,
"Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies",
SIGMOD Records, Volume 29, Issue 1, March 2000.


[SM99]Narayanan Shivakumar and Garcia-Molina,
"Finding near-replicas of documents on the web",
In Proceedings of Workshop on Web Databases (WebDB'98),
March 27-28, 1998.

[SY00]Neel Sundaresan, Jeonghee Yi,
"Mining the Web for Relations",
Computer Networks,
Volume 33, Issue 1-6, June 2000, pp. 699-711.

[Sal01]Salvador Roura,
"Digital Access to Comparison-Based Tree Data Structures and Algorithms",
Journal of Algorithms,
Volume 40, Number 1, July 2001, pp. 1-23.

[SCH01]Soumen Chakrabarti,
"Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction",
In Proceedings of the 10th International World Wide Web Conference,
Hong Kong , May 1-5, 2001, pp. 211-220.

[WZJS94]Jason Tsong-Li Wang, Kaizhong Zhang, Karpjoo Jeong and Dennis Shasha,
"A System for Approximate Tree Matching",
Knowledge and Data Engineering,
Volume 6, Number 4, 1994, pp. 559-571.

[WNZ01]Ji-Rong Wen, Jian-Yun Nie and Hong-Jiang Zhang,
"Clustering user queries of a search engine",
In Proceedings of the 10th international World Wide Web conference,
Hong Kong, 2001, pp. 162-168.

[YN99]Ricardo Baeza-Yatex and Berthier Ribeiro-Neto,
Modern Information Retrieval, Addison-wesley, 1999.

[YAM]蕃薯藤, http://www.yam.com.

[ZE98]Oren Zamir and Oren Etzioni,
"Web Document Clustering: A Feasibility Demonstration",
In Proceedings of the 21st Annual International ACM SIGIR conference on Research and Development in Information Retrieval,
Melbourne, Australia, 1998, pp. 46-54.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2002-07-10起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2002-07-10起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw