系統識別號 U0026-0812200914014835
論文名稱(中文) 以搜尋引擎進行剽竊模式之評估
論文名稱(英文) Assessing Plagiarism Patterns with Web Search Engines
校院名稱 成功大學
系所名稱(中) 工程科學系碩博士班
系所名稱(英) Department of Engineering Science
學年度 95
學期 2
出版年 96
研究生(中文) 劉奕廷
研究生(英文) Yi-Ting Liu
電子信箱 n9694122@mail.ncku.edu.tw
學號 n9694122
學位類別 碩士
語文別 英文
論文頁數 40頁
口試委員 指導教授-鄧維光
中文關鍵字 剽竊偵測  軟體設計  搜尋引擎 
英文關鍵字 plagiarism detection  software design  search engine 
中文摘要 隨著資訊科技的日新月異,集結在網際網路上的資料量也以驚人的速度大幅增加。為解決大量資料的過載問題,人們通常使用搜尋引擎以作為一個能夠找到所需資訊的管道。然而,當搜尋引擎成為一個快速且有效的工具同時,剽竊者亦可輕易地藉此找到他人所撰寫的文字內容,並且經過重新組合後散佈這些文字內容,在此篇論文當中,我們開發了一個線上偵測系統,目的即是希望減少這種搜尋引擎的誤用行為。更明確地來說,當一篇可疑文章的內容被擷取出來後,會透過我們的系統以及搜尋引擎的合作來進行驗證。透過適當的設計,每一個被擷取出來的文字片段在以搜尋引擎驗證之前,都會給予不同的優先順序,如此一來在進行剽竊偵測時,我們可以減少許多不必要且重複性的運算浪費。經實驗研究之驗證,我們所提出的方法不論在理論上或實務上都顯示出它的效果和可行性。
英文摘要 As information technologies advance, the data amount gathered on the Internet increases at an incredible rapid speed. To solve the data overloading problem, people commonly use web search engines to find what they need. However, as search engines become an efficient and effective tool, plagiarists can grab, reassemble and redistribute text contents without much difficulty. In this thesis, we develop an online detection system to reduce such misapplication of search engines. Specifically, suspicious documents are extracted and verified through the collaboration of our plagiarism detection system and search engines. With a proper design, extracted text segments are given different priorities when sending them to search engines as the ascertainment of plagiarism. This greatly reduces unnecessary and repetitive works when performing plagiarism detection. Empirical study shows that the proposed approach is not only theoretically effective but also practically feasible.
論文目次 Chapter 1 Introduction 1
1.1 Motivation and Overview of the Thesis 1
1.2 Contributions of the Thesis 2
Chapter 2 Literature Survey 3
2.1 Use and Misuse of Web Search 3
2.2 Overview of the Plagiarism Problem 4
2.3 Software Tools for Plagiarism Detection 5
2.3.1 COPS 6
2.3.2 SCAM 7
2.3.3 SNTICH 8
2.3.4 Other Commercial Tools 8
2.4 Reusing Search Engine Results 9
Chapter 3 Developing an Online Plagiarism Detection System 11
3.1 System Flows for Online Plagiarism Detection 11
3.2 Schemes of Segment Ranking 13
3.3 Identification of Plagiarism Sources 16
Chapter 4 Empirical Studies 18
4.1 System Implementations 18
4.2 Testing Datasets 20
4.3 Experimental Results 25
4.3.1 Discussions of the Plagiarism Detection Process 25
4.3.2 Experiments on the Real Dataset 28
4.3.3 Experiments on the Synthetic Dataset 31
Chapter 5 Conclusions and Future Works 33
Bibliography 34
參考文獻 [1] A list of English stop words,
[2] ACM Portal, http://portal.acm.org/
[3] B. Belkhouche, A. Nix, and J. Hassell, “Plagiarism Detection in Software Designs,” Proceedings of the 42nd Annual Southeast Regional Conference, pages 207-211, April 2004.
[4] S. Brin, J. Davis and H. Garca-Molina, “Copy Detection Mechanisms for Digital Documents,” Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 398-409, May 1995.
[5] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Proceedings of the Seventh international Conference on World Wide Web 7, pages 107-117, 1998.
[6] Z. Dou, R. Song and J. Wen, “A Large-Scale Evaluation and Analysis of Personalized Search Strategies,” Proceedings of the 16th International Conference on World Wide Web, pages 581-590, May 2007.
[7] S. Dumais, E. Cutrell, and H. Chen, “Optimizing Search by Showing Results in Context,” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 277-284, 2001.
[8] P. Ferragina and A. Gulli, “A Personalized Search Engine Based on Web-Snippet Hierarchical Clustering,” Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pages 801-810, May 2005.
[9] EVE2 Plagiarism Detection for Teachers, http://www.canexus.com.
[10] Google Directory, http://directory.google.com/.
[11] Google Scholar, http://scholar.google.com/.
[12] S. Gruner and S. Naven, “Tool Support for Plagiarism Detection in Text Documents,” Proceedings of the 2005 ACM Symposium on Applied Computing, pages 776-781, March 2005.
[13] M. Henzinger, “Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms,” Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 284-291, August 2006.
[14] IEEE Xplore, http://ieeexplore.ieee.org/.
[15] P. Iyer and A.Singh, “Document Similarity Analysis for a Plagiarism Detection System”, Proceedings of the 2nd Indian International Conference on Artificial Intelligence, pages 2534-2544, December 2005.
[16] T. Joachims, “Optimizing Search Engines Using Clickthrough Data,” Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 166-142, July 2002
[17] W. Kienreich, M. Granitzer, V. Sabol, and W. Klieber, “Plagiarism Detection in Large Sets of Press Agency News Articles,” Proceedings of the 17th International Conference on Database and Expert Systems Applications, pages 181-188, September 2006.
[18] B. Kules, J. Kustanowitz, and B. Shneiderman, “Categorizing Web Search Results into Meaningful and Stable Categories Using Fast-Feature Techniques,” Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 210-219, June, 2006.
[19] H. P. Luhn, “A Statistical Approach to Mechanized Encoding and Searching of Literary Information,” IBM Journal of Research and Development, 1(4):309-317, October 1957.
[20] C. J. Neill and G. Shanmuganthan, “A Web-Enabled Plagiarism Detection Tool,” IT Professional, pages 19-23, September 2004.
[21] S. Niezgoda and T. P. Way, “SNITCH: A Software Tool for Detecting Cut and Paste Plagiarism,” Proceedings of the 37th SIGCSE Technical Symposium on Computer Science Education, pages 51-55, March 2006.
[22] ODP - Open Directory Project, http://dmoz.org/.
[23] K. J. Ottenstein, “An Algorithmic Approach to the Detection and Prevention of Plagiarism,” SIGCSE Bulletin, pages 30-41, December 1976.
[24] A. Papoulis, “Probability, Random Variables, and Stochastic Processes,” McGraw-Hill Inc., December 1984.
[25] P. E. Pfeiffer and D. A. Schum, “Introduction to Applied Probability,” Academic Press Inc., March 1973.
[26] F. Qiu and J. Cho, “Automatic Identification of User Interest for Personalized Search,” Proceedings of the 15th International Conference on World Wide Web, pages 727-736, May 2006.
[27] G. Salton, “Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer,” Addison-Wesley Longman Publishing Co., Inc., 1988.
[28] G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval,” McGraw Hill Inc., 1983.
[29] G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing & Management 24(5): 513–523, August 1988.
[30] N. Shivakumar and H. Garcia-Molina, “Building a Scalable and Accurate Copy Detection Mechanism,” Proceedings of the First ACM International Conference on Digital Libraries, pages 160-168, March 1996.
[31] N. Shivakumar and H. Garcia-Molina, “SCAM: A Copy Detection Mechanism for Digital Documents,” Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries, June 1995.
[32] SnakeT Clustering Engine, http://snaket.di.unipi.it/.
[33] SpringerLink, http://www.springerlink.com/.
[34] J. Sun, D. Shen, H. Zeng, Q. Yang, Y. Lu and Z. Chen, “Web-Page Summarization Using Clickthrough Data,” Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 194-201, August 2005.
[35] The Use of Electronic Tools to Support Plagiarism Detection, http://www.comp.leeds.ac.uk/hannah/CandIT/plagiarism.html.
[36] H. Toda and R. Kataoka, “A Search Result Clustering Method Using Informatively Named Entities,” Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management, pages 81-86, November, 2005.
[37] TurnItIn, http://www.turnitin.com.
[38] Vivisimo Clustering Engine, http://vivisimo.com/.
[39] Welcome to Powerball – Prizes, http://www.powerball.com/powerball/pb_prizes.asp.
[40] G. Whale, “Identification of Program Similarity in Large Populations,” The Computer Journal, 33(2):140-146, April 1990.
[41] D. R. White and M. S. Joy, “Sentence-based Natural Language Plagiarism Detection,” ACM Journal of Educational Resources in Computing, 4(4):1-20, December 2004.
[42] What is Plagiarism?, http://www.hku.hk/plagiarism/.
[43] M. J. Wise, “Detection of Similarities in Student Programs: YAP'ing May Be Preferable to Plague'ing,” Proceedings of the Twenty-Third SIGCSE Technical Symposium on Computer Science Education, pages 268-271, March 1992.
[44] Y. B. Wu, L. Shankar and X. Chen, “Finding More Useful Information Faster from Web Search Results,” Proceedings of the Twelfth International Conference on Information and Knowledge Management, pages 568-571, November 2003.
[45] G. Xue, H. Zeng, Z. Chen, Y. Yu, W. Ma, W. Xi, and W. Fan, “Optimizing Web Search Using Web Click-Through Data,” Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pages 118-126, November 2004.
[46] B. Yaltaghian and M. Chignell, “Re-Ranking Search Results Using Network Analysis a Case Study with Google: A Case Study with Google,” Proceedings of the 2002 Conference of the Centre For Advanced Studies on Collaborative Research, pages 14-23, September 2002.
[47] H. Yang, and J. Callan, “Near-Duplicate Detection by Instance-level Constrained Clustering.” Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 421-428, August 2006.
[48] R. Yerra and Y.-K. Ng, “A Sentence-Based Copy Detection Approach for Web Documents,” Proceedings of the 2nd Annual Internal Conference in Fuzzy Systems and Knowledge Discovery, pages 557-570, August 2005.
[49] B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W. Ma, “Improving Web Search Results Using Affinity Graph,” Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 504-511, August, 2005.
[50] H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma, “Learning to Cluster Web Search Results,” Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 210-217, July, 2004.
[51] Z. Zhuang and S. Cucerzan, “Re-Ranking Search Results Using Query Logs,” Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 860-861, November, 2006.
  • 同意授權校內瀏覽/列印電子全文服務,於2008-09-04起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2008-09-04起公開。

  • 如您有疑問,請聯絡圖書館