||Assessing Plagiarism Patterns with Web Search Engines
||Department of Engineering Science
As information technologies advance, the data amount gathered on the Internet increases at an incredible rapid speed. To solve the data overloading problem, people commonly use web search engines to find what they need. However, as search engines become an efficient and effective tool, plagiarists can grab, reassemble and redistribute text contents without much difficulty. In this thesis, we develop an online detection system to reduce such misapplication of search engines. Specifically, suspicious documents are extracted and verified through the collaboration of our plagiarism detection system and search engines. With a proper design, extracted text segments are given different priorities when sending them to search engines as the ascertainment of plagiarism. This greatly reduces unnecessary and repetitive works when performing plagiarism detection. Empirical study shows that the proposed approach is not only theoretically effective but also practically feasible.
Chapter 1 Introduction 1
1.1 Motivation and Overview of the Thesis 1
1.2 Contributions of the Thesis 2
Chapter 2 Literature Survey 3
2.1 Use and Misuse of Web Search 3
2.2 Overview of the Plagiarism Problem 4
2.3 Software Tools for Plagiarism Detection 5
2.3.1 COPS 6
2.3.2 SCAM 7
2.3.3 SNTICH 8
2.3.4 Other Commercial Tools 8
2.4 Reusing Search Engine Results 9
Chapter 3 Developing an Online Plagiarism Detection System 11
3.1 System Flows for Online Plagiarism Detection 11
3.2 Schemes of Segment Ranking 13
3.3 Identification of Plagiarism Sources 16
Chapter 4 Empirical Studies 18
4.1 System Implementations 18
4.2 Testing Datasets 20
4.3 Experimental Results 25
4.3.1 Discussions of the Plagiarism Detection Process 25
4.3.2 Experiments on the Real Dataset 28
4.3.3 Experiments on the Synthetic Dataset 31
Chapter 5 Conclusions and Future Works 33
 A list of English stop words,
 ACM Portal, http://portal.acm.org/
 B. Belkhouche, A. Nix, and J. Hassell, “Plagiarism Detection in Software Designs,” Proceedings of the 42nd Annual Southeast Regional Conference, pages 207-211, April 2004.
 S. Brin, J. Davis and H. Garca-Molina, “Copy Detection Mechanisms for Digital Documents,” Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 398-409, May 1995.
 S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Proceedings of the Seventh international Conference on World Wide Web 7, pages 107-117, 1998.
 Z. Dou, R. Song and J. Wen, “A Large-Scale Evaluation and Analysis of Personalized Search Strategies,” Proceedings of the 16th International Conference on World Wide Web, pages 581-590, May 2007.
 S. Dumais, E. Cutrell, and H. Chen, “Optimizing Search by Showing Results in Context,” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 277-284, 2001.
 P. Ferragina and A. Gulli, “A Personalized Search Engine Based on Web-Snippet Hierarchical Clustering,” Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pages 801-810, May 2005.
 EVE2 Plagiarism Detection for Teachers, http://www.canexus.com.
 Google Directory, http://directory.google.com/.
 Google Scholar, http://scholar.google.com/.
 S. Gruner and S. Naven, “Tool Support for Plagiarism Detection in Text Documents,” Proceedings of the 2005 ACM Symposium on Applied Computing, pages 776-781, March 2005.
 M. Henzinger, “Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms,” Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 284-291, August 2006.
 IEEE Xplore, http://ieeexplore.ieee.org/.
 P. Iyer and A.Singh, “Document Similarity Analysis for a Plagiarism Detection System”, Proceedings of the 2nd Indian International Conference on Artificial Intelligence, pages 2534-2544, December 2005.
 T. Joachims, “Optimizing Search Engines Using Clickthrough Data,” Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 166-142, July 2002
 W. Kienreich, M. Granitzer, V. Sabol, and W. Klieber, “Plagiarism Detection in Large Sets of Press Agency News Articles,” Proceedings of the 17th International Conference on Database and Expert Systems Applications, pages 181-188, September 2006.
 B. Kules, J. Kustanowitz, and B. Shneiderman, “Categorizing Web Search Results into Meaningful and Stable Categories Using Fast-Feature Techniques,” Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 210-219, June, 2006.
 H. P. Luhn, “A Statistical Approach to Mechanized Encoding and Searching of Literary Information,” IBM Journal of Research and Development, 1(4):309-317, October 1957.
 C. J. Neill and G. Shanmuganthan, “A Web-Enabled Plagiarism Detection Tool,” IT Professional, pages 19-23, September 2004.
 S. Niezgoda and T. P. Way, “SNITCH: A Software Tool for Detecting Cut and Paste Plagiarism,” Proceedings of the 37th SIGCSE Technical Symposium on Computer Science Education, pages 51-55, March 2006.
 ODP - Open Directory Project, http://dmoz.org/.
 K. J. Ottenstein, “An Algorithmic Approach to the Detection and Prevention of Plagiarism,” SIGCSE Bulletin, pages 30-41, December 1976.
 A. Papoulis, “Probability, Random Variables, and Stochastic Processes,” McGraw-Hill Inc., December 1984.
 P. E. Pfeiffer and D. A. Schum, “Introduction to Applied Probability,” Academic Press Inc., March 1973.
 F. Qiu and J. Cho, “Automatic Identification of User Interest for Personalized Search,” Proceedings of the 15th International Conference on World Wide Web, pages 727-736, May 2006.
 G. Salton, “Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer,” Addison-Wesley Longman Publishing Co., Inc., 1988.
 G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval,” McGraw Hill Inc., 1983.
 G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing & Management 24(5): 513–523, August 1988.
 N. Shivakumar and H. Garcia-Molina, “Building a Scalable and Accurate Copy Detection Mechanism,” Proceedings of the First ACM International Conference on Digital Libraries, pages 160-168, March 1996.
 N. Shivakumar and H. Garcia-Molina, “SCAM: A Copy Detection Mechanism for Digital Documents,” Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries, June 1995.
 SnakeT Clustering Engine, http://snaket.di.unipi.it/.
 SpringerLink, http://www.springerlink.com/.
 J. Sun, D. Shen, H. Zeng, Q. Yang, Y. Lu and Z. Chen, “Web-Page Summarization Using Clickthrough Data,” Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 194-201, August 2005.
 The Use of Electronic Tools to Support Plagiarism Detection, http://www.comp.leeds.ac.uk/hannah/CandIT/plagiarism.html.
 H. Toda and R. Kataoka, “A Search Result Clustering Method Using Informatively Named Entities,” Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management, pages 81-86, November, 2005.
 TurnItIn, http://www.turnitin.com.
 Vivisimo Clustering Engine, http://vivisimo.com/.
 Welcome to Powerball – Prizes, http://www.powerball.com/powerball/pb_prizes.asp.
 G. Whale, “Identification of Program Similarity in Large Populations,” The Computer Journal, 33(2):140-146, April 1990.
 D. R. White and M. S. Joy, “Sentence-based Natural Language Plagiarism Detection,” ACM Journal of Educational Resources in Computing, 4(4):1-20, December 2004.
 What is Plagiarism?, http://www.hku.hk/plagiarism/.
 M. J. Wise, “Detection of Similarities in Student Programs: YAP'ing May Be Preferable to Plague'ing,” Proceedings of the Twenty-Third SIGCSE Technical Symposium on Computer Science Education, pages 268-271, March 1992.
 Y. B. Wu, L. Shankar and X. Chen, “Finding More Useful Information Faster from Web Search Results,” Proceedings of the Twelfth International Conference on Information and Knowledge Management, pages 568-571, November 2003.
 G. Xue, H. Zeng, Z. Chen, Y. Yu, W. Ma, W. Xi, and W. Fan, “Optimizing Web Search Using Web Click-Through Data,” Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pages 118-126, November 2004.
 B. Yaltaghian and M. Chignell, “Re-Ranking Search Results Using Network Analysis a Case Study with Google: A Case Study with Google,” Proceedings of the 2002 Conference of the Centre For Advanced Studies on Collaborative Research, pages 14-23, September 2002.
 H. Yang, and J. Callan, “Near-Duplicate Detection by Instance-level Constrained Clustering.” Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 421-428, August 2006.
 R. Yerra and Y.-K. Ng, “A Sentence-Based Copy Detection Approach for Web Documents,” Proceedings of the 2nd Annual Internal Conference in Fuzzy Systems and Knowledge Discovery, pages 557-570, August 2005.
 B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W. Ma, “Improving Web Search Results Using Affinity Graph,” Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 504-511, August, 2005.
 H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma, “Learning to Cluster Web Search Results,” Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 210-217, July, 2004.
 Z. Zhuang and S. Cucerzan, “Re-Ranking Search Results Using Query Logs,” Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 860-861, November, 2006.