進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-0707201511583400
論文名稱(中文) 以Bootstrapping方法萃取網路優惠摘要
論文名稱(英文) Extracting Network Preferential Summary with Bootstrapping Method
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 103
學期 2
出版年 104
研究生(中文) 程彥輔
研究生(英文) Yan-Fu Cheng
學號 R76011074
學位類別 碩士
語文別 中文
論文頁數 56頁
口試委員 指導教授-王惠嘉
口試委員-盧文祥
口試委員-高宏宇
口試委員-劉任修
中文關鍵字 文字探勘  資訊萃取  XML路徑語言  自助法  逐點交互訊息 
英文關鍵字 Text mining  Information extraction  XPath  Bootstrapping  Point-Wise Mutual Information 
學科別分類
中文摘要 台灣電子商務業的產值從2008年開始就持續有明顯的成長,其中消費者對於購物折扣等優惠相關資訊通常具有較大的興趣,由於現今網路的發達,為了找尋所需資訊,使用者通常使用搜尋引擎上網搜尋,但是網路資訊量爆炸性的成長、網頁設計的自由性,使得雜訊大量的存在於網頁之中,搜尋引擎要保持最新以及全面性的搜尋結果並不容易,尤其是特定主題的資訊搜尋,使用者常需要自行判斷是否為其所需的資訊,因為上述需求而發展文件探詢的智慧機制是很重要的。
本研究將使用Bootstrapping的方法,結合文字探勘技術,先找出優惠相關之關鍵字後,以優惠資訊較為齊全的優惠網站作為種子網頁,藉由XML路徑語言(XPath)找出存有優惠資訊的Document Object Model (DOM)位置,得到萃取優惠資訊的樣板,利用該樣版從將選定網站內所有網頁下載下來,經過斷詞系統處理以及設計一考慮字詞距離的Distance Point-Wise Mutual Information (DPMI)分析,將這些資訊存放後,以Bootstrapping方法持續學習新的關鍵字,將學習結果中關鍵字與店家或產品名稱的組合用於搜尋引擎中找出更多的優惠網站,延續前述步驟找出優惠資訊摘要等,建立一個使用者介面,提供使用者以關鍵字查詢優惠資訊,例如:買一送一、同行免費、第二件半價等關鍵字。
在實驗結果的部分,結果顯示使用八個種子關鍵字得到最好的召回率及F-measure,使用名詞合併後的準確率較合併前高出10.7%,使用DPMI進行實驗時以距離為2可以得到最高的準確率29.4%,較於PMI進行實驗結果得到的20%高出9.4%,且最後利用關鍵字與店家或產品名稱找出新優惠網站的實驗中最高也可以得到59%的準確率,召回率則有32.9%。
英文摘要 The output value of e-commerce has obviously growing in 2008. Consumers have most interest in discount and preferential information. It’s difficult for search engine to keep latest and the most comprehensive search result.
This research use bootstrapping method with text mining. After determine preferential keyword, set the website that has complete preferential information as seed pages. Finding document object model (DOM) position of preferential information with XML path language (XPath) to get the pattern that can extract preferential information. The pattern will download webpages from chosen websites. Analyzing these pages with word segmentation system and Distance Point-Wise Mutual Information (DPMI), learning new preferential keywords with bootstrapping method. Combine preferential keyword and store or product name for search engine to find out new preferential websites. Developing a user interface which provides preferential information like: buy one get one, buy one, get one half price, etc.
Experiment result shows that DPMI using two as word distance has the greatest precision 29.4%, 9.4% higher than PMI’s result 20%.
論文目次 1 緒論 1
1.1 研究背景 1
1.2 研究動機與目的 3
1.3 研究流程 5
1.4 研究範圍與限制 6
1.5 論文架構 7
2 文獻探討 8
2.1 Bootstrapping 8
2.1.1 DIPRE 8
2.1.2 Snowball 8
2.1.3 KnowItAll 9
2.2 資訊萃取(Information extraction) 9
2.3 機器學習 10
2.4 樣板 11
2.4.1 以相似度為基礎之樣板排序 11
2.4.2 以文件為基礎之樣板排序 12
2.5 XML路徑語言(XPath) 12
2.6 中文斷詞 14
2.6.1 歧義性 (Ambiguity) 17
2.6.2 未知詞 17
2.7 文件分析 17
2.8 逐點交互訊息(Point-Wise Mutual Information, PMI) 19
2.9 相關研究 20
2.10 小結 21
3 研究方法 22
3.1 研究架構 22
3.2 前處理階段 24
3.3 學習萃取規則及下載網頁階段 25
3.4 斷詞分析、學習階段 28
3.5 學習新網站階段 33
4 系統建置及實作驗證 35
4.1 系統建置 35
4.1.1 實驗環境 35
4.1.2 使用套件與模組 35
4.1.3 系統處理流程 35
4.2 實驗方法 36
4.2.1 資料來源 36
4.2.2 評估指標 40
4.3 參數設定 40
4.4 實驗結果與分析 41
4.4.1 實驗一 41
4.4.2 實驗二 45
4.4.3 實驗三 46
4.4.4 實驗四 46
4.4.5 實驗五 47
5 結論 50
5.1 研究成果 50
5.2 未來研究方向 52
參考文獻 54
參考文獻 Abou Nabout, N., & Skiera, B. (2012). Return on Quality Improvements in Search Engine Marketing. Journal of Interactive Marketing, 26(3), 141-154. doi: http://dx.doi.org/10.1016/j.intmar.2011.11.001
Agichtein, E., & Gravano, L. (2000). Snowball: extracting relations from large plain-text collections. Paper presented at the Proceedings of the fifth ACM conference on Digital libraries, San Antonio, Texas, USA.
Brin, S. (1999). Extracting Patterns and Relations from the World Wide Web: Stanford InfoLab.
Chau, M., & Chen, H. (2008). A machine learning approach to web page filtering using content and structure analysis. Decision Support Systems, 44(2), 482-494. doi: http://dx.doi.org/10.1016/j.dss.2007.06.002
Chiu, Y.-T., & Chen, Y.-L. (2011). An IPC-based vector space model for patent retrieval. Information Processing & Management, 47(3), 309-322.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1), 22-29.
Ciravegna, D., & Petrelli, D. (2001). User involvement in adaptive information extraction: Position paper.
comScore, I. (2014). comScore Explicit Core Search Share Report. from https://www.comscore.com/Insights/Market-Rankings/comScore-Releases-June-2014-US-Search-Engine-Rankings
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., . . . Yates, A. (2004). Web-scale information extraction in knowitall: (preliminary results). Paper presented at the Proceedings of the 13th international conference on World Wide Web, New York, NY, USA.
Hamburg, M. (1985). Basic Statistics: A Modern Approach : Harcourt Brace Jovanovich: Inc.
Huynh-Thi-Le, Q., Le, T., Vo, B., & Le, B. (2015). An efficient and effective algorithm for mining top-rank-k frequent patterns. Expert Systems with Applications, 42(1), 156-164. doi: http://dx.doi.org/10.1016/j.eswa.2014.07.045
Johnson, J., Tellis, G. J., & Ip, E. H. (2013). To Whom, When, and How Much to Discount? A Constrained Optimization of Customized Temporal Discounts. Journal of Retailing, 89(4), 361-373. doi: http://dx.doi.org/10.1016/j.jretai.2013.08.002
Korrapati, H., & Mezouar, Y. (2014). Vision-based sparse topological mapping. Robotics and Autonomous Systems, 62(9), 1259-1270. doi: http://dx.doi.org/10.1016/j.robot.2014.03.015
Liao, S., & Grishman, R. (2010, August 2010). Filtered Ranking for Bootstrapping in Event Extraction. Paper presented at the Proceedings of the 23rd International Conference on Computational Linguistics, Beijing.
Patel, A., & Schmidt, N. (2011). Application of structured document parsing to focused web crawling. Computer Standards & Interfaces, 33(3), 325-331. doi: http://dx.doi.org/10.1016/j.csi.2010.08.002
Peng, T., & Liu, L. (2013). Focused crawling enhanced by CBP–SLC. Knowledge-Based Systems, 51(0), 15-26. doi: http://dx.doi.org/10.1016/j.knosys.2013.06.008
Popescu, A.-M., & Etzioni, O. (2007). Extracting Product Features and Opinions from Reviews. In A. Kao & S. Poteet (Eds.), Natural Language Processing and Text Mining (pp. 9-28): Springer London.
Riloff, E. (1996). Automatically Generating Extraction Patterns from Untagged Text. In Proc. Thirteenth National Conference on Artificial Intelligence, 1044-1049.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18(11), 613-620.
Sleiman, H. A., & Corchuelo, R. (2013). TEX: An efficient and effective unsupervised Web information extractor. Knowledge-Based Systems, 39(0), 109-123. doi: http://dx.doi.org/10.1016/j.knosys.2012.10.009
Stevenson, M., & Greenwood, M. (2005). A Semantic Approach to IE Pattern Induction. Paper presented at the Proceedings of ACL.
TechNews科技新報. (2014). 台灣3大團購網上月業績傳捷報達5.5億、創新高. from http://technews.tw/2014/01/04/taiwan-group-buys-online-months-3-new-high-performance-news-reached-550-million/
Uzun, E., Agun, H. V., & Yerlikaya, T. (2013). A hybrid approach for extracting informative content from web pages. Information Processing & Management, 49(4), 928-944. doi: http://dx.doi.org/10.1016/j.ipm.2013.02.005
w3school. (1999). XPath 實例. from http://fanli7.net/w3school/xpath/xpath_examples.html
Wikipedia. (2014). Pattern. from http://en.wikipedia.org/wiki/Pattern
Yangarber, R. (2003). Counter-Training in Discovery of Semantic Patterns. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics.
Zheng, H.-T., Kang, B.-Y., & Kim, H.-G. (2008). An ontology-based approach to learnable focused crawling. Information Sciences, 178(23), 4512-4522. doi: http://dx.doi.org/10.1016/j.ins.2008.07.030
林千翔. (2006). 基於特製隱藏式馬可夫模型之中文斷詞研究. (碩士), 國立中央大學.
陳光華. (2012). 資訊擷取. from http://terms.naer.edu.tw/detail/1679021/
楊存一. (2002). 利用自適應共振理論網路探討MIS學術論文關鍵議題的發展趨勢. 雲林科技大學. Retrieved from http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dnclcdr&s=id=%22090YUNTE396016%22.&searchmode=basic
資策會FIND/經濟部技術處. (2011). 「科技化服務價值鏈研究與推動計畫」. from http://www.find.org.tw/find/home.aspx?page=many&id=323
資策會產業情報中心. (2013). 台灣電子商務產值一覽. from http://md.ctee.com.tw/news.php?pa=FISvZD%2BdIUDC7Ig1ZRbzagpaMS7l9x52Acpx0PvlHvs=
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2020-07-13起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw