進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-2907201917105100
論文名稱(中文) 自動化新聞事件內容策展方法之建立
論文名稱(英文) A Method for Automatic Content Curation of News Events
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 107
學期 2
出版年 108
研究生(中文) 李亭葦
研究生(英文) Ting-Wei Li
學號 R76064124
學位類別 碩士
語文別 中文
論文頁數 56頁
口試委員 指導教授-王惠嘉
口試委員-劉任修
口試委員-高宏宇
口試委員-李偉柏
中文關鍵字 自動化內容策展  資訊檢索  自動化新聞摘要 
英文關鍵字 Automatic Content Curation  Information Retrieval  Automatic News Summary 
學科別分類
中文摘要 隨著網際網路的快速發展影響了人們獲取資訊的方式,傳統透過報紙、廣播接觸新聞的讀者不斷減縮,取而代之的是線上新聞文章,越來越多的讀者透過電腦或行動裝置來瀏覽新聞,因此新聞媒體產業也開始走向數位化,各家新聞媒體透過網頁發布線上新聞文章,傳遞資訊給群眾。網路上大量的新聞文章固然可以帶給讀者多樣化的新聞,但同時讀者也需要耗費許多時間閱覽才能夠消化資訊。此外新聞的事件期間,隨著時間發展或話題的延燒經常是持續一段時間的,當讀者欲針對特定事件做檢索了解事件發生的來龍去脈時,利用當前新聞網站的搜尋功能做查詢,而搜尋的結果往往面對的是大量新聞文章,導致讀者須要花費更多心力逐一檢視、整理,才能獲取真正尋求的資訊。
面對上述問題,有些平台應用內容策展(Content Curation)的概念,將大量的新聞文章根據事件、議題為基礎進行彙整,經由選材、精煉、組織並增加價值等步驟處理並呈現給讀者。有別於以往內容策展平台經由編輯人工進行整理之方式,本研究主要以自動化實做新聞事件之內容策展,首先萃取出資料集的主題,並透過隱馬爾可夫模型利用字詞序列找出主題轉移之序列,接著計算主題強度以及強度之變異偵測出事件發展期間重要的時間點,最終產生簡潔的文章摘要,結合時序化與摘要兩項特點,來設計呈現給讀者的事件策展結果,期望能有效幫助讀者簡單明瞭的閱讀並快速地掌握事件的脈絡。
英文摘要 The read habit of readers have changed, more and more readers use the computers or mobile devices to browse news, and the news industry is also digitized. Various news broadcaster published online news to pass information. A large number of online news bring readers a variety of information, but at the same time, readers also need to spend more time digesting them. When readers quering a news event, it often returns a large number of search result, which leads readers to spend extra effort to sort out.
In order to solve the problem, some platforms apply the concept of Content Curation, which aggregates the news articles based on event, and then organize and present to readers. At present, most of Content Curation is manually organized. Different from the way of the past platform, this study proposes an automated method of news curation. We first extract the topics from the dataset and use the word sequence to find out the topic sequence through the Hidden Markov Model. Then calculate the strength and the variation to detect important time points during the development of the event. Finally, generate a concise summary to every time points. We combine chronology and summary to design the curation, and look forward to help readers to quickly grasp the context of the news event.
Experiments has found that the method has a good performance in each modules. The curation result have good practicality for the readers. But in terms of coherence, there is slightly insufficient to improve.
論文目次 第1章 緒論 1
1.1 研究背景與動機 3
1.2 研究目的 7
1.3 研究限制與假設 9
1.4 研究流程 9
1.5 論文大綱 10
第2章 文獻探討 12
2.1 內容策展 12
2.2 主題偵測 14
2.2.1. 非機率模型(non-probabilistic model) 14
2.2.2. 機率模型(probabilistic model) 15
2.3 隱馬爾可夫模型 17
2.4 文件摘要 19
2.5 小結 22
第3章 研究方法 23
3.1 研究架構 23
3.2 資料收集與前處理模組 25
3.2.1. 資料收集 25
3.2.2. 資料前處理 26
3.3 文件過濾模組 27
3.4 breakpoint偵測模組 28
3.4.1. 主題萃取 29
3.4.2. 隱馬爾可夫模型 31
3.4.3. 主題強度 31
3.4.4. 主題變異 32
3.5 新聞自動摘要模組 33
3.5.1. 句子分群 33
3.5.2. 選擇代表句子 34
3.6 小結 34
第4章 系統建置與驗證 35
4.1 系統環境建置 35
4.2 實驗方法 35
4.2.1. 資料來源 36
4.2.2. 實驗設計 38
4.2.3. 評估指標 38
4.3 參數設定 40
4.4 實驗結果與分析 43
4.4.1. 實驗一 43
4.4.2. 實驗二 45
4.4.3. 實驗三 46
4.4.4. 實驗四 47
第5章 結論與未來方向 49
5.1 研究成果 49
5.2 未來研究方向 51
參考文獻 52
參考文獻 參考文獻
英文文獻
Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998). Topic Detection and Tracking Pilot Study: Final Report. Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998, 194-218.
Antoniou, G., & Harmelen, F. V. (2008). A Semantic Web Primer (2nd ed.) Cambridge, MA: The MIT Press.
Baralis, E., Cagliero, L., Mahoto, N., & Fiori, A. (2013). GraphSum: Discovering Correlations among Multiple Terms for Graph-based Summarization. Information Sciences, 249, 96-109.
Bawden, D., & Robinson, L. (2009). The Dark Side of Information: Overload, Anxiety and Other Paradoxes and Pathologies. Information Science, 35(2), 180-191.
Bhargava, R. (2009). Manifesto for the Content Curator: The Next Big Social Media Job of the Future? Retrieved from http://www.rohitbhargava.com/2009/09/manifesto-for-the-content-curator-the-next-big-social-media-job-of-the-future.html
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Burnette-Lemon, J. (2012, Jan-Feb). The Collector: Pearltrees' Oliver Starr Explains How Content Curation Works for Both Individual Users and Companies. Communication World, 29, 24-27.
Carbonell, J., & Goldstein, J. (1998). The Use of MMR, Diversity-based Reranking for Reordering Documents and Producing Summaries. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 335-336.
Chen, J., Niu, Z., & Fu, H. (2015). A Multi-news Timeline Summarization Algorithm Based on Aging Theory. In R. Cheng, B. Cui, Z. Zhang, R. Cai, & J. Xu (Eds.), Web Technologies and Applications (pp. 449-460). Cham, Switzerland: Springer International Publishing.
Chen, K. Y., Liu, S. H., Chen, B., Wang, H. M., Jan, E. E., Hsu, W. L., & Chen, H. H. (2015). Extractive Broadcast News Summarization Leveraging Recurrent Neural Network Language Modeling Techniques. IEEE Transactions on Audio, Speech, and Language Processing, 23(8), 1322-1334.
Dale, S. (2014). Content Curation: The Future of Relevance. Business Information Review, 31(4), 199-205.
Dhillon, I. S., & Modha, D. S. (2001). Concept Decompositions for Large Sparse Text Data Using Clustering. Machine Learning, 42(1), 143-175.
Endres, D. M., & Schindelin, J. E. (2003). A New Metric for Probability Distributions. IEEE Transactions on Information Theory, 49(7), 1858-1860.
Erkan, G., & Radev, D. R. (2004). Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research, 22, 457-479.
Filatova, E., & Hatzivassiloglou, V. (2004). Event-based Extractive Summarization. Text Summarization Branches Out, 104-112.
Greenbacker, C. F. (2011). Towards a Framework for Abstractive Summarization of Multimodal Documents. Proceedings of the ACL 2011 Student Session, 75-80.
Haribhakta, Y., Malgaonkar, A., & Kulkarni, P. (2012). Unsupervised Topic Detection Model and Its Application in Text Categorization. Proceedings of the CUBE International Information Technology Conference, 314-319.
Herther, N. K. (2012 September). Content Curation: Quality Judgment and the Future of Media and Web Search. Searcher, 20, 30-41.
Hofmann, T. (1999). Probabilistic Latent Semantic Analysis. Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 289-296.
Hu, P., Huang, M., Xu, P., Li, W., Usadi, A. K., & Zhu, X. (2011). Generating Breakpoint-based Timeline Overview for News Topic Retrospection. 2011 IEEE 11th International Conference on Data Mining, 260-269.
Indra, Winarko, E., & Pulungan, R. (in press). Trending Topics Detection of Indonesian Tweets Using BN-grams and Doc-p. Journal of King Saud University - Computer and Information Sciences.
Kessler, R., Tannier, X., Hagège, C., Moriceau, V., & Bittar, A. (2012). Finding Salient Dates for Building Thematic Timelines. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistic, 1, 730-739.
Khan, A., Salim, N., & Kumar, Y. J. (2015). A Framework for Multi-document Abstractive Summarization Based on Semantic Role Labelling. Applied Soft Computing, 30, 737-747.
Lim, J. M., Kang, I. S., Bae, J. H. J., & Lee, J. H. (2005). Sentence Extraction Using Time Features in Multi-document Summarization. In S. H. Myaeng, M. Zhou, K. F. Wong, & H. J. Zhang (Eds.), Information Retrieval Technology (pp. 82-93). Berlin, Heidelberg: Springer.
Lin, C. Y., & Hovy, E. (2000). The Automated Acquisition of Topic Signatures for Text Summarization Vol. 1. Proceedings of the 18th conference on Computational linguistics (pp. 495-501).
Lloret, E., Plaza, L., & Aker, A. (2018). The Challenging Task of Summary Evaluation: An Overview. Language Resources and Evaluation, 52(1), 101-148.
Loan, F. A. (2011). Impact of Internet on Reading Habits of the Net Generation College Students. International Journal of Digital Library Services, 1(2), 43-48.
Marujo, L., Ling, W., Ribeiro, R., Gershman, A., Carbonell, J., de Matos, D., & Neto, J. P. (2016). Exploring Events and Distributed Representations of Text in Multi-document Summarization. Knowledge-Based Systems, 94, 33-42.
Marujo, L., Ling, W., Ribeiro, R., Gershman, A., Carbonell, J., Martins de Matos, D., & Neto, J. P. (2016). Exploring Events and Distributed Representations of Text in Multi-Document SSummarization. Knowledge-Based Systems, 94, 33-42.
Mauá, D., Antonucci, A., & de Campos, C. (2016). Hidden Markov Models with Set-valued Parameters. Neurocomputing, 180, 94-107.
Mihalcea, R., & Tarau, P. (2004). Textrank: Bringing Order into Text. Proceedings of the 2004 conference on empirical methods in natural language processing.
Nenkova, A., & McKeown, K. (2012). A Survey of Text Summarization Techniques. In C. C. Aggarwal & C. Zhai (Eds.), Mining Text Data (pp. 43-76). New York, NY: Springer Science & Business Media.
Newman, N., Fletcher, R., Kalogeropoulos, A., & Levy, D. (2018). Reuters Institute Digital News Report 2018.
Newman, N., Fletcher, R., Kalogeropoulos, A., Levy, D. A., & Nielsen, R. K. (2017). Reuters Institute digital news report 2017.
Nicholson, N. (2012, Jan/Feb ). An Opportunity to Add Value. Communication World, 29, 3.
Ohsawa, Y., Benson, N. E., & Yachida, M. (1998). KeyGraph: Automatic Indexing by Co-occurrence Graph Based on Building Construction Metaphor. Proceedings of the IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-, 12-18. doi:10.1109/ADL.1998.670375
Petkos, G., Papadopoulos, S., Aiello, L., Skraba, R., & Kompatsiaris, Y. (2014). A Soft Frequent Pattern Mining Approach for Textual Topic Detection. Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14), 1-10.
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-286.
Román-Gálvez, R., Román-Roldán, R., Martínez-Aroza, J., & Gómez-Lopera, J. F. (2015). Semi-hidden Markov Models for Generation and Analysis of Sequences. Mathematics and Computers in Simulation, 118, 320-328.
Sahoo, D., Bhoi, A., & Balabantaray, R. C. (2018). Hybrid Approach To Abstractive Summarization. Procedia Computer Science, 132, 1228-1237.
Salton, G., Wong, A., & Yang, C. S. (1975, November). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18, 613-620.
Sankarasubramaniam, Y., Ramanathan, K., & Ghosh, S. (2014). Text Summarization Using Wikipedia. Information Processing & Management, 50(3), 443-461.
Sarkar, K., Nasipuri, M., & Ghose, S. (2011). Using Machine Learning for Medical Document Summarization. International Journal of Database Theory and Application, 4(1), 31-48.
Sayyadi, H., & Raschid, L. (2013). A Graph Analytical Approach for Topic Detection. ACM Transactions on Internet Technology, 13(2), 1-23.
Sun, J. (2012). ‘Jieba’ Chinese word segmentation tool. Retrieved from https://github.com/fxsjy/jieba
Sun, Y., Deng, H., & Han, J. (2012). Probabilistic Models for Text Mining. In C. C. Aggarwal & C. Zhai (Eds.), Mining Text Data (pp. 259-295). New York, NY: Springer Science & Business Media.
Tanaka, H., Kinoshita, A., Kobayakawa, T., Kumano, T., & Kato, N. (2009). Syntax-driven Sentence Revision for Broadcast News Summarization. Proceedings of the 2009 Workshop on Language Generation and Summarisation, 39-47.
Wartena, C., & Brussee, R. (2008). Topic Detection by Clustering Keywords. Proceedings of the 19th International Workshop on Database and Expert Systems Applications, 54-58.
Wu, Q., Zhang, C., Hong, Q., & Chen, L. (2014). Topic Evolution Based on LDA and HMM and Its Application in Stem Cell Research. Journal of Information Science, 40(5), 611-620.
Xu, J., & Yang, X. (2015). Generating the Theme Overview Based on Clue Chain from Online News. Proceedings of the 2015 IEEE International Conference on Systems, Man, and Cybernetics, 2730-2735.
Yang, Y., Pierce, T., & Carbonell, J. (1998). A Study of Retrospective and On-line Event Detection. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 28-36.
Zhai, C., Velivelli, A., & Yu, B. (2004). A Cross-Collection Mixture Model for Comparative Text Mining. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 743-748.
Zhang, C., Wang, H., Cao, L., Wang, W., & Xu, F. (2016). A Hybrid Term–term Relations Analysis Approach for Topic Detection. Knowledge-Based Systems, 93(1), 109-120.
Zhang, P. Y., & Li, C. H. (2009). Automatic Text Summarization Based on Sentences Clustering and Extraction. Proceedings of the 2nd IEEE International Conference on Computer Science and Information Technology, 167-170.
Zhao, T., Luo, X., Qin, W., Huang, S., & Xie, S. (2018). Topic Detection Model in a Single‐Domain Corpus Inspired by the Human Memory Cognitive Process. Concurrency and Computation: Practice and Experience, 30(19), e4642.
中文文獻
蔡尚勳. (2017). 沒想到? 2017最厲害閱讀關鍵字出爐. Retrieved from https://money.udn.com/money/story/10860/2900117
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2024-08-06起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2024-08-06起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw