進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-1106201418245100
論文名稱(中文) 社群資訊主題偵測 - 以Twitter為例
論文名稱(英文) Topic detection and tracking on social network – A case study of Twitter
校院名稱 成功大學
系所名稱(中) 資訊管理研究所
系所名稱(英) Institute of Information Management
學年度 102
學期 2
出版年 103
研究生(中文) 黃靖傑
研究生(英文) Ching-Chieh Huang
學號 R76014111
學位類別 碩士
語文別 中文
論文頁數 89頁
口試委員 指導教授-王惠嘉
口試委員-劉任修
口試委員-盧文祥
口試委員-高宏宇
中文關鍵字 社群網站  文字探勘  主題偵測與追蹤  短文字分群 
英文關鍵字 social network sites  text mining  topic detection and tracking  short text clustering 
學科別分類
中文摘要 現今社群網站的崛起,人們可以輕易跨過地理的界線進行互動,相互討論感興趣的議題。而眾多社群網站中,以短文字訊息呈現的微網誌為市面上大家主要使用的社群類型。微網誌強調訊息的時間性和即時性,導致每天產生大量的訊息。使用者要如何從這大量的訊息中,迅速了解大眾關注的主題已經成為目前熱門的研究議題。現今主題偵測與追蹤(TDT)的技術,可以從大量文字資料中擷取出主題。然而過去TDT的技術專注於新聞資料集,其方法不適用於短文字且變化較多的微網誌上。此外,主題偵測的結果通常用一個字或一組字來代表主題,但主題會隨者時間而變化,此呈現方式過於簡單,難以讓使用者了解該主題的趨勢與追蹤整個主題事件的始末。
本研究的目的為幫助使用者快速了解目前網路社會上,大眾所關注的主題趨勢,並藉由本研究系統瀏覽主題相關資訊,來了解、追蹤事件的始末。為了達成本研究之目的,本研究系統運用文字分群與主題偵測的技術找出大眾關注的熱門議題,並有效的過濾雜訊。接著建立社群資訊的topic tree 讓使用者了解與發現主題間的關聯。經實驗後證實,本研究提出之分群與主題偵測演算法皆無需人工幫助便能自動化的處理大量網路訊息,其結果的F-measure高達0.6,並優於先前學者提出之研究。
最後將本研究方法獲得的主題相關資訊結合時間軸,建立成一個社群主題資訊系統,呈現出主題資訊與趨勢,讓使用者能夠迅速了解與追蹤整個主題的始末。
英文摘要 Social network plays an important role in communication. People can discuss topics in which they are interested through social network sites (SNSs). SNSs put emphasis on instantaneous short text, so there are lots of messages every day. How to extract the hot topics from this large number of messages is a popular research issue now. Topic detection and tracking (TDT) can extract topic information from lots of mess data; however, in the past, most TDT researches focused on news corpus which is mainly long text. In many situation, the method applied for news cannot be used on SNSs since SNSs’ messages are too short and too various to extract the topics. The goal of this research is to help users quickly track the hot events most people concerned in social network. In order to get this goal, text clustering with probability and TDT are used to find the hot topics and the relation between topics. According to our experiment results, our method in the TDT task can achieve F-measure above 0.6 and it is better than the existed methods. Consequently, our research system can help users understand hot events on SNSs clearly and easily. Users can also focus on an event to track its detail information.
論文目次 第1章 緒論 1
1.1 研究背景與動機 1
1.2 研究目的 4
1.3 研究範圍與限制 4
1.4 研究流程 5
1.5 論文大綱 6
第2章 文獻探討 8
2.1 社群網路分析 8
2.2 自然語言處理 9
2.3 資訊檢索 10
2.3.1 空間向量模型 10
2.3.2 機率模型 12
2.4 機器學習 15
2.5 文件分群 16
2.6 主題偵測 18
2.7 Adjusted Rand Index(ARI) 19
2.8 kappa量表信度測量 21
2.9 小結 23
第3章 研究方法 24
3.1 研究架構 24
3.2 資料前處理模組 27
3.3 Tweets機率分群模組 27
3.3.1 Tweets VSM與相似度建立 28
3.3.2 Decision scalable distance-based clustering 28
3.3.3 Probabilistic feedback mechanism 30
3.4 主題偵測模組 31
3.4.1 daily topic detection 31
3.4.2 topic tree 33
3.5 主題視覺化模組 36
3.5.1 趨勢分析 36
3.5.2 社群主題資訊系統 37
第4章 系統建置與驗證 39
4.1 系統建置 39
4.2 實驗設計 40
4.2.1 資料集 40
4.2.2 實驗設計 40
4.2.3 參數設定 41
4.3 實驗結果 46
4.3.1 實驗一:決定樣本大小 46
4.3.2 實驗二:分群準確度比較 50
4.3.3 實驗三:每日主題偵測 52
4.3.4 實驗四:整體主題偵測 55
4.4 系統實作展示 59
第5章 結論與未來研究方向 63
5.1 研究成果 63
5.2 未來研究方向 68
第6章 參考文獻 70
參考文獻 Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243-256.
Bachrach, Y., Kosinski, M., Graepel, T., Kohli, P., & Stillwell, D. (2012). Personality and patterns of Facebook usage. Paper presented at the Proceedings of the 3rd Annual ACM Web Science Conference.
Banerjee, S., Ramanathan, K., & Gupta, A. (2007). Clustering short texts using wikipedia. Paper presented at the Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.
BBC. (2013). Twitter上市首日開盤上漲73%. Retrieved Nov. 18, 2013, from http://www.bbc.co.uk/zhongwen/trad/business/2013/11/131107_twitter_share_price.shtml
Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). Measuring semantic similarity between words using web search engines. www, 7, 757-766.
Boulis, C., & Ostendorf, M. (2005). Text classification by augmenting the bag-of-words representation with redundancy-compensated bigrams. Paper presented at the Proc. of the International Workshop in Feature Selection in Data Mining.
Boyd, D. M., & Ellison, N. B. (2007). Social network sites: Definition, history, and scholarship. Journal of Computer-Mediated Communication, 13(1).
Brants, T., Chen, F., & Tsochantaridis, I. (2002). Topic-based document segmentation with probabilistic latent semantic analysis. Paper presented at the Proceedings of the eleventh international conference on Information and knowledge management.
Cataldi, M., Di Caro, L., & Schifanella, C. (2010). Emerging topic detection on twitter based on temporal and social terms evaluation. Paper presented at the Proceedings of the Tenth International Workshop on Multimedia Data Mining.
Chen, Y.-L., & Chiu, Y.-T. (2011). An IPC-based vector space model for patent retrieval. Information Processing & Management, 47(3), 309-322.
Chun-hong, W., Li-Li, N., & Yao-Peng, R. (2011). Research on the text clustering algorithm based on latent semantic analysis and optimization. Paper presented at the Computer Science and Automation Engineering (CSAE), 2011 IEEE International Conference on.
Ciravegna, D., & Petrelli, D. (2001). User involvement in adaptive information extraction: Position paper.
Das, S., Abraham, A., & Konar, A. (2008). Automatic clustering using an improved differential evolution algorithm. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 38(1), 218-237.
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification: John Wiley & Sons.
Duric, A., & Song, F. (2011). Feature selection for sentiment analysis based on content and syntax models. Paper presented at the Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis.
Ebner, M., & Reinhardt, W. (2009). Social networking in scientific conferences - Twitter as tool for strengthen a scientific community. Paper presented at the Learning in the Synergy of Multiple Disciplines, Proceedings of the EC-TEL 2009, Berlin/Heidelberg.
Efron, M. (2011). Information search and retrieval in microblogs. Journal of the American Society for Information Science and Technology, 62(6), 996-1008. doi: 10.1002/asi.21512
Golbeck, J., Robles, C., Edmondson, M., & Turner, K. (2011). Predicting personality from twitter. Paper presented at the Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom).
Grace, J., Gruhl, D., Haas, K., Nagarajan, M., Robson, C., & Sahoo, N. (2007). Artist ranking through analysis of on-line community comments. Grace, J.; Gruhl, D.; Haas, K.; Nagarajan, M.; Robson, C.; Sahoo, N.
Guille, A., Hacid, H., Favre, C., & Zighed, D. A. (2013). Information diffusion in online social networks: A survey. ACM SIGMOD Record, 42(1), 17-28.
Handl, J., & Knowles, J. (2007). An evolutionary approach to multiobjective clustering. Evolutionary Computation, IEEE Transactions on, 11(1), 56-76.
Haribhakta, Y., Malgaonkar, A., & Kulkarni, P. (2012). Unsupervised topic detection model and its application in text categorization. Paper presented at the Proceedings of the CUBE International Information Technology Conference, Pune, India.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218.
Java, A., Song, X., Finin, T., & Tseng, B. (2007). Why we twitter: understanding microblogging usage and communities. Paper presented at the Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, San Jose, California.
Jernigan, C., & Mistree, B. F. (2009). Gaydar: Facebook friendships expose sexual orientation. First Monday, 14(10).
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110(15), 5802-5805.
Krovetz, R. (2000). Viewing morphology as an inference process. Artificial intelligence, 118(1), 277-294.
Kumar, R., Novak, J., Raghavan, P., & Tomkins, A. (2004). Structure and evolution of blogspace. Communications of the ACM, 47(12), 35-39. doi: 10.1145/1035134.1035162
Li, S., Xia, R., Zong, C., & Huang, C.-R. (2009). A framework of feature selection methods for text categorization. Paper presented at the Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2.
Mei, Q., Liu, C., Su, H., & Zhai, C. (2006). A probabilistic approach to spatiotemporal theme pattern mining on weblogs. Paper presented at the Proceedings of the 15th international conference on World Wide Web, Edinburgh, Scotland.
Mohd, M., Crestani, F., & Ruthven, I. (2012). Evaluation of an interactive topic detection and tracking interface. Journal of Information Science, 38(4), 383-398. doi: 10.1177/0165551512445848
Naaman, M., Boase, J., & Lai, C.-H. (2010). Is it really about me?: message content in social awareness streams. Paper presented at the Proceedings of the 2010 ACM conference on Computer supported cooperative work.
Nardi, B. A., Schiano, D. J., Gumbrecht, M., & Swartz, L. (2004). Why we blog. Communications of the ACM, 47(12), 41-46. doi: 10.1145/1035134.1035163
Ni, X., Quan, X., Lu, Z., Wenyin, L., & Hua, B. (2011). Short text clustering by finding core terms. Knowledge and information systems, 27(3), 345-365.
NIST. (2008). NIST's Topic Detection and Tracking Evaluation. Retrieved Aug. 23, 2013, from http://www.itl.nist.gov/iad/mig//tests/tdt/
Paice, C. D. (1990). Another stemmer. SIGIR Forum, 24(3), 56-61. doi: 10.1145/101306.101310
Porter, M. F. (1980). An algorithm for suffix stripping. Program: electronic library and information systems, 14(3), 130-137.
Quercia, D., Kosinski, M., Stillwell, D., & Crowcroft, J. (2011). Our Twitter profiles, our selves: Predicting personality with Twitter. Paper presented at the Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom).
Quercia, D., Lambiotte, R., Stillwell, D., Kosinski, M., & Crowcroft, J. (2012). The personality of popular facebook users. Paper presented at the Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work.
Robertson, S. E., & Jones, K. S. (1976). Relevance weighting of search terms. Journal of the American Society for Information science, 27(3), 129-146.
Ruthven, I., & Lalmas, M. (2003). A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 18(02), 95-145.
Sahami, M., & Heilman, T. D. (2006). A web-based kernel function for measuring the similarity of short text snippets. Paper presented at the Proceedings of the 15th international conference on World Wide Web.
Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.
Shani, G., Chickering, M., & Meek, C. (2008). Mining recommendations from the web. Paper presented at the Proceedings of the 2008 ACM conference on Recommender systems.
Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical therapy, 85(3), 257-268.
Survey Sampling International, S. (2013). Social Usage Involves More Platforms, More Often. Retrieved Nov. 18, 2013, from http://www.emarketer.com/Article/Social-Usage-Involves-More-Platforms-More-Often/1010019
Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406-418.
Thelwall, M., Wilkinson, D., & Uppal, S. (2010). Data mining emotion in social network communication: Gender differences in MySpace. Journal of the American Society for Information Science and Technology, 61(1), 190-199. doi: 10.1002/asi.21180
Wartena, C., & Brussee, R. (2008, 1-5 Sept. 2008). Topic Detection by Clustering Keywords. Paper presented at the Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on.
Westman, S., & Freund, L. (2010). Information interaction in 140 characters or less: genres on twitter. Paper presented at the Proceedings of the third symposium on Information interaction in context.
Wilkinson, D., & Thelwall, M. (2012). Trending Twitter topics in English: An international comparison. Journal of the American Society for Information Science and Technology, 63(8), 1631-1646.
Xin, G., Yang, X., & Qian, C. (2011, 26-28 July 2011). A vector space model approach to social relation extraction from text corpus. Paper presented at the Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on.
Yang, C. C., & Dorbin, T. (2011). Analyzing and Visualizing Web Opinion Development and Social Interactions With Density-Based Clustering. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 41(6), 1144-1155. doi: 10.1109/TSMCA.2011.2113334
Yang, C. C., & Ng, T. D. (2007, 23-24 May 2007). Terrorism and Crime Related Weblog Social Network: Link, Content Analysis and Information Visualization. Paper presented at the Intelligence and Security Informatics, 2007 IEEE.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2019-06-26起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw