||Hierarchical Multi-Dimensional Subjectivity-Lexicon Generation Model for Opinion Analysis
||Institute of Computer Science and Information Engineering
natural language processing
意見探勘與觀點分析是資訊萃取與自然語言處理的新興研究領域，主要研究對象是意見萃取與觀點分類或分群，近年來吸引越來越多學術與工業界研究人員的投入與注目。過去的相關研究主要著重於二極化的觀點分類，這類方法的限制將會在本篇論文中提出。由於二極化意見分析的極限性，這類傳統的二極化分類不適合用於需要跟細緻精密的分析方法與模型的領域，例如批評的分析。本論文的主要貢獻有五：(一) 多維度意見分析架構；(二) 無監督式多維度主觀性詞彙表生成模型；(三) 半監督式階層式多維度主觀性詞彙表生成模型；(四)改進半監督式Kernel k-Means群聚演算法；(五)架構於GI的限制一致與限制違反的無介入式評估機制。
多維度意見分析雛型包過四個主要步驟：(一) 爬梳部落格上的評論文章建立資料集合；(二) 建立主觀性詞彙-物件矩陣，其中每個詞彙都塑模成一於高維度特徵空間的向量；(三) 將主觀性詞彙向量轉換至新的較低維度的特徵空間已建立最終的多維度主觀性詞彙表，此特徵空間需能良於以較少的維度表現這些主觀性詞彙；(四) 利用此習得的多維度主觀性詞彙表進行意見探勘與觀點分析。
實驗的部分主要包括：(一) 展示傳統二極化意見分析的極限性；(二) 於習得的多維度主觀性詞彙與階層是多維度主觀性詞彙的特徵空間進行資訊含量評估(資訊熵)，實驗，顯示藉由特徵空間轉換，所習得的特徵空間最高可以得到31%效能的增進；(三) 於本論文所提出的模型進行限制一致與限制違反評估，此評估顯示本論文所提出的模型優於其他模型，於錯誤率和命中率上領先至少21%；(四) 與傳統二極化方法的比較，這些比較與實驗顯示，此論文所提出的雛型不僅可以進行傳統的二極化分類，在批評分析中，還比傳統方法更能夠提供具有語意上意義的資訊。
Opinion mining and sentiment analysis, an emerging area of information retrieval and natural language processing aims to opinion retrieval and subjectivity classification and clustering, has been attracting more and more attention from the academy and industry recently. Traditional approaches mainly focus on polarity classification, which the limitations are addressed in this thesis. As the limitations of the well-studied polarity opinion analysis, the traditional approaches are not adequate for criticism analysis which requires more refined analysis techniques and modeling. The five major contributions of this thesis are: first, a Multi-Dimensional Opinion Analysis (MDOA) framework for criticism analysis; second, an unsupervised Multi-Dimensional Subjectivity-Lexicon (MDSL) generation scheme; third, a semi-supervised Hierarchical MDSL (H-MDSL) generation model; forth, a modified Semi-Supervised Kernel k-Means clustering algorithm; fifth, a non-human-intervention-required evaluation scheme based on constraint agreement and violation quantification.
The MDOA framework consists of four major steps: first, creating a dataset by crawling blog posts of reviews; secondly, creating a “subjectivity-term to object” matrix, with each subjectivity-term is modeled as a vector in a high dimensional space; thirdly, transforming each subjectivity-term into a new feature-space to create the final MDSL in which the feature-space should well-represent the subjectivity-terms; and fourthly, employing the learned MDSL for opinion analysis.
In the experiments, first, the limitations of traditional polarity opinion analysis are addressed. Second, the entropy analysis of the learned MDSL and H-MDSL in the transformed feature space is performed. It shows that the improvement by the feature transformation can be up to 31% in terms of the entropy of the learned features. Third, the constraint agreement and violation evaluation of the proposed models and algorithms are performed, which shows the proposed model outperforms the others by at least 21% in error rate and hit rate. Fourth, comparison with traditional polarity approaches is also presented. In such comparison, it shows that the proposed framework is not only capable of traditional polarity classification but also more capable of providing meaningful semantic information in criticism analysis.
List of Tables VIII
List of Figures IX
Chapter 1 1
1. Introduction 1
1.1 Motivation 4
1.2 Issues and Challenges 7
1.3 Contributions 9
1.4 Organization 10
Chapter 2 11
2. Background and Related work 11
2.1 Opinion Orientation Classification: 11
2.2 Subjectivity Classification 15
2.3 Opinion Retrieval 17
Chapter 3 19
3. Multi-Dimensional Opinion Analysis 19
3.1 Data Collecting 20
3.2 Preprocessing 20
3.2.1 Binary Model 22
3.2.2 Likelihood Model 24
3.2.3 NLP-Enhanced Model 25
3.3 Transformation 28
3.3.1 TF-IDF Weighting: 29
3.3.2 Singular Value Decomposition: 29
3.3.3 Subjectivity-Clustering: 31
3.3.4 Combination 34
3.4 Opinion Analysis 34
3.5 Hierarchical Multi-Dimensional Subjectivity Lexicon 34
3.6 Modified Semi-Supervised Kernel k-Means 38
Chapter 4 47
4. Experiments 47
4.1 Evaluation Design 47
4.2 Experimental Results 54
Chapter 5 73
5. Conclusions and Future Work 73
[Bin06] Bing Liu, Web Data Mining, Springer, December, 2006.
[Bre87] S. E. Brennan, M. W. Friedman, C. J. Pollard, “A Centering Approach to Pronouns,” In the Proceedings of the 25th Annual Meeting on Association for Computational Linguistics, pp. 155-162, Association for Computational Linguistics, Stanford, California, 1987.
[Cha07] C-H Chang, K-C Tsai, “Aspect Summarization from Blogsphere for Social Study,” In The Seventh IEEE International Conference on Data Mining Workshops (ICDM Workshops '07), pp. 9-14, 2007.
[Dee90] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, vol. 41, pp. 391-407, 1990.
[Dem77] A. P. Dempster, et al., “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B (Methodological), vol. 39, pp. 1-38, 1977.
[Fil08] M. Filippone, C. Francesco, M. Francesco, R. Stefano, “A survey of kernel and spectral methods for clustering,” Pattern Recognition, vol. 41, pp. 176-190, 2008.
[Ge98] N. Ge, J. Hale, E. Charniak, “A Statistical Approach to Anaphora Resolution,” In Proceedings of the Sixth Workshop on Very Large Corpora, pp. 161-171, 1998.
[Gol83] G. H. Golub, and C. F. Van Loan, Matrix Computation, The Johns Hopkings University Press, 1983.
[Hat93] V. Hatzivassiloglou and K. R. McKeown, “Towards the automatic identification of adjectival scales: clustering adjectives according to meaning,” presented at the Proceedings of the 31st annual meeting on Association for Computational Linguistics, Columbus, Ohio, 1993.
[Hat97] V. Hatzivassiloglou and K. R. McKeown, “Predicting the semantic orientation of adjectives,” presented at the Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, Madrid, Spain, 1997.
[He08] B. He, C. Macdonald, J. He, I. Ounis, “An Effective Statistical Approach to Blog Post Opinion Retrieval,” In Proceeding of the 17th ACM conference on Information and knowledge management, ACM, Napa Valley, California, USA, 2008.
[Kul05] B. Kulis, S. Basu, I. Dhillon, P. Mooney, “Semi-supervised graph clustering: a kernel approach,” presented at the Proceedings of the 22nd international conference on Machine learning, Bonn, Germany, 2005.
[Lap94] S. Lappin, H. J. Leass, “An Algorithm for Pronominal Anaphora Resolution,” Comput. Linguist., vol. 20, pp. 535-561, MIT Press, 1994.
[Liu10] B Liu, “Sentiment Analysis and Subjectivity,” Handbook of Natural Language Processing 2nd edition, 2010.
[Lu09] Y. Lu, C-X Zhai, N. Sundaresan, “Rated Aspect Summarization of Short Comments,” In Proceedings of the 18th International Conference on World Wide Web, pp. 131-140, ACM, Madrid, Spain, 2009.
[Mar08] M. C. Marneffe, C. D. Manning, “The Stanford Typed Dependencies Representation,” In Workshop on Cross-framework and Cross-domain Parser Evaluation, 2008.
[Mei07] Q. Mei, X. Ling, M. Wondra, C-M Zhai, “Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs,” In Proceedings of the 16th International Conference on World Wide Web, ACM, Banff, Alberta, Canada, 2007.
[Mit02] P. Mitra, C. A. Murthy, S. K. Pal, “Unsupervised Feature Selection Using Feature Similarity,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 301-312, 2002.
[Pan02] B. Pang, L. Lee, S. Vaithyanathan, “Thumbs Up? Sentiment Classification Using Machine Learning Techniques,” In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 79-86. Association for Computational Linguistics, 2002.
[Pan08] B. Pang, L. Lee, “Opinion Mining and Sentiment Analysis,” Found. Trends Inf. Retr., vol. 2, pp. 1-135, Now Publishers Inc, 2008.
[Ril03a] E. Riloff, J. Wiebe, T. Wilson, “Learning Subjective Nouns Using Extraction Pattern Bootstrapping,” In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, vol. 4, pp. 25-32, Association for Computational Linguistics, Edmonton, Canada, 2003.
[Ril03b] E. Riloff, J. Wiebe, “Learning Extraction Patterns for Subjective Expressions,” In Proceedings of the 2003 Conference on Empirical methods in natural language processing, vol. 10, pp. 105-112. Association for Computational Linguistics, 2003.
[Ril06] E. Riloff, S. Patwardhan, J. Wiebe, “Feature Subsumption for Opinion Analysis,” In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 440-448, Association for Computational Linguistics, Sydney, Australia, 2006.
[Sha09] S. K. Shandilya, S. Jain, “Opinion Extraction and Classification of Reviews from Web Documents,” In IEEE International Advance Computing Conference (IACC '09), pp. 924-927, 2009.
[Sto63] P. J. Stone and E. B. Hunt, “A computer approach to content analysis: studies using the General Inquirer system,” presented at the Proceedings of the spring joint computer conference, May 21-23, 1963, Detroit, Michigan, 1963.
[Str00] A. Strehl, J. Ghosh, R. Mooney, “Impact of Similarity Measures on Web-page Clustering,” in Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI 2000), pp. 58-64, Austin, Texas, USA, 30-31, July 2000.
[Tan09] H. Tang, S. Tan, X. Cheng, “A Survey on Sentiment Detection of Reviews,” Expert Systems with Applications, vol. 36, pp. 10760-10773, Pergamon Press, Inc, 2009.
[Tur02] P. D. Turney, “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews,” In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 417-424, Association for Computational Linguistics, Philadelphia, Pennsylvania, 2002.
[Tur03] P. D. Turney and M. L. Littman, “Measuring praise and criticism: Inference of semantic orientation from association,” ACM Transactions on Information System., vol. 21, pp. 315-346, 2003.
[Zha07] W. Zhang, C. Yu, W. Meng, “Opinion Retrieval from Blogs,” In Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, ACM, Lisbon, Portugal, 2007.