進階搜尋


   電子論文尚未授權公開,紙本請查館藏目錄
(※如查詢不到或館藏狀況顯示「閉架不公開」,表示該本論文不在書庫,無法取用。)
系統識別號 U0026-1808202000404000
論文名稱(中文) 具情緒感知之音樂及影片共同特徵空間生成模型
論文名稱(英文) EMVGAN: Emotion-Aware Music-Video Common Representation Learning via Generative Adversarial Networks
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 108
學期 2
出版年 109
研究生(中文) 蔡雨芝
研究生(英文) Yu-Chih Tsai
學號 P76074389
學位類別 碩士
語文別 英文
論文頁數 52頁
口試委員 指導教授-朱威達
指導教授-胡敏君
口試委員-楊奕軒
口試委員-朱宏國
中文關鍵字 生成對抗網路  跨模態對抗機制  情緒識別  共同特徵空間學習  跨模態檢索 
英文關鍵字 Generative Adversarial Network  Cross-modal Adversarial Mechanism  Emotion Recognition  Common Representation Learning  Cross-modal Retrieval 
學科別分類
中文摘要 音樂能強化我們對於影片及影像的情緒反應,影片及影像亦會讓我們對於音樂的情緒感受更強烈。跨模態檢索能為特定影片推薦合適的音樂,反之,亦能為音樂搭配契合的影片片段。然而,不同模態資料的分佈及呈現方式相當不同,導致模態之間產生異質性的隔閡,也使跨模態共同特徵空間之學習具有相當的挑戰性。在本篇論文中,我們提出一個具情緒感知之音樂及影片共同特徵空間生成模型,來建立出音樂及影片之間的情緒共同特徵空間,並解決音樂及影片之異質性隔閡。實驗結果顯示,我們所提出的模型能學習到跨模態之情緒共同特徵,並證明其效能勝過於現有之相關研究。此外,我們也利用跨模態共同特徵進行音樂與影片之雙向跨模態檢索。我們邀請四十位受試者進行跨模態檢索之主觀評估,受試者認為以跨模態共同特徵檢索之音樂影片對於音樂及畫面的契合程度及情緒關聯性,與官方釋出的音樂影片具有相似的評分。
英文摘要 Music can enhance our emotional reactions to videos and images, while videos and images can enrich our emotional response to music. Cross-modality retrieval technology can be used to recommend appropriate music for a given video and vice versa. However, the heterogeneity gap caused by the inconsistent distribution between different data modalities complicates learning the common representation space from different modalities. Accordingly, we propose an emotion-aware music-video cross-modal generative adversarial network (EMVGAN) model to build an affective common embedding space to bridge the heterogeneity gap among different data modalities. The evaluation results revealed that the proposed EMVGAN model can learn affective common representations with convincing performance while outperforming other existing models. Furthermore, the satisfactory performance of the proposed network encouraged us to undertake the music-video bidirectional retrieval task. The results of the subjective evaluations by the 40 recruited participants indicated a similar consistency and emotional relationship between the retrieved music videos and official music videos.
論文目次 Cover
Oral presentation document
Chinese version
English version
Abstract (Chinese) i
Abstract (English) ii
Table of Contents iii
List of Tables v
List of Figures vi
Chapter 1. Introduction 1
Chapter 2. Related Work 4
2.1 Tasks Related to Music Retrieval . . . . . . . . . . . . . . . . . . . . . 4
2.2 Tasks Related to Visual Retrieval . . . . . . . . . . . . . . . . . . . . . 5
2.3 Cross-Modal Correlation Learning Methods . . . . . . . . . . . . . . . 6
2.4 Cross-Modal Correlation Learning Methods with GAN-Based Mechanism 8
Chapter 3. Methodology 10
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Music Emotion Recognition (MER) Model . . . . . . . . . . . . . . . . 11
3.3 Video Emotion Recognition (VER) Model . . . . . . . . . . . . . . . . 11
3.4 Cross-Modal Generative Adversarial Network (GAN) Mechanism . . . 12
3.4.1 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.2 Discriminative Model . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.1 Discriminative Model . . . . . . . . . . . . . . . . . . . . . . . 16
3.5.2 Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 Training and Implementation . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 4. Experimental Results 23
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Performance Evaluation on a Pre-Trained Network . . . . . . . . . . . . 26
4.3.1 Music Emotion Recognition (MER) Model . . . . . . . . . . . . 27
4.3.2 Video Emotion Recognition (VER) Model . . . . . . . . . . . . . 27
4.4 Performance Evaluation on Cross-Modal Network . . . . . . . . . . . . 28
4.4.1 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5.1 Performance of MER/VER Model Using Fine-Tuned Data . . . . 31
4.5.2 Performance of Pre-Trained Emotion Recognition (MER/VER) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.3 Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Application on Cross-Modal Retrieval . . . . . . . . . . . . . . . . . . 33
4.7 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7.1 First Stage: Comparison with the Representations from the Pre-Trained Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7.2 Second Stage: Comparison with Official Music Videos . . . . . . 36
4.7.3 Third Stage: Analysis of Retrieved Result Using the Common Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.7.4 Fourth Stage: Comparison with Related Studies . . . . . . . . . . 40
4.8 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.9.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.9.2 Different Numbers of Classes . . . . . . . . . . . . . . . . . . . 43
4.9.3 Different MER/VER models . . . . . . . . . . . . . . . . . . . . 43
Chapter 5. Conclusion 45
References 46
參考文獻 [1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Developing a benchmark for emotional analysis of music. PLOS ONE, 12(3):1–22, 03 2017.
[2] S. Amiriparian, M. Gerczuk, E. Coutinho, A. Baird, S. Ottl, M. Milling, and B. Schuller. Emotion and themes recognition in music utilising convolutional and recurrent neural networks. 2019.
[3] A. Anna, Y. Yi-Hsuan, and M. Soleymani. Emotion in music task at mediaeval
2015. In Working Notes Proceedings of the MediaEval 2015 Workshop, 2015.
[4] R. Arandjelovic and A. Zisserman. Objects that sound. In The European Conference on Computer Vision (ECCV), September 2018.
[5] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in neural information processing systems, pages 892–900, 2016.
[6] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen. Liris-accede: A video database for affective content analysis. IEEE Transactions on Affective Computing, 6(1):43–55, 2015.
[7] J. Chao, H. Wang, W. Zhou, W. Zhang, and Y. Yu. Tunesensor: A semantic-driven music recommendation service for digital photo albums. In Proceedings of the 10th International Semantic Web Conference. ISWC2011 (October 2011), 2011.
[8] M. Chmulik, R. Jarina, M. Kuba, and E. Lieskovska. Continuous music emotion recognition using selected audio features. In 2019 42nd International Conference on Telecommunications and Signal Processing (TSP), pages 589–592, 2019.
[9] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
[10] A. J. Cohen. Congruence-association model of music and multimedia: Origin and evolution. The psychology of music in multimedia, pages 17–47, 2013.
[11] E. Coutinho, G. Trigeorgis, S. Zafeiriou, and B. Schuller. Automatically estimating emotion in music with deep long-short term memory recurrent neural networks. In CEUR Workshop Proceedings, volume 1436, 2015.
[12] T. Dahiru. Pvalue, a true test of statistical significance? a cautionary note. Annals of Ibadan postgraduate medicine, 6(1):21–26, 2008.
[13] E. Dellandréa, M. Huigsloot, L. Chen, Y. Baveye, Z. Xiao, and M. Sjöberg. The mediaeval 2018 emotional impact of movies task. 2018.
[14] P. Ekman and W. V. Friesen. The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica, 1(1), 1969.
[15] M. B. Er and I. B. Aydilek. Music emotion recognition by using chroma spectrogram and deep visual features. International Journal of Computational Intelligence Systems, 12(2):1622–1634, 2019.
[16] F. Eyben, M. Wöllmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pages 1459–1462, 2010.
[17] Y. Fan, X. Lu, D. Li, and Y. Liu. Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, ICMI '16, page 445–450, New York, NY, USA, 2016. Association for Computing Machinery.
[18] D. Gerónimo and H. Kjellström. Unsupervised surveillance video retrieval based on human action and appearance. In 2014 22nd International Conference on Pattern Recognition, pages 4630–4635. IEEE, 2014.
[19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[20] J. Gu, J. Cai, S. R. Joty, L. Niu, and G. Wang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[21] W. Gu, X. Gu, J. Gu, B. Li, Z. Xiong, and W. Wang. Adversary guided asymmetric hashing for cross-modal retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, ICMR '19, page 159–167, New York, NY, USA, 2019. Association for Computing Machinery.
[22] V. N. Gudivada and V. V. Raghavan. Content based image retrieval systems. Computer, 28(9):18–22, 1995.
[23] R. Gupta and S. S. Narayanan. Predicting affect in music using regression methods on low level features. In MediaEval, 2015.
[24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[25] S. Hong, W. Im, and H. S. Yang. Cbvmr: Content-based video-music retrieval using soft intra-modal structure constraint. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR '18, page 353–361, New York, NY, USA, 2018. Association for Computing Machinery.
[26] T.-H. Hsieh, L. Su, and Y.-H. Yang. A streamlined encoder/decoder architecture for melody extraction. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 156–160. IEEE, 2019.
[27] S. Koelstra, C. Muhl, M. Soleymani, J.S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras. Deap: A database for emotion analysis; using physiological signals. IEEE transactions on affective computing, 3(1): 18–31, 2011.
[28] B. Kostiuk, Y. M. G. Costa, A. S. Britto, X. Hu, and C. N. Silla. Multi-label emotion classification in music videos using ensembles of audio and video features. In 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 517–523, 2019.
[29] B. Li, Z. Chen, S. Li, and W.-S. Zheng. Affective video content analyses by using cross-modal embedding learning features. 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 844–849, 2019.
[30] B. Li and A. Kumar. Query by video: Cross-modal music retrieval. In ISMIR, pages 604–611, 2019.
[31] J.-C. Lin, W.-L. Wei, and H.-M. Wang. Automatic music video generation based on emotion-oriented pseudo song prediction and matching. In Proceedings of the 24th ACM International Conference on Multimedia, MM '16, page 372–376, New York, NY, USA, 2016. Association for Computing Machinery.
[32] J.-C. Lin, W.-L. Wei, and H.-M. Wang. Demv-matchmaker: emotional temporal course representation and deep similarity matching for automatic music video generation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2772–2776. IEEE, 2016.
[33] C. Liu, T. Tang, K. Lv, and M. Wang. Multi-feature based emotion recognition for video clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI '18, page 630–634, New York, NY, USA, 2018. Association for Computing Machinery.
[34] H. Liu, Y. Fang, and Q. Huang. Music emotion recognition using a variant of recurrent neural network. In 2018 International Conference on Mathematics, Modeling, Simulation and Statistics Application (MMSSA 2018), pages 15–18. Atlantis Press, 2019.
[35] X. Liu, Q. Chen, X. Wu, Y. Liu, and Y. Liu. Cnn based music emotion classification. arXiv preprint arXiv:1704.05665, 2017.
[36] Y. Ma, X. Liang, and M. Xu. Thuhcsi in mediaeval 2018 emotional impact of movies task. 2018.
[37] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
[38] M. Malik, S. Adavanne, K. Drossos, T. Virtanen, D. Ticha, and R. Jarina. Stacked convolutional and recurrent neural networks for music emotion recognition. arXiv preprint arXiv:1706.02292, 2017.
[39] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
[40] R. Orjesek, R. Jarina, M. Chmulik, and M. Kuba. Dnn based music emotion recognition from raw audio signal. In 2019 29th International Conference Radioelektronika (RADIOELEKTRONIKA), pages 1–4. IEEE, 2019.
[41] R. Panda, R. M. Malheiro, and R. P. Paiva. Novel audio features for music emotion recognition. IEEE transactions on affective computing, 2018.
[42] Y. Peng and J. Qi. Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(1):1–24, 2019.
[43] S. Qiao, R. Wang, S. Shan, and X. Chen. Deep heterogeneous hashing for face video retrieval. IEEE Transactions on Image Processing, 29:1299–1312, 2020.
[44] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pages 5533–5541, 2017.
[45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
[46] G. Salton. Developments in automatic text retrieval. science, 253(5023):974–980, 1991.
[47] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
[48] E. Schubert. Modeling perceived emotion with continuous musical features. Music perception, 21(4):561–585, 2004.
[49] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[50] J. J. Sun, T. Liu, and G. Prasad. Gla in mediaeval 2018 emotional impact of movies task. arXiv preprint arXiv:1911.12361, 2019.
[51] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6398–6407, 2020.
[52] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
[53] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
[54] B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, MM '17, page 154–162, New York, NY, USA, 2017. Association for Computing Machinery.
[55] H. Wang, D. Sahoo, C. Liu, E.p. Lim, and S. C. Hoi. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11572–11581, 2019.
[56] X. Wu, Y. Qiao, X. Wang, and X. Tang. Bridging music and image via cross-modal ranking analysis. IEEE Transactions on Multimedia, 18(7):1305–1318, 2016.
[57] M. Xu, X. Li, H. Xianyu, J. Tian, F. Meng, and W. Chen. Multi-scale approaches to the mediaeval 2015” emotion in music” task. In MediaEval, 2015.
[58] Y. Yu, S. Luo, S. Liu, H. Qiao, Y. Liu, and L. Feng. Deep attention based music genre classification. Neurocomputing, 372:84–91, 2020.
[59] Z. Yu, X. Xu, X. Chen, and D. Yang. Temporal pyramid pooling convolutional neural network for cover song identification. In IJCAI, pages 4846–4852, 2019.
[60] D. Zeng, Y. Yu, and K. Oyama. Audiovisual embedding for cross-modal music video retrieval through supervised deep cca. In 2018 IEEE International Symposium on Multimedia (ISM), pages 143–150, 2018.
[61] K. Zhang, H. Zhang, S. Li, C. Yang, and L. Sun. The pmemo dataset for music emotion recognition. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR '18, page 135–142, New York, NY, USA, 2018. Association for Computing Machinery.
[62] L. Zhen, P. Hu, X. Wang, and D. Peng. Deep supervised crossmodal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2025-08-15起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2025-08-15起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw