||Improving Transformer Performance Using Ensemble of Variations of One Model from Layer Permutation Training
||Institute of Computer Science and Information Engineering
集成方法結合了多個模型的輸出以提高性能，在包括神經網絡的機器學習領域中已取得了重大的成果。過去相關研究已顯示可減少需要多個模型相關的成本，例如訓練時間和參數數量。本論文著重於一種新穎的方法來應用集成方法，該方法既不會增加參數數量，又能提高性能。所提出的方法通過創建單個模型的變體並整合這些變體來進行操作。這些變體是通過更改層順序創建的。本論文並提出了兩種方法來訓練這種模型。所提出的方法可以與傳統集成方法結合，以進一步提高性能。近年來，轉換器在自然語言處理方面取得了巨大的成功。因此選擇了轉換器在機器翻譯上作為評估。在使用相同數量的參數的情況下，IWSLT 2014德語翻譯英語和法語翻譯英語與單個基線模型相比，至少增加了0.7 BLEU分數。對於3個模型和5個模型的集成，所提出的方法在沒有增加參數的情況下，整體最小增加了0.3 BLEU分數。相較下，5個基線模型的集合比3個基線模型的集合提高了約0.42 BLEU分數，但需要增加66％的參數。
Ensemble techniques combine the output of multiple models to improve the performance. These methods have achieved great success in the field of machine learning including neural networks. There has been research into reducing the cost of associated with requiring multiple models, such as training time, and number of parameters. This study focus on a novel approach to apply ensemble that doesn’t increase the number of parameters while still offering performance gain. The proposed method operates by creating variations of a single model and ensemble these variations. These variations are created by changing the order of layers. Two ways of training are proposed to accommodate this method. The proposed method can be combined with common ensemble technique to further improve the performance. Transformer was chosen as the model to apply this approach as it has seen great success in natural language processing in recent years. For sentence-level translation, the IWSLT 2014 German to English and French to English saw an increase of at least 0.7 BLEU score over the single model baseline with both using the same amount of parameters. For ensemble of 3 models and 5 models, the proposed method saw minimum increase of 0.3 BLEU score across the board with no additional parameters. For reference, ensemble of 5 baseline models improved about 0.42 BLEU score over ensemble of 3 baseline models, while needing an increase of 66% of parameters.
List of Figures V
List of Tables VI
List of Equation VII
List of Algorithm VIII
1.1 Background 1
1.2 Motivation 2
1.3 Literature Review 2
1.3.1 Related Works 2
1.3.2 Neural Network for Machine Translation 4
1.3.3 Transformer 5
1.4 Problem 11
1.5 Main contribution 12
Chapter 2 Methods 13
2.1 Baseline 13
2.2 Inspiration 14
2.3 Proposed Method 15
2.3.1 Generalized Form 16
2.3.2 Training 16
2.3.3 Evaluation 19
2.3.4 Discussion 19
Chapter 3 Experimental Setup 21
3.1 Dataset and Evaluation Method 21
3.2 Data Preprocessing 23
3.3 Training and Hyperparameters 24
3.3.1 Correctness of Baseline Implementation 26
Chapter 4 Experiments and results 27
4.1 Main Result 27
4.1.1 Performance of Individual Orders 28
4.1.2 Layer Permutation Training 29
4.2 Combining with Ensemble Techniques 30
4.3 Hyperparameter Selection 33
4.3.1 Random Order and Set Order 33
4.3.2 Pool Size 33
Chapter 5 Conclusion 35
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
 I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in Advances in neural information processing systems, 2014, pp. 3104-3112.
 G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger, "Snapshot ensembles: Train 1, get m for free," arXiv preprint arXiv:1704.00109, 2017.
 T. G. Dietterich, "Ensemble Methods in Machine Learning," Berlin, Heidelberg, 2000: Springer Berlin Heidelberg, pp. 1-15.
 L. K. Hansen and P. Salamon, "Neural Network Ensembles," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 12, pp. 993-1001, 11/01 1990.
 T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, "Loss surfaces, mode connectivity, and fast ensembling of dnns," in Advances in Neural Information Processing Systems, 2018, pp. 8789-8798.
 P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, "Averaging weights leads to wider optima and better generalization," arXiv preprint arXiv:1803.05407, 2018.
 N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli, and S. Edunov, "Facebook FAIR's WMT19 News Translation Task Submission," arXiv preprint arXiv:1907.06616, 2019.
 M. Junczys-Dowmunt, "Microsoft translator at wmt 2019: Towards large-scale document-level neural machine translation," arXiv preprint arXiv:1907.06170, 2019.
 P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer, "The mathematics of statistical machine translation: Parameter estimation," Computational linguistics, vol. 19, no. 2, pp. 263-311, 1993.
 J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, "Convolutional sequence to sequence learning," in Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017: JMLR. org, pp. 1243-1252.
 D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
 A. Vaswani et al., "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998-6008.
 G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, "Improving neural networks by preventing co-adaptation of feature detectors," arXiv preprint arXiv:1207.0580, 2012.
 V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in ICML, 2010.
 K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
 J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
 A. Vaswani et al., "Tensor2tensor for neural machine translation," arXiv preprint arXiv:1803.07416, 2018.
 G. Klein, Y. Kim, Y. Deng, V. Nguyen, J. Senellart, and A. M. Rush, "Opennmt: Neural machine translation toolkit," arXiv preprint arXiv:1805.11462, 2018.
 M. Ott et al., "fairseq: A fast, extensible toolkit for sequence modeling," arXiv preprint arXiv:1904.01038, 2019.
 R. Xiong et al., "On layer normalization in the transformer architecture," arXiv preprint arXiv:2002.04745, 2020.
 G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton, "Regularizing neural networks by penalizing confident output distributions," arXiv preprint arXiv:1701.06548, 2017.
 G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, "Deep networks with stochastic depth," in European conference on computer vision, 2016: Springer, pp. 646-661.
 A. Fan, E. Grave, and A. Joulin, "Reducing transformer depth on demand with structured dropout," arXiv preprint arXiv:1909.11556, 2019.
 N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. Müller, S. Stüker, and A. Waibel, "Very deep self-attention networks for end-to-end speech recognition," arXiv preprint arXiv:1904.13377, 2019.
 M. Elbayad, J. Gu, E. Grave, and M. Auli, "Depth-adaptive transformer," arXiv preprint arXiv:1910.10073, 2019.
 A. Graves, "Adaptive computation time for recurrent neural networks," arXiv preprint arXiv:1603.08983, 2016.
 G. M. Correia, V. Niculae, and A. F. Martins, "Adaptively sparse transformers," arXiv preprint arXiv:1909.00015, 2019.
 P. Michel, O. Levy, and G. Neubig, "Are sixteen heads really better than one?," in Advances in Neural Information Processing Systems, 2019, pp. 14014-14024.
 E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, "Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned," arXiv preprint arXiv:1905.09418, 2019.
 S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin, "Adaptive attention span in transformers," arXiv preprint arXiv:1905.07799, 2019.
 M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico, "Report on the 11th iwslt evaluation campaign, iwslt 2014," in Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, 2014, vol. 57.
 O. Bojar et al., "Findings of the 2014 workshop on statistical machine translation," in Proceedings of the ninth workshop on statistical machine translation, 2014, pp. 12-58.
 K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311-318.
 M. Post, "A call for clarity in reporting BLEU scores," arXiv preprint arXiv:1804.08771, 2018.
 R. Sennrich, B. Haddow, and A. Birch, "Neural machine translation of rare words with subword units," arXiv preprint arXiv:1508.07909, 2015.
 D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
 M. Ott, S. Edunov, D. Grangier, and M. Auli, "Scaling neural machine translation," arXiv preprint arXiv:1806.00187, 2018.