進階搜尋


下載電子全文  
系統識別號 U0026-2307202011350500
論文名稱(中文) 以降維分析 DNA 甲基化與肺癌的關聯性
論文名稱(英文) Extracting and Analyzing Latent Space from Lung Cancer DNA Methylation Data with Dimensionality Reduction
校院名稱 成功大學
系所名稱(中) 資訊工程學系
系所名稱(英) Institute of Computer Science and Information Engineering
學年度 108
學期 2
出版年 109
研究生(中文) 林晉昇
研究生(英文) Chin-Sheng Lin
學號 P76071381
學位類別 碩士
語文別 英文
論文頁數 40頁
口試委員 指導教授-賀保羅
口試委員-蔣榮先
口試委員-黃則達
中文關鍵字 肺癌  腫瘤  DNA 甲基化  降維  分類 
英文關鍵字 Lung Cancer  Tumor  DNA Methylation  Dimensionality Reduction  Classification 
學科別分類
中文摘要 肺癌是國人因癌症死亡最高的癌症,好發於55歲以上之族群。就其生物特性及臨床表現分為小細胞肺癌與非小細胞肺癌兩大類,其中肺腺癌屬於後者,目前已有接近七成的肺癌屬於肺腺癌。近幾十年來,有許多研究發現肺癌患者有不尋常的基因甲基化現象,甚至有些甲基化被發現與患者的不良預後或腫瘤復發有關。本研究將探討DNA甲基化與肺癌的關聯性,並分析腫瘤患者與非腫瘤樣本之間的差異性,並結合降維與分類等方法處理DNA甲基化的資料以預測腫瘤的狀態。
藉由分析美國癌症基因圖譜計畫(TCGA)的肺腺癌患者的DNA甲基化資料,分析腫瘤與非腫瘤樣本的差異性,並找出甲基化與肺癌的關聯性。本研究可分為兩個部分,第一個部分是利用DNA甲基化的資料預測患者的腫瘤狀態。我們首先基於統計的方法進行特徵選取,選出較能區分腫瘤與非腫瘤的甲基,並利用降維的技術將甲基化的資訊降至較低維度後,訓練分類器來預測樣本的腫瘤狀態。第二個部分對腫瘤患者與非腫瘤樣本分析其差異性,探討兩種情況的甲基化分布等。
在實驗中,本研究利用美國癌症基因圖譜計畫的肺腺癌公開資料集507位患者的DNA甲基化資料,其中包括475位腫瘤與32位非腫瘤的樣本。我們利用統計的方法挑選能夠區分腫瘤與非腫瘤的甲基,並證實這些甲基當中的一部份確實與肺癌或某些病症劇有相關性。將DNA甲基化資料透過降維與分類等技術,預測樣本的腫瘤狀態得到相當高的準確率,並進一步分析腫瘤與非腫瘤樣本間甲基化的差異性,證實DNA甲基化與肺癌具有高度的關聯性。
英文摘要 Lung cancer is the top cause of death in Taiwan among all cancer types. It usually happens to people over 55 years old. It can be classified into two categories according to its biological characteristics and clinical symptoms, small-cell lung carcinoma and non-small-cell carcinoma. Lung adenocarcinoma belongs to the latter, and nearly 70% of lung cancers are lung adenocarcinoma. In recent decades, many researchers have found that patients with lung cancer have unusual gene methylation, and even some methylation has been found have relationship to their poor prognosis or tumor recurrence. In this study, we will explore the relationship between DNA methylation and lung cancer, and analyze the differences between tumor patients and normal samples, combining the methods of dimensionality reduction and classification on DNA methylation data to predict tumor status.
We analyze the DNA methylation data of lung cancer patients in the United States Cancer Gene Atlas Project (TCGA), compare the differences between tumor and normal samples, and find out the relationship between DNA methylation and lung cancer. Our research can be divided into two parts. The first part is to use DNA methylation data to predict the patient’s tumor status. We first select features based on statistical methods that can distinguish tumor samples from normal samples, and use dimensionality reduction techniques to reduce the information of methylation to lower dimensions, then train a classifier to predict the tumor status for each sample. For second part, we analyze the differences between cancer patients and normal samples, and discuss the distribution of DNA methylation in both cases.
As for the experiment, we use the DNA methylation data of 507 patients in a public lung adenocarcinoma dataset of TCGA, including 475 tumors and 32 normal samples. We use statistical methods to select methylation sites that can distinguish between tumor and normal, and confirm that some of these methylation sites are indeed related to lung cancer or some kind of other diseases. We also reduce the dimensionality of DNA methylation data and use machine learning to predict the tumor status and obtain convincing accuracy. Then we further analyze the difference of DNA methylation data between tumor and normal samples, and finally confirm that DNA methylation and lung cancer are highly related.
論文目次 摘要 I
Abstract III
誌 謝 VI
Contents VII
List of Figures IX
List of Tables X
Chapter 1 Introduction. 1
1.1 Background 1
1.2 Motivation 2
1.3 Research Objectives 3
Chapter 2 Related Work 4
2.1 DNA Methylation and Lung Cancer 4
2.2 Dimensionality Reduction on Biological Data 4
Chapter 3 Methods 6
3.1 Dataset 6
3.2 Feature Selection 8
3.3 Dimensionality Reduction 10
3.3.1 Variational Autoencoder 10
3.3.2 Principal Component Analysis 12
3.3.3 Independent Component Analysis 12
3.3.4 Non-negative Matrix Factorization 13
3.4 Classification 13
Chapter 4 Experiments and Results 15
4.1 Feature Selection 15
4.2 Classification 16
4.3 CpG Beta Value Distribution 19
4.4 Latent Space Visualization 22
4.4.1 T-test Latent Space Visualization 22
4.4.2 PCA Latent Space Visualization 23
4.4.3 NMF Latent Space Visualization 25
4.4.4 Outlier Discussion 26
4.5 Location of CpGs 27
4.6 Chromosomes 31
4.6.1 CpGs in Chromosomes 31
4.6.2 Top T-Test CpGs Discussion 32
4.6.3 Top T-Test Genomes 35
Chapter 5 Conclusions and Future Work 36
5.1 Conclusions 36
5.2 Future Work 37
References 38
參考文獻 [1] Wajed, S. A., Laird, P. W., & DeMeester, T. R. (2001). DNA methylation: an alternative pathway to cancer. Annals of surgery, 234(1), 10.
[2] Way, G. P., & Greene, C. S. (2017). Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. BioRxiv, 174474.
[3] Wang, Z., & Wang, Y. (2019). Extracting a biologically latent space of lung cancer epigenetics with variational autoencoders. BMC bioinformatics, 20(18), 1-7.
[4] Tsou, J. A., Hagen, J. A., Carpenter, C. L., & Laird-Offringa, I. A. (2002). DNA methylation analysis: a powerful new tool for lung cancer diagnosis. Oncogene, 21(35), 5450-5461.
[5] Way, G. P., & Greene, C. S. (2017). Evaluating deep variational autoencoders trained on pan-cancer gene expression. arXiv preprint arXiv:1711.04828.
[6] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
[7] Du, P., Zhang, X., Huang, C. C., Jafari, N., Kibbe, W. A., Hou, L., & Lin, S. M. (2010). Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC bioinformatics, 11(1), 587.
[8] Kalpić, D., Hlupić, N., & Lovrić, M. (2011). Student’s t-Tests. International Encyclopedia of Statistical Science. Part 19/Lovrić, Miodrag (ur.).; Berlin: Springer, 2011.; 1559-1563; DOI: 10.1007/978-3-642-04898-2_641; p-ISBN 978-3-642-04897-5, eISBN 978-3-642-04898-2.
[9] Welch, B. L. (1947). The generalization ofstudent's' problem when several different population variances are involved. Biometrika, 34(1/2), 28-35.
[10] Baldi, P. (2012, June). Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML workshop on unsupervised and transfer learning (pp. 37-49).
[11] Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
[12] Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202.
[13] Shlens, J. (2014). A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100.
[14] Hyvärinen, A., & Oja, E. (2000). Independent component analysis: algorithms and applications. Neural networks, 13(4-5), 411-430.
[15] Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE transactions on Neural Networks, 10(3), 626-634.
[16] Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (pp. 556-562).
[17] Peng, C. Y. J., Lee, K. L., & Ingersoll, G. M. (2002). An introduction to logistic regression analysis and reporting. The journal of educational research, 96(1), 3-14.
[18] Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
[19] Bläsius, F. M., Meller, S., Stephan, C., Jung, K., Ellinger, J., Glocker, M. O., ... & Kristiansen, G. (2017). Loss of cadherin related family member 5 (CDHR5) expression in clear cell renal cell carcinoma is a prognostic marker of disease progression. Oncotarget, 8(43), 75076.
[20] Zhang, Y. A., Ma, X., Sathe, A., Fujimoto, J., Wistuba, I. I., Lam, S., ... & Larsen, J. E. (2016). Validation of SCT methylation as a hallmark biomarker for lung cancers. Journal of Thoracic Oncology, 11(3), 346-360.
[21] Han, C., Sun, L. Y., Wang, W. T., Sun, Y. M., & Chen, Y. Q. (2019). Non-coding RNAs in cancers with chromosomal rearrangements: the signatures, causes, functions and implications. Journal of Molecular Cell Biology, 11(10), 886-898.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2020-07-30起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2020-07-30起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw