中文關鍵字 
生物醫學資料探勘
叢集分析
雙分群
基因表現分析
模糊關聯規則
遺傳演算法

英文關鍵字 
biomedical data mining
clustering
biclustering
gene expression analysis
fuzzy association rule mining
genetic algorithm

中文摘要 
近年來，生物醫學資料之知識探勘的需求與重要性與日俱增。對於生物學家而言，一般來說，分析的流程常常是由幾個感興趣的標的物(例如: 疾病相關基因)開始，進而找出更多相關的生物標的物，之後再接著分析更大量的資料。此外，在各種資料探勘技術中，叢集分析為其重要的資料探勘方法之一，並且常利用於生物及醫學等各領域的資料分析，例如:基因微陣列資料(gene expression microarray)分析。在本論文中，我們針對不同的分析需求，提出三種以叢集技術為基礎的探勘方法，包含:查詢導向之模糊式雙分群法、整合式叢集法，以及模糊式關聯規則探勘，以期望能提供給生物學家一個由小到大規模的探勘分析。
首先，針對如何尋找微陣列資料中，與使用者感興趣基因有相似表現量之雙分群分析，我們提出一種稱為加權式模糊基礎之最大相似雙分群法(Weighted Fuzzybased Maximum Similarity Biclustering), 簡稱WFMSB法。此雙分群法提供使用者輸入一個感興趣的基因，根據與此基因的表現量相似程度，找出其他基因在子樣本空間維度下與此輸入參考基因(the reference gene)有相似的基因表現量。相對於傳統的雙分群法，此方法利用模糊理論可以找出不同相似程度的雙分群(bicluster)結果，特別是可以找出與參考基因最相似與最不相似的雙分群結果，而透過基因本體論(Gene Ontology)的生物註解資料，驗證出在同一個雙分群結果裡的基因組具有高度生物相關程度。經由模擬資料與真實微陣列資料之實驗顯示，WFMSB演算法的結果比其他方法的結果具有更顯著的生物意義和基因表現訊號。
另外，針對如何同時分析不同型態之微陣列資料問題的探討，我們提出一種整合時間序列類型與類別型態(例如:有用藥治療樣本與沒有用藥治療樣本)之基因微陣列資料的分析方法，稱為混合式之時間序列類型與類別型態分析演算法(the mixture of Timeseries and Groupcomparative analysis algorithm)，簡稱為TGmix，經由此方法，我們可以分析出有哪些基因同時在時間序列部分與類別型態部分，具有相似的基因表現樣式。方法概述如下，針對每組基因，我們將時間序列類型與類別型態的表現量數據組合成一個整合型態的表現量數據。接下來，我們提出一種新型的相似度計算方法，用來量測兩個整合型態的表現量數據間的相似程度，並且利用密度基礎之叢集分析演算法將此整合型態的表現量數據分成數群。最後，再透過篩選機制挑選出最具有顯著相關性的基因集合。透過真實的大鼠口腔傷口癒合的微陣列資料實驗，TGmix演算法能找出許多具有生物意義的分析結果。
最後，我們提出一種叢集式模糊關聯規則探勘演算法，稱為以叢集與各個擊破法為基礎之基因演化模糊探勘法(Clusterbased Divideandconquer GeneticFuzzy approach with Multiple Minimum Supports)，簡稱CDGFMMS。此演算法利用基因演算法、叢集分析與模糊理論來分析交易(transaction)類型的生物醫學資料，並且找出各個生物醫學項目(item)之最佳的最小支持度門檻值、隸屬函數和模糊關聯規則。透過實驗之驗證，CDGFMMS演算法能找出各個項目中最合適的最小支持度門檻值、隸屬函數和模糊關聯規則，並且比其他方法更能大幅度減少執行時間。
整體來說，本論文提供生物學家數種不同分析需求的探勘演算法，透過各種模擬與真實資料的驗證實驗，這些提出的演算法能成功地解決上述提及的生物醫學資料分析問題。

英文摘要 
The importance of discovering knowledge from biomedical data is growing at rapid pace in recent years. In general, the analysis flow on biomedical data runs from study on a few targeted biomarkers, like diseaserelated genes, to the analysis of relationships among huge targeted biomarkers. Among various data mining techniques, clustering analysis is one of the most important methods being applied to biomedical problems, like gene expression microarray analysis. In this dissertation, we proposed three clusteringbased mining algorithms with different analysis purposes, including querydriven fuzzy biclustering, integratedbased clustering and fuzzy association rule mining, for biologists to investigate one or huge amount of biomarkers.
First, we proposed a querydriven fuzzy biclustering (or coclustering) algorithm, namely Weighted Fuzzybased Maximum Similarity Biclustering (WFMSB), for extracting biclusters with different similarity levels based on the userdefined reference gene. In particular, the most similar bicluster and the most dissimilar bicluster to the reference gene can be extracted, and both biclusters have functional meanings with the Gene Ontology (GO) annotations. Through experiments conducted on simulated and real gene expression data sets, the WFMSB algorithm was shown to outperform previous querydriven biclustering methods greatly in the sense that more significant expression signals are discovered in the biclusters.
Second, we proposed an integrated approach, called mixture of Timeseries and Groupcomparative analysis (TGmix), on both of timeseries type and twogroup comparative type (drug treatment samples versus nontreatment samples) microarray datasets for finding significant genes being coexpressed in timeseries part and differentially expressed in groupbased part simultaneously. For each gene, the corresponding timeseries profile and twogroup comparative profile are combined to be an integrated gene profile. A novel similarity measure was proposed to calculate the similarity between two integrated gene profiles. Then, the densitybased clustering algorithm is used to group coexpressed genes into the same cluster. Finally, a filtering process is applied to select significant gene sets. Through experiments conducted on rat wound healing microarray datasets, the TGmix algorithm was shown to be effective in finding gene clusters with biological meanings.
Finally, we proposed an efficient clusterbased fuzzy association rule mining algorithm, called Clusterbased Divideandconquer GeneticFuzzy approach with Multiple Minimum Supports (CDGFMMS), for discovering associated items from biomedical data. In the CDGFMMS algorithm, Genetic Algorithm (GA), the clustering technique and the fuzzy concepts are used together to discover suitable minimum supports, membership functions and useful fuzzy association rules from quantitative transactions. The CDGFMMS algorithm was shown to deliver higher efficiency than previously existing algorithms.
In summary, we proposed a set of clusteringbased algorithms with different mining purposes for analysis on biomedical data. Through performance evaluations on various simulated and real datasets, these proposed methods can successfully resolve the targeted problems in biomedical data mining.

