進階搜尋


 
系統識別號 U0026-0812200914073086
論文名稱(中文) 應用非對稱性分類分析改進少數類別的分類正確率-以通聯紀錄為例
論文名稱(英文) Use of Skewed Classification Analysis to Improve the Accuracy Ratio for Minority Classification : Exemplified by Call Detail Record
校院名稱 成功大學
系所名稱(中) 電機工程學系碩博士班
系所名稱(英) Department of Electrical Engineering
學年度 96
學期 1
出版年 97
研究生(中文) 張博淵
研究生(英文) Bor-Yuan Chang
學號 N2691195
學位類別 碩士
語文別 英文
論文頁數 86頁
口試委員 口試委員-鄭宇庭
指導教授-焦惠津
口試委員-斯國峰
口試委員-謝邦昌
中文關鍵字 邏輯斯迴歸  類神經網路  資料探勘  異常偵測  決策樹 
英文關鍵字 Fraud Detection  Data Mining  Logistic Regression  Decision Tree  Neural Network 
學科別分類
中文摘要 資料探勘已被廣泛應用在各領域,只要該領域擁有具分析價值與需求的資料倉儲或資料庫,皆可利用探勘工具進行有目的的探勘分析。通聯紀錄是指電信用戶間彼此通話聯絡的紀錄,因此在犯罪偵查上,可藉由通聯紀錄的分析加以研判嫌犯的交往關係、生活作息、活動區域及涉案的可能性等,但通聯紀錄的分析要領需要長時間的摸索,才能熟悉相關分析技巧,因此本論文希望結合辦案人員所累積之經驗法則,並透過資料探勘技術,建立一套異常通聯分析模式,以便從大量繁雜的通聯資料中,快速鎖定少數重要人物所使用的關鍵號碼,再從這些關鍵號碼的通聯紀錄中,歸納出極具價值的異常通聯模式,日後只要運用這些異常通聯模式作交叉分析比對,即可迅速找出少數重要對象的電話號碼,提供辦案人員進行案件研判,並提供調閱標的,如此不僅可以有效節省通聯調閱費用,避免浪費公帑,又能協助辦案人員加速案件偵辦的進度。
為驗證本論文中所提方法的可行性,我們將使用真實結案後的通聯資料去建構出一個複合式模型,並評估此模型之預測正確率與穩定性。由於資料探勘的分類工具種類眾多,且每種分析工具都有其優缺點,經觀察與分析本研究的資料特性後,決定採用C5.0、CART、類神經網路及邏輯斯迴歸這四種資料分析工具,並搭配不同比例的隨機抽樣方式,分別去建立單一判別模式,再從中挑選預測能力較佳者來建立複合式模型,以提高預測的精確度。在經過反覆測試評估後,本論文最後將提出整合C5.0決策樹與類神經網路來建立複合式模型,可有效提高預測少數關鍵號碼之精確度。
英文摘要 Data mining has been widely applied to various domains. Once a domain possesses an analysis-valuable and a required data warehouse or database, a mining tool can be utilized to carry out an aimed mining analysis. The CDR (Call Detail Record) refers to the record of communication among telecom users. It can be used to analyze the social relationships, life habits, proximity to action areas, and the possibility of involvement in a crime of a suspect in a criminal investigation. But CDR analysis has to be practiced for a long time in order for personnel to be familiar with the relevant investigating skills. Therefore, this research aims at establishing a set of fraud CDR analysis models to quickly identify the phone numbers of a few important suspects among a large quantity of multifarious call data and determine which are the most valuable fraud call patterns by combining the experience accumulated by investigators and implementing data mining technology. In the future, these fraud call patterns could be used to carry out a comprehensive analysis and comparison. Then the phone numbers of a few important suspects could be identified in order for investigators to do case-analyses and to find consulting objects. This would not only effectively save communication consulting costs and avoid wasting public money, but also would assist law enforcement and accelerate investigative processes.
In order to test the feasibility of the measures proposed by this research, we will use the call data acquired after a case was resolved to construct a multiple model and to evaluate the forecasting accuracy rate and stability of the model. Considering that there are various classification tools for data mining, and each analysis tool has its own advantages and disadvantages, after observing and analyzing the characteristics of the data, we decided on four data analysis tools, namely C5.0, CART, Back Propagation Neural Network, and Logistic Regression. We also added a random sample of different proportions to respectively establish a single discriminate model and selected one model with better forecasting capability to create a multiple model in order to raise the accuracy of the estimation. After an iterative testing and evaluation, the research finally decided to unify the C5.0 decision tree and the Back Propagation Neural Network to establish a multiple model to enhance the accuracy of forecasting all the key phone numbers of important suspects.
論文目次 1 Introduction 1
1.1 Motivation . . . . . . . 1
1.2 Objective . . . . . . . . . . 2
1.3 Research Scope . . . . . . . 3
1.4 Research Limitations . . . 3
1.5 Thesis Organization . . . .4
2 Literature Review 5
2.1 The Literatures about the Applications of Call Detail Record . . . . . . . . . . 5
2.2 Data Mining . . . . .. . . . .7
2.3 CRISP-DM . . . . . . .. . ..8
2.3.1 Business Understanding . . . . . . .9
2.3.2 Data Understanding . . . . . . . . . . . . 9
2.3.3 Data Preparation . . . . . . . . . . . . .9
2.3.4 Modeling . . . . . . 10
2.3.5 Evaluation . . .. . .10
2.3.6 Deployment . . . . 10
2.4 Fraud . . . . . . . 10
2.5 Modeling Methodology . . .. . . . . . . . .11
2.5.1 Decision Trees . . . . . . . . . . . . . . 12
2.5.2 Neural Network . . . . . . . . . . 16
2.5.3 The Logistic Regression . . . . . .19
3 Research Design and Research Method 21
3.1 Research Process and Frame . . . . . .22
3.2 Data Source and Variable Explain . . . . . . 24
3.3 Data Preparation . . .. . 26
3.3.1 Data Transform . . . . . . . . . . 26
3.3.2 Selecting the Available Variable . . . . . . . . .29
3.4 Sampling Design . . . . . . .30
3.5 Modeling . . .. . . . .31
3.6 Evaluation . . . . .32
3.6.1 Lift Chart . . . .32
3.6.2 Classification Matrix . . . .. . . .33
3.7 Research Equipments and Tools . .. . . 33
4 Experiment Result and Analysis 35
4.1 Sampling Design . . . . . .. . .36
4.2 Prediction Model by Neural Network Algorithm . . . . .38
4.3 Decision Tree . . . . . . . 44
4.3.1 Prediction Model by C5.0 Algorithm . . . . 45
4.3.2 Prediction Model by CART Algorithm . . .. . .47
4.4 Prediction Model by Logistic Regression Algorithm . . .. . . . . . . . . .49
4.5 Comprehensive Comparison of Prediction Results for 1:1 Random Sampling
Models . . . . . . . 51
4.6 Build Multiple Model to Raise Prediction Accuracy . . . .. . .55
4.7 Comprehensive Comparison of Prediction Results for Models by Random Sampling
at Different Proportions . . . . . . . . . . . 59
4.8 Comparison of the Prediction Results from the Multiple Model Established
by Unifying 2 More Algorithms . . . . .64
4.9 Prediction Result from Models Without Using Biased Sampling Method . . . . . . 67
4.10 Prediction Results from Models Established under Original Call Data . . . . . .67
4.11 EXCEL 2007-Assisted Call Data Analysis . . . . . 68
5 Conclusion and Suggestions for Future Work 72
5.1 Conclusion . . .. . .72
5.2 Contribution . . . . . . 73
5.3 Future Work . . . . . . 75

A C5.0-generated ruleset (only a few examples cited) 80
參考文獻 [1] Yi-Tang Chiu, “Data Mining for Communication Database: Study on Prediction of Customer
Drains”, master thesis, Department of Information Management, National Sun Yat-Sen University,
1999.
[2] Shao-Chou Chiu, “The Application of Call Detail Records on Criminal Investigation”, master thesis,
Department of Criminal Police, Central Police University, 2001.
[3] Chun-Hung Cheng, “Analysis of Mobile Phone Criminal Detecting Patterns and Management Strategy”,
14th International Information Management Academic Seminar, 2003.
[4] J. Han, ”Data Mining,” in J. Urban and P. Dasgupta (eds.),Encyclopedia of Distributed Computing,
Kluwer Academic Publishers, 1999.
[5] J. Han and M. Kamber, ”Data Mining: Concepts and Techniques”, Morgan Kanfmann Publishers,
2001.
[6] M. J. A. Berry and G. S. Linoff, ”Data mining techniques for marketing, sales, and customer
support”, Wiley Computer publishing, 1997.
[7] Randy Kerber(NCR) Thomas Khabaza (SPSS) Thomas Reinartz(Daimler Chrysler) Colin
Shearer(SPSS) Rudiger Wirth (Daimler Chrysler) Peter Chapman(NCR), Julian Clinton(SPSS),
“CRISP-DM1.0 Step-by-Step data mining guide ”, http://www.crisp-dm.org, August 2000.
[8] P. Gosset and M. Hyland, “Classification, detection and prosecution of fraud in mobile networks”,
Proceedings of ACTS Mobile Summit, vol. Sorrento, Italy, June, 1999.
[9] R. J. Bolton and D. J Hand, “Statistical fraud detection: a review”, Statistical Science, vol. 17, no.
3, pp. 235–255, 2002.[10] S. Schwartz, “Is There a Schizophrenic Language?”, Behavioral and Brain Sciences, vol. 5, pp.
579–626, 1982.
[11] ChinCh’ang Lin, “Applying Hybird Soft Computing in Healthcare Management for the Detection
of DRGs Greeps”, master thesis, Department of Information, Fo Guang University, 2004.
[12] V. Sudhan L. Nathan V. Chandiramani, R. Jayaseelan and K. Priya, “A neural network approach to
process assignment in multiprocessor systems based on the execution time”, in Proc. of IEEE Int.
Conf. on Intelligent Sensing and Information Processing,Chennai, India, pp. 332–335, Aug. 2004.
[13] W.J. Hsieh, “The analysis and application of grey model and back-propagation network to the
premium rate service”, master thesis, Department of Computer Science and Engineering, Tatung
University, June 2003.
[14] B. Kijsirikul and K. Chongkasemwongse, “Decision tree pruning using back propagation neural
networks”, in Proc. of IEEE Int. Conf. on Neural Networks, Washington D.C., USA, vol. 3, pp.
1876–1880, July 2001.
[15] Y.C. Ye, ”The Application and Design Pattern of Artificial Neural Networks”, Scholars Books Inc.,
Taipei, Taiwan, 1993.
[16] Y.C. Ye, ”The Application of Artificial Neural Networks”, Scholars Books Inc., Taipei,Taiwan, 1997.
[17] Yen-Shih Li, ”Analysis of Risk Factors Influencing Cash Card Default”, master thesis, Department
of Information Management, National Central University, 2006.
[18] S. H. Ha and S. C. Park, “Application of Data Mining Tools to Hotel Data Mart on the Intranet
for Database Marketing”, Expert Systems With Applications, vol. 15, pp. 1–31, 1998.
[19] Chao-Kai Hung, ”Empirical Study on Applying Data Mining Technology to Overdue Credit Cards”,
master thesis, Department of Information Management, Fu Jen Catholic University, 2006.
[20] M. Kubat and S. Matwin, “Addressing the curse of imbalanced training sets: one-sided selection ”,
Proceedings of the 14th International Conference on Machine Learning, pp. 179–186, 1997.
[21] Brown J. Beck H. Fausett L. DeRouin, E. and M. Schneider, “Neural Network Training on Unequally
Represented Classes ”, Intelligent Engineering Systems Through Artificial Neural Networks, C. H.
Dagli, S. R. T. Kumara, and Y. C. Shin (Eds.),ASME Press, New York, pp. 135–145, 1991.[22] Chao-Chiung Cheng, ”Prediction of Applying Intelligent Business Technology to Default Risk of
Credit Card”, Department of Statistics, National Cheng-Chi University, 2006.
[23] Kdnuggets web, ”Which kind of data mining technique is the most frequently used by you?”, http
: // www.kdnuggets.com / polls / 2005 / data mining techniques.htm, 2005.
[24] Hsiangchih Yin, ”SQL Server 2005 Data Mining”, Delight Press, 2007.
[25] Shun-Cheng Yang, ”Application of Business Intelligence–Analysis of Defaulted Credit Card Accounts”,
Special Study on Professor Ben-Chang Shia’s Statistics and Data Mining, 2006.
[26] Patuwo B. E. Zhang, G. and M. Y. Hu, “Forecasting with artificial Neural Networks: the state of
the art ”, International Journal of Forecasting, vol. 14, no. 1, pp. 35–62, 1998.
[27] SPSS Clementine 10.1 Node Reference, 2007.
論文全文使用權限
  • 同意授權校內瀏覽/列印電子全文服務,於2010-02-13起公開。
  • 同意授權校外瀏覽/列印電子全文服務,於2018-02-13起公開。


  • 如您有疑問,請聯絡圖書館
    聯絡電話:(06)2757575#65773
    聯絡E-mail:etds@email.ncku.edu.tw