||Generating Multi-modal Virtual Samples to Assess Product Lifetime Performance for Small Data Sets
||Department of Industrial and Information Management
Small data set
Virtual sample generation
Virtual sample size
Virtual sample generation approaches have been used with small data sets to enhance learning performance in a number of reports. The appropriate estimation of the data distribution plays an important role in this process, and the resulting performance is usually better for data sets that have a simple distribution rather than a complex one. However, mixed-type data sets often have a multi-modal distribution instead of a simple, uni-modal one. In order to solve this problem, this study assumes that a data set follows a two-parameter Weibull distribution, and proposes the Maximal P-Value method to estimate two parameters of a Weibull distribution to construct a nonlinear and asymmetrical small data distribution. Further, this study thus proposes a new approach to detect multi-modality in data sets, to avoid the problem of inappropriately using a uni-modal distribution. This work utilizes the common k-means clustering method to detect possible clusters, and, based on the clustered sample sets, a Weibull variate is estimated for each of these to produce multi-modal virtual data. In this approach, the degree of error variation in the Weibull skewness between the original and virtual data is measured and used as the criterion for determining the sizes of virtual samples. This study provides simulated data sets and two practical examples to demonstrate that the Maximal P-Value method is a more appropriate technique to increase estimation accuracy of data distribution with small sample sizes. In addition, six data sets with different training data sizes are employed to check the performance of the proposed method, and comparisons are made based on the classification accuracy. Finally, the experimental results using non-parametric testing show that the proposed method has better classification performance than that of the Mega-Trend-Diffusion method.
LIST OF TABLES VI
LIST OF FIGURES VII
1. INTRODUCTION 1
1.1 Research Background 1
1.2 Research Motivation 2
1.3 Research Purposes 4
1.4 Research Structure 5
2. LITERATURE REVIEW 6
2.1 Related Studies 6
2.1.1 Virtual Sample Generation 6
2.1.2 The Mega-Trend-Diffusion Method 7
2.1.3 Least-squares Estimation for a Weibull Distribution 8
2.1.4 The Lifetime Performance Testing Procedure 8
2.2 Modality Tests 13
2.2.1 The Dip Test 13
2.2.2 The Excess Mass Test 15
2.3 Related Techniques for Clustering and Classification 17
2.3.1 K-means Clustering 17
2.3.2 Linear Discriminant Analysis 18
2.3.3 K-nearest Neighbors 19
2.3.4 Support Vector Machine 20
3. METHODOLOGY 23
3.1 The Scheme for Virtual Sample Generation 23
3.2 The Maximal P-Value Method 25
3.3 The Proposed Modality Test 26
3.3.1 The Relationship between PDF and CDF 26
3.3.2 The Procedure of Modality Test 28
3.4 The Decision of Virtual Sample Size 30
3.5 Multi-modal Virtual Sample Generation 31
3.5.1 Virtual Sample Generation 32
3.5.2 The Inversion Method 32
3.5.3 K-modality Selection for Attributes 33
3.6 The Detailed Steps of the Proposed Method 34
4. EXPERIMENTS 36
4.1 The Performance of Maximal P-Value Method 36
4.1.1 Simulated Data Sets 36
4.1.2 Two Types of Real Numerical Data 43
4.1.3 Experimental Results 46
4.2 The Six Data Sets 46
4.3 An Example of the Proposed Method 48
4.4 The Experiment Design 50
4.5 The Results for the Selection of Classifiers 51
4.6 The Results of the Experiment to Compare Methods 54
4.7 Summary 58
5. CONCLUSIONS AND SUGGESTIONS 59
5.1 Conclusions 59
5.2 Suggestions 60
Abernethy, R.B. (2004), The New Weibull Handbook (5th ed.), 536 Oyster Road, North Palm Beach, Florida: Robert B Abernethy.
Amari, S.-i. & Wu, S. (1999), “Improving support vector machine classifiers by modifying kernel functions.” Neural Networks, 12 (6), pp. 783-789.
Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/]
Aydin, I., Karakose, M. & Akin, E. (2011), “A multi-objective artificial immune algorithm for parameter optimization in support vector machine.” Applied Soft Computing, 11 (1), pp. 120-129.
Benard, A. & Bos-Levenbach, E.C. (1953), “The plotting of observations on probability paper.” Statistica, 7, pp. 163-173.
Bowman, K.O. & Shenton, L.R. (2001), “Weibull distributions when the shape parameter is defined.” Computational Statistics & Data Analysis, 36 (3), pp. 299-310.
Chan, Y.-b. & Hall, P. (2010), “Using evidence of mixed populations to select variables for clustering very high-dimensional data.” Journal of the American Statistical Association, 105 (490), pp. 798-809.
Chang, C.C. & Lin, C.J. (2011), “LIBSVM: A library for support vector machines.” ACM Transactions on Intelligent Systems and Technology, 2 (3), pp. 1-27.
Chang, Y. & Wu, C.W. (2008), “Assessing process capability based on the lower confidence bound of Cpk for asymmetric tolerances.” European Journal of Operational Research, 190 (1), pp. 205-227.
Chen, J.P. & Chen, K. (2004), “Comparing the capability of two processes using Cpm.” Journal of Quality Technology, 36 (3), pp. 329-335.
Cheng, M.Y. & Hall, P. (1999), “Mode testing in difficult cases.” The Annals of Statistics, 27 (4), pp. 1294-1315.
Cho, S., Jang, M. & Chang, S. (1997), “Virtual sample generation using a population of networks.” Neural Processing Letters, 5 (2), pp. 21-27.
Cortes, C. & Vapnik, V. (1995), “Support-vector networks.” Machine learning, 20 (3), pp. 273-297.
Das, K. & Nenadic, Z. (2009), “An efficient discriminant-based solution for small sample size problem.” Pattern Recognition, 42 (5), pp. 857-866.
Davies, P.L. & Kovac, A. (2004), “Densities, spectral densities and modality.” Annals of Statistics, 32 (3), pp. 1093-1136.
Demšar, J. (2006), “Statistical comparisons of classifiers over multiple data sets.” The Journal of Machine Learning Research, 7, pp. 1-30.
Denoeux, T. (1995), “A k-nearest neighbor classification rule based on Dempster-Shafer theory.” IEEE Transactions on Systems, Man and Cybernetics, 25 (5), pp. 804-813.
Dodson, B. (2006), The Weibull Analysis Handbook (2nd ed.), Milwaukee: American Society for Quality, Quality Press.
Durbin, J., Knott, M. & Taylor, C. (1975), “Components of Cramer-von Mises statistics. II.” Journal of the Royal Statistical Society. Series B (Methodological), 37 (2), pp. 216-237.
Estabrooks, A., Jo, T. & Japkowicz, N. (2004), “A multiple resampling method for learning from imbalanced data sets.” Computational Intelligence, 20 (1), pp. 18-36.
Gail, M.H. & Gastwirth, J.L. (1978), “A scale-free goodness-of-fit test for the exponential distribution based on the Gini statistic.” Journal of the Royal Statistical Society. Series B (Methodological), 40 (3), pp. 350-357.
Good, I. & Gaskins, R. (1980), “Density estimation and bump-hunting by the penalized likelihood method exemplified by scattering and meteorite data.” Journal of the American Statistical Association, 75 (369), pp. 42-56.
Hartigan, J.A. & Hartigan, P. (1985), “The dip test of unimodality.” The Annals of Statistics, 13 (1), pp. 70-84.
Iman, R.L. & Davenport, J.M. (1980), “Approximations of the critical region of the fbietkan statistic.” Communications in Statistics-Theory and Methods, 9 (6), pp. 571-595.
Kapur, K.C. & Lamberson, L.R. (1977), Reliability in Engineering Design, New York: John Wiley and Sons, Inc.
Knott, M. (1974), “The distribution of the Cramér-von Mises statistic for small sample sizes.” Journal of the Royal Statistical Society. Series B (Methodological), 36 (3), pp. 430-438.
Lehmann, E.L. & Scheffé, H. (1950), “Completeness, similar regions, and unbiased estimation: Part I.” Sankhyā: The Indian Journal of Statistics (1933-1960), 10 (4), pp. 305-340.
Li, D.C., Chang, C.C. & Liu, C.W. (2012), “Using structure-based data transformation method to improve prediction accuracies for small data sets.” Decision Support Systems, 52 (3), pp. 748-756.
Li, D.C., Chen, L.S. & Lin, Y.S. (2003), “Using functional virtual population as assistance to learn scheduling knowledge in dynamic manufacturing environments.” International Journal of Production Research, 41 (17), pp. 4011-4024.
Li, D.C., Fang, Y.H. & Fang, Y.M.F. (2010), “The data complexity index to construct an efficient cross-validation method.” Decision Support Systems, 50 (1), pp. 93-102.
Li, D.C. & Lin, L.S. (2013), “A new approach to assess product lifetime performance for small data sets.” European Journal of Operational Research, 230 (2), pp. 290-298.
Li, D.C., Lin, L.S. & Peng, L.J. (2014), “Improving learning accuracy by using synthetic samples for small datasets with non-linear attribute dependency.” Decision Support Systems, 59, pp. 286-295.
Li, D.C. & Lin, Y.S. (2006), “Using virtual sample generation to build up management knowledge in the early manufacturing stages.” European Journal of Operational Research, 175 (1), pp. 413-434.
Li, D.C. & Liu, C.W. (2012), “Extending attribute information for small data set classification.” IEEE Transactions on Knowledge and Data Engineering, 24 (3), pp. 452-464.
Li, D.C., Liu, C.W. & Hu, S.C. (2010), “A learning method for the class imbalance problem with medical data sets.” Computers in Biology and Medicine, 40 (5), pp. 509-518.
Li, D.C., Wu, C.S., Tsai, T.I. & Lina, Y.S. (2007), “Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge.” Computers & Operations Research, 34 (4), pp. 966-982.
Lin, Y.S. & Li, D.C. (2010), “The Generalized-Trend-Diffusion modeling algorithm for small data sets in the early stages of manufacturing systems.” European Journal of Operational Research, 207 (1), pp. 121-130.
Little, S.N. (1983), “Weibull diameter distributions for mixed stands of western conifers.” Canadian Journal of Forest Research, 13 (1), pp. 85-88.
Liu, P.H. & Chen, F.L. (2006), “Process capability analysis of non-normal process data using the Burr XII distribution.” The International Journal of Advanced Manufacturing Technology, 27 (9), pp. 975-984.
Müller, D.W. & Sawitzki, G. (1991), “Excess mass estimates and tests for multimodality.” Journal of the American Statistical Association, 86 (415), pp. 738-746.
Mannino, M., Yang, Y. & Ryu, Y. (2009), “Classification algorithm sensitivity to training data with non representative attribute noise.” Decision Support Systems, 46 (3), pp. 743-751.
Montgomery, D.C. (1985), Introduction to Statistical Quality Control, New York: John Wiley & Sons Inc.
Niyogi, P., Girosi, F. & Poggio, T. (1998), “Incorporating prior information in machine learning by creating virtual examples.” Proceedings of the IEEE, 86 (11), pp. 2196-2209.
Pearn, W.L., Hung, H. & Cheng, Y.C. (2009), “Supplier selection for one-sided processes with unequal sample sizes.” European Journal of Operational Research, 195 (2), pp. 381-393.
Poggio, T. & Vetter, T. (1992). Recognition and structure from one (2D) model view: observations on prototypes, object classes, and symmetries. In AIM-1347 (Ed.). Massachusetts Institute of Technology: Artificial Intelligence Laboratory.
Polonik, W. & Wang, Z. (2005), “Estimation of regression contour clusters—an application of the excess mass approach to regression.” Journal of Multivariate Analysis, 94 (2), pp. 227-249.
Proschan, F. (1963), “Theoretical explanation of observed decreasing failure rate.” Technometrics, 5 (3), pp. 375-383.
Qi, Z., Tian, Y. & Shi, Y. (2013), “Robust twin support vector machine for pattern classification.” Pattern Recognition, 46 (1), pp. 305-316.
Silverman, B.W. (1981), “Using kernel density estimates to investigate multimodality.” Journal of the Royal Statistical Society. Series B (Methodological), 43 (1), pp. 97-99.
Tong, L.I., Chen, K. & Chen, H. (2002), “Statistical testing for assessing the performance of lifetime index of electronic components with exponential distribution.” International Journal of Quality & Reliability Management, 19 (7), pp. 812-824.
Wahed, A.S., Luong, T.M. & Jeong, J.H. (2009), “A new generalization of Weibull distribution with application to a breast cancer data set.” Statistics in Medicine, 28 (16), pp. 2077-2094.
Wu, C.W. & Pearn, W.L. (2008), “A variables sampling plan based on Cpmk for product acceptance determination.” European Journal of Operational Research, 184 (2), pp. 549-560.
Wu, C.W., Pearn, W.L. & Kotz, S. (2009), “An overview of theory and practice on process capability indices for quality assurance.” International Journal of Production Economics, 117 (2), pp. 338-359.
Xu, P., Brock, G.N. & Parrish, R.S. (2009), “Modified linear discriminant analysis approaches for classification of high-dimensional microarray data.” Computational Statistics & Data Analysis, 53 (5), pp. 1674-1687.
Yang, J., Yu, X., Xie, Z.Q. & Zhang, J.P. (2011), “A novel virtual sample generation method based on Gaussian distribution.” Knowledge-Based Systems, 24 (6), pp. 740-748.
Zhang, L.F., Xie, M. & Tang, L.C. (2007), “A study of two estimation approaches for parameters of Weibull distribution based on WPP.” Reliability Engineering & System Safety, 92 (3), pp. 360-368.