TY - GEN
T1 - Estimating false negatives for classification problems with cluster structure
AU - Simon, György J.
AU - Kumar, Vipin
AU - Zhang, Zhi Li
PY - 2007
Y1 - 2007
N2 - Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is random sampling. However, when the target (positive) class of the classification is rare, which is often the case with network intrusions and diseases, this simple method results in excessive sampling. In this paper, we propose an approach that exploits the cluster structure of the data to significantly reduce the amount of sampling needed while guaranteeing an estimation accuracy specified by the user. The basic idea is to cluster the data and divide the clusters into a set of "strata", such that the proportion of positive instances in the stratum is very low, very high or in between, respectively. By taking advantage of the different characteristics of the strata, more efficient estimation strategies can be applied, thereby significantly reducing the amount of required sampling. We also develop a computationally efficient clustering algorithm - referred to as class-focused partitioning - which uses the (imperfect) labels predicted by the classifier as additional guidance. We evaluated our method on the KDDCup network intrusion data set. Our method achieved better precision and accuracy with a 5% sample than the best trial of simple random sampling with 40% samples.
AB - Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is random sampling. However, when the target (positive) class of the classification is rare, which is often the case with network intrusions and diseases, this simple method results in excessive sampling. In this paper, we propose an approach that exploits the cluster structure of the data to significantly reduce the amount of sampling needed while guaranteeing an estimation accuracy specified by the user. The basic idea is to cluster the data and divide the clusters into a set of "strata", such that the proportion of positive instances in the stratum is very low, very high or in between, respectively. By taking advantage of the different characteristics of the strata, more efficient estimation strategies can be applied, thereby significantly reducing the amount of required sampling. We also develop a computationally efficient clustering algorithm - referred to as class-focused partitioning - which uses the (imperfect) labels predicted by the classifier as additional guidance. We evaluated our method on the KDDCup network intrusion data set. Our method achieved better precision and accuracy with a 5% sample than the best trial of simple random sampling with 40% samples.
KW - Clustering
KW - False negatives estimation
KW - Random sampling
UR - http://www.scopus.com/inward/record.url?scp=70449106940&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70449106940&partnerID=8YFLogxK
U2 - 10.1137/1.9781611972771.68
DO - 10.1137/1.9781611972771.68
M3 - Conference contribution
AN - SCOPUS:70449106940
SN - 9780898716306
T3 - Proceedings of the 7th SIAM International Conference on Data Mining
SP - 599
EP - 604
BT - Proceedings of the 7th SIAM International Conference on Data Mining
PB - Society for Industrial and Applied Mathematics Publications
T2 - 7th SIAM International Conference on Data Mining
Y2 - 26 April 2007 through 28 April 2007
ER -