Semi-supervised approach to rapid and reliable labeling of large data sets

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

In this paper, we propose a method, where the labeling of the data set is carried out in a semi-supervised manner with user-specified guarantees about the quality of the labeling. In our scheme, we assume that for each class, we have some heuristics available, each of which can identify instances of one particular class. The heuristics are assumed to have reasonable performance but they do not need to cover all instances of the class nor do they need to be perfectly reliable. We further assume that we have an infallible expert, who is willing to manually label a few instances. The aim of the algorithm is to exploit the cluster structure of the problem, the predictions by the imperfect heuristics and the limited perfect labels provided by the expert to classify (label) the instances of the data set with guaranteed precision (specificed by the user) with regards to each class. The specified precision is not always attainable, so the algorithm is allowed to classify some instances as dontknow. The algorithm is evaluated by the number of instances labeled by the expert, the number of dontknow instances (global coverage) and the achieved quality of the labeling. On the KDD Cup Network Intrusion data set containing 500,000 instances, we managed to label 96.6% of the instances while guaranteeing a nominal precision of 90% (with 95% confidence) by having the expert label 630 instances; and by having the expert label 1200 instances, we managed to guarantee 95% nominal precision while labeling 96.4% of the data. We also provide a case study of applying our scheme to label the network traffic collected at a large campus network.

Original languageEnglish (US)
Title of host publicationKDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining
Pages641-649
Number of pages9
DOIs
StatePublished - Dec 1 2008
Event14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008 - Las Vegas, NV, United States
Duration: Aug 24 2008Aug 27 2008

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Other

Other14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008
CountryUnited States
CityLas Vegas, NV
Period8/24/088/27/08

Keywords

  • Labeling data sets
  • Performance guarantees
  • Semi-supervised learning

Fingerprint Dive into the research topics of 'Semi-supervised approach to rapid and reliable labeling of large data sets'. Together they form a unique fingerprint.

  • Cite this

    Simon, G. J., Kumar, V., & Zhang, Z-L. (2008). Semi-supervised approach to rapid and reliable labeling of large data sets. In KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining (pp. 641-649). (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining). https://doi.org/10.1145/1401890.1401968