Statistical Hypothesis Testing in Positive Unlabelled Data
Konstantinos Sechidis, Borja Calvo, and Gavin Brown
European Conference on Machine Learning, Nancy, France 2014.

** Best Student Paper Prize **

Paper (PDF)   /   Supplementary material (PDF)   /   Bibtex
Matlab code (ZIP)   /   Slides (PDF)

Positive-unlabelled data is a special case of semi-supervised learning. In PU data, we have a smaller number of "positive" examples, and a large number of unlabelled examples, which could be either positive or negative. This type of scenario occurs surprisingly often in data, being of interest to text mining, bioinformatics, and many other areas.

In this paper we presented an analysis of statistical hypothesis testing methodology in this scenario. We focus specifically the G-test, a generalised likelihood ratio test, but the results have wider implications in the use of mutual information, for example to build Bayesian Networks, decision trees, or to select features.

Properties of Hypothesis Testing in PU environments

One very common heuristic is to assume all unlabelled examples are negatives. Our first result is to show that performing a G-test with this heuristic:

    - is guaranteed to have the same false positive rate
    - will have a higher false negative rate

We proceed to prove a correction factor, kappa, that allows a test with identical behaviour to the original, as if you had observed the full data.

The correction factor enables sample size determination in PU data, and additionally provides a new capability - supervision determination - that is automatically determining the number of labelled examples needed to obtain a specified statistical test.

Supervision Determination: How many labels do I need?

Suppose we have N=3000 unlabelled examples. We would ideally like to label all examples, and then conduct a hypothesis test, that will have a false positive rate of 1%. However, suppose that we cannot label all of these, perhaps because labelling is expensive. If we can provide the prior knowledge that p(y=1) = 0.2, we only need to label a very small fraction of the examples, but we can still recover the same statistical test with specified FP/FN rates.

To detect a ''medium'' effect, with power 0.99 (i.e. FNR of 1%) then we only need to find 66 positive examples from the approximate total of 600. The small, medium and large effect sizes correspond to Cohen's standardized interpretations of effect sizes, found in any statistical textbook.