How to Measure the Stability of Feature Selection

"On the Stability of Feature Selection Algorithms''   [PDF]
Sarah Nogueira, Konstantinos Sechidis, and Gavin Brown
Journal of Machine Learning Research, (vol 18, page 1-54, 2018).

"Quantifying the Stability of Feature Selection"   [PDF]
Sarah Nogueira, PhD thesis, School of Computer Science, University of Manchester 2018

***** If you use any of the work below, please cite our JMLR paper above. *****

Feature selection algorithms are central to modern data science, from exploratory analysis to predictive model building. The ‘stability’ of an FS algorithm is the variation of its feature preferences with respect to small variations in the original data — in effect, the reliability of the procedure.

We have developed statistical tools to understand stability, and provide a solid statistical foundation. The novel measure we propose has a number of desirable properties not held by any previous measure. This allows you to reliably estimate stability for a feature selection algorithm, independently of which algorithm you use. You can calculate a point estimate of stability, or if you want, confidence intervals and do hypothesis testing too.


You can get the full source code, including experiments from our paper, at:
      https://github.com/nogueirs/JMLR2018

But, it's really simple to calculate.

The input to the calculation is simply a matrix Z with M rows, where each row represents one run of a feature selection algorithm - each one from a different bootstrap (or some other random perturbation) of the original data. Each row should be of length d, the total number of features, with a 1 indicating the feature was chosen on that run, and 0 not chosen.

For the point estimate, you can do it in Matlab like this...

Z = [...]; % Mxd binary matrix
d = size(Z,2);
kbar = mean(sum(Z,2));
stability = 1 - mean(var(Z)) / (kbar/d*(1-kbar/d))
Or you can do it in Python like this...

Z = [...] # Mxd numpy array
d = Z.shape[1]
kbar = Z.sum(1).mean();
stability = 1 - Z.var(0, ddof=1).mean() / ((kbar/d)*(1-kbar/d))
For more complex operarations, use either the full source code linked above, or our drag n drop interface....


You do not have to upload your original features/data to us.
You do not have to upload the names of the features, or any other confidential information.

As with the Matlab/Python above, the only requirement is a matrix, delivered as a text file with M rows, where each row represents one run of a feature selection algorithm. Each row should be a binary string of length d, the total number of features, with a 1 indicating the feature was chosen, and 0 not chosen. Your file can contain tabs, spaces or commas if you wish, but they will be stripped out. You can use the example file if you want - download it, then drag it in.

Drop a file into either of the boxes below, and the response will be your estimated stability, along with confidence intervals.

Drop two files, and if the dimensions M and d are the same, a hypothesis test of equality will be performed.

drop a file here
drop a file here