How do we ensure that all our exams are equally "fair"?

Rather than just deciding to scale all our marks, I think we should first test
the marks to see if they are fair or not - it may be that, as a result of
this, we decide that we should scale all our marks, but at least we will have
some intellectual justification for it.

	What is "fairness"?

Given a set of students taking a set of exams, there will inevitably be
differences between students. However, assuming all modules are equally well
taught and aimed at average student abilities, once we have allowed for the
different subsets of students doing different exams, the (a) averages and (b)
distributions of marks obtained by students for different exams should be
similar.

However, they need not be identical - differences could be caused by:
1) different students prefer or are best in different subjects,
2) differences on the day (e.g. exam density, date or time, illness, stress)
3) differences between exams for the same subject
We would hope to minimise (3), and expect that some of (1) and (2) would
tend to average itself out over large numbers of students. A source of (3)
which we should not try to remove would be weaker students "question spotting",
which could lead to higher variability depending on how well they anticipated
the questions.

	How could we test for fairness?

We need to compare the marks obtained by the subset of students taking a
specific exam, both in that exam and overall. It is potentially unfair if we
simply examine the marks obtained by all students for all exams e.g. by
looking at overall averages.

Informally, we can simply plot and compare the expected and actual
distributions, or we can compute and examine some summary statistics:
(a) the averages obtained in the exam v. overall
(b) the standard deviations, skew, kurtosis, etc. for the exam v. overall
    the ordinary v. rank correlations of exam and overall marks

Formally, we need to establish and test hypotheses to establish confidence
levels etc. It may be sufficient to have one or more "tripwire" tests, that we
would expect most exams to pass, and resort to more detailed analysis (maybe
informally, as above) only if one or more tests failed e.g. we might have a
single test for poor distributions, which would not tell us what was wrong
with the distribution but only that it needed looking at.

We are interested in 3 sources of variation in the marks:
i) the variation between different students
ii) the variation between the standards of exams for the different subjects
iii) the variation in a particular students ability at different subjects
The final mark that a particular student gets in a particular exam for a
particular subject depends on all of these, but we are only worried by (ii).
(We should also investigate the variation due to the students background e.g.
A-levels, or honours programme etc., but that is not today's problem.)

We can get estimates of (i) and (ii+iii): (i) is the variance in the overall
averages for each student. If we subtract the overall average for each student
from their mark for each subject, we can the compute the variance over the
results to obtain (ii+iii).

We could do an analysis of variance (anova) F-test (single factor, unequal
sample sizes), taking the set of marks for each exam as a separate sample,
comparing the variance between and within samples (exams), and testing whether
the variation between exams is significant. However, this does not make use of
all the information in the data, as individual students should get similar
results in different exams (samples), so this would not test everything we
want to test, and would not distinguish between an exam that was marked too
high and an exam taken by better students. The solution is to use a two-factor
test (students v. exams) but, as each student only does a subset of exams, it
is not clear to what extent we can rely on the results. I think that, in
practice, this would be OK.

We could test the differences between the averages for each exam and overall.
To apply Z- (?) or t-tests on the means we would need the variances such as
(i) to (iii) above. We could also test the variances (ii) between different
exams within the set of all exam marks. We need these formal tests to decide
when means and/or variances are significantly out of line, particularly with
the small subsets of students doing some of the exams.