How do we ensure that all our exams are equally "fair"? Rather than just deciding to scale all our marks, I think we should first test the marks to see if they are fair or not - it may be that, as a result of this, we decide that we should scale all our marks, but at least we will have some intellectual justification for it. What is "fairness"? Given a set of students taking a set of exams, there will inevitably be differences between students. However, assuming all modules are equally well taught and aimed at average student abilities, once we have allowed for the different subsets of students doing different exams, the (a) averages and (b) distributions of marks obtained by students for different exams should be similar. However, they need not be identical - differences could be caused by: 1) different students prefer or are best in different subjects, 2) differences on the day (e.g. exam density, date or time, illness, stress) 3) differences between exams for the same subject We would hope to minimise (3), and expect that some of (1) and (2) would tend to average itself out over large numbers of students. A source of (3) which we should not try to remove would be weaker students "question spotting", which could lead to higher variability depending on how well they anticipated the questions. How could we test for fairness? We need to compare the marks obtained by the subset of students taking a specific exam, both in that exam and overall. It is potentially unfair if we simply examine the marks obtained by all students for all exams e.g. by looking at overall averages. Informally, we can simply plot and compare the expected and actual distributions, or we can compute and examine some summary statistics: (a) the averages obtained in the exam v. overall (b) the standard deviations, skew, kurtosis, etc. for the exam v. overall the ordinary v. rank correlations of exam and overall marks Formally, we need to establish and test hypotheses to establish confidence levels etc. It may be sufficient to have one or more "tripwire" tests, that we would expect most exams to pass, and resort to more detailed analysis (maybe informally, as above) only if one or more tests failed e.g. we might have a single test for poor distributions, which would not tell us what was wrong with the distribution but only that it needed looking at. We are interested in 3 sources of variation in the marks: i) the variation between different students ii) the variation between the standards of exams for the different subjects iii) the variation in a particular students ability at different subjects The final mark that a particular student gets in a particular exam for a particular subject depends on all of these, but we are only worried by (ii). (We should also investigate the variation due to the students background e.g. A-levels, or honours programme etc., but that is not today's problem.) We can get estimates of (i) and (ii+iii): (i) is the variance in the overall averages for each student. If we subtract the overall average for each student from their mark for each subject, we can the compute the variance over the results to obtain (ii+iii). We could do an analysis of variance (anova) F-test (single factor, unequal sample sizes), taking the set of marks for each exam as a separate sample, comparing the variance between and within samples (exams), and testing whether the variation between exams is significant. However, this does not make use of all the information in the data, as individual students should get similar results in different exams (samples), so this would not test everything we want to test, and would not distinguish between an exam that was marked too high and an exam taken by better students. The solution is to use a two-factor test (students v. exams) but, as each student only does a subset of exams, it is not clear to what extent we can rely on the results. I think that, in practice, this would be OK. We could test the differences between the averages for each exam and overall. To apply Z- (?) or t-tests on the means we would need the variances such as (i) to (iii) above. We could also test the variances (ii) between different exams within the set of all exam marks. We need these formal tests to decide when means and/or variances are significantly out of line, particularly with the small subsets of students doing some of the exams.