No Publication Without Evaluation
Bijan Parsia
What is evalution?
- Well?
- The
accumulation of evidence in for and against a hypothesis
- Evaluation
is
- context sensitive and interest relative
- subject
to community norms
- very, very difficult
- very,
very fallible
- We're talking about
- "Performance"
evaluation
- Usability evalutation
A
Case Study
- (What is a case study? What are its strengths and
weaknesses? What sort of conclusions does it support?)
- Taken
from the The
Benchmark Handbook for Database and Transaction Systems
- We
never expected this benchmark to become as popular as it did. In
retrospect, the reasons for this popularity were only
partially due to its technical quality. The primary
reason for its success was that it was the first evaluation
containing impartial measures of real products. By
actually
identifying the products by name, the benchmark triggered a series of
"benchmark wars" between commercial database products. With
each new release, each vendor would produce a new set of numbers
claiming superiority. With some vendors releasing their numbers, other
vendors were obliged to produce numbers for their own
systems. So the benchmark quickly became a standard
which customers knew about and wanted results for.
In retrospect, had the products not been identified by name, there
would have been no reason for the vendors to react the way they did,
and the benchmark would most likely have simply been dismissed as an
academic curiosity. We did not escape these wars completely
unscarred. In particular, the CEO of one of the
companies repeatedly contacted the chairman of the Wisconsin Computer
Sciences Department complaining that we had not represented his product
fairly.
Benchmarking Databases
- Analytic Benchmarks
- Component
or feature oriented
- Transparent
- Unrealistic
- Combinitorics
get bad
- Application Benchmarks
- Real or artificial
- More applicable
- Selective of feature
interactions
- Coding and application sensitive
- Both can lead to tuning to benchmarks
- Both
are evaluating systems
Theories
and Data
- (Empirical) Data never conclusively determine a
conclusion
- Performance was bad because it
wasn't tuned
- Performance was bad because of Java
- Performance
was bad because....
- This result doesn't matter
because it's unrealistic
- This result doesn't matter
because...
- Analysis typically is too
course-grained
- Wost-case complexity isn't
the last word
- Judgement
is key
- What nay-saying is sensible and
helpful?
- What ignoring nay-saying is pollyanaish?
Know thy statistics (1)
- "An
opinion poll released yesterday found Mr. Kerry had the support of 49
per cent of voters, compared with 47 per cent for Mr. Bush, a
statistical tie...." (WashMonthly)
Know
thy statistics (2)
- Suppose
I test two systems and for both of them,
their mean and median
performance are about 30. Is the performance of these two systems,
AFAWCT, the same?
Usability Evaluation
- Usability is an aspect of Human-Computer
Interaction (HCI)
- Usability testing
involves experimenting with human subjects
- You
must ensure that you have had appropriate ethics
board review
- Always! Sometimes, it's not
necessary but you should verify that
- Usability
can be evaluated
- Analytically
- Experimentally
- Measuring
usability is expensive
- Testing
is less so
- "Luckily, you don't
have to measure usability to improve it. Usually, it's enough to test
with a handful of users and revise the design in the direction
indicated by a qualitative analysis of their behavior. When you see
several people being stumped by the same design element, you don't
really need to know how much the users are being delayed. If it's
hurting users, change it or get rid of it."
Q vs. Q
- Quantitative measures
- Time
it takes to complete a task
- Number of errors
- Number
of completed tasks
- Number of clicks to find a
command
- Qualitative measures
- Satisfaction
or Enjoyment
- "I'm bored"
- General
feedback
- "I felt that I was in control",
"I would use this...can I get a copy?"
Different
Sorts of Study
- Task oriented user study
- Longitudinal
study
- Ethnographic study
- Case
study
It is a good idea to...
- Read. A lot.
- Take
experiments and analyses apart
- Try to replicate
them at least a little
- Check your
interests
- What would you do on the basis
of your results?
- Check your context
- ANALYZE before
you experiment
- Is your investigation
grounded?
- Have you thought of the different ways it
might go?
- Aim not to be suprised, but hope that you
are
- Preliminaries!
- Try
it yourself, or have a colleague try it
- Sanity
checks
- Pilot studies
- Detect
problems early
- Do it!
- Best
is the enemey of the good