The Problem with Numbers
The problem with numbers is that people put too much faith in them. The way the human brain works is geared to short-cuts. This is why branding is so important in marketing. It is much easier to trust a brand you know that everyone else is saying is just great than to go and research every product and objectively compare them. Is Coca Cola really better than supermarket brand Cola that is 30% of the price? Is an ipad better than a generic tablet? The point I'm making here is that we are geared not to think, thinking is hard work, time consuming and it is far easier to learn a few labels and go with the herd.
When it comes to statistics this is even more the case. Quantify a measurement and everyone (perhaps with the exception of physicists) will believe it is precisely that value. National Curriculum levels are a good example. If I assessed a piece of work at level 5b and then took a random sample of 100 teachers across the nation and asked each to assess the same piece of work "blind" how likely is it that we would all arrive at 5b? I wouldn't like to bet money on us arriving at an average of level 5, never mind the sub-level. And if we did the exercise again with another 100 teachers would the variation in the average produced by each group be bigger than one sub-level? If so it is very dubious to have such fine divisions if we are using the statistics in any sort of summative exercise such as accountability measures or access to courses. It misleads people into thinking these things are more precise than they actually are. Incidentally the same really applies to exams and qualifications. This does not mean that having descriptions in terms of criteria is necessarily a bad thing for formative assessment, I'm not trying to get into an argument about the desirablility or otherwise of levels, it's more to illustrate that the purpose makes a big difference to how the numbers generated from such things are used.
So this brings me on to Effect Size. Effect size is a calculation of how effective one line of action is compared to not doing it. It is inherently assumed that there is some dependent variable being influened by an independent variable. Take a group of teachers and get them to give out regular homework and get a similar control group and get them to give out no homework. Follow this up and see what difference there is between the two groups in some test afterwards. The score in the test is the dependent variable, the homework the independent variable. Label that difference effect size. Straightforward? Or is it? With all scientific measurements and subsequent generalisations, we make assumptions eg assuming no air resistance, assuming everyone is more or less the same, assuming the speed is less than 50% the speed of light and so on. So let's list the assumptions in the homework example.
The first two assumptions are that the measurement in the follow up to determine the effect size is both valid and dependent and all other variables are eliminated. Let's say it was an exam in the subject matter. Was that the same exam for all pupils in both groups? Is the exam genuinely a reflection of the learning under scrutiny in all its facets? The problem with saying an effect size indicates an effect on learning is in agreeing what learning was being considered in the first place. As an example let's say I spend 2 weeks teaching physics lessons about how reflection and refraction work in glass objects using real bits of glass. Someone else spends the same time teaching this from diagrams in text books. I then give an exam using diagrams to provide the questions and the second group do better than the first. Teaching with diagrams gets a significantly bigger effect size than teaching using real bits of glass. Is it a valid conclusion to say that teaching using diagrams is more effective by Effect Size <N>? If what we are interested in is how well learners can answer questions about relection and refraction framed as diagrams, then perhaps so. If we are interested in how well the candidates understand practical properties of glass, we can't be sure because if we had asked questions that required experience of real bits of glass the results might have changed. I'm not saying in this example that one method is or is not better than the other, I'm simply illustrating the point that the effect size is likely to be less generalisable than is commonly accepted. Often the way we make the assessment is limited to what is easy to assess rather than the things that the Effect Subject is most geared to affect. Even if the Effect Size is positive, there is likely to be considerable uncertainty in the result and that uncertainty is not just about juggling statistics, it is more fundamental than that because there is almost certainly some uncertainty in the validity of the measures themselves.
Another assumption is that there is not a limiting factor that could be removed that would completely change the result. In the homework example let's say most teachers were inept at setting meaningful homework because the training in targeting it well on learning and/or time for planning was insufficient. So the effect size of homework comes out as near zero. The implication is then to scrap all homework as a waste of time. An equally valid possibility is simply that for homework to have a significant effect, teachers need a specific type of training or a minimum amount of planning time. What if that is also specific to the context? It could be that teaching and teacher training always limit the effect size of any particular line of action (not too hard to imagine really) and that it would be very difficult to pin down because the required training was always dependent on the context ie difficult to generalise. If that was true, spending a lot of time arguing about (a) being more efffective than (b) when both would be massively more effective by a marginal effort with (c) really is fiddling while Rome burns.
Yet another assumption is that these effects are independent variables. By this I mean that if we measure an effect size in a controlled context it will always be the same size and if other factors are added in they should as discrete independent variables simply be summed by simple addition. That doesn't seem very plausible. If we take, for example, web-based learning that has a very low effect size in Hattie's list, and providing formative evaluation which is very high, how do we know the effect size of combining them? Let's say we have a formative web based system where learners can present their work and get formative evaluation and feedback on it. How would that be classified? How much does each make to the effect size? Would this result in an average or an addition of the variables or perhaps taking the lower from the higher? We just don't know until we do such an experiment. It begs the question too of what we mean by web based learning. Since web based learning is relatively new would it be a surprise if it was ineffective if the teachers were not familiar with the best way of using it? If the measure of success is to write about it on paper how valid is that assessment? We are back at assumption one and two.
Now I'm not saying knowing these effect sizes is not useful, but I think we need extreme caution in how the numbers are used. Those with strong political curriculum axes to grind will immediately start to cherry pick the numbers that best suit their arguments. We have already seen this in a number of blogs. The snag with numbers is that people believe them to be absolutes. If I measured my life processes in terms of contribution to my survival I might say sight 0.2, hearing 0.15, taste 0.1, respiration 1.0, food 1.0. Clearly people can't live without respiration and food but we don't then simply dismiss the others as unimportant. What gives a quality of life is the optimisation of those factors in relation to each other. I think this is likely to be true of education too.