4.2 out of five, or maybe 1.3

I’ve been meaning to write a note about teacher evaluations. College profs, in specific. The problem is that I thing all the stuff I want to say about it is both obvious and irrelevant; the use of the evaluations is so ludicrously bad that nobody can take them seriously, and yet they are used and relied on without being taken seriously.

Let’s just go over the situation, as briefly as I can manage it, what with brevity never being my strong suit. Universities routinely give students a paper to fill in little bubbles and make semi-anonymous comments about their profs. These are given near the end of the semester, to cover the semester they haven’t finished; there are problems with that, but of course that’s the way to actually find the students. Or most of them, anyway; the students are not compelled to fill out the forms, and usually some of them don’t, either because they skipped that class or because they saw evaluation day as a chance to leave early. So there’s that.

There’s no national standard form, but generally students are asked a half-dozen questions about the course and a half-dozen about the prof, with available ratings from (in the versions I’ve seen) one to five. These are counted and averaged, and the averages are compared against each other.

Your Humble Blogger loathes averages. Averages can, it’s true, form one component of a useful analysis, but averages by themselves are pernicious, misleading and stupid. And averages are most obviously stupid when the quantity of numbers being averaged is small. If you live in a town of 25,000 households, the average household income will tell you something (although without even a standard deviation, that something is almost certain to be wrong), and a list of all 25,000 numbers won’t tell you anything at all. If you have the income of six households on the block, listing the six numbers is going to tell you much more than the average number. In a lecture class of two hundred, there’s some excuse for looking at the averages, but in a discussion section of eighteen, there isn’t. And there is no excuse for comparing the averages of the two hundred to the eighteen.

Also, the students mostly do not take the evaluations very seriously. Some do, some don’t. Some just fill in fours and turn the paper in. Some have a crush on the prof and fill in fives. Some have a grudge and fill in ones. You don’t see a lot of twos or threes, though. That’s a big problem if you are working with the averages.

Now, I think it could be helpful to keep track of how many students are unhappy enough to give ones. A prof who consistently has more than one angry student is probably doing something not altogether good; a prof who never has any angry students is likewise a little suspicious, no? Of course, when the prof has the opportunity to schedule the evaluations, which is likely enough, there’s a chance to try to slide out from under one or two of those grudge-holders. Still, if we were going to be at all serious about analysing the results of the evaluations, we would see the percentage of ones (and twos and threes and so on), in addition to an average—but then, in a small class, it’s just as easy to look at the raw numbers.

What else… the accuracy of the scantron is not really reliable. One year, I was looking at a set of original forms and the averages and thinking they seemed odd, started running the numbers and realized that the averages simply didn’t match up to the original forms at all. I never found out what the mistake was, either. The institution might have mislaid a bunch of forms, or put one prof’s forms into a different prof’s class, or run a bunch of forms through the machine together. So there’s that, too. And then—in a recent semester, a department I saw the numbers for averaged 4.7 out of 5 on prof’s enthusiasm. 4.7! Out of 5! Across the department! How could that even mean anything? It’s not just nonsense on stilts, it’s nonsense in a giant hovercraft filled with eels.

And here’s the amazing thing, the thing that really gets to me: everybody knows all this stuff. Nobody seriously thinks that all the students or even three-quarters of students are filling these forms out carefully and thoughtfully. Nobody thinks that all the students fill out the forms at all, and nobody thinks that the students who do fill out the forms form a representative sample. Nobody really thinks it makes sense to compare averages between small and large classes. Nobody who has put any thought into it thinks it makes any sense to look at averages without standard deviations. Nobody thinks that the scantrons are perfectly accurate. And even if all of the number stuff was perfect, nobody really thinks that the reaction of students a few weeks before the exam is a good way of telling you how effective the prof was. Seriously. No-one. Nobody defends this stuff.

And everybody uses it. Universities spend big money having it done, and profs spend big time spinning the results to their promotion and tenure committees, and profs spend big energy fretting over a number dipping or rising. And committees spend big time and energy pretending to look at the numbers and interpret them.

I know screwier things happen in corporations; I’ve seen some. But come on.

Tolerabimus quod tolerare debemus,
-Vardibidian.

2 thoughts on “4.2 out of five, or maybe 1.3”

Jacob February 20, 2009 at 4:12 pm

I am sure that what you describe is typical of many institutions, but I thought it might be worthwhile to describe the way we handle it at my, admittedly exceptional, institution. I know something about how this works since I wrote some of the code that makes it work.

First, no scantrons. The evaluations are all done online.
Second, no early evaluations and no we-did-evaluations-on-the-day-you-were-out. The online evaluation period opens on the day after classes end.
Third, it’s not optional. At least, until you fill out your evaluations, you can’t look at your grades online, which a lot of people like to do. At any rate, most students do it.

Now, as to how the evaluation scores are used, I can’t say. We don’t compute averages across a class, and we do provide other more useful ways of summarizing the data such as a frequency table graph. We’ve also just added a new feature (requested by a statistics prof and implemented by me) which shows, for each answer, the average and standard deviation of that particular student’s answers on that particular question (across all of their previous classes).

That is to say, if a given student typically answers “4” to the question about professor’s energy, but for this particular class answered “1”, the professor will be able to see that (the standard deviation will show a cluster closely around the mean of 4), which may make that answer more meaningful than a “1” from a student who always answers “1”.
Matt February 21, 2009 at 8:09 am

Yeah, and what’s up with tying the same kind of bullshit, cockamamie, nonsense, bullshit, ineffective, infuriating, bullshit, and did I mention bullshit system to public grade school teachers? BULLSHIT!

Says the public grade school teacher-to-be. Taxes and tags not included, your mileage may vary, grain of salt firmly in cheek.

But not the tongue, ’cause it’s fucking bullshit.

peace
Matt

Comments are closed.