How Reliable Is Inter-Rater Reliability?

Reading Time: 5 minutes

You can listen to the audio version of this article.

What is inter-rater reliability? Colloquially, it is the level of agreement between people completing any rating of anything. A high level of inter-rater reliability is said to exist when there is a high level of agreement between raters, and a low level of the agreement indicates low reliability.

How is inter-rater reliability measured? At its simplest, by percentage agreement or by correlation. More robust measures include Kappa.

Note of caution, if you had asked Western religious leaders of the 15th century if the sun rotates around the earth you would have received a 100% agreement. A rating on the truth of whether the Earth was flat would have received the same result, as would asking 17th-century slavers if such was a legitimate business.

That is, high inter-rater reliability can demonstrate, a red flag, that there are much, much more serious problems afoot.

When is the method of rating used? Usually when objective facts and scientifically robust measures cannot be used, or are not available. When the matter being rated is subjective; a matter of opinion. Ratings are also used when it is not the time or cost-effective to conduct the objective or scientific assessment. Where there is room for opinion, there is near certainty of difference.

Even eyewitness statements are hugely unreliable. Ten people witness the same event at the same time, from the same place, and you’ll have 15 different opinions of what happened. That is, even when a fixed set of facts were physically observed there is a difference of opinion. How much more variation is there when no clear facts are available? What causes such variation?

Attribution

Attribution theory (and studies) explore how and why people ascribe different causes to others’ behaviour. People trend to attribute external causes for their own behaviour, and internal causes for the behaviour of others. In other words: ‘I am not responsible for how I behave, outside events made me do it, but you are responsible for how you behave.’

If the rater is sympathetic to a person being rated they will be more likely to attribute external causes to any failings. If they are unsympathetic, they will attribute internal causes.

That leads us to the nature of the rating. Does the rating question focus the attention of the rater on internal or external causality? If it does, it skews the results and the reliability. If it doesn’t, it leaves the unreliability in the hands of the attributional preferences of the rater. Since we each have our own attributional preferences, in different contexts, there is built-in, and inevitable, inter-rater unreliability.

Attention

When people have their minds on one aspect of events, they are blind to other aspects taking place simultaneously. Even if the other aspects are screamingly loud, most people will simply not process them, if their attention is on an aspect of concern to them.

You may have seen a video of the gorilla walking through a scene. People asked to attend to something else simply do not see the gorilla.

Even more revealing is that when people see the video and are told in advance that others cannot see the gorilla, they have difficulty understanding how anyone can have missed such an obvious event.

That tells us much about inter-rater reliability:

People can rate only what they have given their attention to.
They are blind to that to which they do not attend.
People can’t understand how anyone could miss the aspect to which they have given their attention.

Processing

Even when people attend to the same aspects of something being rated, they process what exists in objective reality through their own paradigms and filters. For instance, if someone thinks all politicians are liars, if they are asked to rate a politician, you can imagine however skilled the person being rated, the lens of “consummate liar” will impact the rating.

Equally, if the person being rated is widely perceived to be very attractive, they will be rated higher than someone of average appearance, for exactly the same performance. This rater bias is known as the halo effect. It seems that the more attractive a rater finds someone the more the halo effect impacts their positive rating bias.

Prior emotional experience

Advertisers all over the world seek to give their target audiences pleasant emotional experiences. Why? Because they know that if the audience can be imprinted with positive emotional experiences the target buyers will rate the product or service higher than that of the competition, even if the competitors objectively have a better product or service. Positive emotional experiences predispose us when we come to making buying decisions. We are largely unaware that we are buying because that product or service evokes pleasant feelings; because of an emotional halo effect.

In a rating situation, as in with buying decisions, prior emotional experience with what is being rated matters. If a manager is being evaluated using 360 feedback, and they have just announced something that all staff appreciate prior to the rating, it will be no surprise that the rating will come back higher than it otherwise would have been.

Motives

People denigrate, belittle and undermine what opposes them, and aggrandise what they believe is in their best interests.

Politicians are rated at the polls, and know that to get support they need to do no more than tell people what they want to hear; they pander to people’s motives. If raters have different motives, we should find that reflected in the ratings they give. When rating something that aligns with the rater’s motives, they will rate it higher than something, or someone, who is opposed to their motives. Since we all have motives, inter-rater unreliability is built-in.

Language meaning

One of the most misused words in English is “strategy”. It is used, misused and abused to mean all sorts. Many, if not most, words have multiple meanings, and few people can give dictionary-accurate definitions of the words we all use so loosely. In any rating system, if any two raters have even slightly different understanding of the meaning of any one word, the rating is subject to yet another impediment to reliability. Most rating systems consist of a large number of words, thus introducing more opportunities for even more inter-rater reliability problems.

Other factors impacting inter-rater reliability

There are many, many more factors that impact inter-rater reliability. For instance, to name but a few:

Time of day (some people are more generous in the mornings than when they are depleted in the evenings, and vice versa)
Season of the year (in winter moods are generally lower in some people than others)
Major events, (people are more generous in the lead up to major celebrations)
Events that have happened in their lives (if someone has just been falsely accused of something, and they have lost a loved one, their depressed spirits may deflate any ratings)

Inter-rater unreliability seems built-in and inherent in any subjective evaluation. Even when the rating appears to be 100% ‘right’, it may be 100% ‘wrong’.

If inter-rater reliability is high, it may be because we have asked the wrong question, or based the questions on a flawed construct.

If inter-rater reliability is low, it may be because the rating is seeking to “measure” something so subjective that the inter-rater reliability figures tell us more about the raters than what they are rating.

Professor Nigel MacLennan runs the performance coaching practice PsyPerform.

VIEW AUTHOR’S PROFILE