Four principles for analysing agreement between metrics and peer review

In discussions about the use of citation-based indicators (‘metrics’) in research evaluations, the degree to which these indicators yield results that are in agreement with the outcomes of peer review exercises plays a key role. Unfortunately, there is a lack of consensus on how agreement between metrics and peer review can best be analysed. As a consequence, different studies take different approaches, obtain different results, and sometimes draw completely opposite conclusions.

This is clearly visible in discussions about the use of metrics in the Research Excellence Framework (REF) in the UK, where some studies report an almost perfect correlation between metrics and peer review while other studies, most notably the influential Metric Tide report, suggest that metrics and peer review are only weakly correlated. Based on a paper that we published last week, we propose four principles for studying agreement between metrics and peer review in a systematic manner. Below, we illustrate each of these principles with concrete examples obtained from our analysis of the UK REF.

correlation scatter size independent Physics 600

1. Choose the proper level of aggregation

The agreement between metrics and peer review depends on the level of aggregation. The Metric Tide report analysed agreement between metrics and peer review at the level of individual publications. However, this is not the appropriate level of aggregation in the context of the REF. The goal of the REF is not to evaluate individual publications but to evaluate institutions. Even though metrics and peer review are only weakly correlated at the individual publication level, they may still be quite strongly correlated at the institutional level; the errors may ‘cancel out’. Indeed, correlations turn out to be substantially higher at the institutional level (often around 0.7) than at the publication level (always below 0.4).

2. Distinguish between size-dependent and size-independent perspectives

The REF has multiple objectives: it aims to provide a reputational yardstick and a basis for distributing funding. A reputational yardstick is usually related to the average scientific quality of the publications of an institution. As such, it is size-independent: it does not depend on the size of an institution. On the other hand, in the REF, funding is allocated based on the total scientific quality of the publications of an institution, which means that funding is size-dependent: institutions with more output or staff generally receive more funding. Given the multiple objectives of the REF, agreement between metrics and peer review in the REF should be analysed from both a size-dependent and a size-independent perspective.

Importantly, correlations between metrics and peer review tend to be much higher from a size-dependent perspective than from a size-independent perspective. This is because in the size-dependent perspective metrics and peer review share a common factor, namely the size of an institution. Size-dependent correlations can be very high even when the corresponding size-independent correlations are low. For example, in the field of mathematics in the REF, the size-dependent correlation between metrics and peer review equals 0.96, whereas the size-independent correlation equals only 0.39.

3. Use an appropriate measure of agreement

Correlation coefficients provide only limited insight into agreement between metrics and peer review, and especially in size-dependent analyses correlation coefficients could lead to misleading conclusions. Alternative measures of agreement should therefore be considered, for example the median absolute difference and the median absolute percentage difference.

The median absolute difference gives an indication of the absolute increase or decrease that can be expected when switching from peer review to metrics. We consider this measure to be especially informative when taking a size-independent perspective. On average, about 30% of the publications in the UK REF are marked as “world-leading”. In fields such as physics and clinical medicine, we find a median absolute difference of about 3 percentage points for the percentage of “world-leading” publications. Hence, in these fields, switching from peer review to metrics will typically lead to an increase or decrease in the percentage of “world-leading” publications of an institution by about 3 percentage points.

Taking a size-dependent perspective and focusing on the funding received by institutions, we consider the median absolute percentage difference to be an insightful measure of agreement. In physics and clinical medicine, the median absolute percentage difference between metrics and peer review in the REF is about 15%. This essentially means that in these fields the funding allocated to an institution will typically increase or decrease by about 15% when switching from peer review to metrics.

4. Acknowledge uncertainty in peer review

Regardless of the level of aggregation, the perspective (i.e. size-dependent or size-independent), and the measure of agreement, it is crucial to acknowledge that peer review is subject to uncertainty. Hypothetically, if the REF peer review had been carried out twice, based on the same publications but with different experts, the outcomes would not have been the same. Comparing the outcomes of two independent peer review exercises provides an indication of the internal agreement of peer review. This provides a sensible baseline with which to compare the agreement between metrics and peer review. If agreement between metrics and peer review is similar to internal peer review agreement, metrics and peer review yield essentially indistinguishable results.

Unfortunately, there are no empirical measurements of peer review uncertainty in the REF. In our paper, we therefore study peer review uncertainty based on a simple mathematical model. We recommend that peer review uncertainty should be measured empirically in the next edition of the REF in 2021.

Our model of peer review uncertainty suggests that in some fields, in particular in physics, clinical medicine, and public health, agreement between metrics and peer review is quite close to internal peer review agreement. In these fields, differences between metrics and peer review seem to be of a similar magnitude as differences between two peer review exercises, from both a reputational (size-independent) and a funding (size-dependent) perspective.

Conclusion

A large number of scientometric studies have analysed agreement between metrics and peer review. Many of these studies take one specific perspective, often without a clear motivation, leading to apparent inconsistencies in the literature. To develop a systematic understanding of agreement between metrics and peer review, we consider it essential to take into account the four principles presented above.

Finally, we emphasize that, even if metrics are found to agree well with peer review, there may be other arguments against replacing peer review by metrics, such as arguments related to goal displacement. The various arguments should be carefully weighed in discussions about the use of metrics as an alternative to peer review. We therefore do not suggest that metrics should replace peer review in the REF. However, in some REF fields, the argument that metrics should not be used because of their low agreement with peer review does not stand up to closer scrutiny.

About Vincent Traag

Senior researcher and bibliometric consultant. Vincent's research focuses on complex networks and social dynamics. He holds a Master in sociology and a PhD in applied mathematics, and tries to combine the two in his work.

About Ludo Waltman

Ludo Waltman is professor of Quantitative Science Studies and scientific director at the Centre for Science and Technology Studies (CWTS) at Leiden University. He is a coordinator of the Information & Openness focal area and a member of the Evaluation & Culture focal area. Ludo is co-chair of the Research on Research Institute (RoRI).

3 comments

Vincent Traag November 6th, 2018 10:42 am

You are quite correct Loet that citations also carry some uncertainty. There are multiple types of uncertainties in fact: for example, uncertainty regarding citation counts (some references may not have been correctly matched for example), uncertainty regarding coverage (not all relevant citations may be included), uncertainty regarding normalisation (what publications would be considered comparable), but perhaps most prominently: uncertainty with regards to what extent citations capture any underlying notion of "value".
When using metrics in an evaluative context, we would indeed want to avoid false precision (see also point 8 of the Leiden Manifesto). However, it would be equally falsely precise to blindly rely on any estimate of uncertainty, exactly because we do not know all sources of uncertainty. So, any estimate of uncertainty should not be used in a dichotomous way (i.e. significant difference or not), and should be interpreted and judged in specific contexts.
In the context of the comparison of peer review uncertainty and metrics, the inclusion of any uncertainty in metrics would not change our conclusion I believe.

Reply
Loet Leydesdorff September 24th, 2018 8:39 am

5. Acknowledge uncertainty in institutional rankings
In addition to the four principles for analyzing agreement between metrics and peer review, a fifth principle can be formulated, in my opinion: uncertainty in the metrics itself or misspecification of the error in the measurement. The metrics, for example, often suggest a precision which cannot be legitimated. In Leydesdorff, Bornmann, & Mingers (2018, forthcoming), for example, we formulated on the basis of a meta-evaluation of the Leiden Rankings 2017 as follows:
“The more detailed analysis of universities at the country level suggests that distinctions beyond three or perhaps four groups of universities (high, middle, low) may not be meaningful. Given similar institutional incentives, isomorphism within each eco-system of universities should not be underestimated. Our results suggest that networks based on overlapping stability intervals can provide a first impression of the relevant groupings among universities. However, the clusters are not well-defined divisions between groups of universities.”
at Figure 1
Figure 1: Grouping of 13 Dutch universities into non-significantly different sets based on the Leiden Rankings 2018 using the potentially overlapping stability intervals as a statistic.
In summary, referees cannot be expected to be able to distinguish among universities which belong to the same group.
Reference:
Leydesdorff, L., Bornmann, L., & Mingers, J. (2018, forthcoming). Statistical Significance and Effect Sizes of Differences among Research Universities at the Level of Nations and Worldwide based on the Leiden Rankings. Journal of the Association for Information Science and Technology.

Reply
- Loet Leydesdorff September 24th, 2018 8:48 am
  
  PS. See for Figure1 at https://www.leydesdorff.net/lr2018/
  
  Reply

Blog archive