Ten principles for the responsible use of university rankings

CWTS Leiden Ranking 2017 University rankings are controversial. There is a lot of criticism on well-known rankings such as the Academic Ranking of World Universities (ARWU), commonly referred to as the Shanghai Ranking, and the World University Rankings of Times Higher Education (THE) and QS. Nevertheless, universities often feel that they are under pressure to show a good performance in these rankings and universities may therefore pay considerable attention to these rankings in their decision making.

Today the 2017 edition of our CWTS Leiden Ranking was released. We use this opportunity to make a statement on appropriate and inappropriate ways of using university rankings. Below we present ten principles that are intended to guide universities, students, governments, and other stakeholders in the responsible use of university rankings. The ten principles relate to the design of university rankings and their interpretation and use. In discussing these principles, we pay special attention to the Leiden Ranking and the way in which it differs from the ARWU, THE, and QS rankings.

Proposals of principles that could guide the responsible use of scientometric tools are not new. In 2015, we contributed to the Leiden Manifesto, in which ten principles for the proper use of scientometric indicators in research evaluations were presented. The Leiden Manifesto considers scientometric indicators in general, while the principles presented below focus specifically on university rankings. Some of the principles proposed below can be seen as elaborations of ideas from the Leiden Manifesto in the context of university rankings.

Many complex questions can be raised in discussions on university rankings. The ten principles that we present certainly do not answer all questions. The principles are based on our experience with the Leiden Ranking, our participation in discussions on university rankings, and also on scientific literature and policy reports in which university rankings are discussed. We are well aware that the principles may need to be refined and amended. Critical feedback on the principles is therefore very much appreciated, including suggestions on alternative principles. To provide feedback, the form at the bottom of this page can be used.

Design of university rankings

1. A generic concept of university performance should not be used

The THE ranking claims to “provide the definitive list of the world’s best universities”. Similar claims are sometimes made by other major university rankings. This is highly problematic. Different users of university rankings are interested in different dimensions of university performance, and therefore a shared notion of ‘best university’ does not exist. Whether a university is doing well or not depends on the dimension of university performance that one is interested in. Some universities for instance may be doing well in teaching, while others may be doing well in research. There is no sensible way in which a good performance in one dimension can be weighed against a less satisfactory performance in another dimension.

The problematic nature of a generic concept of university performance is also visible in the composite indicators that are used in university rankings such as ARWU, THE, and QS. These composite indicators combine different dimensions of university performance in a rather arbitrary way. The fundamental problem of these indicators is the poorly defined concept of university performance on which they are based.

The Leiden Ranking considers only the scientific performance of universities and does not take into account other dimensions of university performance, such as teaching performance. More specifically, based on the publications of a university in international scientific journals, the Leiden Ranking focuses on the scientific impact of a university and on the participation of a university in scientific collaborations. Different aspects of the scientific performance of universities are quantified separately from each other in the Leiden Ranking. No composite indicators are constructed.

2. A clear distinction should be made between size-dependent and size-independent indicators of university performance

Size-dependent indicators focus on the overall performance of a university. Size-independent indicators focus on the performance of a university relative to its size or relative to the amount of resources it has available. Size-dependent indicators can be used to identify universities that make a large overall contribution to science or education. Size-independent indicators can be used to identify universities that make a large contribution relative to their size. Size-dependent and size-independent indicators serve different purposes. Combining them in a composite indicator, as is done for instance in the ARWU ranking, therefore makes no sense. In the Leiden Ranking, size-dependent and size-independent indicators are clearly distinguished from each other.

Users of university rankings should be aware that constructing proper size-independent indicators is highly challenging. These indicators require accurate data on the size of a university, for instance internationally standardized data on a university’s number of researchers or its amount of research funding. This data is very difficult to obtain. In the Leiden Ranking, no such data is used. Instead, size-independent indicators are constructed by using the number of publications of a university as a surrogate measure of university size.

3. Universities should be defined in a consistent way

In order to make sure that universities can be properly compared, they should be defined as much as possible in a consistent way. When a university ranking relies on multiple data sources (bibliometric databases, questionnaires, statistics provided by universities themselves, etc.), the definition of a university should be consistent between the different data sources. However, even when relying on a single data source only, achieving consistency is a major challenge. For instance, when working with a bibliometric data source, a major difficulty is the consistent treatment of hospitals associated with universities. There is a large worldwide variation in the way in which hospitals are associated with universities, and there can be significant discrepancies between the official relation of a hospital with a university and the local perception of this relation. Perfect consistency at an international level cannot be achieved, but as much as possible a university ranking should make sure that universities are defined in a consistent way. Rankings should also explain the approach they take to define universities. The Leiden Ranking offers such an explanation. Unfortunately, major university rankings such as ARWU, THE, and QS do not make clear how they define universities.

4. University rankings should be sufficiently transparent

Proper use of a university ranking requires at least a basic level of understanding of the design of the ranking. University rankings therefore need to be sufficiently transparent. They need to explain their methodology in sufficient detail. University rankings such as ARWU, THE, and QS offer a methodological explanation, but the explanation is quite general. The Leiden Ranking provides a significantly more detailed methodological explanation. Ideally, a university ranking should be transparent in a more far-reaching sense by making available the data underlying the ranking. This for instance could enable users of a ranking to see not only how many highly cited publications a university has produced, but also which of its publications are highly cited. Or it could enable users to see not only the number of publications of a university that have been cited in patents, but also the specific patents in which the citations have been made. Most university rankings, including the Leiden Ranking, do not reach this level of transparency, both because of the proprietary nature of some of the underlying data and because of commercial interests of ranking producers.

Interpretation of university rankings

5. Comparisons between universities should be made keeping in mind the differences between universities

Each university is unique in its own way. Universities have different missions and each university has a unique institutional context. Such differences between universities are reflected in university rankings and should be taken into account in the interpretation of these rankings. A university in the Netherlands for instance can be expected to be more internationally oriented than a university in the US. Likewise, a university focusing on engineering research can be expected to have stronger ties with industry than a university active mainly in the social sciences. To some extent, university rankings correct for differences between universities in their disciplinary focus. So-called field-normalized indicators are used for this purpose, but these indicators are used only for specific aspects of university performance, for instance for quantifying scientific impact based on citation statistics. For other aspects of university performance, no correction is made for the disciplinary profile of a university. The collaboration indicators in the Leiden Ranking for instance do not correct for this. In the interpretation of the indicators provided in a university ranking, one should carefully consider whether the disciplinary profile of a university has been corrected for or not.

6. Uncertainty in university rankings should be acknowledged

University rankings can be considered to be subject to various types of uncertainty. First, the indicators used in a university ranking typically do not exactly represent the concept that one is interested in. For instance, citation statistics provide insight into the scientific impact of the research of a university, but they reflect this impact only in an approximate way. Second, a university ranking may have been influenced by inaccuracies in the underlying data or by (seemingly unimportant) technical choices in the calculation of indicators. Third, there may be uncertainty in a university ranking because the performance of a university during a certain time period may have been influenced by coincidental events and may therefore not be fully representative of the performance of the university in a more general sense. It is important to be aware of the various types of uncertainty in university rankings. To some extent it may be possible to quantify uncertainty in university rankings (e.g., using stability intervals in the Leiden Ranking), but to a large extent one needs to make an intuitive assessment of this uncertainty. In practice, this means that it is best not to pay attention to small performance differences between universities. Likewise, minor fluctuations in the performance of a university over time can best be ignored. The focus instead should be on structural patterns emerging from time trends.

7. An exclusive focus on the ranks of universities in a university ranking should be avoided; the values of the underlying indicators should be taken into account

The term 'university ranking' is somewhat unfortunate, since it implies a focus on the ranks of universities, which creates the risk of overlooking the values of the underlying indicators. Focusing on the ranks of universities can be misleading because universities with quite similar values for a certain indicator may have very different ranks. For instance, when universities in the Leiden Ranking are ranked based on their proportion of highly cited publications, the university at rank 300 turns out to have just 10% fewer highly cited publications than the university at rank 200. By focusing on the ranks of universities, one university may seem to perform much better than another, while the performance difference may in fact be relatively small.

Users of university rankings should also be aware that the rank of a university may drop when the number of universities included in a university ranking is increased. Such a drop in rank may be incorrectly interpreted as a decline in the performance of the university. The value of the underlying indicator may show that there actually has been no performance decline and that the drop in rank is completely due to the increase in the number of universities included in the ranking.

Use of university rankings

8. Dimensions of university performance not covered by university rankings should not be overlooked

University rankings focus on specific dimensions of university performance, typically dimensions that are relatively easy to quantify. The Leiden Ranking for instance has a quite narrow scope focused on specific aspects of the scientific performance of universities. Some other university rankings have a broader scope, with U-Multirank probably being the most comprehensive ranking system. However, there is no university ranking that fully covers all relevant dimensions of university performance. Teaching performance and societal impact are examples of dimensions that are typically not very well covered by university rankings. Within the dimension of scientific performance, scientific impact and collaboration can be captured quite well, but scientific productivity is much more difficult to cover. Dimensions of university performance that are not properly covered by university rankings should not be overlooked. Users of university rankings should be aware that even the most comprehensive rankings offer only a partial perspective on university performance. The information needs of users should always be leading, not the information supply by university rankings.

9. Performance criteria relevant at the university level should not automatically be assumed to have the same relevance at the department of research group level

Performance criteria that are relevant at the level of universities as a whole are not necessarily relevant at the level of individual departments or research groups within a university. It may for instance be useful to know how often articles published by a university are cited in the international scientific literature, but for a specific research group within the university, such as a research group in the humanities, this may not be a very useful performance criterion. Similarly, one may want to know how many publications of a university have been co-authored with industrial partners. However, for research groups active in areas with little potential of commercial application, this may not be the most appropriate performance criterion. It may be tempting for a university to mechanically pass on performance criteria from the university level to lower levels within the organization, but this temptation should be resisted. This is especially important when the distribution of resources within a university is partially dependent on key performance indicators, as is often the case.

10. University rankings should be handled cautiously, but they should not be dismissed as being completely useless

When used in a responsible manner, university rankings may provide relevant information to universities, researchers, students, research funders, governments, and other stakeholders. They may offer a useful international comparative perspective on the performance of universities. The management of a university may use information obtained from university rankings to support decision making and to make visible the strengths of the university. However, when doing so, the limitations of university rankings and the caveats in their use should be continuously emphasized.

We are grateful to Martijn Visser and Alfredo Yegros for helpful comments.

About Ludo Waltman

Ludo Waltman is professor of Quantitative Science Studies and scientific director at the Centre for Science and Technology Studies (CWTS) at Leiden University. He is a coordinator of the Information & Openness focal area and a member of the Evaluation & Culture focal area. Ludo is co-chair of the Research on Research Institute (RoRI).

About Paul Wouters

Paul Wouters is emeritus professor of scientometrics.. Paul is interested in how evaluation systems have developed and are creating new constraints for the development of knowledge. He is also interested in the history of science in general and the role of information systems in these histories in particular.

About Nees Jan van Eck

Senior researcher, head of data science, and coordinator of the Information & Openness focal area. Nees Jan's research focuses on infrastructures and the development of tools and algorithms to support research assessment, science policy, and scholarly communication.

10 comments

Chitkara University October 29th, 2020 8:31 am

thxu for good info ireally like this
Baba Farid University

Reply
Baba Farid University October 29th, 2020 8:29 am

thxu for good info
Baba Farid University

Reply
Alex William September 13th, 2017 2:38 am

Dear Mr. Waltman,
our first short comment shall not be followed by a complete critical acclaim of your work.
The Leiden Ranking promises to “offer key insights into the scientific performance of over 900 major universities” “by using bibliometric indicators and by ranking universities according to these indicators. The Ten Principles for the Responsible Use of University Rankings exhibits how limited and obliquitous the design, interpretation and use of university rankings can be. Even the slogan „moving beyond just rankings“ used by the Leiden Ranking shows that the CWTS is not wholly convinced that university rankings represent „responsible metrics“.
Following these ten principles, it is obvious that the design of rankings raises serious issues; among them, we can mention the most worrisome ones:
1.-The concept of university and university performance can only be “generic” or it will be methodologically impossible to rank these institutions (principle 1 and 3). That is because every form of comparison presupposes that the compared entities are similar along one “generic”, i.e. non-particular, dimension.
For example, principle 9 states that indicators, which may be of relevance to a university as a whole, may not be so to an individual department. The Leiden Ranking does not take into account conference proceedings publications and book publications and you point out that this constitutes "an important limitation in certain research fields, especially in computer science, engineering, the social sciences and humanities." Considering that universities are composed of different departments with varying subject cultures and missions how can a comparison of different universities be justified methodologically and how can lacking data of departments be of relevance to universities as a whole?
Furthermore, taking into account the specific missions and highly divers contexts of these institutions (principle 5 and 3) is especially hard when priority is given to non-composite indicators (principle 1).
Most worrisome than that, is that Universities which have a focus on the development of departments and scientific disciplines which are well represented by the particular criteria selected by rankings, exceed in this discriminating type of evaluation. And this at the cost of leaving in a clear disadvantage all the Universities with a more universal interest in the development of all scientific disciplines and the humanities, and also those that have a diverse mission of both advancing knowledge and transiting it.
2.-In regards to your response to my last comment you also mention that citation statistics are “widely regarded as important indicators of scientific performance”. To which I think there is enough controversy regarding this methodology as a good way to measure of scientific quality. Not because everybody is using it, it means that there is not a discussion about whether or not these measurements are meaningful and truly representative.
3.-Another big Problem is that data and methods for all rankings are “proprietary“, “commercial“ (principle 4), or rely on high requirements of expertise to be sufficiently transparent for the ranked institutions to check whether they are ranked appropriately. Moreover, the need for recommendations on how the rankings should be interpreted and used is a clear indication that transparency in general is lacking.
4.-Interpretation of rankings is non-trivial: Principles 5-7 suggest that rankings should be handled with utmost care (5 and 7) and it even seems that rankings fail in acknowledging the uncertainty of their indicators (6).
5.-That the Leiden ranking is not moving from individual ranks to rank classes even though stability intervals are deemed necessary, illustrates nicely that the pressure to provide a sports-like league table is more pronounced than professional standards in scientometrics. And even if rankings are not made to hide any information, highlighting certain information probably results in making other information even more difficult to see.
6.-The use of rankings is non-trivial: When university rankings only provide a “partial perspective on university performance” “and when it is obvious that evaluation regimes force institutions to conform to the indicators”, why are rankings then useful? What is the justification for providing partial rankings, when the effects of these rankings are unintended? Are these rankings really serving as a benchmark to improve universities? What do you think about the process of homogenization? Who exactly are the beneficiaries of these rankings and which concrete benefits do they have when using this partial information presented in such a manner? Even when the Leiden Ranking openly limits the significance of its results, to what extend can they assure that their user will do so too?
When confronted with all the considerations mentioned above, is it possible to say that the production of a university ranking is at all a responsible and meaningful task?

Reply
Alex William June 12th, 2017 3:08 pm

To principle 1:
In order to compare institutions of science to each other in a list of „universities“, you have to assume that they are sufficiently equal and you thereby apply some kind of generic concept. What is this concept in the case of the Leiden Ranking and can four bibliometric indicators alone represent it? On the one hand, combining a composite indicator definitely creates a very arbitrary picture of „university“. But on the other hand, to choose a few indicators representing only one dimension is arbitrary, too. There are definitely „different dimensions of university performance“ and for many observers the modern university is characterized by the combination of these different dimensions, such as like research, teaching and service. If „users of university rankings are interested in different dimensions of university performance“, how can a limitation on research provide „key insights into the scientific performance“ to them? Can the dimension of „research" even be represented appropriately by measuring only journals of one source, excluding for example books and all non-english publications? Can a "poorly defined concept of university performance" be replaced by a more narrow glance at the institution?

Reply
- Ludo Waltman June 12th, 2017 7:31 pm
  
  Thank you for your feedback on the principles we propose.
  You are completely right that the scope of the Leiden Ranking is quite narrow, since the ranking provides only publication, citation, and co-authorship statistics. Although we acknowledge the narrow scope of our ranking, we do feel it is fair to state that the statistics we make available offer “key insights into the scientific performance” of universities. Especially citation statistics are widely regarded as important indicators of scientific performance.
  As we make clear in principle 8, we definitely do not claim that the statistics made available in the Leiden Ranking cover all relevant aspects of the scientific performance of universities. The Leiden Ranking just provides a number of helpful statistics, without the intention to comprehensively cover all relevant aspects of universities’ scientific performance. In my view, your comments nicely illustrate the importance of principle 8: Dimensions of university performance not covered by university rankings should not be overlooked. In other words, the statistics made available in the Leiden Ranking should be combined with other sources of information on the performance of universities.
  Recognizing the narrow scope of the Leiden Ranking, I do want to mention that in the future we plan to broaden the scope of the ranking by providing a more diverse set of indicators.
  
  Reply
Lutz Bornmann June 6th, 2017 1:03 pm

I enjoyed reading your principles for the responsible use of university rankings. These are well-formulated principles. However, I wonder why you do not stronger follow your own principles in the Leiden ranking. For example, you say that the term 'university ranking' is unfortunate (principle 7). Thus, you could simply change the name of the Leiden ranking (e.g., to Leiden metrics). Furthermore, you could sort the universities only by names without allowing any further sorting. Another point is the consideration of differences between universities in rankings (principle 5). You say that the collaboration indicators in the Leiden ranking do not correct for the disciplinary profile of a university and thus violate the principle. I wonder why you provide these indicators; you could drop them if they do not accord with your own principles. Ludo Waltman criticizes Nature Research that signing of the San Francisco Declaration on Research Assessment (DORA) is not consistent with the aims of its Nature Index (doi:10.1038/545412a). Do we have a similar situation with the Leiden principles on university rankings and the Leiden ranking?

Reply
- Ludo Waltman June 6th, 2017 7:55 pm
  
  Thank you Lutz for raising these interesting points. Let me respond to them one by one.
  Indeed there is a tension between the statement we make in principle 7 and the name ‘Leiden Ranking’. During the past two years, we have had extensive discussions at CWTS on whether or not to change the name ‘Leiden Ranking’. In the end, we have chosen to keep the name, at least for the near future. Our argument is pragmatic. The Leiden Ranking is highly visible in discussions on university rankings. By changing the name, we expect there would be a significant risk of our tool losing its visibility in these discussions, which is something we believe would be regrettable. Nevertheless, I acknowledge that a name which does not include the word ‘ranking’ may be considered more in line with principle 7, and it is possible that in the future such a name will be adopted.
  I believe that it is helpful to allow users to sort universities based on the different indicators available in the Leiden Ranking. I do not see how this violates any of our ten principles. At CWTS, we have thought about sorting universities by default based on their name. In a sense, this would be the most neutral way of sorting universities. On the other hand, it is also a non-informative way of sorting them. Our choice now is to sort universities by default based on their number of publications. This is an informative way of sorting universities, while it avoids expressing a preference between size-dependent and size-independent impact indicators, such as P(top 10%) and PP(top 10%). However, I think we may consider to use name-based sorting as the default choice in the future.
  I do not consider the collaboration indicators in the Leiden Ranking to violate principle 7. The principle does not state that indicators should be field normalized. In my view, performing field normalization has both advantages and disadvantages. It is perfectly ok to use non-normalized indicators, provided that one is aware that the indicators have not been field normalized.
  
  Reply
Loet Leydesdorff May 19th, 2017 4:40 pm

Thirteen Dutch universities and ten principles in the Leiden Ranking 2017.
Under principle 6, you formulate as follows: “To some extent it may be possible to quantify uncertainty in university rankings (e.g., using stability intervals in the Leiden Ranking), but to a large extent one needs to make an intuitive assessment of this uncertainty. In practice, this means that it is best not to pay attention to small performance differences between universities.”
It seems to me of some relevance whether minor differences are significant or not. The results can be counter-intuitive. At the occasion of the Leiden Ranking 2011, Lutz Bornmann and I therefore developed a tool in Excel that enables the user to test (i) the difference between two universities on its significance and (ii) for each university the difference between its participation in the top-10% cited publications versus the ceteris-paribus expectation of 10% participation (Leydesdorff & Bornmann, 2012). Does the university perform above or below expectation?
The Excel sheet containing the test can be retrieved at http://www.leydesdorff.net/leiden11/leiden11.xls . In response to concerns similar to yours about using significance tests expressed by (Cohen, 1994; Schneider, 2013; Waltman, 2016), we added effect sizes to the tool (Cohen, 1988) . However, the weights of effect sizes are more difficult to interpret than p-values indicating a significance level.
For example, one can raise the question of whether the relatively small differences among Dutch universities indicate that they can be considered as a homogenous set. This is the intuitive assessment which dominates in the Netherlands. Using the stability intervals on your website, however, one can show that there are two groups: one in the western part of the country (the “randstad”) and another in more peripheral regions with significantly lower scores in terms of the top-10 publication (PP10). Figure 1 shows the division.
Figure 1: Thirteen Dutch universities grouped into two statistically homogenous sets on the basis of the Leiden Rankings 2017. Stability intervals used as methodology. (If not visible, see the version at http://www.leydesdorff.net/leiden17/index.htm )

You add to principle 6 as follows: “Likewise, minor fluctuations in the performance of a university over time can best be ignored. The focus instead should be on structural patterns emerging from time trends.”

Figure 2: Thirteen Dutch universities grouped into two statistically homogenous sets on the basis of the Leiden Rankings 2016. Methodology: z-test. (If not visible, see the version at http://www.leydesdorff.net/leiden17/index.htm )
Using the z-test in the excel sheet, Dolfsma & Leydesdorff (2016) showed a similar pattern in 2016 (Figure 2). Only the position of the Radboud University in Nijmegen was changed: in 2017, this university is part of the core group.
Using two subsequent years and two different methods, therefore, we have robust results and may conclude that there is a statistically significant division in two groups among universities in the Netherlands. This conclusion can have policy implications since it is counter-intuitive.
In summary, the careful elaboration of statistical testing enriches the Leiden Rankings which can without such testing be considered as descriptive statistics.
Amsterdam, May 19, 2017.
References
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
Cohen, J. (1994). The Earth is Round (p

Reply
- Ludo Waltman May 19th, 2017 5:30 pm
  
  Dear Loet,
  Statistical significance is not useful in the interpretation of the CWTS Leiden Ranking. In fact, it is likely to lead to misleading conclusions.
  Let me give an extreme example. Suppose hypothetically that we construct the Leiden Ranking not based on four years of publications (2012-2015) but based on 100 years of publications (1916-2015). We will then have a much larger number of publications for most of the Dutch universities. Consequently, it is likely that the differences between all, or at least between most, pairs of universities will be statistically significant. Following your reasoning, the conclusion then would be that each university finds itself individually in its own 'homogeneous set'.
  The other way around, suppose that we construct the Leiden Ranking not based on four years of publications (2012-2015) but based on just one year of publications (e.g., 2015). We will then have a much smaller number of publications for the Dutch universities. Consequently, it is likely that the differences between most pairs of universities will not be statistically significant. Following your reasoning, the conclusion then would be that all universities find themselves together in a single 'homogeneous set'.
  Hence, depending on the length of the publication window that is used in the Leiden Ranking, we may find that the Dutch university landscape consists of two 'homogeneous sets', a separate 'homogeneous set' for each university, or just a single 'homogeneous set'.
  Clearly, the idea of 'homogeneous sets' derived from significance tests is of no policy relevance. It gives results that mainly reflect the amount of data used in the calculation of the indicators.
  For additional arguments why statistical inference is not helpful in the CWTS Leiden Ranking, and in similar types of bibliometric exercises, please read http://dx.doi.org/10.1016/j.joi.2016.09.012.
  Ludo
  
  Reply
  - Loet Leydesdorff May 20th, 2017 8:22 am
    
    Dear Ludo,
    Your paper about significance (Waltman, 2016) was on the reference list, but unfortunately cut-off by the system. (See the full text at http://www.leydesdorff.net/leiden17/index.htm .) In order to accommodate to your objections, I used the stability intervals which you provide in the Leiden Ranking in the 2017 analysis. In my opinion, significance testing is legitimate:
    1. This is descriptive statistics for exploring the data. I am not studying hypothetical data sets since 1916, but exploring the set that is under study as a data source.
    2. When the stability intervals on your website do not overlap, the sets are significantly different at the level of significance chosen by you and your team (1%, 5%?).
    3. PP10 is size-normalized. The expected value for PP10 is 10% because of the definition. When one has an expected value, an observed value can be tested for significance using a number of tests. We use the z-test.
    4. The statistics add to the interpretation of the Leiden Rankings which are otherwise under-utilized as a rich source of information.
    The conclusion about the Dutch universities is robust, relatively independent of the statistics used.
    Best,
    Loet
    
    Reply

Blog archive