The Initiative for Open Citations (I4OC) encourages scholarly publishers to make the references found in their journals and books openly available through Crossref. With a few exceptions (most notably the American Chemical Society, Elsevier, IEEE, and Wolters Kluwer Health), almost all large publishers support the initiative. So far, this support has resulted in approximately half of all references deposited in Crossref being openly available, yielding about half a billion open references.
I4OC has attracted widespread attention. The initiative is of particular importance for the scientometric community. Thanks to I4OC, Crossref has the potential to become an openly available source of citation data covering a large share of all scholarly literature. I4OC has been endorsed by CWTS and the International Society for Scientometrics and Informetrics. Last month, an open letter from the scientometric community was published calling for publishers to open their references. This letter has already been signed by nearly 300 individuals.
At present, scientometricians typically obtain citation data from Web of Science (WoS) and Scopus, two proprietary data sources. In this post, we provide empirical insights into the value of Crossref as a new source of citation data. We compare Crossref with WoS and Scopus, focusing on the citation data that is available in the different data sources. Our analysis will show that more than three-quarters of the references in WoS and more than two-thirds of the Scopus references can be found in Crossref, with about half of these references being openly available. On the other hand, it will also be shown that millions of references are missing in Crossref. These references occur in publications that have been deposited in Crossref without their references.
The statistics presented in this post are based on WoS and Scopus data provided to CWTS in September 2017 and May 2017, respectively. The Crossref data was downloaded through the Crossref API in August 2017. For WoS, we consider the Science Citation Index Expanded, the Social Sciences Citation Index, the Arts & Humanities Citation Index, and the Conference Proceedings Citation Index. Other citation indices included in WoS, in particular the Emerging Sources Citation Index and the Book Citation Index, are not taken into account, as we do not have access to them.
Matching Crossref with Web of Science and Scopus
To compare Crossref with WoS and Scopus, we matched publications using Digital Object Identifiers (DOIs). However, as we will see, this matching approach is not perfect. Every publication in Crossref has a DOI, but only a selection of the publications in WoS and Scopus have such an identifier. Furthermore, not all publications with a DOI in WoS and Scopus have a matching DOI in Crossref.
We begin by analyzing the extent to which WoS and Scopus provide DOIs, and in particular DOIs that can be used to match publications with Crossref. We consider publications in WoS and Scopus from the period 2012–2016. Recent publications are more likely to have DOIs in WoS and Scopus, and our focus is therefore on publications from these recent years. Table 1 provides statistics both for WoS and for Scopus. Statistics are presented for all document types (which includes proceedings papers, letters, editorials, book reviews, etc. in addition to research and review articles) and exclusively for research and review articles. We note that the total number of publications in Crossref in the period 2012–2016 equals 19.1 million, which is substantially more than the 11.9 and 13.9 million publications reported in Table 1 for WoS and Scopus, respectively.
Table 1. Number of publications in WoS and Scopus, with a breakdown based on whether a publication has a DOI and whether it has a Crossref match (in millions; period 2012–2016).
|
All document types |
Research and review articles |
||
WoS |
Scopus |
WoS |
Scopus |
|
All pub. |
11.9 (100.0%) |
13.9 (100.0%) |
7.6 (100.0%) |
9.9 (100.0%) |
Pub. with DOI |
8.3 (69.6%) |
11.3 (80.9%) |
6.8 (90.2%) |
8.3 (83.8%) |
Pub. with Crossref match |
8.2 (68.3%) |
10.7 (76.9%) |
6.7 (88.9%) |
7.9 (79.7%) |
As shown in Table 1, 68.3% and 76.9% of the publications in WoS and Scopus have a DOI match with Crossref. Focusing on research and review articles, these figures increase to 88.9% for WoS and 79.7% for Scopus. This demonstrates that a relatively large share of the publications not classified as research or review article in WoS lack a DOI.
Matching based on DOIs involves various difficulties. When a publication does not have a DOI in WoS or Scopus, there are two possibilities. Either the publication truly does not have a DOI or it does have a DOI, but the DOI is missing in WoS or Scopus. Based on a manual examination of a small sample of publications, we estimate that about 75% of the publications without a DOI in WoS or Scopus truly do not have a DOI. The other 25% do have a DOI, but the DOI is missing in WoS or Scopus.
Duplicate DOIs also cause problems in matching. DOIs are assumed to be unique. One would not expect to have multiple publications with the same DOI. However, duplicate DOIs can be found both in WoS and in Scopus. The problem is particularly sizeable in Scopus. In the period 2012–2016, there are 161,446 duplicate DOIs in Scopus, some of them assigned to more than 100 publications. There are 8,087 duplicate DOIs in WoS in the same period.
In addition to Crossref, there are also other organizations that register DOIs. This causes another complication in matching. As shown in Table 1, of all publications in Scopus, 4.0% cannot be matched with Crossref even though they do have a DOI. We performed a manual examination of a small sample of these publications. In about half of the cases, a DOI was registered not with Crossref but with another organization, such as the China National Knowledge Infrastructure (CNKI). In other cases, DOIs in Scopus are incorrect (i.e., different from DOIs reported on publishers’ websites), DOIs were never registered, or registration was not yet completed when the Crossref data was downloaded. As can be seen in Table 1, the number of publications in WoS with a non-matching DOI is relatively limited.
Comparing citation data in Crossref with Web of Science and Scopus
How many of the references in WoS and Scopus are also available in Crossref? As discussed above, matching Crossref with WoS and Scopus involves various challenges, and we therefore cannot give a precise answer to this question. However, by matching publications based on DOIs, an approximate lower bound can be provided for the number of references in WoS and Scopus that can also be found in Crossref.
Figure 1 shows how many of the references in WoS and Scopus publications from the period 2012–2016 have a matching reference in Crossref. In addition, the figure also shows how many of the matching references are openly available in Crossref. A reference in WoS or Scopus is considered to have a matching reference in Crossref if the citing publication has a DOI match with a publication in Crossref that also has references. All citing publications in WoS and Scopus are taken into account, irrespective of their document type.
Figure 1. Number of references in WoS and Scopus, with a breakdown based on whether a reference has a Crossref match and whether it is open or closed in Crossref (in millions; period 2012–2016).
As shown in Figure 1, 77.1% of the references in WoS have a matching reference in Crossref, but only 39.7% of the references in WoS have a matching reference that is openly available. For Scopus these statistics are somewhat lower, 69.1% and 34.8%, respectively. It needs to be emphasized that these results are likely to underestimate the true overlap in terms of references between WoS and Scopus on the one hand and Crossref on the other hand. Because of missing and incorrect DOIs in WoS and Scopus, our matching of Crossref with WoS and Scopus is incomplete, leading to an underestimation of the overlap between the different data sources. Both for WoS and for Scopus, Figure 1 shows that slightly more than half of all references with a Crossref match are openly available. This is in line with overall statistics reported for Crossref, where about 50% of all references are found to be open.
We note that the total number of references in Crossref in the period 2012–2016 is 339.2 million, counting both open and closed references. Incidentally, this is very close to the 337.5 million references in WoS reported in Figure 1. With 437.0 million references, Scopus provides the largest number of references. In fact, since there is an overlap of about 300 million references between Crossref and Scopus, almost 90% of the references in Crossref are also available in Scopus. Hence, in terms of references, the content of Crossref that is unique relative to Scopus is fairly small. Relative to WoS, the unique content of Crossref is more substantial. Of the 339.2 million references in Crossref, about 75% can also be found in WoS.
The statistics presented in Figure 1 can be further broken down by field. Figure 2 shows such a breakdown for WoS. Five main fields are distinguished, following the definitions used in the CWTS Leiden Ranking. Substantial differences between fields can be observed. For instance, in the physical sciences and engineering, more than 90% of the WoS references have a match with Crossref, while this is the case for less than 75% of the WoS references in the social sciences and humanities. Nevertheless, almost half of the WoS references in the social sciences and humanities are openly available in Crossref. In mathematics and computer science, just around 35% of the WoS references are open in Crossref.
Figure 2. Breakdown for the percentage of references in WoS that (i) have a Crossref match and are openly available, (ii) have a Crossref match but are not openly available, and (iii) do not have a Crossref match (period 2012–2016).
For the period 2012–2016, for each source (i.e., journal, book series, or conference proceedings) in WoS and Scopus, the number of publications and references that can be matched with Crossref is reported in this Excel file (see the worksheets Crossref vs. WoS and Crossref vs. Scopus; sources with fewer than 100 publications are omitted). The statistics reveal an interesting result: Focusing on the sources with the largest numbers of references that cannot be matched with Crossref, we observe that for most of these sources a large share of their publications do have a match with Crossref. Hence, publications in these sources have been deposited in Crossref, but the corresponding references have not. We investigate this issue in more detail below.
Missing references in Crossref
Discussions about the Initiative for Open Citations (I4OC) have focused mostly on references that have been submitted to Crossref. So far, little attention has been paid to references that are missing in Crossref. These are references that have not been deposited in Crossref, even though the publications in which they occur have been deposited. It is clear that missing references may significantly reduce the value of Crossref as a source of citation data. To a certain extent, the issue of missing references may also explain why some references in WoS and Scopus do not have a match with Crossref, as discussed in the previous section.
For each publication in Crossref in the period 2012–2016, we tried to find a publication in WoS or Scopus with the same DOI. We then identified publications that do not have references in Crossref while they do have references in WoS or Scopus. By counting the number of references in these publications in WoS or Scopus, a lower bound is obtained for the number of references that are missing in Crossref. The results are presented in Table 2. (We note that, because of duplicate DOIs in WoS or Scopus, a publication in Crossref may sometimes have multiple matching publications in WoS or Scopus. We then used the publication with the largest number of references.)
Table 2. Number of references in Crossref, and number of missing references, based on comparisons with WoS and Scopus (in millions; period 2012–2016).
References in Crossref, both open and closed |
339.2 |
Missing references in Crossref, based on a comparison with WoS |
34.2 |
Missing references in Crossref, based on a comparison with Scopus |
64.5 |
Table 2 makes clear that the number of missing references in Crossref is substantial. The comparison with Scopus shows that at least 64.5 million references are missing in Crossref. If publishers take the initiative to deposit these references in Crossref, the number of references in Crossref will increase by 64.5M / 339.2M = 19.0%. Combining this with the statistics presented in Figure 1, it follows that the share of references in Scopus with a Crossref match will increase from 69.1% to (302.1M + 64.5M) / 437.0M = 83.9%. For WoS, there will be an increase from 77.1% to (260.2M + 34.2M) / 337.5M = 87.2%.
A list of publishers with a large number of missing references in Crossref can be found in this Excel file (see the worksheet Crossref missing ref.; the worksheet includes all publishers with at least 1000 references in total, counting both missing and non-missing references). Figure 3 shows the top 15 publishers with the largest number of missing references. The numbers reported in this figure are based on a comparison with Scopus, which yields more comprehensive statistics on missing references than a WoS-based comparison. Publishers that support I4OC are presented in yellow in the figure, while those that do not support the initiative are presented in red. Interestingly, various publishers have a large number of missing references, even though they support I4OC. These publishers make the references they deposit in Crossref openly available, but a sizeable share of their references have not been deposited in Crossref at all.
As can be seen in Figure 3, the most significant example of such a publisher is Springer Nature, with more than 10 million references that are missing in Crossref. A more detailed examination of the missing references of Springer Nature revealed that these are mostly references in books and book chapters. For journal articles published by Springer Nature, the number of missing references is much more limited. We are in contact with Springer Nature about their missing references in Crossref. They informed us that they are investigating why so many of their references are missing, and they assured us that it is their intention to make these references available.
Figure 3. Top 15 publishers with the largest number of missing references in Crossref, based on a comparison with Scopus (in millions; period 2012–2016).
Conclusions
A large share of the scholarly literature indexed in WoS and Scopus is also available in Crossref. For recent years, 68% of the WoS publications and 77% of the Scopus publications can be matched with Crossref using DOIs as a crosswalking mechanism. These figures are likely to underestimate the true overlap between the data sources, since matching based on DOIs presents several difficulties, such as missing, incorrect, and duplicate DOIs. To improve matching, publishers and data providers need to work together to offer more comprehensive and more accurate DOI data.
The coverage of references is a critical concern for scientometricians. For recent years, more than three-quarters of the references in WoS and more than two-thirds of the Scopus references can be found in Crossref. Slightly more than half of the matched references are openly available. Our analysis also demonstrates that millions of references are missing in Crossref. These missing references occur in publications that have been deposited in Crossref without their references. We estimate that the deposit of missing references in Crossref would increase the share of matched references to 87% for WoS and 84% for Scopus. In order to create a comprehensive source of citation data, publishers must not only open their deposited references, but also attend to missing references.
Several next steps are needed to take full advantage of the infrastructure offered by Crossref. Many references are either closed or missing in Crossref. We therefore call for publishers to deposit their references in Crossref and to make them openly available. Moreover, the quality of reference data in Crossref can be improved. For instance, there is no standardized format for author names, and DOIs appear to be missing for a significant share of the references. Also, quite a lot of references are incomplete, with missing data for some or even all elements of a reference (e.g., author name, journal title, publication year, etc.). Publishers should work together with Crossref to improve the quality of reference data. Finally, we urge scientometricians to perform more in-depth studies of the data available in Crossref, to investigate possible systematic differences between references that are open and closed (e.g., in terms of geography, language, and research area), and to assess the suitability of Crossref data for different types of scientometric analyses.
Thanks to David Shotton for his comments on an earlier version of this post.