The Unpaywall designation should be corrected for a number of "Open Archive" records
Recently, an excellent publication by researchers at the ScholCommLab in Canada estimated the amount of article processing charges (APCs) brought in by the “Big 5” publishers from 2015–20181. This analysis depended heavily on identifying publications from those years that could have been made openly available as a result of paying an APC—namely, articles of Gold or Hybrid OA types.
The methodology section in the paper detailed the search query and filters that were used to collect data from the Web of Science database (document type = articles or review articles). Web of Science has strict indexing requirements and inclusion criteria, and as a result, not all journals will appear in the database search results. I wondered what might have been missing because of this.
I wanted to look at the underlying data and run some numbers myself, and because the lab adheres to good open science practices, I was able to easily find and download the data from the analysis.
I wanted to compare this dataset to a similar search run in Dimensions, a database that indexes more broadly but has slightly messier data quality.
(Note: all data collection and analysis described here were run between December 11–20, 2023.)
To keep things simple at the start, I focused only on Elsevier journal titles and their count of Hybrid or Gold publications in 2015. The resulting comparison is shown in Figure 1 below. Each dot is a journal title, and the 2015 counts for that title are displayed with the Web of Science result on the x-axis and the Dimensions result on the y-.
Perfect agreement between the two databases would mean all data points lie on the dotted 45-degree line.
Instead, we see a mismatch, especially at the left side of the graph. Some titles that returned low numbers in WoS have a corresponding Dimensions number that is much higher.
I chose to zoom into Neuron (WoS=30, Dimensions=386), a Hybrid journal, to debug further and see what was going on. Why was Dimensions returning 10x the number of articles for the same search query?
My first approach was to visit the journal page of Neuron and manually click through all 65 articles from one volume in 2015 (v.85). I consider this approach to give me the ground truth of the breakdown in articles’ OA type.
On each article’s landing page, there is a displayed classification of:
“Open Archive”—articles originally published behind a paywall, but now made openly available after an expired embargo through Elsevier’s Open Archive program (12 months for Neuron)
“Open Access”—articles published openly immediately at the time of publication through the payment of an APC ($5000 in 2015 for Neuron)
I found 7/65 (10.7%) articles in v.85 to be Open Access, with the remaining 58 as Open Archive. This matches the result from WoS data more closely than the result from Dimensions. I had looked at about 1/4 of the articles from Neuron in 2015 (which consisted of 4 volumes), and found about 1/4 of the Hybrid articles reported in the WoS dataset (7 out of 30).
Manually clicking through and investigating each article’s landing page gives me the highest quality data, but it is much too labor and time-intensive. If I wanted to scale this up to look at multiple years, or even at a complete single year (2015 had 384 articles across four volumes, v.85-88), this approach would not be feasible.
Therefore, my next idea was to automate the manual clicking and data collection using the Python library Selenium. I coded up a web browser to start at dx.doi.org, input the DOI from WoS, resolve to the article page, and look for the HTML element that displayed the OA type as seen in Figure 2.
This process worked but was slow, taking about 7 seconds per DOI to run. It also ran into a problem—the webdriver got blocked after requesting more than one DOI in a session.
I struggled to complete a single v.85, but finally did and found 9/96 (9.4%) articles classified as Open Access, with the remainder Open Archive. The difference in count was due to a wider consideration of article types from WoS, including articles, reviews, and reports to match Butler’s dataset.
In talking with other ScholCommLab members, it seemed likely that the core issue was that Unpaywall’s algorithm to classify open access modes had changed between the construction of the open dataset and my new run. Unpaywall’s published help page lists Hybrid OA as “free to read at the time of publication, with an open license. These are usually published in exchange for an article processing charge, or APC.”
However, we can see that many of these Neuron articles were classified as Hybrid, but are in fact delayed OA and were not free to read immediately at the time of publication.
Unpaywall’s API provides details about any open copies of a DOI it can find. Specifically, the published_date
field is the timestamp of when the article was published, and each open copy has an oa_date
field to show when Unpaywall first found that open copy (alarmingly, the oa_date
field may be going away in 2024).
I wrote a script in Python to query the Unpaywall API and ask for information about each of the 96 DOIs in Neuron v.85. Taking the delta between the oa_date
and published_date
fields should tell us if an open copy was available immediately upon (or shortly after) publication (denoting Open Acess), or if the open copy was found after Neuron’s 12-month embargo period (denoting Open Archive).
I was pleased to see total agreement between this timestamp delta approach and the webscraping approach shown above: 9 Open Access, 87 Open Archive.
I next expanded this approach to test all articles in 2015. To my surprise, the two approaches did not match. Fifteen articles designated as Open Archive actually return the published_date
equal to the oa_date
field! It is unclear how this is possible since it is inconsistent with the definition, but it means that relying on Unpaywall data through the API is not a reliable way to detect these types of articles.
So what does all this mean?
Based on this analysis and work over the past two weeks, I believe that the Unpaywall designation is incorrect for most of these records. Every record tested here was classified in Unpaywall as Hybrid, but as we saw, the vast majority are in fact Open Archive - only open after the expiration of a 12-month embargo period. This does not match Unpaywall’s own definition of what Hybrid means. Instead, Bronze would be a more accurate categorization.
Elsevier shows 140 journals are included in its Open Archive program across a wide range of years, so the number of affected articles is not insignificant. I have sent a summary of this investigation to Unpaywall support. I suspect that fixing this imprecise classification issue would resolve many of the inconsistencies shown in Figure 1, as Dimensions would no longer return those publications when searching for Gold or Hybrid articles.
This matters because Unpaywall is the commonly accepted standard for open access classifications in bibliometric analyses. When researching the activity around past APC payments and rates, the first step is to narrow in on publications that may have been made openly available as a result of paying an APC.
It is frustrating to know that I can’t run additional analyses and comparisons to the open dataset very easily or at scale. My initial question of what WoS might be missing that Dimensions can return was not able to be answered, since I couldn’t even match the results of known journals between the two databases.
This also points to the fact that Unpaywall’s data is always changing and run live, and there is no real “archive” to speak of that can easily tell us about past OA statuses. “Studies relying on Unpaywall data should be aware that the reproducibility and comparability of their results are time dependent.”[1] Jahn [2] looks at the same problem but from a wider angle and seems to be keeping track of Unpaywall snapshots from the past. In fact, Butler [3] ran into this issue, with Unpaywall designations changing halfway through the analysis and writing of their paper!
Overall, Unpaywall is a living service that continues to evolve, but being aware of possible errors should always be kept in mind when doing any analysis.
Disclosure: I am affiliated with the ScholCommLab as a Research Associate