Skip to main content
SearchLoginLogin or Signup

When Is a Year Complete?

Scholarly metadata takes time to get into databases. How long do I need to wait before analyzing a calendar year?

Published onOct 05, 2023
When Is a Year Complete?
·

Part of my job as the Collection Analysis Librarian at Iowa State University is to analyze where our researchers publish each year and track changes over time. Are author habits shifting? What journals do they choose, and what publishers own those journals? How does researcher behavior compare to what the library spends on collections, subscriptions, and APCs?

I've been curious about how long I need to wait before the records from a completed calendar year are available in bibliographic databases. I know there is some delay in databases ingesting the metadata, but how long? When do December 2022 papers reliably show up in databases in 2023? I have typically waited until the following September before looking at the most recent year, but is that enough?

Database Indexing Criteria

It is important to note that different databases take fundamentally different approaches to indexing. Web of Science is selective, only including material from journals that meet specified criteria.

Dimensions takes a more inclusive indexing approach, with no inclusion criteria beyond ingesting what’s available from Crossref and other metadata sources.

OpenAlex takes this comprehensive approach as well but goes even a step further, indexing so widely that it includes things that may not always be considered “scholarly.” OpenAlex is also a very new database and is still under development even today, so the counts are not as stable and can be more volatile (as we will see).

Automated API Pulls

To collect some actual data on record availability, I set up an automated API call in February 2022. Every day for the past 18 months, the program has launched at 9:00am (assuming my computer is turned on and plugged in) and collects the number of 2022 ISU publications in Web of Science, Dimensions, and OpenAlex. The program records the data, saves it, and updates the plot.

Now that we have collected some results, there are lots of interesting things to see (Figure 1).

Graph on number of records found over time in three databases, running from March 2022 to October 2023.

Figure 1. Progression of Iowa State University authored publications in three databases.

How Quickly Do We Reach Today’s Number?

If we were to pull the 2022 data as soon as the calendar flips to 2023 (January 4 is the closest data point collected), we would only see the following percentages of coverage (using the Oct. 5, 2023 data counts as the denominator):

  • Dimensions: 3584 / 3827, or 93.6%

  • Web of Science: 3157 / 3699, or 85.3%

  • OpenAlex: 2745 / 3920, or 70.0%

Dimensions gets to the “true value” (for lack of a better term) the most quickly, already achieving over 93% at the turn of the calendar year. This high of a rate this quickly was surprising to me. WoS takes until early March to match the same 93%, and OpenAlex needs until June.

Comparisons

The next thing I notice is the delta between the Dimensions (blue) and Web of Science (red) lines. This represents the articles that ISU authors published in journals that are *not* indexed by Web of Science. The difference stays fairly consistent but tends to widen over time, peaking in early January 2023 at 427, or 12% higher, before coming back down.

OpenAlex’s indexing philosophy is similar to Dimensions, and as such, the two lines should end up in about the same place. They eventually do, but the OpenAlex (green) line is more jumpy. In early 2022, it operated on a 14-day cycle, updating records only once every two weeks. Even so, there appears to have been a long pause in adding new records around June 2022, as evidenced by the flat line.

OpenAlex’s total publication counts were much lower than the others over 2022, but it races to catch back up starting in May 2023. Some sort of re-classification happened around July 2023, as the count spikes up and then back down before settling in around the Dimensions number.

My rule of thumb has always been to wait until September to pull WoS data for analysis, but this data indicates that it reached 98% of records on June 9. Dimensions indexes even more quickly, reaching 98% on May 11. I could probably pull the yearly data sooner, though the count of records will continue to increase slightly even into 2024.

Next Steps

I have already started collecting the same data for 2023, and that graph will continue to update over time.

You can watch that process and find more information, including the code and interactive HTML version of the graphs, at the Github repo.

Comments
0
comment
No comments here
Why not start the discussion?