Signmap: Providing an Inventory of the Scholarly Objects Managed by a Repository

Prepared by: Herbert Van de Sompel, Patrick Hochstenbach, Michael L. Nelson, Martin Klein, Enno Meijers, ...
This version, created 20240624: https://signposting.org/Signmap/
Please provide feedback in the GitHub Signposting repository, using label Signmap

Historically, repositories have provided an inventory of the scholarly objects they manage by making descriptive metadata available via an OAI-PMH machine interface. But, differing approaches in providing metadata can make it hard for harvesting applications to unambiguously determine where an object's content files reside, what it's persistent identifier is, etc. Recent innovations in metadata formats, exemplified by RIOXX version 3, have significantly improved on the status quo.

Over time, repositories have also started to publish an inventory using the Sitemaps Protocol, which has been the dominant approach to help web crawlers find a server's resources since 2009. In a typical repository implementation, for each scholarly object managed by a repository, the Sitemap has an entry that provides the object's landing page URL. Given a landing page URL and the HTML that is available there, a crawler can attempt to discover the URLs of other resources associated with each scholarly object, e.g. metadata resources, content resources, persistent identifier. For the longest time, this has been a laborious heuristic-bound task. Support for FAIR Signposting removes uncertainty by providing distinct typed links on the landing page that make discovering the constituent resources of a scholarly object unambiguous. This significantly simplifies the task for any bot that interacts with landing pages, including crawlers intent on collecting all resources associated with each repository object. But if a crawler is only interested in, for example, PDF content resources or BIBTEX metadata resources, it must visit the landing page URL of each object and, by checking the appropriate typed links, determine whether any of the linked resources meet its scope.

Signmaps, specified in this document, leverage the convenience of the long-established Sitemaps Protocol and extend it with the ability to associate Signposting links with each landing page URL listed in a Sitemap. As such, Signmaps allow crawlers to discover URLs of resources that meet their scope without having to visit each landing page URL.
Signposting the Scholarly Web: Signmap

Image courtesy of Patrick Hochstenbach.

Table of Contents

  1. Introduction
  2. Building Blocks
    1. The Sitemaps Protocol
    2. The Robots Exclusion Protocol
    3. Links in Sitemaps
    4. Link Relation Types for Links in Sitemaps

1. Introduction

This specification details the Signmap approach that repositories can use to publish an inventory of the scholarly objects they manage. The approach consists of:

A Signmap showing an entry for a single scholarly object managed by a repository. It shows the URL of the object's landing page in the <loc> element and several Signposting links associated with the landing page in consecutive <rs:ln> elements. As per Signposting conventions, the describedby link points at a metadata resource, the item link at a content resource, and the cite-as link at the object's persistent identifier. Note that the first two links also provide information on the media type of the linked resources, i.e. JSON and PDF, respectively. The first link additionally expresses the profile of the media type.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <url>
      <loc>https://example.com/res1</loc>
      <rs:ln rel="describedby"
             href="https://example.com/metadata/res1.json"
             type="application/ld+json"
             profile="https://w3id.org/ro/crate"/>
      <rs:ln rel="item"
             href="https://example.com/content/res1.pdf"
             type="application/pdf"/>
      <rs:ln rel="cite-as"
             href="https://doi.org/123.457643"/>
  </url>
</urlset>
The Signmap approach leverages the following specifications:

2. Building Blocks

This section provides a description of the building blocks of the Signmap approach that can be used to publish an inventory of the scholarly objects managed by a repository.

2.1. The Sitemaps Protocol

The Signmap approach rigorously follows all aspects of the Sitemaps Protocol with the following implementation guidelines for repositories:

2.2. The Robots Exclusion Protocol

The Sitemap Protocol uses the Robots Exclusion Protocol aka robots.txt to make a Sitemap (or a Sitemap Index, if applicable) discoverable. The Signmap approach uses the Robots Exclusion Protocol in the same way, with the following implementation guidelines for repositories:

2.3. Links in Sitemaps

The Sitemaps Protocol can be extended through inclusion of XML elements from namespaces other than the Sitemaps XML Namespace. The ResourceSync Framework Specification uses this extensibility mechanism to support the inclusion of typed links pertaining to the resource for which the URL is provided in a Sitemap's <loc> element. The Signmaps approach uses this extensibility mechanism to provide typed links pertaining to the landing page URLs provided in a Sitemap's <loc> elements. This is achieved by:

2.4. Link Relation Types for Links in Sitemaps

The ResourceSync Framework Specification supports using link relation types registered in the IANA Link Relation Type Registry or expressed as URIs as specified in RFC 8288, Sec. 2.1.2. Signmaps offer the same flexibility regarding the provision of typed links pertaining to the landing page with a focus on Signposting links, and, especially those link relation types that can that guide a web crawler that is intent on limiting its scope, e.g. describedby to link to metadata resources and item to link to content resources.