Signmap - Signposting the Scholarly Web

Signmap: Providing an Inventory of the Scholarly Objects Managed by a Repository

Prepared by: Herbert Van de Sompel, Patrick Hochstenbach, Michael L. Nelson, Martin Klein, Enno Meijers, ...

This version, created 20240624: https://signposting.org/Signmap/

Please provide feedback in the GitHub Signposting repository, using label Signmap

Historically, repositories have provided an inventory of the scholarly objects they manage by making descriptive metadata available via an OAI-PMH machine interface. But, differing approaches in providing metadata can make it hard for harvesting applications to unambiguously determine where an object's content files reside, what it's persistent identifier is, etc. Recent innovations in metadata formats, exemplified by RIOXX version 3, have significantly improved on the status quo.

Over time, repositories have also started to publish an inventory using the Sitemaps Protocol, which has been the dominant approach to help web crawlers find a server's resources since 2009. In a typical repository implementation, for each scholarly object managed by a repository, the Sitemap has an entry that provides the object's landing page URL. Given a landing page URL and the HTML that is available there, a crawler can attempt to discover the URLs of other resources associated with each scholarly object, e.g. metadata resources, content resources, persistent identifier. For the longest time, this has been a laborious heuristic-bound task. Support for FAIR Signposting removes uncertainty by providing distinct typed links on the landing page that make discovering the constituent resources of a scholarly object unambiguous. This significantly simplifies the task for any bot that interacts with landing pages, including crawlers intent on collecting all resources associated with each repository object. But if a crawler is only interested in, for example, PDF content resources or BIBTEX metadata resources, it must visit the landing page URL of each object and, by checking the appropriate typed links, determine whether any of the linked resources meet its scope.

Signmaps, specified in this document, leverage the convenience of the long-established Sitemaps Protocol and extend it with the ability to associate Signposting links with each landing page URL listed in a Sitemap. As such, Signmaps allow crawlers to discover URLs of resources that meet their scope without having to visit each landing page URL.

Image courtesy of Patrick Hochstenbach.

Introduction
Building Blocks

1. Introduction

This specification details the Signmap approach that repositories can use to publish an inventory of the scholarly objects they manage. The approach consists of:

Using the Sitemaps Protocol to publish the URLs of the landing page of each scholarly object.
Providing the URLs of other constituent resources of each scholarly object (e.g. content resources, metadata resources) by means of unambiguous Signposting links associated with the object's landing page URL.

A Signmap showing an entry for a single scholarly object managed by a repository. It shows the URL of the object's landing page in the <loc> element and several Signposting links associated with the landing page in consecutive <rs:ln> elements. As per Signposting conventions, the describedby link points at a metadata resource, the item link at a content resource, and the cite-as link at the object's persistent identifier. Note that the first two links also provide information on the media type of the linked resources, i.e. JSON and PDF, respectively. The first link additionally expresses the profile of the media type.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <url>
      <loc>https://example.com/res1</loc>
      <rs:ln rel="describedby"
             href="https://example.com/metadata/res1.json"
             type="application/ld+json"
             profile="https://w3id.org/ro/crate"/>
      <rs:ln rel="item"
             href="https://example.com/content/res1.pdf"
             type="application/pdf"/>
      <rs:ln rel="cite-as"
             href="https://doi.org/123.457643"/>
  </url>
</urlset>

The Signmap approach leverages the following specifications:

The Sitemaps Protocol that has been widely adopted as a means to allow crawlers to discover a web server's resources.
The Robots Exclusion Protocol (RFC 9309) aka robots.txt, which is used by the Sitemaps Protocol to support discovery of Sitemaps.
The ResourceSync Framework Specification (ANSI/NISO Z39.99-2017) that extends Sitemaps with the ability to include typed links to point at resources related to those listed in a Sitemap.
The Signposting conventions that detail which link relation type to use to point at various constituent resources of a scholarly object on the web.

2. Building Blocks

This section provides a description of the building blocks of the Signmap approach that can be used to publish an inventory of the scholarly objects managed by a repository.

2.1. The Sitemaps Protocol

The Signmap approach rigorously follows all aspects of the Sitemaps Protocol with the following implementation guidelines for repositories:

A Sitemap must have one <url> element per scholarly object managed by the repository. The URL of the object's landing page must be provided in the <loc> element. Other elements may be provided as intended by the Sitemaps Protocol.
Landing page URLs that contain characters that are reserved in XML must be encoded as follows:
- & must be encoded as &
- ' must be encoded as '
- " must be encoded as "
- < must be encoded as <
- > must be encoded as >
When the number of scholarly objects managed by the repository (and hence the number of landing page URLs) exceeds 50.000, multiple Sitemaps must be provided as well as a Sitemap Index file that lists the URL of each Sitemap.
In order to meet the restrictions regarding the location of a Sitemap specified by the Sitemaps Protocol, a Sitemap must be provided at a URL that allows listing the contained landing page URLs. For example:
- a Sitemap <https://myuniversity.edu/sitemap.xml> can not list landing page URLs located at <https://repository.myuniversity.edu/> but it can list landing page URLs located at <https://myuniversity.edu/repository/>;
- when landing pages are located at <https://repository.myuniversity.edu/>, the Sitemap must be located there too.

2.2. The Robots Exclusion Protocol

The Sitemap Protocol uses the Robots Exclusion Protocol aka robots.txt to make a Sitemap (or a Sitemap Index, if applicable) discoverable. The Signmap approach uses the Robots Exclusion Protocol in the same way, with the following implementation guidelines for repositories:

A robots.txt file must be provided at the repository-entry-URL of the repository. What the repository-entry-URL is depends on how/where the repository was installed. For example, it could be <https://myuniversity.edu/repository/home>, or <https://repository.myuniversity.edu/home>, or <https://repo.org/>. Generally speaking it is the de-facto entry page to a repository.
The URL of the Sitemap (or Sitemap Index, if applicable) must be provided in the robots.txt file by means of a Sitemap: line, e.g. by adding the line Sitemap: https://myuniversity.edu/sitemap.xml. Other lines may be added to the robots.txt file, as described in the Robots Exclusion Protocol.

2.3. Links in Sitemaps

The Sitemaps Protocol can be extended through inclusion of XML elements from namespaces other than the Sitemaps XML Namespace. The ResourceSync Framework Specification uses this extensibility mechanism to support the inclusion of typed links pertaining to the resource for which the URL is provided in a Sitemap's <loc> element. The Signmaps approach uses this extensibility mechanism to provide typed links pertaining to the landing page URLs provided in a Sitemap's <loc> elements. This is achieved by:

Including the ResourceSync XML Namespace URL in the opening <urlset> element of a Sitemap as xmlns:rs="http://www.openarchives.org/rs/terms/"
Conveying a link pertaining to a landing page by means of a <rs:ln> child element of the <loc> element that contains the landing page's URL.
Using the following attributes for the <rs:ln> element to provide link information:
- rel: conveys the link relation type;
- href: conveys the URL (absolute, not relative URL) of the resource that is the target of the link;
- type: conveys the media type of the resource that is the target of the link.
- profile: conveys a profile of the media type by means of a Profile URI.

2.4. Link Relation Types for Links in Sitemaps

The ResourceSync Framework Specification supports using link relation types registered in the IANA Link Relation Type Registry or expressed as URIs as specified in RFC 8288, Sec. 2.1.2. Signmaps offer the same flexibility regarding the provision of typed links pertaining to the landing page with a focus on Signposting links, and, especially those link relation types that can that guide a web crawler that is intent on limiting its scope, e.g. describedby to link to metadata resources and item to link to content resources.