This is a supporting document for the proposed Linked Data Signatures Working Group Charter, providing some extra explanation of the problem space and associated use cases and requirements.

Terminology

Linked Data and RDF Relationships

The term “Linked Data” was originally defined in 2006. However, this was never a formal definition and the term’s use has evolved over time. These days it can encompass resolvable URI schemes other than HTTP (notably Decentralized Identifiers) and may be used informally as a general term for any set of facts linked over the Web. Different communities may use these terms differently, although the underlying data model and semantics are close, if not identical. The work proposed by the charter necessarily focuses specifically on the underlying RDF technology. For this reason, in the proposed charter and in the terminology to be used by the Working Group, the terms “Linked Data” and “RDF” are used largely as a synonyms.

Canonicalization terminology

For a precise definition of the various terms and concepts, the reader should refer to the formal RDF specification [[rdf11-concepts]].

RDF Datasets

R, R' and S each denote an RDF Dataset [[rdf11-concepts]].

Identical RDF Datasets

R = S denotes that R and S are identical RDF Datasets.

Two RDF Datasets are identical if and only if they have the same default graph (under set equality) and the same set of named graphs (under set equality).

If R and S are identical, we may equivalently say that they are the same RDF Dataset.

Isomorphic RDF Datasets

R ≈ S denotes that R and S are isomorphic RDF Datasets.

In particular, R is isomorphic with S if and only if it is possible to map the blank nodes of R to the blank nodes of S in a one-to-one manner, generating an RDF dataset R' such that R' = S.

RDF Dataset Canonicalization

RDF Dataset Canonicalization is a function C that maps an RDF Dataset to an RDF Dataset in a manner that satisfies the following two properties for all RDF Datasets R and S:

  • R ≈ C(R); and
  • C(R) = C(S) if and only if R ≈ S.

We may refer to C(R) as the canonical form of R (under C ).

Such a canonicalization function can be implemented, in practice, as a procedure that deterministically labels all blank nodes of an RDF Dataset in a one-to-one manner, without depending on the particular set of blank node identifiers used in the serialization of the input RDF Dataset.

It is important to emphasize that the term “canonicalization” is used here in its very generic form, described as:

In computer science, canonicalization […] is a process for converting data that has more than one possible representation into a “standard”, “normal”, or canonical form.
Source: Wikipedia.

Canonicalization, as used in the context of this document and the proposed charter, is indeed defined on an abstract data model (i.e., on RDF Dataset [[rdf11-concepts]]), regardless of a specific serialization. (It could also be referred to as a “canonical labelling scheme”). It is therefore very different from the usage of the term in, for example, the “Canonical XML” [[xml-c14n11]] or the “JSON Canonicalization Scheme” [[rfc8785]] specifications. Any comparison with those documents can be misleading.

The General Problem Space

Though canonical labeling procedures for directed and undirected graphs have been studied for several decades, only in the past 10 years have two generalized approaches been proposed for RDF Graphs and RDF Datasets:

  1. The algorithms defined by Aidan Hogan in [[aidan-2017]], reviewed through the anonymous scholarly peer review process, implemented by the author.
  2. The algorithm defined by Rachel Arnold and Dave Longley, see [[arnold-longley-2020]], reviewed by experts at Mirabolic Consulting, implemented and deployed via, e.g., the JSON-LD Signatures package used in several JSON-LD Signature suites.

The introduction of Aidan Hogan’s paper [[aidan-2017]] also contains a more thorough description of the underlying mathematical challenges.

Defining an RDF Dataset Hash and/or Signing RDF Datasets

Signing an RDF Dataset follows, roughly, the same approach as for XML [[xmldsig-core1]]. For an RDF Dataset R this implies the following steps:

  1. use an RDF Dataset Canonicalization function C to calculate C(R);
  2. serialize C(R) to quads [[n-quads]] and sort the resulting set of quads;
  3. apply a (traditional) hashing function h on the result of the serialization to yield h(R) (the cryptographic hash of the Dataset).
  4. apply a digital signature function on h(R).

The main challenge for the Working Group is to provide a standard for the RDF Dataset Canonicalization function. That paves the way for uniquely identifying RDF Datasets, as well as methods to digitally sign RDF Datasets.

When a digitally signed file is transferred from one system to another and used as is, there is no need for any processing other than checking the digital signature. This is as true for RDF as any other data format and existing signature and other cryptographic proof methods may be used. However, RDF has many serializations for datasets, notably TriG, JSON-LD, N-quads and, informally, CBOR-LD. The constrained data transfer use case provides an example of this, where data is transferred using an optical or RF data carrier. The space efficient use case, again, points to a need in some circumstances to transform — usually to minimize — the data that is transferred. In these scenarios, a signature on the original file, such as a JSON signature on a JSON-LD file, is not appropriate, as the conversion will make it invalid. A signature of the abstract dataset, on the other hand, will still be valid.

Out-of-scope Issues for the Working Group

Linked Data/RDF is commonly understood to encode descriptions of real world objects and claims about their properties and relationships. Some users of RDF are also aware that, and possibly rely upon, the fact that systems can derive additional claims implied by instance data based on schemas and ontologies. There are also different approaches on how the semantic features of RDF are adapted to RDF Datasets. These are all but a few examples where the complexities around the usage of Linked Data/RDF are real, and may affect the usage patterns of signed RDF content and what the social meaning on those signatures are.

However, these complexities are out of scope for this WG, which concentrates only on data used to exchange and integrate simple factual data expressed in RDF Graphs and Datasets. The approach taken by this WG is that its minimalist deliverables should provide the foundational technology components as building blocks, and tackling the complexities of the more challenging use cases is left to other groups and communities. In particular, the Linked Data Integrity specification will provide an extensible framework for them to build upon.

Separation of Concepts in the Deliverables

This section provides some background for the structure of deliverables in the charter.

An attentive reader of the Linked Data Signatures Working Group Charter may realize that two deliverables (i.e., the “RDF Dataset Hash” and the “Linked Data Integrity” specifications) rely on the same Linked Data Proofs 1.0 draft Community Group Report. The main reason is a separation of concepts to make the deliverable structure more readable, but to also reflect different possible use cases.

The RDF Dataset Canonicalization forms the basis of the deliverables in this charter. While, in practice, it is usually combined with a hash function and, possibly, a digital signature scheme, it has a usage on its own right, see, for example, the “Generating canonical Skolem IRIs for blank nodes” use case below.

The RDF Dataset Hash algorithm is necessary for a traditional digital signature that relies on the creation of the calculated hash value. However, a hash can be used directly, see, for example, the “Space-efficient verification of the contents of Datasets” or the “Semantic consistency of multi-part datasets” use cases in the next section.

The integrity of Linked Data may be secured by a digital signature, as described in the previous paragraph. But that is not the only possible way of proving data integrity. For example, the BBS+ Signatures 2020 scheme, used to ensure zero knowledge proof disclosure of statements, relies on the canonicalization of an RDF Dataset, but it does not rely on the RDF Dataset Hash. Instead, it has a scheme to hash individual RDF triples. The existence of such schemes is the reason why the Linked Data Integrity deliverable aims to define a more general framework where such schemes can also be expressed.

Use Cases and Requirements

Some typical use cases for RDF Dataset Canonicalization and/or signatures are:

Detecting changes in Datasets
When processing RDF Datasets over a period of time, determining if information has changed is helpful. For example, knowing if information has changed helps with data cache invalidation, detecting if expected data has been tampered with or modified, or when debugging unexpected changes in source RDF Datasets.
Requirement: RDF Dataset Canonicalization and Hash algorithms.
Space-efficient verification of the contents of Datasets
If unique identification of RDF Datasets is possible, one can cryptographically hash the information to establish a storage-efficient way to verify that the information has not changed over time. One property of cryptographic digests is that one can verify data integrity. For example, a small device sending an RDF Dataset to a remote storage location can compute a cryptographic digest for later use in verifying that all the data arrived intact and has not been tampered with.
Requirement: RDF Dataset Canonicalization and Hash algorithms.
(Contributed by Alan Karp.)
Secret confirmation of the contents of Datasets
Since a cryptographic digest is a one-way function, and serves as an abbreviation for the entire RDF Dataset, one can use it in places where secrecy is desired. For example, when ensuring that the transaction history on a distributed ledger is the same between two services, two systems could keep track of the list of transactions in their respective ledgers. Canonicalizing and cryptographically hashing the list of transactions should result in the same cryptographic hash without either party needing to share the list of transactions with the other.
Requirement: RDF Dataset Canonicalization and Hash algorithms.
(Contributed by Alan Karp.)
Annotating Datasets with digital signatures and other digital proofs
When publishing or transmitting an RDF Dataset, clearly articulating the entity that published the data and protecting it from undetected modification is useful for mission critical systems. For example, understanding the issuer of a Verifiable Credential and ensuring that it is evident when a Verifiable Presentation has been tampered with underlies the trustworthiness of the encoded information.
Requirement: A way of encoding and verifying a digital signature on an RDF Dataset.
Anchoring the existence of Datasets to a Distributed Ledger
New forms of digital proofs, such as proof of work, proof of stake, proof of existence, and proof of elapsed time, have demonstrated that there are useful forms of cryptographic digital proofs that go beyond digital signatures. For example, anchoring an RDF Dataset that expresses a land deed to a Distributed Ledger can establish a proof of existence in a way that does not depend on a single point of failure, such as a local government office.
Requirement: A way of encoding and verifying a digital proof that is not a digital signature on an RDF Dataset.
Generating canonical Skolem IRIs for blank nodes
Skolem IRIs have been proposed in RDF 1.1 as a way to replace blank nodes with IRIs in application scenarios where it is preferable to avoid the use of blank nodes. Rather than using an ad hoc scheme to generate Skolem IRIs to replace blank nodes, an alternative is to generate Skolem IRIs in a deterministic manner, such that compliant implementations will generate the same IRIs to replace the same blank nodes in isomorphic copies of an RDF graph or dataset. Such a procedure will produce a canonical version of a Skolemized RDF graph that can then be used in the context of several of the use-cases mentioned previously.
Requirement: An RDF Dataset Canonicalization algorithm that produces canonical labels for blank nodes and a convention to compute Skolem IRIs from these labels.
Constrained data transfer
The space-efficient and digital proofs use cases above both hint at scenarios where data transfer is severely constrained. Such transfer might be achieved using an optical data carrier, such as a QR code or Data Matrix, or a radio frequency data carrier such as an NFC tag.
Requirement: Lossless conversion between highly compact and less compact forms of informationally identical RDF Datasets.
Semantic consistency of multi-part datasets
The detecting changes and space-efficient verification use-cases above can be leveraged in situations where a graph or dataset semantically relies on one or several other graph(s), which it refers to through links. Attaching cryptographic hashes to these links would allow to verify the overall integrity of the set of interconnected graphs. One such example is the import mechanism of OWL: the ontology consumer may wish to verify that the imported ontology is the same as the one used by the author of the importing ontology, otherwise the resulting inferences may differ. Another such example are EARL test reports: the consumer may wish to ensure that the test description pointed to by the report is the one that was actually used for the test.
Requirement: A standard way of computing and attaching a cryptographic hash to a graph or dataset.