DOI: http://dx.doi.org/10.5281/zenodo.189431

1 Abstract¶

LSST DocHub is a proposed solution to information discovery for LSST Data Management, and the LSST project in general. LSST documentation and information artifacts are published through a variety of platforms by virtue of the way information is created — from documents archived on DocuShare, to source code on GitHub, to conversations on community.lsst.org. Currently, staff and users must go to each platform to find information. This has an overall effect of slowing, and even preventing, knowledge sharing. LSST DocHub can solve this problem by decoupling information publication from information discovery. DocHub consists of a unified web front-end for documentation browsing, filtering, and search. The front-end is fed by a web API to centralized metadata and full-text databases. These databases are populated by adapters that monitor each of LSST’s information platforms for new and updated artifacts. DocHub stores metadata as JSON-LD, which is a community-standard, extensible, and self-describing schema. This technote establishes the basic design concept for DocHub, including its architecture and JSON-LD metadata patterns.

Note

LSST DocHub, as described in this technote, is a work in progress. Details may change, and many design decisions need to be made.

2 DocHub’s purpose¶

LSST continuously produces a vast number of information artifacts across numerous platforms. Software and associated documentation repositories are published through GitHub. Documents and presentations may also be archived in DocuShare, but also on Zenodo to enable scientific citation. Conversations about tasks may happen in JIRA, while more general conversations happen on the Community.lsst.org forum. Papers may be written on GitHub, but are ultimately made available through ADS and a publisher’s website. Ultimately, finding information is more difficult than it should be.

One response is to pare down the number of places where LSST’s information artifacts can exist. While laudable — and using platforms with identical use cases and feature sets simultaneously is certainly confusing — retreating to a single platform is not a solution. Each platform facilitates different types of work. Arbitrarily forcing work onto platforms that aren’t a good fit for that work reduces productivity.

Instead, our approach is to decouple information discovery from the platforms that information exists on. We will index LSST information artifacts, and make their metadata centrally available through a website and search API. Then, finding LSST information involves searching through a single website rather than iteratively trawling individual platforms. This system for indexing LSST information artifacts, storing metadata in a database, and making metadata available through an API and website is called DocHub. This technote describes LSST DocHub’s design.

3 Existing metadata systems¶

Information discovery is a common issue across large, distributed organizations like LSST. These are some implementations of DocHub-like systems by other open software and data organizations:

18F embeds .about.yml metadata files in their repositories. These metadata are indexed and published on 18F’s Dashboard.
Code for America similarly implements a civic.json metadata format. These are used by Code for America’s project search page. Primarily it is used to denote the status of a project and to provide tags for search. Much of the information for the search page is also automatically obtained from GitHub metadata, like the project description.
Code.gov embeds a code.json file in federal repositories. This metadata is tracked and published by the Code.gov website and API. Additional discussion about code.gov’s metadata schema is taking place on a GitHub issue. Earlier discussion also happened on a White House source-code-policy issue.
CodeMeta is a minimal metadata schema, written in JSON-LD, that can describe scientific software repositories. CodeMeta’s schema is designed to be cross-walked to other metadata schemas including Dublin Core, schema.org, Python’s PyPI, and DataCite.
Asset description metadata for software from the EU. See Issue #41 at codemeta as well.

These metadata systems share a common pattern: metadata is embedded with the information artifact, centrally indexed, and made available through a search API and website. This approach scales well since it federates metadata definition and maintenance to the source repositories themselves. DocHub builds upon this design pattern.

4 DocHub architecture¶

DocHub uses a microservice architecture (Figure 1) to gain flexibility. The components are:

A metadata schema. DocHub uses JSON-LD since it is extensible, yet self describing. Like code.gov and similar implementations, this metadata embedded in source repositories whenever possible. The same metadata format is used in the database.
A metadata database. DocHub uses a MongoDB database to store all metadata. MongoDB is a document database that works natively with JSON. The JSON-LD that’s embedded in source repositories is available through MongoDB.
A full-text database. While MongoDB is well-suited to querying semi-structured data like JSON-LD, its full-text search capabilities are more limited. Where possible, the content of documents will be stored and made available through Elasticsearch.
Ingest adapters. Each adapter is a microservice built to transform content and metadata for a particular type of resource into a JSON-LD record and full-text entry stored in the MongoDB and Elasticsearch databases. This adapter architecture helps DocHub scale: indexing a new arbitrary information source involves deploying a new adapter service. Adapters can either by pushed to (say, by a GitHub webhook), or can poll a platform for new and updated records. Each adapter handles the platform specific challenge of transforming either templated JSON-LD stored in a source repository or a platform’s native metadata into standardized DocHub JSON-LD.
An API server. The web API server allows applications to query against DocHub’s metadata and full-text databases.
A web front end. This front end is how people typically use DocHub. This website will allow users to browse and filter DocHub information artifacts, and also provide a generic search against the full-text and metadata databases. The website will be editorially designed to some extent. For example, the front page will show featured projects, papers and documents in addition to giving entry points to search and browse against usefully-selected categories. Generally the website (and API) will allow anonymous access. DocHub can be designed to facilitate authorization-based access to non-public documentation (private GitHub repositories) for example, though this will depend on a centralized user database that doesn’t exist in the needed form yet.

Figure 1 DocHub’s architecture. Adapter microservices pull metadata and content from information artifacts, which could be GitHub repositories, JIRA issues, forum topics, and ADS entries. The adapters build JSON-LD metadata documents and persist them into a MongoDB metadata database. An Elasticsearch cluster stores full text content from the adapters, where possible. The API server provides GraphQL and RESTful interfaces to the MongoDB and Elasticsearch databases.

5 DocHub’s JSON-LD metadata¶

All DocHub metadata records share a common JSON-LD (linked data) schema. Through a @context, JSON-LD documents map the names of fields to semantic definitions in http://schema.org and other vocabularies. Specifically, DocHub adopts and extends codemeta, which is a minimal schema of concepts needed to describe a scientific software repository. CodeMeta JSON-LD objects can be cross-walked to other repository metadata schemas to enable automatic submission pipelines from GitHub to a repository like Zenodo, for example.

DocHub metadata exists in two contexts: the metadata database, and in artifact repositories (such as GitHub repositories). Metadata at rest in DocHub’s database is intended to be complete and authoritative, while metadata embedded in repositories is templated. Metadata templates are transformed by ingest adapters into complete JSON-LD stored by DocHub. This section describes these DocHub metadata as it is authoritatively stored in the metadata database.

See also: 9 Appendix: JSON-LD reading list.

5.1 JSON-LD in MongoDB¶

DocHub’s metadata database is MongoDB so that JSON-LD documents can be persisted and queried natively. This design greatly simplifies the RESTful API server by allowing it to return documents in essentially the same form as they are stored.

MongoDB also obviates schema migrations. By building upon JSON-LD and CodeMeta, the API server is inherently backwards-compatible with any JSON-LD document, even metadata records with new fields not originally known by the API server. As new types of fields are added to metadata records, the API server and front-end can evolve independently to provide new functionality based on this data.

Todo

How are collections structured? One collection per data class? Or, one collection for everything?

How should artifacts that appear in multiple forms be stored? For example, a technote can have multple Git branches and tags on GitHub, multiple published editions on LSST the Docs, multiple DOIs, and an ADS entry. CodeMeta JSON-LD tends to capture single versions of a project (a snapshot of a Git branch/tag, LSST the Docs edition and DOI), see 5.2.1 Representing versioned resources in JSON-LD and the metadata database. Is there a need for a special class of MongoDB document that combines and caches this versioned metadata in a way that DocHub’s API and front-end can efficiently use to build, for example, a page listing all technotes?

5.2 JSON-LD Applications¶

This section explores how different types of metadata can be encoded in CodeMeta JSON-LD (and DocHub’s extension of it):

5.2.1 Representing versioned resources in JSON-LD and the metadata database.
5.2.2 Related identifiers (DOIs).
5.2.3 Relationships to projects.
5.2.4 Representing people in JSON-LD.
5.2.5 Representing organizations and copyright holders in JSON-LD.
5.2.6 Describing organizational hierarchy.
5.2.7 Types for non-software artifacts.
5.2.8 Representation of publications.

5.2.1 Representing versioned resources in JSON-LD and the metadata database¶

From a user’s perspective, DocHub is a way to browse software and documentation projects, and see what versions are published on LSST the Docs.

CodeMeta JSON-LD is best suited for describing single versions of a project in individual JSON-LD metadata objects. But a software or documentation artifact (especially one backed by GitHub) is not a single version:

There are multiple versions of the software and documentation (and its corresponding metadata) and individual branches and tags
Multiple editions on LSST the Docs, corresponding to GitHub branches and tags.
Zenodo depositions corresponding to tags.
An ADS entry
JIRA conversations
Community.lsst.org conversations.

Although it could be possible to combine all of these resources and versions in a single MongoDB document, treating a MongoDB documents as a holistic description of a project, the schema for combining several JSON-LD resources in a MongoDB document would be ad-hoc. Instead, DocHub maps MongoDB documents one-to-one with JSON-LD documents.

In this case, a JSON-LD and MongoDB document would refer to a single branch HEAD or tagged commit.

Note

In this design, DocHub only tracks the HEAD of Git branches and tags. Individual commits aren’t tracked. Tracking commits would enable interesting software provenance tracking, but this would also be a significant scope-creep for DocHub. Since LSST the Docs editions only track branches and editions, it makes sense for DocHub to also work at that level.

CodeMeta’s relationships field enables one metadata document to refer to another. For one JSON-LD document to refer to its parent Git repository:

{
  "@context": "...",
  "version": "master"
  "relationships": [
    {
      "relationshipType": "wasRevisionOf",
      "namespace": "http://www.w3.org/ns/prov#",
      "relatedIdentifier": "https://github.com/lsst-sqre/sqr-013.git",
      "relatedIdentifierType": "URL"
    }
  ]
}

The wasRevisionOf relationship type is defined in PROV. The PROV ontology includes other relationship types, though CodeMeta does not restrict relationships to use only PROV types.

Given this relationship, the MongoDB query for all JSON-LD records belonging to a GitHub project are:

find({
  relationships: {$elemMatch: {relationshipType: "wasRevisionOf",
                               relatedIdentifier: "https://github.com/lsst-sqre/sqr-013.git"}}
})

It makes sense to use the metadata for the master branch as the ‘main’ record for a GitHub repository. The master metadata is queried with:

find({
  version: "master",
  relationships: {$elemMatch: {relationshipType: "wasRevisionOf",
                               relatedIdentifier: "https://github.com/lsst-sqre/sqr-013.git"}}
})

5.2.3 Relationships to projects¶

relationships can support linking an artifact to larger multi-repository projects. For example, we want to associate Science Pipelines packages to Science Pipelines itself.

For this, we’d use a isPartOf relationship:

{
  "@context": "...",
  "version": "master"
  "relationships": [
    {
      "relationshipType": "isPartOf",
      "relatedIdentifier": "https://github.com/lsst/pipelines_docs.git",
      "relatedIdentifierType": "URL"
    }
  ]
}

In this example, the metadata record is declared as a part of the pipelines_docs GitHub repo, since pipelines_docs ‘represents’ the LSST Science Pipelines. (See below for additional relationship types).

Alternatively, it might be useful to create JSON-LD metadata records corresponding to a product or product, such as lsst_apps.

Note

isPartOf is a schema.org term. It is also in the Zenodo relationship vocabulary.

5.2.4 Representing people in JSON-LD¶

In CodeMeta JSON-LD, authors are specified in an agents field. For example:

{
   "@context": "...",
   "agents": [
     {
       "@id": "https://orcid.org/0000-0003-3001-676X",
       "@type": "person",
       "email": "jsick@lsst.org",
       "name": "Jonathan Sick",
       "affiliation": "AURA/LSST",
       "mustbeCited": true,
       "isMaintainer": true,
       "isRightsHolder": false,
     }
   ]
}

Note that the @id field is an ORCiD. From a linked-data perspective, adopting ORCiDs as identifiers for people allows us to leverage other data sources, including journals and ADS, more effectively.

ORCiD is not currently required by LSST. An alternative to ORCiD is to treat metadata records served through DocHub’s RESTful API as authoritative records. The DocHub URL for a person’s record becomes their @id.

5.2.5 Representing organizations and copyright holders in JSON-LD¶

In addition to authors, agents can indicate the involvement of organizations, and even indicate what organizations hold copyright:

{
   "@context": "...",
   "agents": [
     {
       "@type": "organization",
       "name": "Association of Universities for Research in Astronomy",
       "isRightsHolder": true,
       "isMaintainer": false,
       "role": {
         "namespace": "http://www.ngdc.noaa.gov/metadata/published/xsd/schema/resources/Codelist/gmxCodelists.xml#CI_RoleCode",
         "roleCode": "rightsHolder"
       }
      },
   ]
}

The role field provides detailed information about the role an agent plays.

Note

In CodeMeta, examples show the role as copyrightHolder, however the namespace has a rightHolder instead.

Other roles are:

resourceProvider: party that supplies the resource.
custodian: party that accepts accountability and responsibility for the data and ensures appropriate care and maintenance of the resource.
owner: party that owns the resource.
sponsor: party that sponsors the resource.
user: party who uses the resource.
distributor: party who distributes the resource.
originator: party who created the resource.
pointOfContact: party who can be contacted for acquiring knowledge about or acquisition of the resource.
principleInvestorigator: key party responsible for gathering information and conducting research.
processor: party who has processed the data in a manner such that the resource has been modified.
publisher: party who published the resource.
author: party who authored the resource.
coAuthor: party who jointly authors the resource.
collaborator: party who assists with the generation of the resource other than the principal investigator.
editor: party who reviewed or modified the resource to improve the content.
mediator: a class of entity that mediates access to the resource and for whom the resource is intended or useful.
rightsHolder: party owning or managing rights over the resource.
contributor: party contributing to the resource.
funder: party providing monetary support for the resource.
stakeholder: party who has an interest in the resource or the use of the resource.

5.2.6 Describing organizational hierarchy¶

One search pattern for DocHub, especially by LSST staff, is to browse artifacts by the organization that made them (LSST subsystems, and teams). The subOrganization type and parentOrganization build an organizational hierarchy:

{
   "@context": "...",
   "agents": [
     {
       "@type": "organization",
       "name": "Association of Universities for Research in Astronomy",
       "isRightsHolder": true,
       "isMaintainer": false,
       "role": {
         "namespace": "http://www.ngdc.noaa.gov/metadata/published/xsd/schema/resources/Codelist/gmxCodelists.xml#CI_RoleCode",
         "roleCode": "rightsHolder"
       }
      },
      {
        "@type": "organization",
        "name": "Large Synoptic Survey Telescope",
        "parentOrganization": "Association of Universities for Research in Astronomy",
        "isRightHolder": false,
        "isMaintainer": false
      },
      {
        "@type": "organization",
        "name": "Data Management",
        "parentOrganization": "Large Synoptic Survey Telescope",
        "isRightHolder": false,
        "isMaintainer": false
      },
      {
        "@type": "organization",
        "name": "Science Quality and Reliability Engineering Team",
        "parentOrganization": "Data Management",
        "isRightHolder": false,
        "isMaintainer": true
      }

   ]
}

5.2.7 Types for non-software artifacts¶

CodeMeta JSON-LD was designed to designed to represent software projects, see the @type:

{
  "@context":"https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
  "@type": "SoftwareSourceCode",
}

5.2.7.1 schema.org types¶

SoftwareSourceCode is a schema.org @type: http://schema.org/SoftwareSourceCode. SoftwareSourceCode is derives from a schema.org CreativeWork.

Some other derived types from schema.org that may be useful are:

ScholarlyArticle for peer-reviewed articles.
Conversation, for forum topics or GitHub issue threads.
SocialMediaPosting, for tweets.

5.2.7.2 Zenodo types¶

These are artifact types defined by the Zenodo deposition schema:

publication: Publication, with publication_type:
- book: Book
- section: Book section
- conferencepaper: Conference paper
- article: Journal article
- patent: Patent
- preprint: Preprint
- report: Report
- softwaredocumentation: Software documentation
- thesis: Thesis
- technicalnote: Technical note
- workingpaper: Working paper
- other: Other
poster: Poster
presentation: Presentation
dataset: Dataset
image: Image, with image_type:
- figure: Figure
- plot: Plot
- drawing: Drawing
- diagram: Diagram
- photo: Photo
- other: Other
video: Video/Audio
software: Software

In a JSON-LD sense, DocHub will use schema.org types, but should be capable of cross-walking metadata to and from these Zenodo types.

5.2.8 Representation of publications¶

schema.org has full support for describing scholarly articles using JSON-LD:

This is Example 2 from schema.org’s ScholarlyArticle documentation:

{
  "@context": "http://schema.org",
  "@graph": [
    {
        "@id": "#issue",
        "@type": "PublicationIssue",
        "issueNumber": "5",
        "datePublished": "2012",
        "isPartOf": {
            "@id": "#periodical",
            "@type": [
                "PublicationVolume",
                "Periodical"
            ],
            "name": "Cataloging & Classification Quarterly",
            "issn": [
                "1544-4554",
                "0163-9374"
            ],
            "volumeNumber": "50",
            "publisher": "Taylor & Francis Group"
        }
    },
    {
        "@type": "ScholarlyArticle",
        "isPartOf": "#issue",
        "description": "The library catalog as a catalog of works was an infectious idea, which together with research led to reconceptualization in the form of the FRBR conceptual model. Two categories of lacunae emerge--the expression entity, and gaps in the model such as aggregates and dynamic documents. Evidence needed to extend the FRBR model is available in contemporary research on instantiation. The challenge for the bibliographic community is to begin to think of FRBR as a form of knowledge organization system, adding a final dimension to classification. The articles in the present special issue offer a compendium of the promise of the FRBR model.",
        "sameAs": "http://dx.doi.org/10.1080/01639374.2012.682254",
        "about": [
            "Works",
            "Catalog"
        ],
        "pageEnd": "368",
        "pageStart": "360",
        "name": "Be Careful What You Wish For: FRBR, Some Lacunae, A Review",
        "author": "Smiraglia, Richard P."
    }
  ]
}

And Example 3 from ScholarlyArticle:

{
  "@context": "http://schema.org",
  "@graph": [
    {
      "@id": "#issue4",
      "@type": "PublicationIssue",
      "datePublished": "2006-10",
      "issueNumber": "4"
    },
    {
      "@id": "#volume50",
      "@type": "PublicationVolume",
      "volumeNumber": "50"
    },
    {
      "@id": "#periodical",
      "@type": "Periodical",
      "name": "Library Resources and Technical Services"
    },
    {
      "@id": "#article",
      "@type": "ScholarlyArticle",
      "author": "Carlyle, Allyson.",
      "isPartOf": [
        {
          "@id": "#periodical"
        },
        {
          "@id": "#volume50"
        },
        {
          "@id": "#issue4"
        }
      ],
      "name": "Understanding FRBR as a Conceptual Model: FRBR and the Bibliographic Universe",
      "pageEnd": "273",
      "pageStart": "264"
    }
  ]
}

Example 3 establishes bibliographic information with a @graph containing PublicationIssue, PublicationVolume, and Periodical objects. These three objects are connected to the publication with isPartOf, however there’s no explicit relationship between the issue, volume and periodical.

Alternatively, Example 2 has two objects in its @graph: a PublicationIssue (that includes PublicationVolume and Periodical metadata in its type), and a ScholarlyArticle. The ScholarlyArticle links to PublicationIssue through an isPartOf relationship. Thus Example 2 establishes a complete semantic relationship between the article, issue, volume and periodical. Example 2 is preferred.

The schema.org approach is slightly different from CodeMeta since it encapsulates several simultaneous relations in a relationships array. This is ideal since it allows us to connect a paper not only to its journal context, but also to associated source code and datasets.

Another difference is that DocHub JSON-LD does not tend to use @graphs; instead one resource is mapped to a MongoDB document. This is one possible approach to using relationships and folding Journal information into the relationship type:

{
  "@context": "...",
  "@type": "ScholarlyArticle",
  "relationships": [
    {
      "relationshipType": "isSupplementTo",
      "relatedIdentifier": "https://github.com/lsst/example_analysis_software.git",
      "relatedIdentifierType": "URL"
    },
    {
      "relationshipType" "isPartOf",
      "@id": "#issue",
      "@type": [
          "PublicationVolume",
          "Periodical",
          "PublicationIssue"
      ],
      "name": "Cataloging & Classification Quarterly",
      "volumeNumber": "50",
      "issueNumber": "5",
      "publisher": "Taylor & Francis Group"
      "pageEnd": "368",
      "partStart": "360",
    },
    {
      "relationshipType": "isIdenticalTo",
      "relatedIdentifier": "doi:...",
      "relatedIdentifierType": "DOI"
    },
  ],
  "name": "Article's Name",
  "description": "Article's abstract ..."
}

6 JSON-LD metadata templates¶

Although complete JSON-LD metadata documents can be embedded in GitHub (and similar) repositories, managing metadata this way may not be sustainable. First, some metadata changes with each commit, and the time of commit (such dateModified). Second, a lot of metadata is inherent to a repository and its content. Git commit trees contain information to build contributor metadata, the LICENSE file authoritatively defines the repository’s license, and the document’s text authoritatively describes its content. Repeating information inherent to the GitHub repository in a metadata file introduces fragility.

DocHub’s approach is to shift the responsibility of building a complete metadata record to the ingest adapter. To help the ingest adapter, and to store metadata that can be statically managed, we store metadata templates in the Git repository.

6.1 Interpolation objects¶

For example, consider the licenseId field in a DocHub JSON-LD metadata object:

{
  "@context": "...",
  "licenseId": "MIT"
}

Instead of hard-coding the license’s SPDX Id, we can direct the adapter to interpolate a metadata template to include license information from the GitHub API:

{
  "@context": "...",
  "licenseId": {"@template": "GitHubLicenseId"}
}

An object with @template field is an interpolation object. The value of @template is the name of a metadata interpolator known to the ingest adapter.

The interpolation object may contain additional fields that act as arguments to the interpolation function. For example, The GitContributors interpolator can take additional agents who aren’t reflected in a Git repos’s history:

{
  "@context": "...",
  "agents": {"@template": "GitContributors",
             "additionalAgents": [
               {
                 "@type": "organization",
                 "name": "Science Quality and Reliability Engineering Team",
                 "parentOrganization": "Data Management",
                 "isRightHolder": false,
                 "isMaintainer": true
               }
             ]
}

These additional agents can be organizations (shown in this example), or additional authors that aren’t Git contributors.

7 Ingest Adapters¶

Ingest adapters are microservices that take an artifact in its native form, and index it in the DocHub databases. That is, it transforms the artifact’s native metadata into DocHub JSON-LD metadata. Each type of artifact has a dedicated ingest adapter microservice. This way all platform-specific logic is contained within individual ingest adapter code bases. The DocHub API server does not largely need to know about platforms; it only needs to interpret metadata in DocHub’s schema.

Ingest adapters can either be designed for pulling artifact updates, or being pushed update’s from the artifact’s platform. For example, GitHub repositories can emit webhook events that trigger ingest adapters. Alternatively, ingest adapters can poll for updates from platforms that do not support webhooks.

7.1 Kubernetes deployment pattern¶

Since DocHub is deployed with Kubernetes, adapters are expected to be deployed as Kubernetes pods in the same cluster as the API server and databases.

Adapters that recieve HTTP POST requests from webhooks are configured with Kubernetes ingress resources, which gives them an external IP.

Being in the same cluster, the adapters can directly connect with the MongoDB and Elasticsearch instances, which removes any need for an intermediate API layer. This arrangement does require that adapters are trusted. Every adapter will need to be managed by DocHub’s DevOps team.

7.2 Example: Sphinx Technote Adapter¶

This section explores how adapters work through the example of DM’s Sphinx technotes. Technotes are GitHub repositories published through LSST the Docs.

This adapter is a web (HTTP) server. It needs a public ingress, and should be in the same cluster (namely, Kubernetes cluster) as the MongoDB and Elasticsearch databases.

The adapter has a HTTP POST endpoint that receives a GitHub webhook that is configured directly in the technote’s GitHub repository. GitHub triggers webhooks for different events; the PushEvent is useful since it’s triggered whenever the repository is updated with new content, regardless of the branch. From the webhook POST, the adapter receives a payload of information about the commits in the push, including:

ref: The Git ref that was pushed to (typically a branch name),
head: The SHA ref of the HEAD of the commits. For GitHub repositories, DocHub only tracks the head of each branch or a tag, not individual commits.
commits: an array of commit objects, including commits[][url], the API URL of each commit in the push.

From this commit information, the adapter begins to build a metadata record for the repository. First, the adapter looks at the lsstmeta.json file in the repository. Most likely, this is a templated JSON-LD file, which requires the adapter to run metadata interpolators to build a complete lsstmeta.json JSON-LD file. To facilitate this, the adapter performs a shallow clone of the entire repository so that the adapter’s interpolation pipeline can scrape metadata from the repository content (such as the document’s title and abstract). The adapter can also GitHub’s API to query for structured information that GitHub has about the repository, such as committers to build authorship metadata, or parsed license information. Once built, the adapter inserts the JSON-LD object in the resource’s MongoDB document.

In addition, the adapter also extracts text from the technote’s reStructedText and inserts that content into Elasticsearch.

8 DocHub API server¶

8.1 Authentication and authorization¶

DocHub’s API will require auth infrastructure:

Some resources will be embargoed (particularly, draft papers in private GitHub repositories) and classified (for example, access-controlled documents in DocuShare).
Some fields within resources may be access controlled. For example, there may be a desire to make email addresses in records of people available only to authenticated project and science collaboration users.

LSST does not currently have a general purpose authentication system and user database capable of supporting authorization tasks. There are some workarounds for this:

Permit DocHub to only index public information. The metadata of a classified DocuShare document may be considered public and indexed, but the content would not be indexed by Elasticsearch. In this case, the metadata adapters are required to enforce data classification.
Use GitHub. GitHub OAuth would authenticate users and GitHub’s permissions model would be used for authorization. That is, only those who can see a GitHub repository would be able to view it on DocHub. One problem here is that not everyone is LSST is on GitHub. Second, access controls on DocuShare do not map to GitHub organizations.
Use Slack. This is a tenable authentication solution since everyone in the project and science collaborations have (or can have) an https://lsstc.slack.com Slack account, making Slack-based OAuth authentication possible. The https://slack.com/api/users.identity endpoint can include information about a user’s Slack team memberships. This could be a convenient way of establishing authorization.

In the long term, an ideal solution would be to have a central LSST and community user database. That database provide university user authentication. It would also be the best place to establish groups that define permissions. Indeed, DocuShare, GitHub, Slack permissions and groups ought to be derived from this central database.

In the near term, we can launch DocHub as a completely open system, though a system for checking authorizations should be anticipated in the original design.

8.2 RESTful API¶

DocHub API server will provide a basic RESTful API to access JSON-LD documents:

GET https://dochub.lsst.codes/metadata/identifier.json

This provides two important features for linked-data datasets:

The URL for a JSON-LD document serves as the universal identifier for a resource, in a linked-data sense. For example, a relationships field in one JSON-LD document can use a DocHub REST API URL of another artifact as the relatedIdentifier.
Third-party metadata services can ingest this JSON-LD.

8.2.1 Implementation¶

For consistency with LSST Data Management’s technology stack, the RESTful API will be deployed as a Flask application.

The ID of a DocHub JSON-LD document can be derived from its MongoDB ObjectId, which is a universally unique identifier for every MongoDB document.

8.2.2 Additional questions¶

Should DocHub fully-resolve the metadata of all related resources (as much as is possible) by walking the link tree? This could argument to the HTTP GET request.
Should the RESTful API provide JSON-LD transformation functionality, like framing (customizing the representation of a JSON-LD document), expansion (inlining the context with field names) and flattening (collecting individual field’s data and context in separate JSON objects).

8.3 GraphQL API¶

In addition to the RESTful API, DocHub should provide a GraphQL API through a /graphql endpoint. Whereas RESTful APIs are oriented towards CRUD operations on resources, GraphQL is designed to efficiently populate data in user interfaces, which usually iterate over a subset of data in many resources. In REST, it’s often necessary to build custom endpoints that efficiently provide data to populate a UI. With GraphQL, the query specifies exactly what the shape of the output dataset is.

8.3.1 Implementation¶

DocHub’s GraphQL API will be implemented with the Graphene package within the Flask application. All GraphQL queries are served from a single /graphql endpoint.

8.3.2 Type system¶

GraphQL uses a type system so that the server can validate and resolve GraphQL’s arbitrary requests. DocHub’s GraphQL implementation will need to distill the various types of information expressed in JSON-LD as basic GraphQL types like Person and Organization, and interfaces like Artifact for hierarchies that include types like SoftwareRepository, GitRef, LsstTheDocsEdition, DocuShareDeposition, ZenodoDeposition, and so forth.

Overall, the GraphQL API should be designed to efficiently populate DocHub’s front-end user interface (whereas the REST and JSON-LD API is designed to be cross-walked to other metadata systems).

9 Appendix: JSON-LD reading list¶

JSON-LD best practices.
Building a better book in the browser.
Linked Data Patterns
Indexing bibliographic linked data with JSON-LD, ElasticSearch.
JSON-LD: Building meaningful data APIs.
BibJSON describes resources with JSON objects with fields defined in BibTeX. Being JSON, it’s also possible to describe these files with JSON-LD.

SQR-013: LSST DocHub Design

1 Abstract¶

2 DocHub’s purpose¶

3 Existing metadata systems¶

4 DocHub architecture¶

5 DocHub’s JSON-LD metadata¶

5.1 JSON-LD in MongoDB¶

5.2 JSON-LD Applications¶

5.2.1 Representing versioned resources in JSON-LD and the metadata database¶

5.2.2 Related identifiers (DOIs)¶

5.2.3 Relationships to projects¶

5.2.4 Representing people in JSON-LD¶

5.2.5 Representing organizations and copyright holders in JSON-LD¶

5.2.6 Describing organizational hierarchy¶

5.2.7 Types for non-software artifacts¶

5.2.7.1 schema.org types¶

5.2.7.2 Zenodo types¶

5.2.8 Representation of publications¶

6 JSON-LD metadata templates¶

6.1 Interpolation objects¶

7 Ingest Adapters¶

7.1 Kubernetes deployment pattern¶

7.2 Example: Sphinx Technote Adapter¶

8 DocHub API server¶

8.1 Authentication and authorization¶

8.2 RESTful API¶

8.2.1 Implementation¶

8.2.2 Additional questions¶

8.3 GraphQL API¶

8.3.1 Implementation¶

8.3.2 Type system¶

9 Appendix: JSON-LD reading list¶