AIRR Ontologies and Vocabularies Sub-WG#
Summary#
The “Ontologies and Vocabularies Team” was initial formed as a joint interest group of the Common Repository (ComRepo) and the Minimal Standards (MiniStd) working groups (WG) of the AIRR Community. When the two WG merged into the current Standards WG in Decemmber 2020, OntoVoc became a Sub-WG of it. The long-term aim of the Sub-WG is to define standard vocabularies and ontologies to be used by AIRR-compliant repositories.
Ontology Data Representation#
The nodes in an ontology are typically either concepts (e.g., capital)
or instances thereof (e.g., Paris). These nodes have local IDs (often
numbers), which are unique within an ontology. They also typically have
labels, which is the human-readable name of the node. Ontology
entities in the AIRR Data Standard reflect this model, with each AIRR
field that is represented as an ontology recorded with a global
ontology ID (id) and the corresponding label (label).
Within the AIRR Standards, Compact URIs (CURIEs) are used to represent ontology IDs or persistent IDs. CURIEs are a standardized way to abbreviate International Resource Identifiers (IRI, [RFC3987]), which include URIs and URLs as subsets. They were originally conceived to simplify the handling of attributes, e.g., in XML or SPARQL, by making them more compact and readable. CURIEs are also used by IEDB databases to reduce redundancies (mainly in the leading part of IRIs).
For example, a typical CURIE would look like NCBITAXON:9258. In this
case, NCBITAXON is the prefix, a custom string that will be
replaced by a repository-defined IRI component (e.g.,
http://purl.obolibrary.org/obo/NCBITaxon_). Note that there is no
connection between NCBITAXON in the CURIE and NCBITaxon in the
IRI, the former one is just a placeholder. Although common, it is not
always the case that a resolved CURIE (the IRI prefix plus the
local ID) can be used as a URL directly to look up the CURIE using a
web browser.
The AIRR Schema provides a CURIEMap, a list of AIRR approved CURIE
prefixes along with a map of at least one iri_prefix (i.e.,
a replacement string to construct the complete IRI) for each prefix.
As the iri_prefix might differ between provider-specific
implementations of an ontology (e.g., NCBI Taxonomy), the CURIEMap
supports multiple iri_prefix entries for a given prefix. Finally,
the CURIEMap should also provide a default map and provider
for each prefix. Complementary to this, the InformationProvider
list describes the mechanism to computationally look up a resolved IRI
(e.g., the iri_prefix and the local ID) by specifying how to make
a request to the provider as well as describing the format in which
the request response will be provided.
The CURIEMap serves several purposes:
It provides a controlled namespace for CURIE prefixes used in the AIRR Schema. For now, custom additions to or replacements of these prefixes in the schema are prohibited. This does not affect the ability of repositories to use such custom prefixes internally.
It simplifies resolution of CURIEs. The
iri_prefixlists for each prefix should not be considered to be exhaustive. However, when using a customiri_prefix, it must be ensured that the expanded IRI still refers to the same concept/instance as when using the defaultiri_prefix.It simplifies computation using CURIES. It is possible to use the
providerfor a prefix as a mechanism to look up a CURIE from a provider with a defined response (See below)
It should be explicitly noted that the CURIEMap should not be
interpreted as any kind of recommendation for certain providers. It is
left up to users to decide how to resolve the resulting IRIs, e.g., via
DNS/HTTP (if possible) or by using a provider of their choice.
General Policies#
Criteria#
Ontologies used within AIRR standards
MUST [1]_ cover the majority of the required terms, but complete coverage is OPTIONAL
MUST have a structure that is scientifically correct and logically coherent
MUST NOT feature complexity that makes it hard to use for queries and data representation
SHOULD already be widely adopted
MUST be actively maintained
MUST be available under a free license
SHOULD comply to the OBO Foundry Principles. This does not imply a preference.
Comments on criteria:
ad (1): For most fields it will be difficult to find complete and accurate ontologies. Therefore picking the best available ontology and working with its maintainers to include missing terms is expected to be the most sustainable approach.
ad (5): This requirement follows from (1), as there needs to be a way for term requests.
ad (6): A number of ontologies need to be licensed from their respective copyright holders. This results in potential barriers for implementation and distribution of such ontologies. Therefore only ontologies available under a free license are considered suitable for AIRR-compliant databases. The list of suitable licenses is not final, but includes: CC0 and CC BY.
ad (7): This is an endorsement of the OBO Foundry Principles, not of the OBO Foundry Ontologies in general. Hence, also non-OBO have an equal standing if they comply to the Principles.
Approved Ontologies#
Cell ontology (CL)
used in:
Cell subset (
cell_subset, Tissue and Cell Processing)
CURIE summary
CURIE Prefix:
CLCURIE IRI Prefix:
http://purl.obolibrary.org/obo/CL_
example AIRR use
“cell_subset.id” : “CL:0000542”
“cell_subset.label” : “lymphocyte”
default root node
label:
lymphocytelocal id:
CL_0000542path: ``
license: CC BY
latest release (as of 2020-05-20): 2020-03-02
maintainer: Alexander Diehl, Buffalo, NY, US (addiehl@buffalo.edu)
Human disease ontology (DOID)
used in:
Diagnosis (
disease_diagnosis, Diagnosis)
CURIE summary
CURIE Prefix:
DOIDCURIE IRI Prefix:
http://purl.obolibrary.org/obo/DOID_
example AIRR use
“disease_diagnosis.id” : “DOID:9538”
“disease_diagnosis.label” : “multiple myeloma”
default root node
label:
diseaselocal ID:
DOID:4path:
disease
license: CC0
latest release (as of 2020-05-20): 2020-04-20
maintainer: Lynn Schriml, U Maryland, MD, US (lynn.schriml@gmail.com)
notes: Features ICD cross-reference
NCBI organismal taxonomy (NCBITAXON)
used in:
Species (
species, Subject)Cell species (
cell_species, Tissue and Cell Processing)
CURIE summary
CURIE Prefix:
NCBITAXONCURIE IRI Prefixes:
http://purl.obolibrary.org/obo/NCBITaxon_,http://purl.bioontology.org/ontology/NCBITAXON/
example AIRR use
“species.id” : “NCBITAXON:9606”
“species.label” : “Homo sapiens”
default root node
label:
Gnathostomatalocal ID:
7776path:
cellular organisms/Eukaryota/Opisthokonta/Metazoa/Eumetazoa/Bilateria/Deuterostomia/Chordata/Craniata/Vertebrata/Gnathostomata
license: UMLS
latest release (as of 2020-05-20): 2020-04-18
repo: obophenotype/ncbitaxon
maintainer: NCBI (info@ncbi.nlm.nih.gov)
NCI thesaurus (NCIT)
used in:
Study type (
study_type, Study)
CURIE summary
CURIE Prefix:
NCITCURIE IRI Prefixes:
http://purl.obolibrary.org/obo/NCIT_,http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#
example AIRR use
“study_type.id” : “NCIT:C15197”
“study_type.label” : “Case-Control Study”
default root node
label:
Studylocal ID:
C63536path:
Activity/Clinical or Research Activity/ Research Activity/Study
license: Public domain, credit of NCI is requested
latest release (as of 2020-05-20): 2020-05-04
maintainer: NCI (ncicbiitappssupport@mail.nih.gov)
Units of measurement ontology (UO)
used in:
Age unit (
age_unit, Subject)
CURIE summary
CURIE Prefix:
UOCURIE IRI Prefix:
http://purl.obolibrary.org/obo/UO_
example AIRR use
“age_unit.id” : “UO:0000036”
“age_unit.label” : “year”
default root node
label:
time unitlocal ID:
UO_0000003path:
unit/time unit
license: CC BY (per Github repo)
latest release (as of 2020-05-20): 2020-05-18
maintainer: unknown
Uber-anatomy ontology (Uberon)
used in:
Tissue (
tissue, Sample)
CURIE summary
CURIE Prefix:
UBERONCURIE IRI Prefix:
http://purl.obolibrary.org/obo/UBERON_
example AIRR use
“tissue.id” : “UBERON:0002371”
“tissue.label” : “bone marrow”
default root node
label:
multicellular anatomical structurelocal ID:
UBERON:0010000path:
/BFO_0000002/BFO_0000004/anatomical entity/material anatomical entity/anatomical structure/multicellular anatomical structure
license: CC BY
repo: obophenotype/uberon
latest release (as of 2020-05-20): 2019-11-22
maintainer: Chris Mungall, LBL, CA, US (cjmungall@lbl.gov)
Computing with Ontologies#
One of the key goals of using ontologies is to enable analysis tools
to perform computation using the information in those ontologies. The
AIRR Schema’s CURIEMap lists one or more providers for each CURIE
prefix that can be used programmatically by analysis tools. Although
the AIRR Schema lists multiple providers for each ontology, this section
focuses on the use of the EBI OLS provider’s OLS Web API
interface for querying ontologies.
If we consider the DOID prefix from the CURIEMap, the section
below defines the use of the Human Disease Ontology (DOID) within the
AIRR Standard:
DOID:
type: ontology
default:
map: OBO
provider: OLS
map:
OBO:
iri_prefix: "http://purl.obolibrary.org/obo/DOID_"
We see that the default map for DOID is OBO map, and the OBO
map’s iri_prefix is http://purl.obolibrary.org/obo/DOID_. Thus
the mapping of the CURIE DOID:9538 (the CURIE for disease “multiple
myeloma”) will yield the resolved string
http://purl.obolibrary.org/obo/DOID_9538. By the strictest of
defintions, this is a valid IRI and should only be considered an
identifier, but in this case this IRI is also a URL and can be used to
look up the CURIE.
If we consider the default DOID provider in the CURIEMap, we see
that it is OLS. Then, in the InformationProvider object of the
AIRR Schema, under provider we see:
InformationProvider:
provider:
OLS:
request:
url: "https://www.ebi.ac.uk/ols/api/ontologies/{ontology_id}/terms?iri={iri}"
response: application/json
And later we see that the parameters for OLS are:
parameter:
CL:
Ontobee:
ontology_id: CL
OLS:
ontology_id: cl
DOID:
Ontobee:
ontology_id: DOID
OLS:
ontology_id: doid
The above tells us that we can use the OLS provider to look up
ontology terms. The {iri} component of the url string tells us
that we need to use the resolved IRI and the {ontology_id} component
tells us that we need to replace the ontology_id parameter in the
URL with the DOID OLS parameter in the specification, which is the
string doid. Thus the fully resolved URL to query for the CURIE
DOID:9538 would be:
https://www.ebi.ac.uk/ols/api/ontologies/doid/terms?iri=http://purl.obolibrary.org/obo/DOID_9538
Again, referring to the OLS provider we see that we can expect an
application/json response to the above query, and indeed the
response we receive from the above starts with a JSON object as follows.
{
"_embedded" : {
"terms" : [ {
"iri" : "http://purl.obolibrary.org/obo/DOID_9538",
"label" : "multiple myeloma",
"description" : [ "A myeloid neoplasm that is located_in the plasma cells in bone marrow." ],
"annotation" : {
"comment" : [ "OMIM mapping confirmed by DO. [SN]." ],
"database_cross_reference" : [ "ICD10CM:C90.0", "MESH:D009101", "ICD9CM:203.0", "GARD:7108", "NCI:C3242", "OMIM:254500", "ORDO:29073", "EFO:0001378", "SNOMEDCT_US_2020_09_01:94705007", "UMLS_CUI:C0026764" ],
"has_obo_namespace" : [ "disease_ontology" ],
"id" : [ "DOID:9538" ]
},
"synonyms" : [ "plasma cell myeloma" ],
"ontology_name" : "doid",
"ontology_prefix" : "DOID",
"ontology_iri" : "http://purl.obolibrary.org/obo/doid.owl",
"is_obsolete" : false,
"term_replaced_by" : null,
"is_defining_ontology" : true,
"has_children" : true,
"is_root" : false,
"short_form" : "DOID_9538",
"obo_id" : "DOID:9538",
[Content edited because of length]
In this repsonse, you can see that the Ontology object that we requested
has a label field that contains the value multiple myeloma and
that the id field has a value of DOID:9538.
It is beyond the scope of this document to describe in detail the JSON
structure of each of the providers, but this information can be
discovered through the provider web sites. It should be noted that
all Ontology objects in the AIRR specification have the OLS as a
provider and therefore the method above can be used for any of the
ontologies in the AIRR specification. Please see the OLS Web API
documentation for details of the JSON response for the OLS provider.
Sprint Reports#
Internationalized Resource Identifiers (IRIs). `DOI:10.17487/RFC3987`_