AIRR Data Standards#

AIRR Data Standards are versioned specifications that consist of a file format and a well-defined schema. The schema is provided in a machine-readable YAML document that follows the OpenAPI v2.0 specification. The schema defines the data model, field names, data types, and encodings for AIRR standard objects. Strict typing enables interoperability and data sharing between different AIRR-seq analysis tools and repositories, and some fields use a controlled vocabulary or an ontology for value restriction. Specification extensions are utilized to define AIRR-specific attributes.

Schema Definitions#

Data Model#

The MiAIRR standard defines the minimal information for submission and publication of AIRR-seq datasets. The standard defines a set of data elements for this information and organizes them into six high-level sets.

Study, Subject and Diagnosis
Sample Collection
Sample Processing and Sequencing
Raw Sequences
Data Processing
Processed AIRR Sequences with Annotations

However beyond these sets, MiAIRR does not define any structure, data model or relationship between the data elements. This provides flexibility for the information to be stored in various database repositories but is problematic for interoperability and reusability of that information by computer programs. The AIRR Data Model overcomes these issues by defining a schema for the MiAIRR data elements, structuring them within schema objects, defining the relationship between those objects, and defining a file format.

Here are the primary schema objects of the AIRR Data Model:

Schema Object	Description
`Study`	Information about the experimental study design, including the title of the study, laboratory contact information, funding, and linked publications.
`Subject`	Information about the study cohorts and individual subjects, including species, sex, age, and ancestry.
`Diagnosis`	Information about disease state(s), therapies, and study group membership (e.g., control versus disease).
`Sample`	Information about the origin and expected composition of the biological sample(s). This set aims to capture essential information about the collection of a sample, including its source (e.g., anatomical site), its provenance (provider), and the experimental condition (e.g., the time point during the course of a disease or treatment).
`CellProcessing`	Information about the cell subset being profiled, as defined by the investigator, and the flow cytometry or other markers used to select the subset. Additional information includes the number of cells per sample and whether cells were prepared in bulk or captured as single cells.
`NucleicAcidProcessing`	Information about nucleic acid sample type (e.g., RNA versus DNA) and how immune-receptor gene rearrangements were amplified and sequenced (for example, RACE-PCR versus multiplex PCR, paired PCR, and/or varying read length and sequencing chemistries).
`SequencingRun`	Information about the sequencing run, such as the number of reads, read lengths, quality control parameters, the sequencing kit and instrument(s) used, and run batch number. Also includes information about the raw data for the sequencing run (e.g., FASTQ files).
`DataProcessing`	Information about the data processing to transform the raw sequencing data into `Rearrangements`.
`Repertoire`	Composite object that combines the schema objects `Study`, `Subject`, `Diagnosis`, `Sample`, `CellProcessing`, `NucleicAcidProcessing`, `SequencingRun`, and `DataProcessing`. Each `Repertoire` has a unique identifier `repertoire_id` for linking with other data files, e.g. `Rearrangements`. `Repertoires` have their own schema and file format described here.
`RepertoireGroup`	Composite object that combines multiple `Repertoires` (as `RepertoireFilters`) for further analysis.
`RepertoireFilter`	Object with a pointer to an original `Repertoire` with descriptions of how it was filtered for inclusion in a `RepertoireGroup` and optional additional metadata such as time point. `RepertoireFilters` have their own schema and file format described here.
`Rearrangments`	Annotated sequences describing adaptive immune receptor chains. `Rearrangements` have their own schema and file format described here.
`Clones`	Information about inferred clones/lineages from a study. `Clones` have their own schema and file format described here.
`Cells`	Information about an observed Cell in a study. `Cells` have their own schema and file format described here.
`CellExpression properties`	Information about expression properties observed for a specific cell. `CellExpression` properties have their own schema and file format described here.
`Receptor`	Information about adaptive immune receptors (i.e., Ig and TCR) that are linked to observed Cells in a study. `Receptors` have their own schema and file format described here.
`GermlineSet`	Lists the receptor germline sequences that have been identified for a single locus within a particular species or sub-species, together with supporting evidence and additional metadata to assist with sequence annotation. Brings togteher the subsidiary objects `AlleleDescription`, `SequenceDelineationV`, `RearrangedSequence`, `UnrearrangedSequence`, `Acknowledgement`.
`GenotypeSet`	Lists the receptor germline sequences that have been identified within a single subject, including both those that are listed within `GermlineSets` and those that have not been so listed. References the subsidiary object `Genotype`, which covers a single locus.

Relationship between Schema Objects#

The MiAIRR categories are hierarchical, and includes information about the study, the subjects, the collected samples and how they are processed, details of the sequencing protocol, and information about the data analysis. The top-down relationships are either 1-to-n indicating the top level object can be related to any number of sub-level objects, or n-to-n indicating any number of top level object can be related to any number of sub-level objects. Lastly, 1-to-1 indicates the top level object is related to a single sub-level object.

Study 1-to-n with Subject. A study may contain any number of subjects.
Subject 1-to-n with Diagnosis. Each subject may contain any number of diagnoses.
Subject 1-to-n with Sample. Each subject may contain any number of samples.
Sample 1-to-n with CellProcessing. A sample may have any number of cell processing records.
CellProcessing 1-to-n with NucleicAcidProcessing. A cell processing record may have any number of nucleic acid processing records.
NucleicAcidProcessing 1-to-n with SequencingRun. A nucleic acid processing records may have any number of sequencing runs.
SequencingRun n-to-n with DataProcessing. Multiple sequencing runs can be combined in a data processing, and multiple data processing can be done on a sequencing run.

However, this hierarchy is deep and complicated. Therefore to simplify the processing of this information, we denormalized the hierarchy around the conceptual Repertoire object. This denormalization represents many relationships as 1-to-1 which simplifies the structure. A single Repertoire has these relationships with the primary schema objects.

Repertoire 1-to-1 with Study. A repertoire is for a single study, though a study may have multiple repertoires.
Repertoire 1-to-1 with Subject. A repertoire is for a single subject, though a subject may have other repertoires defined.
SampleProcessing 1-to-1 with Sample, CellProcessing, NucleicAcidProcessing, and SequencingRun. A sample processing is a single chain from initial collection, through cell and nucleic acid processing, to sequencing.
Repertoire 1-to-n with SampleProcessing. Generally a repertoire has a single sample processing, but sometimes studies perform technical replicates or re-sequencing to generate additional data, and these studies will have multiple sample processings, which are to be combined and analyzed together as part of the same repertoire.
Repertoire 1-to-n with DataProcessing. A repertoire can be analyzed multiple times. More details about multiple data processing is provided below.

The trade-off with denormalization of the hierarchy is that it causes duplication of data. For example, two repertoires for the same study will have the Study information duplicated within each of the two repertoire records; likewise multiple repertoires for the same subject will have the Subject information duplicated.

While the denormalized Repertoire simplifies read-only access to the MiAIRR information, it complicates data entry and write access to the information because updates need to be propagated to all of the duplicate records. Therefore, Repertoire was designed to be easily transformed into a normalized form, representing the full hierarchy of the objects, by utilizing the study_id, subject_id, sample_id, and sample_processing_id fields to uniquely identify the Study, Subject, Sample, and SampleProcessing objects across multiple repertoires. The exception is that CellProcessing and NucleicAcidProcessing do not have their own unique identifiers, so they are included within SampleProcessing.

As a Repertoire is limited to a single sample, many analyses will involve multiple Repertoires, which may be combined into a RepertoireGroup.

AIRR extension properties#

The OpenAPI V2 and V3 specification provides the ability to define extension properties on schema objects. These are additional properties on the schema definition directly, not to be confused with additional properties on the data. These extension properties allow those schema definitions to be annotated with MiAIRR and AIRR specific information. Instead of creating separate extensions for each property, a single extension x-airr property is defined, which is an object that contains any number of properties. Within the AIRR schema, AIRR_Extension defines the schema for the x-airr object and the properties within it. Here is a list of the currently supported AIRR extension properties:

Extension	Description
`miairr`	Present if the annotated property is a MiAIRR data standard element. Always has a requirement level assigned to it.
`nullable`	Assumes `miairr`. False if the annotated property must not be `NULL` by the MiAIRR standard, otherwise True or null. This extension is not valid for OpenAPI V3 as the `nullable` builtin property should be used.
`set`	Assumes `miairr`. The MiAIRR set for the annotated property.
`subset`	Assumes `miairr`. The MiAIRR subset for the annotated property.
`name`	Assumes `miairr`. The MiAIRR field name.
`format`	Describes the format for the annotated property. Value is either `free text`, `controlled vocabulary` or `ontology`.
`ontology`	If `format=ontology` then this provides additional information about the ontology including draft status, name, URL and top node term.
`identifier`	True if the field is an identifier required to link metadata and/or individual sequence records across objects in the complete AIRR Data Model and ADC API.
`adc-query-support`	True if an ADC API implementation must support queries on the field. If false, query support for the field in ADC API implementations is optional.
`adc-api-optional`	True if the field is specific to the ADC API and is not part of the AIRR specification proper. These are typically “convenience” fields that make finding data easy or efficient (can be optimized by a repository).
`deprecated`	True if the field has been deprecated from the schema.
`deprecated-description`	Information regarding the deprecation of the field.
`deprecated-replaced-by`	The deprecated field is replaced by this list of fields.

FAIR Principles#

We desire AIRR standard objects to be FAIR (findable, accessible, interoperable and reusable) [Wilkinson_2016]:

Findable: by giving AIRR standard objects a globally unique identifier.
Accessible: by providing an API where AIRR standard objects can be queried and downloaded.
Interoperable: by defining a OpenAPI schema for the AIRR standard objects.
Reusable: by linking the AIRR standard objects together into a standard formats.

AIRR Data Standards

Contents