AIRR Data Representations#
AIRR Data Representations are versioned specifications that consist of a file format and a well-defined schema. The schema is provided in a machine-readable YAML document that follows the OpenAPI v2.0 specification. The schema defines the data model, field names, data types, and encodings for AIRR standard objects. Strict typing enables interoperability and data sharing between different AIRR-seq analysis tools and repositories, and some fields use a controlled vocabulary or an ontology for value restriction. Specification extensions are utilized to define AIRR-specific attributes.
FAIR Principles#
We desire AIRR standard objects to be FAIR (findable, accessible, interoperable and reusable) [Wilkinson_2016]:
findable: by giving AIRR standard objects a globally unique identifier
accessible: by providing an API where AIRR standard objects can be queried and downloaded
interoperable: by defining a OpenAPI schema for the AIRR standard objects
reusable: by linking the AIRR standard objects together into a standard formats
AIRR Data Model#
The MiAIRR standard defines the minimal information for submission and publication of AIRR-seq datasets. The standard defines a set of data elements for this information and organizes them into six high-level sets.
Study, Subject and Diagnosis
Sample Collection
Sample Processing and Sequencing
Raw Sequences
Data Processing
Processed Sequences with Annotations
However beyond these sets, MiAIRR does not define any structure, data model or relationship between the data elements. This provides flexibility for the information to be stored in various database repositories but is problematic for interoperability and reusability of that information by computer programs. The AIRR Data Model overcomes these issues by defining a schema for the MiAIRR data elements, structuring them within schema objects, defining the relationship between those objects, and defining a file format.
Here are the primary schema objects of the AIRR Data Model:
Schema Object |
Description |
---|---|
|
Information about the experimental study design, including the title of the study, laboratory contact information, funding, and linked publications. |
|
Information about the study cohorts and individual subjects, including species, sex, age, and ancestry. |
|
Information about disease state(s), therapies, and study group membership (e.g., control versus disease). |
|
Information about the origin and expected composition of the biological sample(s). This set aims to capture essential information about the collection of a sample, including its source (e.g., anatomical site), its provenance (provider), and the experimental condition (e.g., the time point during the course of a disease or treatment). |
|
Information about the cell subset being profiled, as defined by the investigator, and the flow cytometry or other markers used to select the subset. Additional information includes the number of cells per sample and whether cells were prepared in bulk or captured as single cells. |
|
Information about nucleic acid sample type (e.g., RNA versus DNA) and how immune-receptor gene rearrangements were amplified and sequenced (for example, RACE-PCR versus multiplex PCR, paired PCR, and/or varying read length and sequencing chemistries). |
|
Information about the sequencing run, such as the number of reads, read lengths, quality control parameters, the sequencing kit and instrument(s) used, and run batch number. Also includes information about the raw data for the sequencing run (e.g., FASTQ files). |
|
Information about the data processing to transform the raw sequencing data into |
|
Composite object that combines the schema objects |
|
Annotated sequences describing adaptive immune receptor chains. |
|
Information about inferred clones from a study. |
|
Information about an observed Cell in a study. |
|
Information about expression properties observed for a specific cell. |
|
Information about adaptive immune receptors (i.e., Ig and TCR) that are
linked to observed Cells in a study. |
|
Lists the receptor germline sequences that have been identified for a single locus within a particular species or sub-species, together with supporting evidence and additional metadata to assist with sequence annotation. Brings togteher the subsidiary objects |
|
Lists the receptor germline sequences that have been identified within a single subject, including both those that are listed within |
Relationship between Schema Objects#
The MiAIRR categories are hierarchical, and includes information about the study, the subjects, the collected samples and how they are processed, details of the sequencing protocol, and information about the data analysis. The top-down relationships are either 1-to-n indicating the top level object can be related to any number of sub-level objects, or n-to-n indicating any number of top level object can be related to any number of sub-level objects. Lastly, 1-to-1 indicates the top level object is related to a single sub-level object.
Study
1-to-n withSubject
. A study may contain any number of subjects.Subject
1-to-n withDiagnosis
. Each subject may contain any number of diagnoses.Subject
1-to-n withSample
. Each subject may contain any number of samples.Sample
1-to-n withCellProcessing
. A sample may have any number of cell processing records.CellProcessing
1-to-n withNucleicAcidProcessing
. A cell processing record may have any number of nucleic acid processing records.NucleicAcidProcessing
1-to-n withSequencingRun
. A nucleic acid processing records may have any number of sequencing runs.SequencingRun
n-to-n withDataProcessing
. Multiple sequencing runs can be combined in a data processing, and multiple data processing can be done on a sequencing run.
However, this hierarchy is deep and complicated. Therefore to simplify
the processing of this information, we denormalized the hierarchy
around the conceptual Repertoire
object. This denormalization
represents many relationships as 1-to-1 which simplifies the
structure. A single Repertoire
has these relationships with the
primary schema objects.
Repertoire
1-to-1 withStudy
. A repertoire is for a single study, though a study may have multiple repertoires.Repertoire
1-to-1 withSubject
. A repertoire is for a single subject, though a subject may have other repertoires defined.SampleProcessing
1-to-1 withSample
,CellProcessing
,NucleicAcidProcessing
, andSequencingRun
. A sample processing is a single chain from initial collection, through cell and nucleic acid processing, to sequencing.Repertoire
1-to-n withSampleProcessing
. Generally a repertoire has a single sample processing, but sometimes studies perform technical replicates or re-sequencing to generate additional data, and these studies will have multiple sample processings, which are to be combined and analyzed together as part of the same repertoire.Repertoire
1-to-n withDataProcessing
. A repertoire can be analyzed multiple times. More details about multiple data processing is provided below.
The trade-off with denormalization of the hierarchy is that it causes
duplication of data. For example, two repertoires for the same study
will have the Study
information duplicated within each of the two
repertoire records; likewise multiple repertoires for the same subject
will have the Subject
information duplicated.
While the denormalized Repertoire
simplifies read-only access to
the MiAIRR information, it complicates data entry and write access to
the information because updates need to be propagated to all of the
duplicate records. Therefore, Repertoire
was designed to be easily
transformed into a normalized form, representing the full hierarchy of
the objects, by utilizing the study_id
, subject_id
, sample_id
, and
sample_processing_id
fields to uniquely identify the Study
, Subject
, Sample
, and
SampleProcessing
objects across multiple repertoires. The exception is that
CellProcessing
and NucleicAcidProcessing
do not have their own
unique identifiers, so they are included within SampleProcessing
.
AIRR extension properties#
The OpenAPI V2 and V3 specification provides the ability to define extension
properties on schema objects. These are additional properties on
the schema definition directly, not to be confused with additional
properties on the data. These extension properties allow those schema
definitions to be annotated with MiAIRR and AIRR specific
information. Instead of creating separate extensions for each
property, a single extension x-airr
property is defined, which is
an object that contains any number of properties. Within the AIRR
schema, AIRR_Extension
defines the schema for the x-airr
object and the properties within it. Here is a list of the currently
supported AIRR extension properties:
Extension |
Description |
---|---|
|
Present if the annotated property is a MiAIRR data standard element. Always has a requirement level assigned to it. |
|
Assumes |
|
Assumes |
|
Assumes |
|
Assumes |
|
Describes the format for the annotated property. Value is either
|
|
If |
|
True if the field is an identifier required to link metadata and/or individual sequence records across objects in the complete AIRR Data Model and ADC API. |
|
True if an ADC API implementation must support queries on the field. If false, query support for the field in ADC API implementations is optional. |
|
True if the field is specific to the ADC API and is not part of the AIRR specification proper. These are typically “convenience” fields that make finding data easy or efficient (can be optimized by a repository). |
|
True if the field has been deprecated from the schema. |
|
Information regarding the deprecation of the field. |
|
The deprecated field is replaced by this list of fields. |
Schema Definitions#
- Requirement levels of fields
- Repertoire Schema
- Multiple Data Processing on a Repertoire
- Linking Data
- Duality between Repertoires and Rearrangements
- File Format Specification
- Repertoire Fields
- Study Fields
- Subject Fields
- Diagnosis Fields
- Sample Fields
- Tissue and Cell Processing Fields
- Nucleic Acid Processing Fields
- PCR Target Locus Fields
- Raw Sequence Data Fields
- Sequencing Run Fields
- Data Processing Fields
- Rearrangement Schema
- Alignment Schema (Experimental)
- Clone and Lineage Tree Schema (Experimental)
- Cell Schema (Experimental)
- Cell Expression Schema (Experimental)
- Cell Reactivity Schema (Experimental)
- Germline Schema (Experimental)
- Motivation
- Receptor Germline Schema
- Gene and Allele Naming
- Genotypes
- MHC Genotypes
- File Format Specification
- GermlineSet Fields
- AlleleDescription Fields
- RearrangedSequence Fields
- UnrearrangedSequence Fields
- SequenceDelineationV Fields
- GenotypeSet Fields
- Genotype Fields
- MHCGenotypeSet Fields
- MHCGenotype Fields
- Receptor Schema