Repertoire Schema#
A Repertoire
is an abstract organizational unit of analysis that
is defined by the researcher and consists of study metadata, subject
metadata, sample metadata, cell processing metadata, nucleic acid
processing metadata, sequencing run metadata, a set of raw sequence
files, data processing metadata, and a set of Rearrangements
. A
Repertoire
gathers all of this information together into a
composite object, which can be easily accessed by computer programs
for data entry, analysis and visualization.
A Repertoire
is specific to a single subject otherwise it can
consist of any number of samples (which can be processed in different
ways), any number of raw sequence files, and any number of
rearrangements. It can also consist of any number of data processing
metadata objects that describe the processing of raw sequence files
into Rearrangements
.
Typically, a Repertoire
corresponds to the biological concept of
the immune repertoire for that single subject which the researcher
experimentally measures and computationally analyzes. However,
researchers can have different interpretations about what constitutes
the biological immune repertoire; therefore, the Repertoire
schema
attempts to be flexible and broadly useful for all AIRR-seq studies.
Another researcher can take the same raw sequencing data and
associated metadata and create their own Repertoire
that is
different from the original researcher’s. A common example is to
define a repertoire that is a subset such as “productive
rearrangements for IGHV4” whereas the original researcher defined a
more generic “B cell repertoire”. This new Repertoire
would have
much of the same metadata as the original Repertoire
, except
associated with a different study, and with additional information in
the data processing metadata that describes how the rearrangements
were filtered down to just the “productive rearrangements for
IGHV4”. Likewise, another researcher may get access to the original
biosample material and perform their own sample processing and
sequencing, which also would be a new Repertoire
. That new
Repertoire
could combine samples from the original researcher’s
Repertoire
with the new sample data as a large dataset for the
subject.
Multiple Data Processing on a Repertoire#
Data processing can be a complicated multi-stage
process. Documenting the process in a formal way is challenging
because of the diversity of actions that may be performed. The MiAIRR
standard requires documentation of the process but in an informal way
with free text descriptions. A Repertoire
might undergo multiple
different data processing for any number of reasons, e.g. to
compare the results from different toolchains, or to compare different
settings for the same toolchain.
It is expected that all of the Samples
of a Repertoire
will be
processed together within a DataProcessing
. That is, a
DataProcessing
that only uses some but not all samples in a
Repertoire
could be confusing to users and appear as though data
is missing. Likewise, processing some samples within a Repertoire
with one DataProcessing
and the remaining samples with a
different DataProcessing
could also confuse users. Because
DataProcessing
is unstructured information, it is not possible
to validate that all Samples
in a Repertoire
are being
processed together, so this expectation cannot be strictly
enforced.
Having multiple DataProcessing
for a Repertoire
will
create multiple sets of Rearrangements
that are distinct and
separate from each other. Analysis tools need to be careful not to mix
these sets of Rearrangements
from different DataProcessing
because it can generate incorrect results. The identifier
data_processing_id
was added so Rearrangements
can
identify their specific DataProcessing
.
Linking Data#
Each Repertoire
has a unique repertoire_id
identifier. This
identifier should be globally unique so that repertoires from multiple
studies can be combined together without conflict. The
repertoire_id
is used to link other AIRR data to a
Repertoire
. Specifically, the Rearrangements Schema includes repertoire_id
for referencing the
specific Repertoire
for that Rearrangement
.
If a Repertoire
has multiple DataProcessing
then
data_processing_id
should be used to distinguish the
appropriate DataProcessing
within the Repertoire
. The
Rearrangements
contains data_processing_id
for this
purpose. The data_processing_id
is only unique within a
Repertoire
so repertoire_id
should first be used to get the
appropriate Repertoire
object and then data_processing_id
used to acquire the appropriate DataProcessing
.
It is expected that typical Repertoires
might only have a single
DataProcessing
, in which case repertoire_id
and
data_processing_id
will be semantically equivalent and only the
former should be used.
If a Repertoire
has multiple sample processing objects in the sample
array then sample_processing_id
should be used to distinguish the
the approrpiate sample processing object within the Repertoire
. The
Rearrangement
object can contain a sample_processing_id
to uniquely
identify a sample processing object within a Repertoire
. Like
data_processing_id
, the sample_processing_id
is only unique within
the Repertoire
so repertoire_id
should first be used to get the
appropiate Repertoire
object and then sample_processing_id
should
be used to determine the appropiate sample processing object that is associated
with the Rearrangement
. If the Rearrangement
object does not have a
sample_processing_id
then it can be assumed that the rearrangement is
associated with all of the samples in the Repertoire
(e.g. the rearrangement
is a collapsed rearrangement across multiple samples).
It is expected that Repertoires
might often have a single
sample processing object, in which case repertoire_id
and
sample_processing_id
will be semantically equivalent and only the
former should be used.
Finally, if it is necessary to link a Rearrangement
object with a unique
pairing of sample processing and DataProcessing
, the repertoire_id
of
the Rearrangement
object should be used to identify the correct Repertoire
object and then the data_processing_id
should be used to identify the correct
DataProcessing
metadata and the sample_processing_id
should be used to
identify the correct sample processing metadata within that Repertoire
.
Duality between Repertoires and Rearrangements#
There is an important duality relationship between Repertoires
and
Rearrangements
, specifically with the experimental protocols
described in the Repertoire
versus the annotations on
Rearrangements
. A Repertoire
defines an experimental design
for what a researcher intends to measure or observe, while the
Rearrangements
are what was actually measured and
observed. Technically, the border between the two occurs at
sequencing, that is when the biological physical entity (prepared DNA)
is measured and recorded as information (nucleotide sequence).
This duality is important when considering how to answer certain
questions. For example, locus
for Rearrangements
may have the
value “IGH” which indicates that B cell heavy chain receptors were
measured, yet the Repertoire
might have “T cell” in
cell_subset
which indicates the researcher intended to measure T
cells. This conflict between the two indicates something is
wrong. Differences can occur in many ways, as with errors in the
experimental protocol, or data processing might have incorrectly
processed the raw sequencing data leading to invalid annotations.
File Format Specification#
Files are YAML/JSON with a structure defined below. Files should be
encoded as UTF-8. Identifiers are case-sensitive. Files should have the
extension .yaml
, .yml
, or .json
.
File Structure#
The file as a whole is considered a dictionary (key/value pair) structure with the keys
Info
andRepertoire
.The file can (optionally) contain an
Info
object, at the beginning of the file, based upon theInfo
schema in the OpenAPI V2 specification. If provided,version
inInfo
should reference the version of the AIRR schema for the file.The file should correspond to a list of
Repertoire
objects, usingRepertoire
as the key to the list.Each
Repertoire
object should contain a top-level key/value pair forrepertoire_id
that uniquely identifies the repertoire.Some fields require the use of a particular ontology or controlled vocabulary.
The structure is the same regardless of whether the data is stored in a file or a data repository. For example, The ADC API will return a properly structured JSON object that can be saved to a file and used directly without modification.
Repertoire Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
optional, identifier, nullable |
Identifier for the repertoire object. This identifier should be globally unique so that repertoires from multiple studies can be combined together without conflict. The repertoire_id is used to link other AIRR data to a Repertoire. Specifically, the Rearrangements Schema includes repertoire_id for referencing the specific Repertoire for that Rearrangement. |
|
string |
optional, nullable |
Short generic display name for the repertoire |
|
string |
optional, nullable |
Generic repertoire description |
|
required |
Study object |
|
|
required |
Subject object |
|
|
array of SampleProcessing |
required |
List of Sample Processing objects |
|
array of DataProcessing |
required |
List of Data Processing objects |
Study Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, nullable |
Unique ID assigned by study registry such as one of the International Nucleotide Sequence Database Collaboration (INSDC) repositories. |
|
string |
required, nullable |
Descriptive study title |
|
required, nullable |
Type of study design |
|
|
string |
optional, nullable |
Generic study description |
|
string |
required, nullable |
List of criteria for inclusion/exclusion for the study |
|
string |
required, nullable |
Funding agencies and grant numbers |
|
string |
optional, nullable |
Full contact information of the contact persons for this study This should include an e-mail address and a persistent identifier such as an ORCID ID. |
|
string |
required, nullable |
Full contact information of the data collector, i.e. the person who is legally responsible for data collection and release. This should include an e-mail address and a persistent identifier such as an ORCID ID. |
|
string |
required, nullable |
Department of data collector |
|
string |
required, nullable |
Institution and institutional address of data collector |
|
string |
required, nullable |
Full contact information of the data depositor, i.e., the person submitting the data to a repository. This should include an e-mail address and a persistent identifier such as an ORCID ID. This is supposed to be a short-lived and technical role until the submission is relased. |
|
string |
required, nullable |
Publications describing the rationale and/or outcome of the study. Where ever possible, a persistent identifier should be used such as a DOI or a Pubmed ID |
|
array of string |
required, nullable |
Keywords describing properties of one or more data sets in a study |
|
string |
optional, nullable |
Date the study was first published in the AIRR Data Commons. |
|
string |
optional, nullable |
Date the study data was updated in the AIRR Data Commons. |
Subject Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, nullable |
Subject ID assigned by submitter, unique within study. If possible, a persistent subject ID linked to an INSDC or similar repository study should be used. |
|
boolean |
required |
TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display) |
|
required |
Binomial designation of subject’s species |
|
|
DEPRECATED |
Binomial designation of subject’s species |
|
|
string |
required, nullable |
Biological sex of subject |
|
number |
required, nullable |
Specific age or lower boundary of age range. |
|
number |
required, nullable |
Upper boundary of age range or equal to age_min for specific age. This field should only be null if age_min is null. |
|
required, nullable |
Unit of age range |
|
|
string |
required, nullable |
Event in the study schedule to which Age refers. For NCBI BioSample this MUST be sampling. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between Age event and Sample collection time, hence the chosen events should be in temporal proximity. |
|
string |
DEPRECATED |
|
|
string |
required, nullable |
Broad geographic origin of ancestry (continent) |
|
string |
required, nullable |
Ethnic group of subject (defined as cultural/language-based membership) |
|
string |
required, nullable |
Racial group of subject (as defined by NIH) |
|
string |
required, nullable |
Non-human designation of the strain or breed of animal used |
|
string |
required, nullable |
Subject ID to which Relation type refers |
|
string |
required, nullable |
Relation between subject and linked_subjects, can be genetic or environmental (e.g.exposure) |
|
array of Diagnosis |
optional |
Diagnosis information for subject |
|
object |
optional, nullable |
Genotype for this subject, if known |
Diagnosis Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, nullable |
Designation of study arm to which the subject is assigned to |
|
required, nullable |
Diagnosis of subject |
|
|
string |
required, nullable |
Time duration between initial diagnosis and current intervention |
|
string |
required, nullable |
Stage of disease at current intervention |
|
string |
required, nullable |
List of all relevant previous therapies applied to subject for treatment of Diagnosis |
|
string |
required, nullable |
Antigen, vaccine or drug applied to subject at this intervention |
|
string |
required, nullable |
Description of intervention |
|
string |
required, nullable |
Medical history of subject that is relevant to assess the course of disease and/or treatment |
Sample Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, nullable |
Sample ID assigned by submitter, unique within study. If possible, a persistent sample ID linked to INSDC or similar repository study should be used. |
|
string |
required, nullable |
The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture |
|
required, nullable |
The actual tissue sampled, e.g. lymph node, liver, peripheral blood |
|
|
string |
required, nullable |
The anatomic location of the tissue, e.g. Inguinal, femur |
|
string |
required, nullable |
Histopathologic evaluation of the sample |
|
number |
required, nullable |
Time point at which sample was taken, relative to Collection time event |
|
required, nullable |
Unit of Sample collection time |
|
|
string |
required, nullable |
Event in the study schedule to which Sample collection time relates to |
|
string |
required, nullable |
Name and address of the entity providing the sample |
Tissue and Cell Processing Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, nullable |
Enzymatic digestion and/or physical methods used to isolate cells from sample |
|
required, nullable |
Commonly-used designation of isolated cell population |
|
|
string |
required, nullable |
List of cellular markers and their expression levels used to isolate the cell population |
|
optional, nullable |
Binomial designation of the species from which the analyzed cells originate. Typically, this value should be identical to species, in which case it SHOULD NOT be set explicitly. However, there are valid experimental setups in which the two might differ, e.g., chimeric animal models. If set, this key will overwrite the species information for all lower layers of the schema. |
|
|
boolean |
required, nullable |
TRUE if single cells were isolated into separate compartments |
|
integer |
required, nullable |
Total number of cells that went into the experiment |
|
integer |
required, nullable |
Number of cells for each biological replicate |
|
boolean |
required, nullable |
TRUE if cells were cryo-preserved between isolation and further processing |
|
string |
required, nullable |
Relative amount of viable cells after preparation and (if applicable) thawing |
|
string |
required, nullable |
Description of the procedure used for marker-based isolation or enrich cells |
|
string |
required, nullable |
Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript. |
Nucleic Acid Processing Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required |
The class of nucleic acid that was used as primary starting material for the following procedures |
|
string |
required, nullable |
Description and results of the quality control performed on the template material |
|
number |
required, nullable |
Amount of template that went into the process |
|
required, nullable |
Unit of template amount |
|
|
string |
required |
Generic type of library generation |
|
string |
required, nullable |
Description of processes applied to substrate to obtain a library that is ready for sequencing |
|
string |
required, nullable |
When using a library generation protocol from a commercial provider, provide the protocol version number |
|
array of PCRTarget |
optional |
If a PCR step was performed that specifically targets the IG/TR loci, the target and primer locations need to be provided here. This field holds an array of PCRTarget objects, so that multiplex PCR setups amplifying multiple loci at the same time can be annotated using one record per locus. PCR setups not targeting any specific locus must not annotate this field but select the appropriate library_generation_method instead. |
|
string |
required |
To be considered complete, the procedure used for library construction MUST generate sequences that 1) include the first V gene codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J gene (i.e. 1 bp 5’ of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered complete & untemplated, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. mixed should only be used if the procedure used for library construction will likely produce multiple categories of sequences in the given experiment. It SHOULD NOT be used as a replacement of a NULL value. |
|
string |
required |
In case an experimental setup is used that physically links nucleic acids derived from distinct Rearrangements before library preparation, this field describes the mode of that linkage. All hetero_* terms indicate that in case of paired-read sequencing, the two reads should be expected to map to distinct IG/TR loci. *_head-head refers to techniques that link the 5’ ends of transcripts in a single-cell context. *_tail-head refers to techniques that link the 3’ end of one transcript to the 5’ end of another one in a single-cell context. This term does not provide any information whether a continuous reading-frame between the two is generated. *_prelinked refers to constructs in which the linkage was already present on the DNA level (e.g. scFv). |
PCR Target Locus Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, nullable |
Designation of the target locus. Note that this field uses a controlled vocubulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature. |
|
string |
required, nullable |
Position of the most distal nucleotide templated by the forward primer or primer mix |
|
string |
required, nullable |
Position of the most proximal nucleotide templated by the reverse primer or primer mix |
Raw Sequence Data Fields#
Download as TSV
Sequencing Run Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
required, nullable |
ID of sequencing run assigned by the sequencing facility |
|
integer |
required, nullable |
Number of usable reads for analysis |
|
string |
required, nullable |
Designation of sequencing instrument used |
|
string |
required, nullable |
Name and address of sequencing facility |
|
string |
required, nullable |
Date of sequencing run |
|
string |
required, nullable |
Name, manufacturer, order and lot numbers of sequencing kit |
|
SequencingData |
optional |
Set of sequencing files produced by the sequencing run |
Data Processing Fields#
Name |
Type |
Attributes |
Definition |
---|---|---|---|
|
string |
optional, identifier, nullable |
Identifier for the data processing object. |
|
boolean |
optional, identifier |
If true, indicates this is the primary or default data processing for the repertoire and its rearrangements. If false, indicates this is a secondary or additional data processing. |
|
string |
required, nullable |
Version number and / or date, include company pipelines |
|
string |
required, nullable |
How paired end reads were assembled into a single receptor sequence |
|
string |
required, nullable |
How sequences were removed from (4) based on base quality scores |
|
string |
required, nullable |
How primers were identified in the sequences, were they removed/masked/etc? |
|
string |
required, nullable |
The method used for combining multiple sequences from (4) into a single sequence in (5) |
|
string |
required, nullable |
General description of how QC is performed |
|
array of string |
optional, nullable |
Array of file names for data produced by this data processing. |
|
string |
required, nullable |
Source of germline V(D)J genes with version number or date accessed. |
|
string |
optional, nullable |
Unique identifier of the germline set and version, in standardized form (Repo:Label:Version) |
|
string |
optional, nullable |
Identifier for machine-readable PROV model of analysis provenance |