Repertoire Schema#

A Repertoire is an abstract organizational unit of analysis that is defined by the researcher and consists of study metadata, subject metadata, sample metadata, cell processing metadata, nucleic acid processing metadata, sequencing run metadata, a set of raw sequence files, data processing metadata, and a set of Rearrangements. A Repertoire gathers all of this information together into a composite object, which can be easily accessed by computer programs for data entry, analysis and visualization.

A Repertoire is specific to a single subject otherwise it can consist of any number of samples (which can be processed in different ways), any number of raw sequence files, and any number of rearrangements. It can also consist of any number of data processing metadata objects that describe the processing of raw sequence files into Rearrangements.

Typically, a Repertoire corresponds to the biological concept of the immune repertoire for that single subject which the researcher experimentally measures and computationally analyzes. However, researchers can have different interpretations about what constitutes the biological immune repertoire; therefore, the Repertoire schema attempts to be flexible and broadly useful for all AIRR-seq studies.

Another researcher can take the same raw sequencing data and associated metadata and create their own Repertoire that is different from the original researcher’s. A common example is to define a repertoire that is a subset such as “productive rearrangements for IGHV4” whereas the original researcher defined a more generic “B cell repertoire”. This new Repertoire would have much of the same metadata as the original Repertoire, except associated with a different study, and with additional information in the data processing metadata that describes how the rearrangements were filtered down to just the “productive rearrangements for IGHV4”. Likewise, another researcher may get access to the original biosample material and perform their own sample processing and sequencing, which also would be a new Repertoire. That new Repertoire could combine samples from the original researcher’s Repertoire with the new sample data as a large dataset for the subject.

Multiple Data Processing on a Repertoire#

Data processing can be a complicated multi-stage process. Documenting the process in a formal way is challenging because of the diversity of actions that may be performed. The MiAIRR standard requires documentation of the process but in an informal way with free text descriptions. A Repertoire might undergo multiple different data processing for any number of reasons, e.g. to compare the results from different toolchains, or to compare different settings for the same toolchain.

It is expected that all of the Samples of a Repertoire will be processed together within a DataProcessing. That is, a DataProcessing that only uses some but not all samples in a Repertoire could be confusing to users and appear as though data is missing. Likewise, processing some samples within a Repertoire with one DataProcessing and the remaining samples with a different DataProcessing could also confuse users. Because DataProcessing is unstructured information, it is not possible to validate that all Samples in a Repertoire are being processed together, so this expectation cannot be strictly enforced.

Having multiple DataProcessing for a Repertoire will create multiple sets of Rearrangements that are distinct and separate from each other. Analysis tools need to be careful not to mix these sets of Rearrangements from different DataProcessing because it can generate incorrect results. The identifier data_processing_id was added so Rearrangements can identify their specific DataProcessing.

Linking Data#

Each Repertoire has a unique repertoire_id identifier. This identifier should be globally unique so that repertoires from multiple studies can be combined together without conflict. The repertoire_id is used to link other AIRR data to a Repertoire. Specifically, the Rearrangements Schema includes repertoire_id for referencing the specific Repertoire for that Rearrangement.

If a Repertoire has multiple DataProcessing then data_processing_id should be used to distinguish the appropriate DataProcessing within the Repertoire. The Rearrangements contains data_processing_id for this purpose. The data_processing_id is only unique within a Repertoire so repertoire_id should first be used to get the appropriate Repertoire object and then data_processing_id used to acquire the appropriate DataProcessing.

It is expected that typical Repertoires might only have a single DataProcessing, in which case repertoire_id and data_processing_id will be semantically equivalent and only the former should be used.

If a Repertoire has multiple sample processing objects in the sample array then sample_processing_id should be used to distinguish the the approrpiate sample processing object within the Repertoire. The Rearrangement object can contain a sample_processing_id to uniquely identify a sample processing object within a Repertoire. Like data_processing_id, the sample_processing_id is only unique within the Repertoire so repertoire_id should first be used to get the appropiate Repertoire object and then sample_processing_id should be used to determine the appropiate sample processing object that is associated with the Rearrangement. If the Rearrangement object does not have a sample_processing_id then it can be assumed that the rearrangement is associated with all of the samples in the Repertoire (e.g. the rearrangement is a collapsed rearrangement across multiple samples).

It is expected that Repertoires might often have a single sample processing object, in which case repertoire_id and sample_processing_id will be semantically equivalent and only the former should be used.

Finally, if it is necessary to link a Rearrangement object with a unique pairing of sample processing and DataProcessing, the repertoire_id of the Rearrangement object should be used to identify the correct Repertoire object and then the data_processing_id should be used to identify the correct DataProcessing metadata and the sample_processing_id should be used to identify the correct sample processing metadata within that Repertoire.

Duality between Repertoires and Rearrangements#

There is an important duality relationship between Repertoires and Rearrangements, specifically with the experimental protocols described in the Repertoire versus the annotations on Rearrangements. A Repertoire defines an experimental design for what a researcher intends to measure or observe, while the Rearrangements are what was actually measured and observed. Technically, the border between the two occurs at sequencing, that is when the biological physical entity (prepared DNA) is measured and recorded as information (nucleotide sequence).

This duality is important when considering how to answer certain questions. For example, locus for Rearrangements may have the value “IGH” which indicates that B cell heavy chain receptors were measured, yet the Repertoire might have “T cell” in cell_subset which indicates the researcher intended to measure T cells. This conflict between the two indicates something is wrong. Differences can occur in many ways, as with errors in the experimental protocol, or data processing might have incorrectly processed the raw sequencing data leading to invalid annotations.

File Format Specification#

Files are YAML/JSON with a structure defined below. Files should be encoded as UTF-8. Identifiers are case-sensitive. Files should have the extension .yaml, .yml, or .json.

File Structure#

  • The file as a whole is considered a dictionary (key/value pair) structure with the keys Info and Repertoire.

  • The file can (optionally) contain an Info object, at the beginning of the file, based upon the Info schema in the OpenAPI V2 specification. If provided, version in Info should reference the version of the AIRR schema for the file.

  • The file should correspond to a list of Repertoire objects, using Repertoire as the key to the list.

  • Each Repertoire object should contain a top-level key/value pair for repertoire_id that uniquely identifies the repertoire.

  • Some fields require the use of a particular ontology or controlled vocabulary.

  • The structure is the same regardless of whether the data is stored in a file or a data repository. For example, The ADC API will return a properly structured JSON object that can be saved to a file and used directly without modification.

Repertoire Fields#

Download as TSV

Name

Type

Attributes

Definition

repertoire_id

string

optional, identifier, nullable

Identifier for the repertoire object. This identifier should be globally unique so that repertoires from multiple studies can be combined together without conflict. The repertoire_id is used to link other AIRR data to a Repertoire. Specifically, the Rearrangements Schema includes repertoire_id for referencing the specific Repertoire for that Rearrangement.

repertoire_name

string

optional, nullable

Short generic display name for the repertoire

repertoire_description

string

optional, nullable

Generic repertoire description

study

Study

required

Study object

subject

Subject

required

Subject object

sample

array of SampleProcessing

required

List of Sample Processing objects

data_processing

array of DataProcessing

required

List of Data Processing objects

Study Fields#

Download as TSV

Name

Type

Attributes

Definition

study_id

string

required, nullable

Unique ID assigned by study registry such as one of the International Nucleotide Sequence Database Collaboration (INSDC) repositories.

study_title

string

required, nullable

Descriptive study title

study_type

Ontology

required, nullable

Type of study design

study_description

string

optional, nullable

Generic study description

inclusion_exclusion_criteria

string

required, nullable

List of criteria for inclusion/exclusion for the study

grants

string

required, nullable

Funding agencies and grant numbers

study_contact

string

optional, nullable

Full contact information of the contact persons for this study This should include an e-mail address and a persistent identifier such as an ORCID ID.

collected_by

string

required, nullable

Full contact information of the data collector, i.e. the person who is legally responsible for data collection and release. This should include an e-mail address and a persistent identifier such as an ORCID ID.

lab_name

string

required, nullable

Department of data collector

lab_address

string

required, nullable

Institution and institutional address of data collector

submitted_by

string

required, nullable

Full contact information of the data depositor, i.e., the person submitting the data to a repository. This should include an e-mail address and a persistent identifier such as an ORCID ID. This is supposed to be a short-lived and technical role until the submission is relased.

pub_ids

string

required, nullable

Publications describing the rationale and/or outcome of the study. Where ever possible, a persistent identifier should be used such as a DOI or a Pubmed ID

keywords_study

array of string

required, nullable

Keywords describing properties of one or more data sets in a study

adc_publish_date

string

optional, nullable

Date the study was first published in the AIRR Data Commons.

adc_update_date

string

optional, nullable

Date the study data was updated in the AIRR Data Commons.

Subject Fields#

Download as TSV

Name

Type

Attributes

Definition

subject_id

string

required, nullable

Subject ID assigned by submitter, unique within study. If possible, a persistent subject ID linked to an INSDC or similar repository study should be used.

synthetic

boolean

required

TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display)

species

Ontology

required

Binomial designation of subject’s species

organism

Ontology

DEPRECATED

Binomial designation of subject’s species

sex

string

required, nullable

Biological sex of subject

age_min

number

required, nullable

Specific age or lower boundary of age range.

age_max

number

required, nullable

Upper boundary of age range or equal to age_min for specific age. This field should only be null if age_min is null.

age_unit

Ontology

required, nullable

Unit of age range

age_event

string

required, nullable

Event in the study schedule to which Age refers. For NCBI BioSample this MUST be sampling. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between Age event and Sample collection time, hence the chosen events should be in temporal proximity.

age

string

DEPRECATED

ancestry_population

string

required, nullable

Broad geographic origin of ancestry (continent)

ethnicity

string

required, nullable

Ethnic group of subject (defined as cultural/language-based membership)

race

string

required, nullable

Racial group of subject (as defined by NIH)

strain_name

string

required, nullable

Non-human designation of the strain or breed of animal used

linked_subjects

string

required, nullable

Subject ID to which Relation type refers

link_type

string

required, nullable

Relation between subject and linked_subjects, can be genetic or environmental (e.g.exposure)

diagnosis

array of Diagnosis

optional

Diagnosis information for subject

genotype

object

optional, nullable

Genotype for this subject, if known

Diagnosis Fields#

Download as TSV

Name

Type

Attributes

Definition

study_group_description

string

required, nullable

Designation of study arm to which the subject is assigned to

disease_diagnosis

Ontology

required, nullable

Diagnosis of subject

disease_length

string

required, nullable

Time duration between initial diagnosis and current intervention

disease_stage

string

required, nullable

Stage of disease at current intervention

prior_therapies

string

required, nullable

List of all relevant previous therapies applied to subject for treatment of Diagnosis

immunogen

string

required, nullable

Antigen, vaccine or drug applied to subject at this intervention

intervention

string

required, nullable

Description of intervention

medical_history

string

required, nullable

Medical history of subject that is relevant to assess the course of disease and/or treatment

Sample Fields#

Download as TSV

Name

Type

Attributes

Definition

sample_id

string

required, nullable

Sample ID assigned by submitter, unique within study. If possible, a persistent sample ID linked to INSDC or similar repository study should be used.

sample_type

string

required, nullable

The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture

tissue

Ontology

required, nullable

The actual tissue sampled, e.g. lymph node, liver, peripheral blood

anatomic_site

string

required, nullable

The anatomic location of the tissue, e.g. Inguinal, femur

disease_state_sample

string

required, nullable

Histopathologic evaluation of the sample

collection_time_point_relative

number

required, nullable

Time point at which sample was taken, relative to Collection time event

collection_time_point_relative_unit

Ontology

required, nullable

Unit of Sample collection time

collection_time_point_reference

string

required, nullable

Event in the study schedule to which Sample collection time relates to

biomaterial_provider

string

required, nullable

Name and address of the entity providing the sample

Tissue and Cell Processing Fields#

Download as TSV

Name

Type

Attributes

Definition

tissue_processing

string

required, nullable

Enzymatic digestion and/or physical methods used to isolate cells from sample

cell_subset

Ontology

required, nullable

Commonly-used designation of isolated cell population

cell_phenotype

string

required, nullable

List of cellular markers and their expression levels used to isolate the cell population

cell_species

Ontology

optional, nullable

Binomial designation of the species from which the analyzed cells originate. Typically, this value should be identical to species, in which case it SHOULD NOT be set explicitly. However, there are valid experimental setups in which the two might differ, e.g., chimeric animal models. If set, this key will overwrite the species information for all lower layers of the schema.

single_cell

boolean

required, nullable

TRUE if single cells were isolated into separate compartments

cell_number

integer

required, nullable

Total number of cells that went into the experiment

cells_per_reaction

integer

required, nullable

Number of cells for each biological replicate

cell_storage

boolean

required, nullable

TRUE if cells were cryo-preserved between isolation and further processing

cell_quality

string

required, nullable

Relative amount of viable cells after preparation and (if applicable) thawing

cell_isolation

string

required, nullable

Description of the procedure used for marker-based isolation or enrich cells

cell_processing_protocol

string

required, nullable

Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript.

Nucleic Acid Processing Fields#

Download as TSV

Name

Type

Attributes

Definition

template_class

string

required

The class of nucleic acid that was used as primary starting material for the following procedures

template_quality

string

required, nullable

Description and results of the quality control performed on the template material

template_amount

number

required, nullable

Amount of template that went into the process

template_amount_unit

Ontology

required, nullable

Unit of template amount

library_generation_method

string

required

Generic type of library generation

library_generation_protocol

string

required, nullable

Description of processes applied to substrate to obtain a library that is ready for sequencing

library_generation_kit_version

string

required, nullable

When using a library generation protocol from a commercial provider, provide the protocol version number

pcr_target

array of PCRTarget

optional

If a PCR step was performed that specifically targets the IG/TR loci, the target and primer locations need to be provided here. This field holds an array of PCRTarget objects, so that multiplex PCR setups amplifying multiple loci at the same time can be annotated using one record per locus. PCR setups not targeting any specific locus must not annotate this field but select the appropriate library_generation_method instead.

complete_sequences

string

required

To be considered complete, the procedure used for library construction MUST generate sequences that 1) include the first V gene codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J gene (i.e. 1 bp 5’ of the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered complete & untemplated, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. mixed should only be used if the procedure used for library construction will likely produce multiple categories of sequences in the given experiment. It SHOULD NOT be used as a replacement of a NULL value.

physical_linkage

string

required

In case an experimental setup is used that physically links nucleic acids derived from distinct Rearrangements before library preparation, this field describes the mode of that linkage. All hetero_* terms indicate that in case of paired-read sequencing, the two reads should be expected to map to distinct IG/TR loci. *_head-head refers to techniques that link the 5’ ends of transcripts in a single-cell context. *_tail-head refers to techniques that link the 3’ end of one transcript to the 5’ end of another one in a single-cell context. This term does not provide any information whether a continuous reading-frame between the two is generated. *_prelinked refers to constructs in which the linkage was already present on the DNA level (e.g. scFv).

PCR Target Locus Fields#

Download as TSV

Name

Type

Attributes

Definition

pcr_target_locus

string

required, nullable

Designation of the target locus. Note that this field uses a controlled vocubulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature.

forward_pcr_primer_target_location

string

required, nullable

Position of the most distal nucleotide templated by the forward primer or primer mix

reverse_pcr_primer_target_location

string

required, nullable

Position of the most proximal nucleotide templated by the reverse primer or primer mix

Raw Sequence Data Fields#

Download as TSV

Sequencing Run Fields#

Download as TSV

Name

Type

Attributes

Definition

sequencing_run_id

string

required, nullable

ID of sequencing run assigned by the sequencing facility

total_reads_passing_qc_filter

integer

required, nullable

Number of usable reads for analysis

sequencing_platform

string

required, nullable

Designation of sequencing instrument used

sequencing_facility

string

required, nullable

Name and address of sequencing facility

sequencing_run_date

string

required, nullable

Date of sequencing run

sequencing_kit

string

required, nullable

Name, manufacturer, order and lot numbers of sequencing kit

sequencing_files

SequencingData

optional

Set of sequencing files produced by the sequencing run

Data Processing Fields#

Download as TSV

Name

Type

Attributes

Definition

data_processing_id

string

optional, identifier, nullable

Identifier for the data processing object.

primary_annotation

boolean

optional, identifier

If true, indicates this is the primary or default data processing for the repertoire and its rearrangements. If false, indicates this is a secondary or additional data processing.

software_versions

string

required, nullable

Version number and / or date, include company pipelines

paired_reads_assembly

string

required, nullable

How paired end reads were assembled into a single receptor sequence

quality_thresholds

string

required, nullable

How sequences were removed from (4) based on base quality scores

primer_match_cutoffs

string

required, nullable

How primers were identified in the sequences, were they removed/masked/etc?

collapsing_method

string

required, nullable

The method used for combining multiple sequences from (4) into a single sequence in (5)

data_processing_protocols

string

required, nullable

General description of how QC is performed

data_processing_files

array of string

optional, nullable

Array of file names for data produced by this data processing.

germline_database

string

required, nullable

Source of germline V(D)J genes with version number or date accessed.

germline_set_ref

string

optional, nullable

Unique identifier of the germline set and version, in standardized form (Repo:Label:Version)

analysis_provenance_id

string

optional, nullable

Identifier for machine-readable PROV model of analysis provenance